How to implement vaex dataframes in sections of Large Python/Pandas Application? #1610
Replies: 1 comment 2 replies
-
Hi Russel,
I don't understand why you think this? The pandas dataframes seems to use ~250MB right? If you convert it to a Vaex dataframe, the process still uses the same amount of memory, so it seem we do not create a copy, which is good right? We cannot use less memory right? Is that what you though Vaex would do?
No, it will use the same amount of memory.
Yes, more so, and if you store your data on disk as hdf5 or arrow, we'll memory map it such that it doesn waste precious memory. Hope this helps. |
Beta Was this translation helpful? Give feedback.
-
I am currently optimizing some of our Data Pipeline jobs(primarily optimizing for memory) and I had the idea of converting our pandas dataframes into vaex dataframes in our Load functions that use psycopg2 to read from Postgres DB tables.
Upon converting to vaex dataframe I was expecting for memory to free up on the line I made the changes too, but it is actually increasing memory usage slightly. The runtime improved slightly though.
Our application then stores each of the loaded tables in global variables that are processed later on the pipeline.
Here is an example:
Parent function
LoadCalEffects()
Uses three different global variables that contain DataFrames that were loaded into memory by earlier Load functions in the pipeline.
It is at the end of the 3 functions above that I convert the pandas DataFrames to vaex DataFrames.
End of LoadWeights() function:
`
gbl.weights = weights
gbl.weights_min_date = int(weights[params.ods_day_id].min().strftime('%Y%m%d'))
# Convert to vx dataframes
gbl.weights = vx.from_pandas(gbl.weights)
`
If I do a memory profile of the function I do not see any improvements in memory on the conversion from Pandas DataFrame to vaex DataFrame.
LoadWeights() - Pandas
`
Filename: ./load.py
Line # Mem usage Increment Occurences Line Contents
2281 203.8 MiB 203.8 MiB 1 @benchmarking.memory
2282 def LoadWeights():
2283
2284 203.8 MiB 0.0 MiB 1 logger.debug(log_params.START)
2285
2286 203.8 MiB 0.0 MiB 1 try:
2287 204.0 MiB 0.2 MiB 1 weights_query = ODSBuildQuery(dict=query_cfg['weights'])
2288 208.9 MiB 4.9 MiB 1 weights = ODSConnection.execute_select_query(weights_query)
2289 .rename(columns={
2290 208.9 MiB 0.0 MiB 1 'trend': 'Trend'
2291 208.9 MiB 0.0 MiB 1 , 'tuwin': 'TuWin'
2292 208.9 MiB 0.0 MiB 1 , 'wewin': 'WeWin'
2293 208.9 MiB 0.0 MiB 1 , 'thwin': 'ThWin'
2294 208.9 MiB 0.0 MiB 1 , 'frwin': 'FrWin'
2295 208.9 MiB 0.0 MiB 1 , 'sawin': 'SaWin'
2296 208.9 MiB 0.0 MiB 1 , 'suwin': 'SuWin'
2297 208.9 MiB 0.0 MiB 1 , 'mosum': 'MoSum'
2298 208.9 MiB 0.0 MiB 1 , 'tusum': 'TuSum'
2299 208.9 MiB 0.0 MiB 1 , 'wesum': 'WeSum'
2300 208.9 MiB 0.0 MiB 1 , 'thsum': 'ThSum'
2301 208.9 MiB 0.0 MiB 1 , 'frsum': 'FrSum'
2302 208.9 MiB 0.0 MiB 1 , 'sasum': 'SaSum'
2303 208.9 MiB 0.0 MiB 1 , 'susum': 'SuSum'
2304 208.9 MiB 0.0 MiB 1 , 'feb' : 'Feb'
2305 208.9 MiB 0.0 MiB 1 , 'mar' : 'Mar'
2306 208.9 MiB 0.0 MiB 1 , 'apr' : 'Apr'
2307 208.9 MiB 0.0 MiB 1 , 'may' : 'May'
2308 208.9 MiB 0.0 MiB 1 , 'jun' : 'Jun'
2309 208.9 MiB 0.0 MiB 1 , 'jul' : 'Jul'
2310 208.9 MiB 0.0 MiB 1 , 'aug' : 'Aug'
2311 208.9 MiB 0.0 MiB 1 , 'sep' : 'Sep'
2312 208.9 MiB 0.0 MiB 1 , 'oct' : 'Oct'
2313 208.9 MiB 0.0 MiB 1 , 'nov' : 'Nov'
2314 208.9 MiB 0.0 MiB 1 , 'dec' : 'Dec'
2315 })
2316 .astype({
2317 208.9 MiB 0.0 MiB 1 'Feb' : np.float64
2318 208.9 MiB 0.0 MiB 1 , 'Mar' : np.float64
2319 208.9 MiB 0.0 MiB 1 , 'Apr' : np.float64
2320 208.9 MiB 0.0 MiB 1 , 'May' : np.float64
2321 208.9 MiB 0.0 MiB 1 , 'Jun' : np.float64
2322 208.9 MiB 0.0 MiB 1 , 'Jul' : np.float64
2323 208.9 MiB 0.0 MiB 1 , 'Aug' : np.float64
2324 208.9 MiB 0.0 MiB 1 , 'Sep' : np.float64
2325 208.9 MiB 0.0 MiB 1 , 'Oct' : np.float64
2326 208.9 MiB 0.0 MiB 1 , 'Nov' : np.float64
2327 208.6 MiB -0.2 MiB 1 , 'Dec' : np.float64
2328 })
2329 216.5 MiB 7.9 MiB 1 weights.to_csv(params.local__weights, index = False)
2335 216.5 MiB 0.0 MiB 1 bucket_name = gbl.s3_bucket_extracts
2336 216.5 MiB 0.0 MiB 1 folder_name = gbl.s3_folder_extracts
2337
2338 216.6 MiB 0.1 MiB 1 DAE_S3.S3Connection(bucket_name, folder_name, params.local_seasonal_weights).S3upload()
2339
2340 except:
2341 logger.warning("Problem when loading weights.")
2342 weights = None
2343
2344 216.6 MiB 0.0 MiB 1 gbl.weights = weights
2345 216.6 MiB 0.0 MiB 1 gbl.weights_min_date = int(weights[params.ods_day_id].min().strftime('%Y%m%d'))
2347 216.6 MiB 0.0 MiB 1 logger.debug(log_params.END)
`
LoadWeights() - vaex
`
Line # Mem usage Increment Occurences Line Contents
2286 246.4 MiB 246.4 MiB 1 @benchmarking.memory
2287 def LoadWeights():
2288
2289 246.4 MiB 0.0 MiB 1 logger.debug(log_params.START)
2290
2291 246.4 MiB 0.0 MiB 1 try:
2292 246.4 MiB 0.0 MiB 1 weights_query = ODSBuildQuery(dict=query_cfg['seasonal_weights'])
2293 251.0 MiB 4.6 MiB 1 weights = execute_select_query(weights_query)
2294 .rename(columns={
2295 251.0 MiB 0.0 MiB 1 'trend': 'Trend'
2296 251.0 MiB 0.0 MiB 1 , 'tuwin': 'TuWin'
2297 251.0 MiB 0.0 MiB 1 , 'wewin': 'WeWin'
2298 251.0 MiB 0.0 MiB 1 , 'thwin': 'ThWin'
2299 251.0 MiB 0.0 MiB 1 , 'frwin': 'FrWin'
2300 251.0 MiB 0.0 MiB 1 , 'sawin': 'SaWin'
2301 251.0 MiB 0.0 MiB 1 , 'suwin': 'SuWin'
2302 251.0 MiB 0.0 MiB 1 , 'mosum': 'MoSum'
2303 251.0 MiB 0.0 MiB 1 , 'tusum': 'TuSum'
2304 251.0 MiB 0.0 MiB 1 , 'wesum': 'WeSum'
2305 251.0 MiB 0.0 MiB 1 , 'thsum': 'ThSum'
2306 251.0 MiB 0.0 MiB 1 , 'frsum': 'FrSum'
2307 251.0 MiB 0.0 MiB 1 , 'sasum': 'SaSum'
2308 251.0 MiB 0.0 MiB 1 , 'susum': 'SuSum'
2309 251.0 MiB 0.0 MiB 1 , 'feb' : 'Feb'
2310 251.0 MiB 0.0 MiB 1 , 'mar' : 'Mar'
2311 251.0 MiB 0.0 MiB 1 , 'apr' : 'Apr'
2312 251.0 MiB 0.0 MiB 1 , 'may' : 'May'
2313 251.0 MiB 0.0 MiB 1 , 'jun' : 'Jun'
2314 251.0 MiB 0.0 MiB 1 , 'jul' : 'Jul'
2315 251.0 MiB 0.0 MiB 1 , 'aug' : 'Aug'
2316 251.0 MiB 0.0 MiB 1 , 'sep' : 'Sep'
2317 251.0 MiB 0.0 MiB 1 , 'oct' : 'Oct'
2318 251.0 MiB 0.0 MiB 1 , 'nov' : 'Nov'
2319 251.0 MiB 0.0 MiB 1 , 'dec' : 'Dec'
2320 })
2321 .astype({
2322 251.0 MiB 0.0 MiB 1 'Feb' : np.float64
2323 251.0 MiB 0.0 MiB 1 , 'Mar' : np.float64
2324 251.0 MiB 0.0 MiB 1 , 'Apr' : np.float64
2325 251.0 MiB 0.0 MiB 1 , 'May' : np.float64
2326 251.0 MiB 0.0 MiB 1 , 'Jun' : np.float64
2327 251.0 MiB 0.0 MiB 1 , 'Jul' : np.float64
2328 251.0 MiB 0.0 MiB 1 , 'Aug' : np.float64
2329 251.0 MiB 0.0 MiB 1 , 'Sep' : np.float64
2330 251.0 MiB 0.0 MiB 1 , 'Oct' : np.float64
2331 251.0 MiB 0.0 MiB 1 , 'Nov' : np.float64
2332 250.6 MiB -0.4 MiB 1 , 'Dec' : np.float64
2333 })
2334 254.7 MiB 4.1 MiB 1 weights.to_csv(params.local_seasonal_weights, index=False)
2340 254.7 MiB 0.0 MiB 1 bucket_name = gbl.s3_bucket_extracts
2341 254.7 MiB 0.0 MiB 1 folder_name = gbl.s3_folder_extracts
2343 254.9 MiB 0.2 MiB 1 DAE_S3.S3Connection(bucket_name, folder_name, params.local_seasonal_weights).S3upload()
2344
2345 except:
2346 logger.warning("Problem when loading seasonal_weights.")
2347 weights = None
2348
2349 254.9 MiB 0.0 MiB 1 gbl.weights = seasonal_weights
2350 254.9 MiB 0.0 MiB 1 gbl.weights_min_date = int(seasonal_weights[params.ods_day_id].min().strftime('%Y%m%d'))
2351 # Convert to vx dataframes
2352 254.9 MiB 0.0 MiB 1 gbl.weights = vx.from_pandas(gbl.weights)
2353
2354 254.9 MiB 0.0 MiB 1 logger.debug(log_params.END)
`
As you can see I only convert to vaex at the end of the function - my thinking was that this will free up the memory used by the Pandas version of the weights dataframe stored in gbl.weights
The reason I am only converting at the end is because I want to minimally invasive in terms of refactoring these functions (due to time constraints) - I think we will be looking into moving over to vaex in the future.
I would greatly appreciate if anyone could tell me why this is not working as I expect, as my understanding may be incorrect.
And of course, if anyone could point out the correct way of implementing what I am trying to achieve, that would be greatly appreciated.
Edit
I think my understanding of how vaex works may not have been clear. I believe the memory and runtime benefits are only realized when all the data manipulation is done using vaex dataframes (instead of pandas). - I was only converting the global variables to vaex before the LoadFunctions returns the dataframe, so that it returns a vaex dataframe.
I thought that by doing this I would free up memory taken up by the Pandas DataFrames stored in memory as they are now stored as vaex dataframes.
I believe that vaex would be a great help if I were to refactor all of our load functions to process our data using vaex dataframes, but it is out of the scope of my current task.
I think for version 3.0 of our application I will be pushing for:
Beta Was this translation helpful? Give feedback.
All reactions