How to implement vaex dataframes in sections of Large Python/Pandas Application? #1610

russell94paul · 2021-10-05T15:56:30Z

russell94paul
Oct 5, 2021

I am currently optimizing some of our Data Pipeline jobs(primarily optimizing for memory) and I had the idea of converting our pandas dataframes into vaex dataframes in our Load functions that use psycopg2 to read from Postgres DB tables.

Upon converting to vaex dataframe I was expecting for memory to free up on the line I made the changes too, but it is actually increasing memory usage slightly. The runtime improved slightly though.

Our application then stores each of the loaded tables in global variables that are processed later on the pipeline.

Here is an example:

Parent function
LoadCalEffects()

Uses three different global variables that contain DataFrames that were loaded into memory by earlier Load functions in the pipeline.

gbl.weights -> DataFrame -> returned by LoadWeights()
gbl.events -> DataFrame - > returned by LoadEvents()
gbl.coefficients -> DataFrame -> returned by LoadCoeffs()

It is at the end of the 3 functions above that I convert the pandas DataFrames to vaex DataFrames.

End of LoadWeights() function:
`
gbl.weights = weights
gbl.weights_min_date = int(weights[params.ods_day_id].min().strftime('%Y%m%d'))
# Convert to vx dataframes
gbl.weights = vx.from_pandas(gbl.weights)

`
If I do a memory profile of the function I do not see any improvements in memory on the conversion from Pandas DataFrame to vaex DataFrame.

LoadWeights() - Pandas

`
Filename: ./load.py

Line # Mem usage Increment Occurences Line Contents

2281 203.8 MiB 2282 2283
2284 203.8 MiB 2285
2286 203.8 MiB 2287 204.0 MiB 2288 208.9 MiB 2289 2290 208.9 MiB 2291 208.9 MiB 2292 208.9 MiB 2293 208.9 MiB 2294 208.9 MiB 2295 208.9 MiB 2296 208.9 MiB 2297 208.9 MiB 2298 208.9 MiB 2299 208.9 MiB 2300 208.9 MiB 2301 208.9 MiB 2302 208.9 MiB 2303 208.9 MiB 2304 208.9 MiB 2305 208.9 MiB 2306 208.9 MiB 2307 208.9 MiB 2308 208.9 MiB 2309 208.9 MiB 2310 208.9 MiB 2311 208.9 MiB 2312 208.9 MiB 2313 208.9 MiB 2314 208.9 MiB 2315 2316 2317 208.9 MiB 2318 208.9 MiB 2319 208.9 MiB 2320 208.9 MiB 2321 208.9 MiB 2322 208.9 MiB 2323 208.9 MiB 2324 208.9 MiB 2325 208.9 MiB 2326 208.9 MiB 2327 208.6 MiB 2328 2329 216.5 MiB 2335 216.5 MiB 2336 216.5 MiB 2337
2338 216.6 MiB 2339
2340 2341 2342 2343
2344 216.6 MiB 2345 216.6 MiB 203.8 MiB 1 @benchmarking.memory
def LoadWeights():
0.0 MiB 1 logger.debug(log_params.START)
0.0 MiB 1 try:
0.2 MiB 1 weights_query = ODSBuildQuery(dict=query_cfg['weights'])
4.9 MiB 1 weights = ODSConnection.execute_select_query(weights_query)
.rename(columns={
0.0 MiB 1 'trend': 'Trend'
0.0 MiB 1 , 'tuwin': 'TuWin'
0.0 MiB 1 , 'wewin': 'WeWin'
0.0 MiB 1 , 'thwin': 'ThWin'
0.0 MiB 1 , 'frwin': 'FrWin'
0.0 MiB 1 , 'sawin': 'SaWin'
0.0 MiB 1 , 'suwin': 'SuWin'
0.0 MiB 1 , 'mosum': 'MoSum'
0.0 MiB 1 , 'tusum': 'TuSum'
0.0 MiB 1 , 'wesum': 'WeSum'
0.0 MiB 1 , 'thsum': 'ThSum'
0.0 MiB 1 , 'frsum': 'FrSum'
0.0 MiB 1 , 'sasum': 'SaSum'
0.0 MiB 1 , 'susum': 'SuSum'
0.0 MiB 1 , 'feb' : 'Feb'
0.0 MiB 1 , 'mar' : 'Mar'
0.0 MiB 1 , 'apr' : 'Apr'
0.0 MiB 1 , 'may' : 'May'
0.0 MiB 1 , 'jun' : 'Jun'
0.0 MiB 1 , 'jul' : 'Jul'
0.0 MiB 1 , 'aug' : 'Aug'
0.0 MiB 1 , 'sep' : 'Sep'
0.0 MiB 1 , 'oct' : 'Oct'
0.0 MiB 1 , 'nov' : 'Nov'
0.0 MiB 1 , 'dec' : 'Dec'
})
.astype({
0.0 MiB 1 'Feb' : np.float64
0.0 MiB 1 , 'Mar' : np.float64
0.0 MiB 1 , 'Apr' : np.float64
0.0 MiB 1 , 'May' : np.float64
0.0 MiB 1 , 'Jun' : np.float64
0.0 MiB 1 , 'Jul' : np.float64
0.0 MiB 1 , 'Aug' : np.float64
0.0 MiB 1 , 'Sep' : np.float64
0.0 MiB 1 , 'Oct' : np.float64
0.0 MiB 1 , 'Nov' : np.float64
-0.2 MiB 1 , 'Dec' : np.float64
})
7.9 MiB 1 weights.to_csv(params.local__weights, index = False)
0.0 MiB 1 bucket_name = gbl.s3_bucket_extracts
0.0 MiB 1 folder_name = gbl.s3_folder_extracts
0.1 MiB 1 DAE_S3.S3Connection(bucket_name, folder_name, params.local_seasonal_weights).S3upload()
except:
logger.warning("Problem when loading weights.")
weights = None
0.0 MiB 1 gbl.weights = weights
0.0 MiB 1 gbl.weights_min_date = int(weights[params.ods_day_id].min().strftime('%Y%m%d'))

2347 216.6 MiB 0.0 MiB 1 logger.debug(log_params.END)

`

LoadWeights() - vaex

`
Line # Mem usage Increment Occurences Line Contents

2286 246.4 MiB 246.4 MiB 1 @benchmarking.memory
2287 def LoadWeights():
2288
2289 246.4 MiB 0.0 MiB 1 logger.debug(log_params.START)
2290
2291 246.4 MiB 0.0 MiB 1 try:
2292 246.4 MiB 0.0 MiB 1 weights_query = ODSBuildQuery(dict=query_cfg['seasonal_weights'])
2293 251.0 MiB 4.6 MiB 1 weights = execute_select_query(weights_query)
2294 .rename(columns={
2295 251.0 MiB 0.0 MiB 1 'trend': 'Trend'
2296 251.0 MiB 0.0 MiB 1 , 'tuwin': 'TuWin'
2297 251.0 MiB 0.0 MiB 1 , 'wewin': 'WeWin'
2298 251.0 MiB 0.0 MiB 1 , 'thwin': 'ThWin'
2299 251.0 MiB 0.0 MiB 1 , 'frwin': 'FrWin'
2300 251.0 MiB 0.0 MiB 1 , 'sawin': 'SaWin'
2301 251.0 MiB 0.0 MiB 1 , 'suwin': 'SuWin'
2302 251.0 MiB 0.0 MiB 1 , 'mosum': 'MoSum'
2303 251.0 MiB 0.0 MiB 1 , 'tusum': 'TuSum'
2304 251.0 MiB 0.0 MiB 1 , 'wesum': 'WeSum'
2305 251.0 MiB 0.0 MiB 1 , 'thsum': 'ThSum'
2306 251.0 MiB 0.0 MiB 1 , 'frsum': 'FrSum'
2307 251.0 MiB 0.0 MiB 1 , 'sasum': 'SaSum'
2308 251.0 MiB 0.0 MiB 1 , 'susum': 'SuSum'
2309 251.0 MiB 0.0 MiB 1 , 'feb' : 'Feb'
2310 251.0 MiB 0.0 MiB 1 , 'mar' : 'Mar'
2311 251.0 MiB 0.0 MiB 1 , 'apr' : 'Apr'
2312 251.0 MiB 0.0 MiB 1 , 'may' : 'May'
2313 251.0 MiB 0.0 MiB 1 , 'jun' : 'Jun'
2314 251.0 MiB 0.0 MiB 1 , 'jul' : 'Jul'
2315 251.0 MiB 0.0 MiB 1 , 'aug' : 'Aug'
2316 251.0 MiB 0.0 MiB 1 , 'sep' : 'Sep'
2317 251.0 MiB 0.0 MiB 1 , 'oct' : 'Oct'
2318 251.0 MiB 0.0 MiB 1 , 'nov' : 'Nov'
2319 251.0 MiB 0.0 MiB 1 , 'dec' : 'Dec'
2320 })
2321 .astype({
2322 251.0 MiB 0.0 MiB 1 'Feb' : np.float64
2323 251.0 MiB 0.0 MiB 1 , 'Mar' : np.float64
2324 251.0 MiB 0.0 MiB 1 , 'Apr' : np.float64
2325 251.0 MiB 0.0 MiB 1 , 'May' : np.float64
2326 251.0 MiB 0.0 MiB 1 , 'Jun' : np.float64
2327 251.0 MiB 0.0 MiB 1 , 'Jul' : np.float64
2328 251.0 MiB 0.0 MiB 1 , 'Aug' : np.float64
2329 251.0 MiB 0.0 MiB 1 , 'Sep' : np.float64
2330 251.0 MiB 0.0 MiB 1 , 'Oct' : np.float64
2331 251.0 MiB 0.0 MiB 1 , 'Nov' : np.float64
2332 250.6 MiB -0.4 MiB 1 , 'Dec' : np.float64
2333 })
2334 254.7 MiB 4.1 MiB 1 weights.to_csv(params.local_seasonal_weights, index=False)
2340 254.7 MiB 0.0 MiB 1 bucket_name = gbl.s3_bucket_extracts
2341 254.7 MiB 0.0 MiB 1 folder_name = gbl.s3_folder_extracts
2343 254.9 MiB 0.2 MiB 1 DAE_S3.S3Connection(bucket_name, folder_name, params.local_seasonal_weights).S3upload()
2344
2345 except:
2346 logger.warning("Problem when loading seasonal_weights.")
2347 weights = None
2348
2349 254.9 MiB 0.0 MiB 1 gbl.weights = seasonal_weights
2350 254.9 MiB 0.0 MiB 1 gbl.weights_min_date = int(seasonal_weights[params.ods_day_id].min().strftime('%Y%m%d'))
2351 # Convert to vx dataframes
2352 254.9 MiB 0.0 MiB 1 gbl.weights = vx.from_pandas(gbl.weights)
2353
2354 254.9 MiB 0.0 MiB 1 logger.debug(log_params.END)

`
As you can see I only convert to vaex at the end of the function - my thinking was that this will free up the memory used by the Pandas version of the weights dataframe stored in gbl.weights

The reason I am only converting at the end is because I want to minimally invasive in terms of refactoring these functions (due to time constraints) - I think we will be looking into moving over to vaex in the future.

I would greatly appreciate if anyone could tell me why this is not working as I expect, as my understanding may be incorrect.

And of course, if anyone could point out the correct way of implementing what I am trying to achieve, that would be greatly appreciated.

Edit

I think my understanding of how vaex works may not have been clear. I believe the memory and runtime benefits are only realized when all the data manipulation is done using vaex dataframes (instead of pandas). - I was only converting the global variables to vaex before the LoadFunctions returns the dataframe, so that it returns a vaex dataframe.
I thought that by doing this I would free up memory taken up by the Pandas DataFrames stored in memory as they are now stored as vaex dataframes.
I believe that vaex would be a great help if I were to refactor all of our load functions to process our data using vaex dataframes, but it is out of the scope of my current task.
I think for version 3.0 of our application I will be pushing for:
- asyncpg instead of psycopg2 for our data access layer
- Vaex instead of Pandas in our Load Functions at least.
- Convert I/O files to arrow format where possible (we already use parquet for some output files)

maartenbreddels · 2021-10-12T09:03:14Z

maartenbreddels
Oct 12, 2021
Maintainer

Hi Russel,

As you can see I only convert to vaex at the end of the function - my thinking was that this will free up the memory used by the Pandas version of the weights dataframe stored in gbl.weights

I don't understand why you think this? The pandas dataframes seems to use ~250MB right? If you convert it to a Vaex dataframe, the process still uses the same amount of memory, so it seem we do not create a copy, which is good right? We cannot use less memory right? Is that what you though Vaex would do?

I thought that by doing this I would free up memory taken up by the Pandas DataFrames stored in memory as they are now stored as vaex dataframes.

No, it will use the same amount of memory.

I believe the memory and runtime benefits are only realized when all the data manipulation is done using vaex dataframes

Yes, more so, and if you store your data on disk as hdf5 or arrow, we'll memory map it such that it doesn waste precious memory.

Hope this helps.

2 replies

Ben-Epstein Oct 13, 2021

@maartenbreddels quick follow up question - it's been my working assumption that storing files as arrow/hdf5 in a remote location like minio/S3 will also memory map to save memory, of course at the cost of performance. is that also the case?

maartenbreddels Oct 14, 2021
Maintainer

We wrote something at https://vaex.io/docs/example_io.html#Cloud-Support
if it will cache locally, it can be mmapped, but that will be highly performant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement vaex dataframes in sections of Large Python/Pandas Application? #1610

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to implement vaex dataframes in sections of Large Python/Pandas Application? #1610

russell94paul Oct 5, 2021

Line # Mem usage Increment Occurences Line Contents

` Line # Mem usage Increment Occurences Line Contents

Replies: 1 comment · 2 replies

maartenbreddels Oct 12, 2021 Maintainer

Ben-Epstein Oct 13, 2021

maartenbreddels Oct 14, 2021 Maintainer

russell94paul
Oct 5, 2021

`
Line # Mem usage Increment Occurences Line Contents

Replies: 1 comment 2 replies

maartenbreddels
Oct 12, 2021
Maintainer

maartenbreddels Oct 14, 2021
Maintainer