-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067
Comments
Hi Andriy, very good questions. First, reading from CSV using pandas (which is now the default) will put the data in memory, but worse, the string data is a ndarray with dtype=object. To work with the strings in an efficient way, we have to convert it to arrow on the fly (in memory). Saving the csv to hdf5 or arrow would solve that, and we make that easy with the convert argument. df = vaex.from_csv(path, convert=True) # will do a 1 time conversion the first time This will change once #1028 gets in, which is fast enough to do on the fly csv reading, in the proper format. But I don't think we'll get this in v4. Then the next issue here, is that vaex by default processes 1048576 rows (1024**2) at a time. I just opened a PR #1068 to fix this (this was a long-standing annoyance), but for now, you can do: df = vaex.from_cvs(path, convert=True) # will do a 1 time conversion the first time
df.executor. buffer_size = 10_000 # do 10k rows a time, note that this will change soon, using the mentioned PR Also, I see you are using cheers, Maarten |
Thank you, Maarten. As for "apply," I believe I need it because I use the hyperscan library to extract keywords from the text documents. In each document, hyperscan searches for matches of ~50,000 regular expressions. I don't think it could be vectorized. Do you see how? Your suggestion of setting df.executor.buffer_size = 10_000 didn't help, it still says "MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268" so it still seems to try to allocate the memory for the entire dataset. Here's the code:
Can you see where this allocation of memory for 67033 records could happen? |
Finally reproduced your issue, luckily the solution was quite simple, the export_csv takes an extra argument (e.g. chunk_size=10_000) and then it will write so many rows at a time (and thus only put so many columns into memory). Indeed, in a memory constrained docker, this gets killed, with arguments, it works. Hope this solved it! |
It's actually an interesting use case of vaex, I never thought of it as something you'd use on a tiny VM's, but it makes sense. Let us know if you encounter more memory issues. |
This should be fixed in the latest release:
|
Description
I'm currently testing vaex on a very tiny VM on AWS just to see its limits. I run a UDF on a dataset of 67k rows, each row has one column with a text. The UDF extracts keywords from each text. In the end, I save the extracted keywords and the original texts (two columns) to a CSV file.
If I reduce the size of the dataset to 10k rows by slicing, everything works fine. But for the entire set of 67k rows I get the following error:
I'm trying to figure out why vaex tries to put the entire data into the memory? Why doesn't it split the data into smaller chunks to fit the data in memory? Also, it's not clear when the error happens. Does it happen when vaex tries to save the result to a CSV?
Software information
import vaex; vaex.__version__)
: {'vaex-core': '4.0.0a5'}The text was updated successfully, but these errors were encountered: