Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067

Closed
aburkov opened this issue Nov 22, 2020 · 5 comments

Comments

@aburkov
Copy link

aburkov commented Nov 22, 2020

Description
I'm currently testing vaex on a very tiny VM on AWS just to see its limits. I run a UDF on a dataset of 67k rows, each row has one column with a text. The UDF extracts keywords from each text. In the end, I save the extracted keywords and the original texts (two columns) to a CSV file.

If I reduce the size of the dataset to 10k rows by slicing, everything works fine. But for the entire set of 67k rows I get the following error:

ERROR:ThreadPoolExecutor-2_0:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 101, in evaluate
    result = self[expression]
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 163, in __getitem__
    raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'lambda_function(original_description)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/execution.py", line 163, in execute_async
    async for element in self.thread_pool.map_async(self.process_part, dataset.chunk_iterator(columns),
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/multithreading.py", line 90, in map_async
    value = await value
  File "/home/someuser/anaconda3/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/multithreading.py", line 86, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/multithreading.py", line 78, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/execution.py", line 229, in process_part
    block_dict = {expression: block_scope.evaluate(expression) for expression in run.expressions}
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/execution.py", line 229, in <dictcomp>
    block_dict = {expression: block_scope.evaluate(expression) for expression in run.expressions}
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 101, in evaluate
    result = self[expression]
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 154, in __getitem__
    values = self.evaluate(expression)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 107, in evaluate
    result = eval(expression, expression_namespace, self)
  File "<string>", line 1, in <module>
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/arrow/numpy_dispatch.py", line 75, in wrapper
    result = f(*args, **kwargs)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/expression.py", line 1175, in __call__
    result = np.array(result)
MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268

I'm trying to figure out why vaex tries to put the entire data into the memory? Why doesn't it split the data into smaller chunks to fit the data in memory? Also, it's not clear when the error happens. Does it happen when vaex tries to save the result to a CSV?

Software information

  • Vaex version (import vaex; vaex.__version__): {'vaex-core': '4.0.0a5'}
  • Vaex was installed via: pip
  • OS: Amazon Linux V2
@maartenbreddels
Copy link
Member

maartenbreddels commented Nov 22, 2020

Hi Andriy,

very good questions. First, reading from CSV using pandas (which is now the default) will put the data in memory, but worse, the string data is a ndarray with dtype=object. To work with the strings in an efficient way, we have to convert it to arrow on the fly (in memory). Saving the csv to hdf5 or arrow would solve that, and we make that easy with the convert argument.

df = vaex.from_csv(path, convert=True)  # will do a 1 time conversion the first time

This will change once #1028 gets in, which is fast enough to do on the fly csv reading, in the proper format. But I don't think we'll get this in v4.

Then the next issue here, is that vaex by default processes 1048576 rows (1024**2) at a time. I just opened a PR #1068 to fix this (this was a long-standing annoyance), but for now, you can do:

df = vaex.from_cvs(path, convert=True)  # will do a 1 time conversion the first time
df.executor. buffer_size = 10_000  # do 10k rows a time, note that this will change soon, using the mentioned PR

Also, I see you are using .apply. While this is a good last resort, if possible, never use it. .apply is meant as a black box and is not vectorized. Using for instance df. original_description.str.lower() etc would be much faster and memory-efficient (see e.g. https://vaex.io/blog/vaex-a-dataframe-with-super-strings ). We can try to improve .apply using various methods, but I think it should not be used, because we lose information on what a users actually wants to do, and cannot optimize anything (we are forced to have to execute an arbitrary Python function that holds the GIL).

cheers,

Maarten

@aburkov
Copy link
Author

aburkov commented Nov 22, 2020

Thank you, Maarten. As for "apply," I believe I need it because I use the hyperscan library to extract keywords from the text documents. In each document, hyperscan searches for matches of ~50,000 regular expressions. I don't think it could be vectorized. Do you see how?

Your suggestion of setting df.executor.buffer_size = 10_000 didn't help, it still says "MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268" so it still seems to try to allocate the memory for the entire dataset.

Here's the code:

df = vaex.open('s3://...parquet')

df = df[["original_description"]]
df.executor.buffer_size = 10_000

extractor = Extractor()# here hyperscan creates a huge database of regular expressions

def extractor_function(input):
    return json.dumps(extractor.match(input))

df["keywords"] = df.apply(extractor_function, arguments=[df.original_description])

df.export_csv('/home/username/output_data.csv')

Can you see where this allocation of memory for 67033 records could happen?

@maartenbreddels
Copy link
Member

Finally reproduced your issue, luckily the solution was quite simple, the export_csv takes an extra argument (e.g. chunk_size=10_000) and then it will write so many rows at a time (and thus only put so many columns into memory).

Indeed, in a memory constrained docker, this gets killed, with arguments, it works.

Hope this solved it!

@maartenbreddels
Copy link
Member

It's actually an interesting use case of vaex, I never thought of it as something you'd use on a tiny VM's, but it makes sense. Let us know if you encounter more memory issues.

@maartenbreddels
Copy link
Member

This should be fixed in the latest release:

pip install vaex-core==4.0.0a10
or
pip install vaex==4.0.0a5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants