[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067

aburkov · 2020-11-22T00:41:47Z

Description
I'm currently testing vaex on a very tiny VM on AWS just to see its limits. I run a UDF on a dataset of 67k rows, each row has one column with a text. The UDF extracts keywords from each text. In the end, I save the extracted keywords and the original texts (two columns) to a CSV file.

If I reduce the size of the dataset to 10k rows by slicing, everything works fine. But for the entire set of 67k rows I get the following error:

ERROR:ThreadPoolExecutor-2_0:vaex.execution:error in task, flush task queue
Traceback (most recent call last):
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 101, in evaluate
    result = self[expression]
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 163, in __getitem__
    raise KeyError("Unknown variables or column: %r" % (variable,))
KeyError: "Unknown variables or column: 'lambda_function(original_description)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/execution.py", line 163, in execute_async
    async for element in self.thread_pool.map_async(self.process_part, dataset.chunk_iterator(columns),
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/multithreading.py", line 90, in map_async
    value = await value
  File "/home/someuser/anaconda3/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/multithreading.py", line 86, in <lambda>
    iterator = (loop.run_in_executor(self, lambda value=value: wrapped(value)) for value in iterator)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/multithreading.py", line 78, in wrapped
    return callable(self.local.index, *args, **kwargs, **kwargs_extra)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/execution.py", line 229, in process_part
    block_dict = {expression: block_scope.evaluate(expression) for expression in run.expressions}
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/execution.py", line 229, in <dictcomp>
    block_dict = {expression: block_scope.evaluate(expression) for expression in run.expressions}
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 101, in evaluate
    result = self[expression]
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 154, in __getitem__
    values = self.evaluate(expression)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/scopes.py", line 107, in evaluate
    result = eval(expression, expression_namespace, self)
  File "<string>", line 1, in <module>
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/arrow/numpy_dispatch.py", line 75, in wrapper
    result = f(*args, **kwargs)
  File "/home/someuser/anaconda3/lib/python3.8/site-packages/vaex/expression.py", line 1175, in __call__
    result = np.array(result)
MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268

I'm trying to figure out why vaex tries to put the entire data into the memory? Why doesn't it split the data into smaller chunks to fit the data in memory? Also, it's not clear when the error happens. Does it happen when vaex tries to save the result to a CSV?

Software information

Vaex version (import vaex; vaex.__version__): {'vaex-core': '4.0.0a5'}
Vaex was installed via: pip
OS: Amazon Linux V2

The text was updated successfully, but these errors were encountered:

maartenbreddels · 2020-11-22T08:28:05Z

Hi Andriy,

very good questions. First, reading from CSV using pandas (which is now the default) will put the data in memory, but worse, the string data is a ndarray with dtype=object. To work with the strings in an efficient way, we have to convert it to arrow on the fly (in memory). Saving the csv to hdf5 or arrow would solve that, and we make that easy with the convert argument.

df = vaex.from_csv(path, convert=True)  # will do a 1 time conversion the first time

This will change once #1028 gets in, which is fast enough to do on the fly csv reading, in the proper format. But I don't think we'll get this in v4.

Then the next issue here, is that vaex by default processes 1048576 rows (1024**2) at a time. I just opened a PR #1068 to fix this (this was a long-standing annoyance), but for now, you can do:

df = vaex.from_cvs(path, convert=True)  # will do a 1 time conversion the first time
df.executor. buffer_size = 10_000  # do 10k rows a time, note that this will change soon, using the mentioned PR

Also, I see you are using .apply. While this is a good last resort, if possible, never use it. .apply is meant as a black box and is not vectorized. Using for instance df. original_description.str.lower() etc would be much faster and memory-efficient (see e.g. https://vaex.io/blog/vaex-a-dataframe-with-super-strings ). We can try to improve .apply using various methods, but I think it should not be used, because we lose information on what a users actually wants to do, and cannot optimize anything (we are forced to have to execute an arbitrary Python function that holds the GIL).

cheers,

Maarten

aburkov · 2020-11-22T23:54:38Z

Thank you, Maarten. As for "apply," I believe I need it because I use the hyperscan library to extract keywords from the text documents. In each document, hyperscan searches for matches of ~50,000 regular expressions. I don't think it could be vectorized. Do you see how?

Your suggestion of setting df.executor.buffer_size = 10_000 didn't help, it still says "MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268" so it still seems to try to allocate the memory for the entire dataset.

Here's the code:

df = vaex.open('s3://...parquet')

df = df[["original_description"]]
df.executor.buffer_size = 10_000

extractor = Extractor()# here hyperscan creates a huge database of regular expressions

def extractor_function(input):
    return json.dumps(extractor.match(input))

df["keywords"] = df.apply(extractor_function, arguments=[df.original_description])

df.export_csv('/home/username/output_data.csv')

Can you see where this allocation of memory for 67033 records could happen?

maartenbreddels · 2020-11-24T14:59:12Z

Finally reproduced your issue, luckily the solution was quite simple, the export_csv takes an extra argument (e.g. chunk_size=10_000) and then it will write so many rows at a time (and thus only put so many columns into memory).

Indeed, in a memory constrained docker, this gets killed, with arguments, it works.

Hope this solved it!

maartenbreddels · 2020-11-24T15:12:41Z

It's actually an interesting use case of vaex, I never thought of it as something you'd use on a tiny VM's, but it makes sense. Let us know if you encounter more memory issues.

maartenbreddels · 2020-12-02T15:05:12Z

This should be fixed in the latest release:

pip install vaex-core==4.0.0a10
or
pip install vaex==4.0.0a5

maartenbreddels closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067

[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067

aburkov commented Nov 22, 2020 •

edited

Loading

maartenbreddels commented Nov 22, 2020 •

edited by JovanVeljanoski

Loading

aburkov commented Nov 22, 2020

maartenbreddels commented Nov 24, 2020

maartenbreddels commented Nov 24, 2020

maartenbreddels commented Dec 2, 2020

[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067

[BUG-REPORT] MemoryError: Unable to allocate 1.07 GiB for an array with shape (67033,) and data type <U4268 #1067

Comments

aburkov commented Nov 22, 2020 • edited Loading

maartenbreddels commented Nov 22, 2020 • edited by JovanVeljanoski Loading

aburkov commented Nov 22, 2020

maartenbreddels commented Nov 24, 2020

maartenbreddels commented Nov 24, 2020

maartenbreddels commented Dec 2, 2020

aburkov commented Nov 22, 2020 •

edited

Loading

maartenbreddels commented Nov 22, 2020 •

edited by JovanVeljanoski

Loading