Remove data spilling that acts on `target` threshold #8820

fjetter · 2024-08-07T12:49:51Z

We currently have two distinct spill-to-disk mechanism

If a task completes and the memory of a worker is above target it write the data to disk immediately, i.e. this is basically a LIFO policy. Note: This functionality is implemented in zict
If during a recurring (i.e. PeriodicCallback) check, the worker memory is above spill we are starting to write data to disk based on a simple LRU policy.

I believe we should get rid of the entire 1.) mechanism, i.e. deprecate the target threshold and disable implict/automatic zict based spilling entirely.

First of all, this mechanism is confusing and I believe very few people actually understand the differences, let alone know which value to increase when and why.

More importantly, though, the first and second mechanism are working on contradicting eviction policies. While it's non-trivial to tell which policy is the best it is not helpful to have two contradicting policies live.

I am suggesting to remove target over spill for a couple of different reasons

I think LRU is better than LIFO especially for reducing memory pressure since we hope to process done results as quickly as possible again shortly after it finished
The background coroutine can counteract memory pressure while a task is running
The target mechanism actually adds significantly to code complexity due to how error handling is implemented (the mutable mapping interface is pretty leaky here, e.g. if serialization fails)
The background task decouples us to a large extend from zict which opens the possibilities for much easier asnyc spilling / finer control over what we spill and where (note: there is a ticket somewhere about async disk that already suggests to drop the mutable mapping interface entirely)

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2024-08-22T12:18:56Z

If a task completes and the memory of a worker is above target it write the data to disk immediately, i.e. this is basically a LIFO policy. Note: This functionality is implemented in zict

From what I understand, this mechanism works slightly different from what is described here. If a task completes and the size of its output data is above target, we write the data to disk immediately.

Chances are pretty high that this will evict everything from the LRU on the next callback because overall memory consumption exceeded the spill threshold. Unless there's just a single one of these objects on the worker, it might also run into memory issues after this. Another potential way of dealing with this case would be failing any task that creates output data exceeding target.

hendrikmakait · 2024-08-22T12:22:53Z

Yet another alternative would be to immediately trigger a new execution of the callback. This would take care of spilling and potentially pausing the worker before another task starts adding even more memory pressure.

fjetter · 2024-08-23T07:44:39Z

From what I understand, this mechanism works slightly different from what is described here. If a task completes and the size of its output data is above target, we write the data to disk immediately.

Well, you are right. I checked again and I think the logic changed over the years. The LRU logic in zict is defined here
https://github.com/dask/zict/blob/7e5dafedcf016a4ac61286badbf1f8da3741d2d3/zict/lru.py#L135-L166
which defines this behavior.
An important assumption of this ticket is indeed wrong. As long as that single key we're writing is not above the target threshold (which is a case that probably never actually happens) we are evicting based on LRU policy after all. That explains why we haven't seen any large change in behavior when disabling this.

I still stand by removing this. I feel the encapsulated spilling in zict is adding complexity and the two thresholds are not user friendly.

github-actions bot added the needs triage label Aug 7, 2024

fjetter added enhancement Improve existing functionality or make things work better performance memory and removed needs triage labels Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove data spilling that acts on `target` threshold #8820

Remove data spilling that acts on `target` threshold #8820

fjetter commented Aug 7, 2024

hendrikmakait commented Aug 22, 2024

hendrikmakait commented Aug 22, 2024

fjetter commented Aug 23, 2024

Remove data spilling that acts on target threshold #8820

Remove data spilling that acts on target threshold #8820

Comments

fjetter commented Aug 7, 2024

hendrikmakait commented Aug 22, 2024

hendrikmakait commented Aug 22, 2024

fjetter commented Aug 23, 2024

Remove data spilling that acts on `target` threshold #8820

Remove data spilling that acts on `target` threshold #8820