Fix a potential data corruption for Pandas UDF #9942

firestarman · 2023-12-04T06:56:05Z

This PR moves the BatchQueue into the DataProducer to share the same lock as the output iterator returned by asIterator, and make the batch movement from the input iterator to the batch queue be an atomic operation to eliminate the race when appending the batches to the queue.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2023-12-04T07:59:32Z

build

This PR moves the BatchQueue into the DataProducer to share the same lock as the output iterator returned by asIterator, and make the batch movement from the input iterator to the batch queue be an atomic operation to eliminate the race when appending the batches to the queue.

* Download Maven from apache.org archives (#10225) Fixes #10224 Replace broken install using apt by downloading Maven from apache.org. Signed-off-by: Gera Shegalov <gera@apache.org> * Fix a hang for Pandas UDFs on DB 13.3[databricks] (#9833) fix #9493 fix #9844 The python runner uses two separate threads to write and read data with Python processes, however on DB13.3, it becomes single-threaded, which means reading and writing run on the same thread. Now the first reading is always ahead of the first writing. But the original BatchQueue will wait on the first reading until the first writing is done. Then it will wait forever. Change made: - Update the BatchQueue to support asking for a batch instead of waiting unitl one is inserted into the queue. This can eliminate the order requirement of reading and writing. - Introduce a new class named BatchProducer to work with the new BatchQueue to support rows number peek on demand for the reading. - Apply this new BatchQueue to relevant plans. - Update the Python runners to support writing one batch one time for the singled-threaded model. - Found an issue about PythonUDAF and RunningWindoFunctionExec, it may be a bug specific to DB 13.3, and add a test (test_window_aggregate_udf_on_cpu) for it. - Other small refactors --------- Signed-off-by: Firestarman <firestarmanllc@gmail.com> * Fix a potential data corruption for Pandas UDF (#9942) This PR moves the BatchQueue into the DataProducer to share the same lock as the output iterator returned by asIterator, and make the batch movement from the input iterator to the batch queue be an atomic operation to eliminate the race when appending the batches to the queue. * Do some refactor for the Python UDF code to try to reduce duplicate code. (#9902) Signed-off-by: Firestarman <firestarmanllc@gmail.com> * Fixed 330db Shims to Adopt the PythonRunner Changes [databricks] (#10232) This PR removes the old 330db shims in favor of the new Shims, similar to the one in 341db. **Tests:** Ran udf_test.py on Databricks 11.3 and they all passed. fixes #10228 --------- Signed-off-by: raza jafri <rjafri@nvidia.com> --------- Signed-off-by: Gera Shegalov <gera@apache.org> Signed-off-by: Firestarman <firestarmanllc@gmail.com> Signed-off-by: raza jafri <rjafri@nvidia.com> Co-authored-by: Gera Shegalov <gera@apache.org> Co-authored-by: Liangcai Li <firestarmanllc@gmail.com>

Fix a potential data corruption for Pandas UDF

5de4639

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested review from revans2, GaryShen2008 and winningsix December 4, 2023 08:42

revans2 approved these changes Dec 4, 2023

View reviewed changes

sameerz added the bug Something isn't working label Dec 4, 2023

firestarman merged commit 07c6163 into NVIDIA:branch-24.02 Dec 5, 2023
39 checks passed

firestarman deleted the udf-lock branch December 5, 2023 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a potential data corruption for Pandas UDF #9942

Fix a potential data corruption for Pandas UDF #9942

firestarman commented Dec 4, 2023 •

edited

Loading

firestarman commented Dec 4, 2023

Fix a potential data corruption for Pandas UDF #9942

Fix a potential data corruption for Pandas UDF #9942

Conversation

firestarman commented Dec 4, 2023 • edited Loading

firestarman commented Dec 4, 2023

firestarman commented Dec 4, 2023 •

edited

Loading