[feat] optimize read, pushdown `limit` to file level for `to_arrow` #1038

kevinjqliu · 2024-08-11T16:40:36Z

Feature Request / Improvement

As of now, limit is checked only after an entire parquet file is read.

Lines 1360 to 1390 in d8b5c17

    
           executor = ExecutorFactory.get_or_create() 
        
           futures = [ 
        
               executor.submit( 
        
                   _task_to_table, 
        
                   fs, 
        
                   task, 
        
                   bound_row_filter, 
        
                   projected_schema, 
        
                   projected_field_ids, 
        
                   deletes_per_file.get(task.file.file_path), 
        
                   case_sensitive, 
        
                   table_metadata.name_mapping(), 
        
                   use_large_types, 
        
               ) 
        
               for task in tasks 
        
           ] 
        
           total_row_count = 0 
        
           # for consistent ordering, we need to maintain future order 
        
           futures_index = {f: i for i, f in enumerate(futures)} 
        
           completed_futures: SortedList[Future[pa.Table]] = SortedList(iterable=[], key=lambda f: futures_index[f]) 
        
           for future in concurrent.futures.as_completed(futures): 
        
               completed_futures.add(future) 
        
               if table_result := future.result(): 
        
                   total_row_count += len(table_result) 
        
               # stop early if limit is satisfied 
        
               if limit is not None and total_row_count >= limit: 
        
                   break 
        
           # by now, we've either completed all tasks or satisfied the limit 
        
           if limit is not None: 
        
               _ = [f.cancel() for f in futures if not f.done()]

Optimization to pushdown limit to the parquet reading level

For more details, see this comment

The text was updated successfully, but these errors were encountered:

soumya-ghosh · 2024-08-13T03:31:18Z

@kevinjqliu I would like to work on this one.

kevinjqliu · 2024-08-13T04:52:10Z

sure @soumya-ghosh, assigned to you

The solution might look similar to what is already done for project_batches in #1042

iceberg-python/pyiceberg/io/pyarrow.py

Lines 1457 to 1479 in f05b1ae

    
           for task in tasks: 
        
               # stop early if limit is satisfied 
        
               if limit is not None and total_row_count >= limit: 
        
                   break 
        
               batches = _task_to_record_batches( 
        
                   fs, 
        
                   task, 
        
                   bound_row_filter, 
        
                   projected_schema, 
        
                   projected_field_ids, 
        
                   deletes_per_file.get(task.file.file_path), 
        
                   case_sensitive, 
        
                   table_metadata.name_mapping(), 
        
                   use_large_types, 
        
               ) 
        
               for batch in batches: 
        
                   if limit is not None: 
        
                       if total_row_count >= limit: 
        
                           break 
        
                       elif total_row_count + len(batch) >= limit: 
        
                           batch = batch.slice(0, limit - total_row_count) 
        
                   yield batch 
        
                   total_row_count += len(batch)

kevinjqliu · 2024-09-01T16:38:52Z

Closed by #1043 (see comment)

kevinjqliu mentioned this issue Aug 11, 2024

Peformance question for to_arrow, to_pandas, to_duckdb #1032

Closed

kevinjqliu assigned soumya-ghosh Aug 13, 2024

soumya-ghosh mentioned this issue Aug 13, 2024

Optimize reads of record batches by pushing limit to file level #1057

Closed

kevinjqliu closed this as completed Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] optimize read, pushdown `limit` to file level for `to_arrow` #1038

[feat] optimize read, pushdown `limit` to file level for `to_arrow` #1038

kevinjqliu commented Aug 11, 2024

soumya-ghosh commented Aug 13, 2024

kevinjqliu commented Aug 13, 2024

kevinjqliu commented Sep 1, 2024

[feat] optimize read, pushdown limit to file level for to_arrow #1038

[feat] optimize read, pushdown limit to file level for to_arrow #1038

Comments

kevinjqliu commented Aug 11, 2024

Feature Request / Improvement

soumya-ghosh commented Aug 13, 2024

kevinjqliu commented Aug 13, 2024

kevinjqliu commented Sep 1, 2024

[feat] optimize read, pushdown `limit` to file level for `to_arrow` #1038

[feat] optimize read, pushdown `limit` to file level for `to_arrow` #1038