Improve `opds2_feed_reaper` performance for large feeds. (PP-1756) #2089

tdilauro · 2024-09-26T21:41:50Z

Description

Makes memory use and database query size more predictable for the opds2_feed_reaper script.

Motivation and Context

The previous approach could not support very large feeds (somewhere over 150K items) and performed poorly on even smaller feeds than that.

[Jira PP-1756]

How Has This Been Tested?

Tested in local dev environment database from dev server.

Checklist

N/A - I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2024-09-26T21:50:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.78%. Comparing base (344383b) to head (7742eb4).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2089      +/-   ##
==========================================
+ Coverage   90.67%   90.78%   +0.10%     
==========================================
  Files         344      344              
  Lines       40585    40580       -5     
  Branches     6583     8819    +2236     
==========================================
+ Hits        36801    36839      +38     
+ Misses       2506     2484      -22     
+ Partials     1278     1257      -21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tdilauro · 2024-09-27T12:34:13Z

bin/opds2_reaper_monitor

-        to_be_reaped_qu = unlimited_access_license_pools_qu.join(Identifier).filter(
-            ~Identifier.id.in_(identifier_ids)
-        )


The key change starts here. We remove this join, which scaled poorly because the SELECT grew linearly with the number of identifiers in the feed.

The other part of this change is below, where we iterate over all eligible license pools locally and reap the ones that are not mentioned in the feed.

tdilauro · 2024-09-27T12:35:14Z

bin/opds2_reaper_monitor

+        for pool in eligible_license_pools_qu.options(raiseload("*")).yield_per(
+            query_batch_size
+        ):
+            if pool.identifier_id not in identifier_ids:
+                reap_count += 1
+                # Don't actually reap, unless this is explicitly NOT a dry run.
+                if self.dry_run is False:
+                    pool.unlimited_access = False


This is the second part of the main change here, where we iterate over all eligible license pools locally and reap the ones that are not mentioned in the feed.

jonathangreen

Looks good

Improve opds2_feed_reaper performance for large feeds. (PP-1756)

210e3f2

tdilauro requested a review from jonathangreen September 26, 2024 21:42

tdilauro commented Sep 27, 2024

View reviewed changes

jonathangreen approved these changes Sep 27, 2024

View reviewed changes

Also need to batch identifier lookups.

7742eb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `opds2_feed_reaper` performance for large feeds. (PP-1756) #2089

Improve `opds2_feed_reaper` performance for large feeds. (PP-1756) #2089

tdilauro commented Sep 26, 2024 •

edited by jira bot

Loading

codecov bot commented Sep 26, 2024 •

edited

Loading

tdilauro Sep 27, 2024

tdilauro Sep 27, 2024

jonathangreen left a comment

Improve opds2_feed_reaper performance for large feeds. (PP-1756) #2089

Are you sure you want to change the base?

Improve opds2_feed_reaper performance for large feeds. (PP-1756) #2089

Conversation

tdilauro commented Sep 26, 2024 • edited by jira bot Loading

Description

Motivation and Context

How Has This Been Tested?

Checklist

codecov bot commented Sep 26, 2024 • edited Loading

Codecov Report

tdilauro Sep 27, 2024

Choose a reason for hiding this comment

tdilauro Sep 27, 2024

Choose a reason for hiding this comment

jonathangreen left a comment

Choose a reason for hiding this comment

Improve `opds2_feed_reaper` performance for large feeds. (PP-1756) #2089

Improve `opds2_feed_reaper` performance for large feeds. (PP-1756) #2089

tdilauro commented Sep 26, 2024 •

edited by jira bot

Loading

codecov bot commented Sep 26, 2024 •

edited

Loading