Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropped records from one republication event to another #5301

Closed
dshorthouse opened this issue Apr 18, 2024 · 5 comments
Closed

Dropped records from one republication event to another #5301

dshorthouse opened this issue Apr 18, 2024 · 5 comments

Comments

@dshorthouse
Copy link

Most (all?) of the Arctos-based IPTs recently experienced a significant drop in the number of records they serve and it's unclear to me if anyone noticed:

https://ipt.vertnet.org/resource?r=mvz_herp from 290k to 10k
https://ipt.vertnet.org/resource?r=uam_herb_vascular from 202k to 21k
https://ipt.vertnet.org/resource?r=mvz_bird 195k to 3k
https://ipt.vertnet.org/resource?r=ucm_herps 68k to 40
https://ipt.vertnet.org/resource?r=mvz_mammal 245k to 19k

I contacted the data publishers for the above & their technical support. No need to do so again.

However...

Besides checking for shifting occurrenceIDs, I suggest the processing pipeline at GBIF likewise put a stop with GitHub issues created here when a republication event experiences a significant drop in the number of records served such as what happened to the above. A 25% drop in number might be a reasonable threshold.

@dbloom
Copy link

dbloom commented Apr 18, 2024

I agree with @dshorthouse that a pause on resources that experience a significant reduction in records during a publication event would be helpful, but I would add that it would be most helpful if the pause was accompanied with a notification describing the reason for the pause. GBIF already does this for resources when a significant number of occurrenceIDs have changed, and as an Admin on the VertNet IPT, I receive an email from @jhnwllr detailing the rationale. I can then either communicate that this was intentional and request the pause be lifted or I am alerted to return to the IPT and the resource in question to look for errors.

In this particular issue with Arctos, notifications would be extra helpful, because all of these resources are set to publish automatically and I cannot monitor these events in person. I expect other Admins who have resources set to publish on a regular schedule would appreciate the extra layer of safety, too.

Cc'ing others on the GBIF Helpdesk here (other than JW). @ahahn-gbif @ManonGros @CecSve

@dshorthouse dshorthouse changed the title Dropped records from one replication event to another Dropped records from one republication event to another Apr 18, 2024
@dbloom
Copy link

dbloom commented Apr 18, 2024

Update: the backend Arctos issues have been corrected and all affected resources have been updated and corrected via the VertNet IPT, but the issue presented here is still relevant.

@jhnwllr jhnwllr transferred this issue from gbif/ingestion-management Apr 19, 2024
@jhnwllr
Copy link

jhnwllr commented Apr 19, 2024

@dshorthouse @dbloom I think our occurrenceId checker currently only catches datasets when they have a big increase, but no decrease in records. It might be possible to have it catch both. @muttcg

@timrobertson100
Copy link
Member

It'd be good for the GBIF ingestion to have a threshold check, but the IPT 3.1.0 should also catch this at source when this issue is addressed - please can you comment on that if you have wishes?

@dshorthouse
Copy link
Author

Closing this ticket as addressed, assuming other tickets elsewhere have been created to target solutions that help monitor mishaps such as significant decreased in number of records per dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants