Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCI-3492 [LogPoller]: Allow withObservedExecAndRowsAffected to report non-zero rows affected #14057

Merged
merged 5 commits into from
Aug 8, 2024

Conversation

reductionista
Copy link
Contributor

Addresses this bug: https://smartcontract-it.atlassian.net/browse/BCI-3492

Previously, rows affected was only getting reported for log & block pruning when there was an error with the query, so it was always 0

  • Always report rows affected for log & block pruning
  • Change behavior of DeleteExpiredLogs to delete logs which don't match any filter
  • Add a test case to ensure the dataset size is published properly during pruning

Also:
- Change behavior of DeleteExpiredLogs to delete logs which don't match any filter
- Add a test case to ensure the dataset size is published properly during pruning
@reductionista reductionista requested a review from a team as a code owner August 7, 2024 02:53
@reductionista reductionista marked this pull request as draft August 7, 2024 04:24
@reductionista reductionista marked this pull request as ready for review August 7, 2024 04:26
dhaidashenko
dhaidashenko previously approved these changes Aug 7, 2024
@@ -285,7 +285,7 @@ func withObservedExecAndRowsAffected(o *ObservedORM, queryName string, queryType
WithLabelValues(o.chainId, queryName, string(queryType)).
Observe(float64(time.Since(queryStarted)))

if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

// DeleteExpiredLogs removes any logs which either:
// - don't match any currently registered filters, or
// - have a timestamp older than any matching filter's retention, UNLESS there is at
// least one matching filter with retention=0
func (o *DSORM) DeleteExpiredLogs(ctx context.Context, limit int64) (int64, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to accidentally remove logs when job is updated? Let's imagine the following scenario

  • We have a job, that registered multiple LogPoller filters during the init
  • CLO proposes a job replacement, a slight change to the job definition, but contracts stay the same so filters would be the same, so the logs required for plugin to work
  • NOP accepts the job which triggers removing the old job and creating a new one. It includes unregistering old filters and registering new ones
    • In the meantime LogPoller PruneExpiredLogs kicks in trying to remove stale logs. It sees a new set of "abandoned" logs, because filters were just removed. It removes these logs immediately.
  • Plugin init registers new filters (which are the same), but logs are gone so we might need to run replay

I wonder if this is something that could happen or impact the system. How does it work for other products? What if job is removed and added in a two way process? (via API). AFAIK CLO does this entire flow in a large single db transaction, but I guess it might change at some point because of the loopps architecture, right?

Copy link
Contributor Author

@reductionista reductionista Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I'm pretty sure it's okay for now, since as you say CLO does the whole flow in a single db transaction. Someone could manually remove a job, and then add a similar one later... but I think it would be a natural user expectation to assume that there are no guarantees about events the old job cared about still being there for the new job to use. If CLO changes so that it's no longer in a single db transaction, that seems more problematic since it's presented to the user as an "Update" rather than a "Remove" followed by a separate "Create". But I think this would cause similar problems with or without this change. One way or another, we'll have to solve some issues if we can no longer guarantee the ability for CLO to make atomic job updates.

Plugin init registers new filters (which are the same), but logs are gone so we might need to run replay

Any plugin which cares about past logs already does a replay to the appropriate block when it's initialized. (For example, going back to pick up an OCR2 configuration event.) So I think the only issue is with missing logs that get emitted around the time the old job is getting swapped out with the new job. But even without this change, there is a potential for missing some logs while that's happening. This would just increase the likelihood of it happening, since it would apply not just to events emitted between the delete and create operations but also those emitted shortly before the delete where the old job hadn't had time to process them before being deleted.

One way I can think of for how we might be able to solve this issue is with ChainReader. If it can expose a method for updating a filter that does UnregisterFilter and then RegisterFilter within the same db transaction, that should solve it I think. We might already even be doing this with Bind(), I'm not sure but @ilija42 should know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. One thing that comes to my mind that might work as an additional safety check is to wait for some "sane" threshold before deleting logs without the filter. When retention is null we match only logs older than X days. Not sure what that threshold should be (maybe configurable?), but that might prevent us from accidentally losing logs right after deleting the job in some weird way. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking that might be what we have to do, if we can't solve things with ChainReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just thought of a simple solution which I think would be pretty robust: in UnregisterFilter(), we could return successful but delay the actual removal of the filter by 1 hour (or any long enough time, it could even be 24 hours but that seems like overkill). As long as any call to RegisterFilter() arrives within that 1 hour grace period that matches the logs that would have been deleted, there is no time at which the pruner would be allowed to delete them.

@reductionista reductionista added this pull request to the merge queue Aug 8, 2024
Merged via the queue into develop with commit e0850a6 Aug 8, 2024
120 checks passed
@reductionista reductionista deleted the BCI-3492-fix/logpoller-observe-rowsaffected branch August 8, 2024 18:11
momentmaker added a commit that referenced this pull request Aug 8, 2024
* develop:
  CRIB CI integration (#13924)
  fix: refactor sonarqube scan args (#13875)
  BCI-3492 [LogPoller]: Allow withObservedExecAndRowsAffected to report non-zero rows affected (#14057)
  Add error mapping for Astar (#13990)
  [BCI-3862][chainlink] - Change DSL Block primitive to string instead of int (#14033)
  [KS-412] Validate called DON membership in TriggerPublisher (#14040)
  [TT-1429] notifying guardian on some failures (#14073)
  Add Mantle errors (#14053)
  Fix write target nil dereferences (#14059)
  Allow retrying failed transmissions (#14017)
  auto-11214: migrate more tests to foundry (#13934)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants