-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Draft grouphashmetadata
table
#75980
Conversation
This PR has a migration; here is the generated SQL for --
-- Create model GroupHashMetadata
--
CREATE TABLE "sentry_grouphashmetadata" ("id" bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, "date_added" timestamp with time zone NOT NULL, "grouping_method" integer NOT NULL CHECK ("grouping_method" >= 0), "grouping_config" varchar NOT NULL, "hash_basis" varchar NULL, "hashing_metadata" text NULL, "enhancements" text NULL, "fingerprint" text NULL, "seer_model" varchar NULL, "seer_date_sent" timestamp with time zone NULL, "seer_results" text NULL, "event_sent" varchar(32) NULL, "grouphash_id" bigint NOT NULL UNIQUE, "secondary_hash_match_id" bigint NULL);
ALTER TABLE "sentry_grouphashmetadata" ADD CONSTRAINT "sentry_grouphashmeta_grouphash_id_c47122d9_fk_sentry_gr" FOREIGN KEY ("grouphash_id") REFERENCES "sentry_grouphash" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "sentry_grouphashmetadata" VALIDATE CONSTRAINT "sentry_grouphashmeta_grouphash_id_c47122d9_fk_sentry_gr";
ALTER TABLE "sentry_grouphashmetadata" ADD CONSTRAINT "sentry_grouphashmeta_secondary_hash_match_f564cbf0_fk_sentry_gr" FOREIGN KEY ("secondary_hash_match_id") REFERENCES "sentry_grouphash" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "sentry_grouphashmetadata" VALIDATE CONSTRAINT "sentry_grouphashmeta_secondary_hash_match_f564cbf0_fk_sentry_gr";
CREATE INDEX CONCURRENTLY "sentry_grouphashmetadata_secondary_hash_match_id_f564cbf0" ON "sentry_grouphashmetadata" ("secondary_hash_match_id"); |
Test Failures Detected: Due to failing tests, we cannot provide coverage reports at this time. ❌ Failed Test Results:Completed 21748 tests with View the full list of failed testspytest
|
adac841
to
519b3a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool to see. Good start 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional] You can add a module comment to help folks get the gist of what this module is about.
FULL_STACKTRACE = "stack" | ||
APP_STACKTRACE = "app_stack" | ||
FINGERPRINT = "fingerprint" | ||
# TODO: Are there others? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom fingerprint.
Today I've found this new way of grouping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I actually think I need to go through all of the strategies, variants, and components to figure out which of them can turn into an over-arching reason for grouping. Will definitely take some more research, but I'll cross that bridge when I come to it.
# Most recent config to produce this hash | ||
# TODO: do we need first, also? | ||
# TODO: Is there a way to get choices for this from grouping.strategies.configurations.CONFIGURATIONS? | ||
grouping_config = models.CharField() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware: I'm handwaving here!
Maybe something like this:
from foo import CONFIGURATIONS
class ProjectConfig(models.TextChoices):
CONFIG_0 = CONFIGURATIONS[0]
CONFIG_1 = CONFIGURATIONS[1]
...
CONFIG_N = CONFIGURATIONS[N]
Also, we should add an assertion for the size of it somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think the big issue here is going to be circular dependencies - lots of the grouping modules import from models
, and in this model I'd have to be importing from one of those modules. Even if it technically works right now (because whatever I need doesn't happen to come from modules which import from models
), I'd be nervous to rely on that continuing to the the case. So move it to a neutral location? But then it's populated by grouping code...
|
||
@region_silo_model | ||
class GroupHashMetadata(Model): | ||
__relocation_scope__ = RelocationScope.Excluded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the implications of using this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of that backup/relocation project Alex has been working on. When you migrate an org from one region to another, or from on-prem to SAAS, not all of your data comes along for the ride, and in particular none of your event/group data does.
# Eg. message, in-app stacktrace, full stacktrace, fingerprint, etc | ||
hash_basis = models.CharField(choices=HashBasis, null=True) | ||
# Eg. if message parameterized, what type of parameterization?, etc) | ||
hashing_metadata: models.Field[dict[str, Any] | None, dict[str, Any] | None] = JSONField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example to help me understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are examples of all of these in the Notion doc, but one possibility for this field might be:
{
"message_parametrized": true,
"parameterization_types": ["int", "date"]
}
) | ||
# Built-in and custom stacktrace rules used | ||
enhancements: models.Field[dict[str, Any] | None, dict[str, Any] | None] = JSONField(null=True) | ||
# Client fingerprint, value, matched rule, rule source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have examples of each property you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup - in the Notion doc linked in the above comment.
# Client fingerprint, value, matched rule, rule source | ||
fingerprint: models.Field[dict[str, Any] | None, dict[str, Any] | None] = JSONField(null=True) | ||
# If this hash was associated to its group via a grouping config transition, what secondary hash | ||
# made the link? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble picturing when this would happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a config transition, if the secondary config calculates hash A and the new config calculates hash B, we use the group associated with hash A for hash B as well. This field would exist on hash B's metadata, and would point to hash A.
) | ||
|
||
# SEER | ||
seer_model = models.CharField(null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could the values be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we're going with vX
-type versioning:
sentry/src/sentry/conf/server.py
Line 3443 in 3c02b74
SEER_SIMILARITY_MODEL_VERSION = "v0" |
# This will be different than `date_added` if we send it to Seer as part of a backfill | ||
seer_date_sent = models.DateTimeField(null=True) | ||
# TODO Figure out how to best reuse the types in seer.similarity.types | ||
seer_results: models.Field[list[Any] | None, list[Any] | None] = JSONField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What can the seer results be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be a list of dict versions of SeerSimilarIssueData
instances. For now there'll only ever be one item in the list, since we're only planning to store the single best match (if any), but here and elsewhere in the code it's a list just in case that changes in the future.
sentry/src/sentry/seer/similarity/types.py
Lines 51 to 56 in 3c02b74
class SeerSimilarIssueData: | |
stacktrace_distance: float | |
message_distance: float | |
should_group: bool | |
parent_group_id: int | |
parent_hash: str |
grouphashmetadata
tablegrouphashmetadata
table
This pull request has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you add the label "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
NOTE: This is an overview of what the eventual state of the table might look like, but is not meant to be merged. It's just a place to have conversation about the overall plan.
My first stab at creating the new table. General feedback welcome, but also, I have a two general questions:
This is all of the fields that - as of now - are likely to be added in this first go-round. I'm sure there'll be changes, though. Do you think it's better to create the table with all the fields and possibly write migrations to change fields, or just add the fields as I add code related to them?
Are there any bits I'm missing, things besides just the fields?