-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization Stage Misalignment Fix #198
Conversation
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes involve significant modifications to the Changes
Sequence Diagram(s)sequenceDiagram
participant A as DataFrame
participant B as extract_statics_and_schema
participant C as Output DataFrame
A->>B: Input DataFrame with static and schema data
B->>B: Perform full outer join
B->>B: Coalesce null values
B->>C: Return processed Output DataFrame
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/MEDS_transforms/transforms/tokenization.py (2 hunks)
- tests/MEDS_Transforms/test_tokenization.py (4 hunks)
Additional comments not posted (6)
tests/MEDS_Transforms/test_tokenization.py (5)
15-18
: LGTM!The code changes are approved.
82-91
: LGTM!The code changes are approved. The new DataFrame
WANT_SCHEMAS_TRAIN_0_MISSING_STATIC
is well-defined and will be useful for validating the tokenization process against expected outputs when static data is missing.
227-232
: LGTM!The code changes are approved. The new dictionary
WANT_SCHEMAS_MISSING_STATIC
is well-defined and will enhance the organization of expected schema outputs.
241-281
: LGTM!The code changes are approved. The new string
WANT_TRAIN_0
and variableNORMALIZED_SHARDS_MISSING_STATIC
are well-defined and will be useful for testing the tokenization process when static data is missing.
302-321
: LGTM!The code changes are approved. The new test function
test_tokenization_missing_static
is well-defined and will significantly improve the robustness of the testing framework by ensuring that edge cases related to missing static data are adequately covered.src/MEDS_transforms/transforms/tokenization.py (1)
147-178
: LGTM! The changes improve the function's robustness and maintainability.The key improvements are:
Changing from inner join to full outer join allows the function to handle scenarios where there are no matching entries between the
static_by_subject
andschema_by_subject
DataFrames. This enhances the function's ability to handle diverse data scenarios.The
coalesce=True
parameter ensures that any null values resulting from the join are handled appropriately, improving the output's integrity.The extensive doctests provide clear examples of the function's expected behavior with sample data. They illustrate how the function processes and transforms the data, and demonstrate the output shapes and types. This greatly improves the code's maintainability.
Also applies to: 193-193
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## dev #198 +/- ##
==========================================
+ Coverage 94.23% 94.38% +0.14%
==========================================
Files 27 27
Lines 2100 2100
==========================================
+ Hits 1979 1982 +3
+ Misses 121 118 -3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny fix requested re the naming of the imported shards from the normalization test otherwise this looks good to go in.
…ad outputs of the normalization stage
Resolves issue #197
Added doctests and integration tests that reproduced the issue and resolve it.
Summary by CodeRabbit
New Features
Bug Fixes
Tests