-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release Candidate 0.0.6 #175
Conversation
…match and revise to be fully implementable, so it is a scaffold for now until #119 goes in
…xtract Remove aggregation of code metadata from default extraction ETL.
…ected normalization outputs and update the numbers.
…arize_all_codes is True
Fix tests to expect Float32 throughout and time derived to specifically compute age in float32
…er_bug Fixes and expands tests for `aggregate_code_metadata` across various aggregations
… in preparation for multi-stage test
…ests-for-pre-processing Adds a multi-stage integration test for pre-processing.
Exit metadata extraction if there is no _metadata in the event configs
…he MIMIC and eICU example READMEs.
Added badges to the README.
WalkthroughThe recent changes involve updates to GitHub Actions workflows for Python code quality and testing, enhancing version management through a matrix strategy. The Python setup actions have been upgraded to their latest versions, and the workflows now support multiple Python versions. Additionally, modifications to the pre-commit configuration and type annotations in the codebase improve clarity and compatibility. Overall, these changes streamline the development and testing processes. Changes
Sequence Diagram(s)sequenceDiagram
participant Developer
participant GitHubActions
participant PythonEnv
participant Codebase
Developer->>GitHubActions: Push changes
GitHubActions->>PythonEnv: Set up matrix for Python versions
PythonEnv->>Codebase: Run tests and quality checks
Codebase-->>GitHubActions: Return results
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #175 +/- ##
==========================================
+ Coverage 93.38% 93.51% +0.12%
==========================================
Files 25 26 +1
Lines 1859 1988 +129
==========================================
+ Hits 1736 1859 +123
- Misses 123 129 +6 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (2)
src/MEDS_transforms/configs/extract.yaml (1)
Review and Update References to
aggregate_code_metadata
The
aggregate_code_metadata
stage is referenced in multiple files, which suggests that its removal might affect tests, documentation, and example scripts. Please review and update the following files to ensure consistency:
pyproject.toml
tests/transform_tester_base.py
docs/preprocessing_operation_prototypes.md
tests/test_aggregate_code_metadata.py
MIMIC-IV_Example/joint_script_slurm.sh
MIMIC-IV_Example/joint_script.sh
Make sure that all references to
aggregate_code_metadata
are either removed or updated to reflect the current state of the pipeline. This will prevent potential issues with testing, documentation, and script execution.Analysis chain
Line range hint
7-29
: LGTM! Verify the impact ofaggregate_code_metadata
stage removal.The removal of the
aggregate_code_metadata
stage simplifies the pipeline. Ensure that this change aligns with the overall objectives and does not introduce issues.Run the following script to verify the impact of
aggregate_code_metadata
stage removal:Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the removal of `aggregate_code_metadata` stage does not affect the pipeline. # Test: Search for occurrences of `aggregate_code_metadata` usage. Expect: No occurrences. rg --type yaml $'aggregate_code_metadata'Length of output: 40
Script:
#!/bin/bash # Description: Search for occurrences of `aggregate_code_metadata` in the entire codebase to ensure its removal is safe. # Test: Search for occurrences of `aggregate_code_metadata` in all file types. rg 'aggregate_code_metadata' --hidden -A 5Length of output: 5070
tests/test_extract_no_metadata.py (1)
9-11
: Consider adding error handling forrootutils.setup_root
.The
setup_root
function is used consistently across multiple test files without error handling. Adding error handling can help manage potential issues during the setup process.
tests/test_extract_no_metadata.py
tests/test_extract.py
tests/transform_tester_base.py
Analysis chain
Verify the usage of
rootutils
.Ensure that
rootutils.setup_root
is correctly setting up the environment. Consider adding error handling if the setup fails.Run the following script to verify the usage of
rootutils.setup_root
:Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `rootutils.setup_root`. # Test: Search for `setup_root` usage. Expect: Correct setup with error handling. rg --type python -A 5 $'setup_root'Length of output: 65
Script:
#!/bin/bash # Description: Verify the usage of `rootutils.setup_root`. # Test: Search for `setup_root` usage. Expect: Correct setup with error handling. rg --type py -A 5 $'setup_root'Length of output: 1203
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (28)
- MIMIC-IV_Example/README.md (2 hunks)
- README.md (1 hunks)
- eICU_Example/README.md (1 hunks)
- src/MEDS_transforms/aggregate_code_metadata.py (4 hunks)
- src/MEDS_transforms/configs/extract.yaml (2 hunks)
- src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml (1 hunks)
- src/MEDS_transforms/extract/extract_code_metadata.py (1 hunks)
- src/MEDS_transforms/fit_vocabulary_indices.py (1 hunks)
- src/MEDS_transforms/transforms/add_time_derived_measurements.py (4 hunks)
- src/MEDS_transforms/transforms/tensorization.py (2 hunks)
- src/MEDS_transforms/transforms/tokenization.py (1 hunks)
- src/MEDS_transforms/utils.py (2 hunks)
- tests/test_add_time_derived_measurements.py (1 hunks)
- tests/test_aggregate_code_metadata.py (1 hunks)
- tests/test_extract.py (6 hunks)
- tests/test_extract_no_metadata.py (1 hunks)
- tests/test_filter_measurements.py (2 hunks)
- tests/test_filter_patients.py (1 hunks)
- tests/test_fit_vocabulary_indices.py (2 hunks)
- tests/test_multi_stage_preprocess_pipeline.py (1 hunks)
- tests/test_normalization.py (3 hunks)
- tests/test_occlude_outliers.py (1 hunks)
- tests/test_reorder_measurements.py (1 hunks)
- tests/test_reshard_to_split.py (1 hunks)
- tests/test_tensorization.py (2 hunks)
- tests/test_tokenization.py (10 hunks)
- tests/transform_tester_base.py (13 hunks)
- tests/utils.py (4 hunks)
Files skipped from review due to trivial changes (7)
- MIMIC-IV_Example/README.md
- README.md
- src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml
- src/MEDS_transforms/fit_vocabulary_indices.py
- tests/test_add_time_derived_measurements.py
- tests/test_filter_measurements.py
- tests/test_reshard_to_split.py
Additional context used
Ruff
tests/test_extract_no_metadata.py
410-410: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (39)
tests/test_tensorization.py (2)
28-29
: LGTM! Verify consistency of parameter name changes across the codebase.The change from
want_outputs
towant_data
aligns with the new naming convention. Ensure that this change is consistently applied throughout the codebase.Run the following script to verify the consistency of parameter name changes:
Verification successful
Parameter Name Change Verified Across Codebase
The parameter name
want_data
is consistently used across the codebase, confirming that the change fromwant_outputs
towant_data
has been applied correctly.
- The parameter
want_data
appears in multiple test files, including:
tests/transform_tester_base.py
tests/test_tensorization.py
tests/test_reorder_measurements.py
tests/test_reshard_to_split.py
tests/test_normalization.py
tests/test_occlude_outliers.py
tests/test_tokenization.py
tests/test_filter_patients.py
tests/test_filter_measurements.py
tests/test_multi_stage_preprocess_pipeline.py
tests/test_add_time_derived_measurements.py
This confirms the consistency of the parameter name change throughout the codebase.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the parameter name `want_data` is consistently used in the codebase. # Test: Search for occurrences of `want_data` usage. Expect: Consistent parameter name. rg --type python -A 5 $'want_data'Length of output: 64
Script:
#!/bin/bash # Description: Verify that the parameter name `want_data` is consistently used in the codebase. # Test: Search for occurrences of `want_data` usage. Expect: Consistent parameter name. rg --type py -A 5 $'want_data'Length of output: 5410
Line range hint
16-20
: LGTM! Verify consistency of key changes across the codebase.The change to append
.nrt
to each key inWANT_NRTS
improves clarity. Ensure that this change is consistently applied throughout the codebase.Run the following script to verify the consistency of key changes:
Verification successful
Change is consistent within its scope. The
.nrt
suffix is applied to keys inWANT_NRTS
withintests/test_tensorization.py
, and no other occurrences were found in the codebase. The change is localized and consistent.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the `.nrt` suffix is consistently applied to keys in the codebase. # Test: Search for occurrences of `WANT_NRTS` usage. Expect: Consistent key format. rg --type python -A 5 $'WANT_NRTS'Length of output: 64
Script:
#!/bin/bash # Description: Verify that the `.nrt` suffix is consistently applied to keys in the codebase. # Corrected Test: Search for occurrences of `WANT_NRTS` usage in Python files. Expect: Consistent key format. rg --glob '*.py' -A 5 'WANT_NRTS'Length of output: 516
tests/test_fit_vocabulary_indices.py (2)
35-36
: LGTM! Verify consistency of parameter name changes across the codebase.The change from
want_outputs
towant_metadata
aligns with the new naming convention. The direct use ofparse_code_metadata_csv
simplifies the test setup.Run the following script to verify the consistency of parameter name changes:
Verification successful
Parameter Name Change Verified:
want_metadata
is Consistently Used Across the CodebaseThe parameter name
want_metadata
is consistently used across multiple test files, confirming the successful application of the naming convention change.
tests/transform_tester_base.py
tests/test_fit_vocabulary_indices.py
tests/test_aggregate_code_metadata.py
tests/test_multi_stage_preprocess_pipeline.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the parameter name `want_metadata` is consistently used in the codebase. # Test: Search for occurrences of `want_metadata` usage. Expect: Consistent parameter name. rg --type python -A 5 $'want_metadata'Length of output: 68
Script:
#!/bin/bash # Description: Verify that the parameter name `want_metadata` is consistently used in the codebase. # Test: Search for occurrences of `want_metadata` usage. Expect: Consistent parameter name. rg -t py -A 5 'want_metadata'Length of output: 3659
10-12
: LGTM! Verify the impact ofWANT_SCHEMA
andWANT_DF
removal.The removal of
WANT_SCHEMA
andWANT_DF
reduces complexity. Ensure that this removal does not affect other parts of the codebase.Run the following script to verify the impact of
WANT_SCHEMA
andWANT_DF
removal:Verification successful
No impact from removal of
WANT_SCHEMA
andWANT_DF
.The specific identifiers
WANT_SCHEMA
andWANT_DF
were not found in the codebase, indicating that their removal does not affect other parts of the code. However, similar identifiers likeWANT_SCHEMAS_TRAIN_0
exist, which should be manually verified to ensure no related issues.
- File:
tests/test_tokenization.py
(contains similar identifiers)Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the removal of `WANT_SCHEMA` and `WANT_DF` does not affect the codebase. # Test: Search for occurrences of `WANT_SCHEMA` and `WANT_DF` usage. Expect: No occurrences. rg --type python $'WANT_SCHEMA|WANT_DF'Length of output: 71
Script:
#!/bin/bash # Description: Verify that the removal of `WANT_SCHEMA` and `WANT_DF` does not affect the codebase. # Test: Search for occurrences of `WANT_SCHEMA` and `WANT_DF` usage. Expect: No occurrences. rg 'WANT_SCHEMA|WANT_DF'Length of output: 708
tests/test_filter_patients.py (1)
85-85
: Parameter name change acknowledged.The parameter name change from
want_outputs
towant_data
reflects a semantic shift in the expected data. Ensure that this change is consistently applied across all relevant tests and documentation.tests/test_reorder_measurements.py (1)
114-114
: Parameter name change acknowledged.The parameter name change from
want_outputs
towant_data
indicates a semantic shift in the expected data. Ensure that this change is consistently applied across all relevant tests and documentation.src/MEDS_transforms/transforms/tensorization.py (1)
91-101
: Enhanced error handling and logging acknowledged.The added validation checks for empty columns and the use of logging improve error handling and observability. Ensure that these changes are documented and that any dependent code is updated to handle the new exceptions appropriately.
tests/test_occlude_outliers.py (1)
174-174
: Parameter name changed for clarity.The parameter
want_outputs
has been renamed towant_data
to better reflect its purpose. Ensure that this change is consistent across all related tests and documentation.tests/utils.py (2)
17-18
: Schema update: Precision and new field added.The
numeric_value
type has been changed frompl.Float64
topl.Float32
, which may affect precision. Additionally, a new fieldnumeric_value/is_inlier
of typepl.Boolean
has been added. Ensure that these changes are compatible with the rest of the codebase.
111-112
: New parameters improve command execution robustness.The
run_command
function now includesstage_name
anddo_pass_stage_name
parameters. This change enhances robustness by enforcing necessary parameters for certain operations. Ensure that all function calls are updated accordingly.tests/test_aggregate_code_metadata.py (1)
1-186
: New test file added for code metadata aggregation.The
test_aggregate_code_metadata.py
file introduces a comprehensive test for theaggregate_code_metadata
script. The test is well-structured and utilizes polars for DataFrame operations, ensuring clarity and efficiency in testing.tests/test_tokenization.py (3)
44-44
: Verify adequacy ofFloat32
precision fortime_delta_days
.The change to
Float32
reduces precision but optimizes memory usage. Ensure that this precision is sufficient for the application's requirements.Run the following script to verify the adequacy of
Float32
precision for thetime_delta_days
field:
Line range hint
71-102
: Verify correctness of updated test data values.The test data values have been updated to align with the new schema. Ensure that these values are correct and consistent with the schema changes.
Run the following script to verify the correctness of the updated test data values:
Also applies to: 119-133, 145-158, 179-196
42-42
: Verify compatibility ofUInt8
forcode
field.The change to
UInt8
optimizes memory usage, but ensure that all values in thecode
field are within the range ofUInt8
(0-255).Run the following script to verify the compatibility of
UInt8
for thecode
field:Verification successful
Change to
UInt8
forcode
field is compatible.The values assigned to the
code
field in the test cases are within the valid range forUInt8
(0-255), confirming compatibility with the change. No issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that all values in the `code` field are within the range of `UInt8`. # Test: Search for the `code` field values. Expect: All values should be within 0-255. rg --type python -A 5 $'code: [[5], [1, 10, 11], [10, 11], [10, 11], [10, 11], [4]]'Length of output: 116
Script:
#!/bin/bash # Description: Search for assignments and manipulations of the `code` field to verify compatibility with `UInt8`. # Search for the `code` field assignments in Python files. rg 'code\s*:\s*\[\[' --glob '*.py' -A 5 # Search for any manipulations or calculations involving the `code` field. rg 'code' --glob '*.py' -A 5Length of output: 340503
tests/test_normalization.py (3)
134-161
: Verify correctness of updated expected output strings.The expected output strings have been adjusted to align with new calculations. Ensure that these values are correct and consistent with the schema changes.
Run the following script to verify the correctness of the updated expected output strings:
Also applies to: 168-179, 186-190, 197-205
90-90
: Verify correctness of output formatting change.The output formatting now converts each value to a float using a list comprehension. Ensure that this change does not affect the correctness of the output.
Run the following script to verify the correctness of the output formatting change:
Verification successful
Output formatting change verified.
The conversion of values to floats before printing in
tests/test_normalization.py
is correctly implemented and does not affect the correctness of the output.
- Location:
tests/test_normalization.py
line withprint([float(x) for x in subj_vals])
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness of the output formatting change. # Test: Search for output formatting change. Expect: Correctness should be maintained. rg --type python -A 5 $'print([float(x) for x in subj_vals])'Length of output: 93
Script:
#!/bin/bash # Description: Verify the correctness of the output formatting change by searching for the specific print statement. # Correcting the previous command by removing the unrecognized option. rg -A 5 'print([float(x) for x in subj_vals])'Length of output: 21752
84-84
: Verify adequacy ofnp.float32
precision for normalization calculations.The change to
np.float32
reduces precision but optimizes memory usage. Ensure that this precision is sufficient for the application's requirements.Run the following script to verify the adequacy of
np.float32
precision for normalization calculations:eICU_Example/README.md (3)
31-44
: Verify overall clarity and usability of the README.The README has been updated to improve the usability of the installation instructions. Ensure that the overall clarity and usability are improved.
Run the following script to verify the overall clarity and usability of the README:
Verification successful
README Clarity and Usability Verified
The installation instructions in
eICU_Example/README.md
are clear and formatted correctly. No issues with clarity or usability were found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the overall clarity and usability of the README. # Test: Search for updated sections in the README. Expect: Overall clarity and usability should be improved. rg --type markdown -A 5 $'## Step 0: Installation'Length of output: 761
37-44
: Verify correctness and effectiveness of script download and permission setting commands.New commands download necessary scripts using
wget
and set executable permissions. Ensure that these commands are correct and effective.Run the following script to verify the correctness and effectiveness of the script download and permission setting commands:
Verification successful
Commands are correctly documented in the README.md
The
wget
andchmod
commands for downloading and setting permissions on the scripts are correctly documented in theREADME.md
. Assuming they are executed in an environment with internet access and proper permissions, they should work as intended. No further issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness and effectiveness of script download and permission setting commands. # Test: Search for script download and permission setting commands. Expect: Commands should be correct and effective. rg --type markdown -A 5 $'wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/eICU_Example/joint_script.sh'Length of output: 544
31-36
: Verify clarity and correctness of installation instructions.The installation instructions specify the installation of
MEDS_transforms
with thelocal_parallelism
option. Ensure that these instructions are clear and correct.Run the following script to verify the clarity and correctness of the installation instructions:
src/MEDS_transforms/transforms/tokenization.py (1)
187-198
: LGTM! But verify the function usage in the codebase.The changes improve type safety and memory efficiency. Ensure that all function calls to
extract_seq_of_patient_events
are consistent with the updated data types.Run the following script to verify the function usage:
Also applies to: 203-210
Verification successful
Function usage is consistent within the file.
The
extract_seq_of_patient_events
function is defined and used withinsrc/MEDS_transforms/transforms/tokenization.py
. There are no external usages, ensuring that the changes do not impact other parts of the codebase.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all function calls to `extract_seq_of_patient_events`. # Test: Search for the function usage. Expect: Consistency with updated data types. rg --type python -A 5 $'extract_seq_of_patient_events'Length of output: 84
Script:
#!/bin/bash # Description: Verify all function calls to `extract_seq_of_patient_events`. # Test: Search for the function usage. Expect: Consistency with updated data types. rg --type py -A 5 $'extract_seq_of_patient_events'Length of output: 1732
src/MEDS_transforms/utils.py (1)
247-247
: LGTM! But verify the function usage in the codebase.The changes reflect a shift in data flow, particularly in metadata input directory handling. Ensure that all function calls to
populate_stage
are consistent with the updated logic.Run the following script to verify the function usage:
Also applies to: 307-307
Verification successful
Function usage verified successfully.
The
populate_stage
function is used consistently with the updated logic, as evidenced by the doctests and its integration withOmegaConf
. No further action is required.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all function calls to `populate_stage`. # Test: Search for the function usage. Expect: Consistency with updated logic. rg --type python -A 5 $'populate_stage'Length of output: 69
Script:
#!/bin/bash # Description: Verify all function calls to `populate_stage`. # Test: Search for the function usage. Expect: Consistency with updated logic. rg --type py -A 5 $'populate_stage'Length of output: 4187
src/MEDS_transforms/transforms/add_time_derived_measurements.py (1)
233-237
: LGTM! But verify the function usage in the codebase.The changes improve type consistency and flexibility in type conversion. Ensure that all function calls to
age_fntr
are consistent with the updated data types.Run the following script to verify the function usage:
Also applies to: 253-253
Verification successful
Function Usage Verified
The function
age_fntr
is used consistently within the codebase, and its usage aligns with the expected data types. The examples provided in the code further confirm its correct application.
- Usage examples and integration with
add_new_events_fntr
are present insrc/MEDS_transforms/transforms/add_time_derived_measurements.py
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all function calls to `age_fntr`. # Test: Search for the function usage. Expect: Consistency with updated data types. rg --type python -A 5 $'age_fntr'Length of output: 63
Script:
#!/bin/bash # Description: Verify all function calls to `age_fntr`. # Test: Search for the function usage. Expect: Consistency with updated data types. rg -t py -A 5 'age_fntr'Length of output: 3357
tests/transform_tester_base.py (6)
7-12
: LGTM! Imports are correctly updated for YAML handling.The addition of YAML handling imports is appropriate for the new functionality.
206-208
: LGTM! Theparse_shards_yaml
function is well-implemented.The function effectively processes YAML strings into DataFrames with dynamic schema updates.
210-217
: LGTM! Theparse_code_metadata_csv
function is correctly implemented.The function effectively parses CSV strings and handles the
parent_codes
column as a list.
352-393
: LGTM! Thecheck_outputs
function is well-structured.The function effectively consolidates output verification logic and supports both data and metadata checks.
Line range hint
396-440
: LGTM! Thesingle_stage_transform_tester
function improvements enhance readability.The changes align with best practices and improve the function's structure.
443-501
: LGTM! Themulti_stage_transform_tester
function is comprehensive.The function integrates previous logic into a more robust framework for testing multiple transformation stages.
src/MEDS_transforms/extract/extract_code_metadata.py (1)
367-370
: LGTM! The new conditional check enhances robustness.The added check prevents unnecessary processing when metadata blocks are absent.
tests/test_extract.py (2)
263-268
: LGTM! The simplification ofMEDS_OUTPUT_CODE_METADATA_FILE
improves readability.The changes enhance the maintainability of the test code by focusing on essential validations.
544-544
: LGTM! The updates intest_extraction
align with the new testing strategy.The removal of
AGGREGATE_CODE_METADATA_SCRIPT
simplifies the test logic.tests/test_extract_no_metadata.py (2)
128-134
: Verify the correctness of expected splits.Ensure that the expected splits match the actual splits generated by the test. Consider adding comments explaining the expected values.
Run the following script to verify the correctness of expected splits:
Verification successful
Expected splits are subject to change due to non-determinism.
The expected splits in the test files are placeholders that may need updating if the data or random seed changes. The comments in the code already explain this potential variability.
tests/test_extract_no_metadata.py
tests/test_extract.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness of expected splits. # Test: Search for the expected splits usage. Expect: Correct values with explanations. rg --type python -A 5 $'EXPECTED_SPLITS'Length of output: 70
Script:
#!/bin/bash # Description: Verify the correctness of expected splits. # Test: Search for the expected splits usage. Expect: Correct values with explanations. rg --type py -A 5 $'EXPECTED_SPLITS'Length of output: 1862
16-24
: Verify the environment variable usage.Ensure that the environment variable
DO_USE_LOCAL_SCRIPTS
is set and used correctly. Consider adding default values or error handling.Run the following script to verify the usage of the environment variable:
Verification successful
Environment Variable Usage Verified
The environment variable
DO_USE_LOCAL_SCRIPTS
is used correctly with a default value of "0", ensuring that the code behaves as expected even if the variable is not set. No additional error handling is necessary.
- Usage found in:
tests/test_extract_no_metadata.py
tests/transform_tester_base.py
tests/test_extract.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `DO_USE_LOCAL_SCRIPTS` environment variable. # Test: Search for the environment variable usage. Expect: Correct usage with default values or error handling. rg --type python -A 5 $'os.environ.get("DO_USE_LOCAL_SCRIPTS")'Length of output: 95
Script:
#!/bin/bash # Description: Verify the usage of `DO_USE_LOCAL_SCRIPTS` environment variable. # Test: Search for the environment variable usage. Expect: Correct usage with default values or error handling. rg --type py -A 5 $'os.environ.get("DO_USE_LOCAL_SCRIPTS")'Length of output: 60
Script:
#!/bin/bash # Description: Search for any usage of the `DO_USE_LOCAL_SCRIPTS` environment variable in the codebase. # Test: Search for any access to the `DO_USE_LOCAL_SCRIPTS` environment variable. rg --type py -A 5 'DO_USE_LOCAL_SCRIPTS'Length of output: 7365
src/MEDS_transforms/aggregate_code_metadata.py (4)
459-463
: Verify the correctness of quantile handling.Ensure that the quantile handling logic is correct and efficient. Consider optimizing if necessary.
Run the following script to verify the correctness of quantile handling:
714-718
: Verify the correctness of file operations.Ensure that file operations are correctly handled and optimized for performance. Consider adding error handling if necessary.
Run the following script to verify the correctness of file operations:
Verification successful
File operations are correctly handled and optimized.
The file operations involving
reducer_fp
are correctly implemented with directory creation and file writing. The use ofexist_ok=True
ensures that existing directories do not cause errors, which is a good practice. No further issues were found.
aggregate_code_metadata.py
: Correct handling of directory creation and file writing.extract_code_metadata.py
: Similar operations are correctly implemented.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness of file operations. # Test: Search for file operations. Expect: Correct and optimized handling. rg --type python -A 5 $'reducer_fp'Length of output: 65
Script:
#!/bin/bash # Description: Verify the correctness of file operations. # Test: Search for file operations. Expect: Correct and optimized handling. rg --type py -A 5 'reducer_fp'Length of output: 1960
711-718
: Verify the correctness of aggregation operations.Ensure that the aggregation operations are correctly applied and optimized for performance.
Run the following script to verify the correctness of aggregation operations:
Verification successful
Aggregation operations are correctly implemented.
The aggregation operations in the code are structured and applied correctly using the
agg_operations
dictionary. They are dynamically configured and used in data frame operations, ensuring flexibility and correctness.
- The operations are mapped from
CODE_METADATA_AGGREGATIONS
and applied using functions likegroup_by
andselect
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correctness of aggregation operations. # Test: Search for aggregation operations. Expect: Correct and optimized operations. rg --type python -A 5 $'agg_operations'Length of output: 69
Script:
#!/bin/bash # Description: Verify the correctness of aggregation operations. # Test: Search for aggregation operations. Expect: Correct and optimized operations. rg --type py -A 5 'agg_operations'Length of output: 3679
470-472
: Verify the summarization logic.Ensure that the summarization logic is correct and aligns with the configuration. Consider adding comments for clarity.
Run the following script to verify the summarization logic:
tests/test_multi_stage_preprocess_pipeline.py (1)
276-360
: Verify the correctness of expected outlier detection results.Ensure that the expected outlier detection results match the actual results generated by the test. Consider adding comments explaining the expected values.
Run the following script to verify the correctness of expected outlier detection results:
MEDS-transforms should be usable under python 3.11 as well as 3.12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (9)
- .github/workflows/code-quality-main.yaml (1 hunks)
- .github/workflows/code-quality-pr.yaml (1 hunks)
- .github/workflows/python-build.yaml (1 hunks)
- .github/workflows/tests.yaml (1 hunks)
- .pre-commit-config.yaml (1 hunks)
- pyproject.toml (1 hunks)
- src/MEDS_transforms/aggregate_code_metadata.py (5 hunks)
- src/MEDS_transforms/extract/split_and_shard_patients.py (4 hunks)
- src/MEDS_transforms/mapreduce/utils.py (2 hunks)
Files skipped from review due to trivial changes (1)
- pyproject.toml
Files skipped from review as they are similar to previous changes (1)
- src/MEDS_transforms/aggregate_code_metadata.py
Additional context used
Ruff
src/MEDS_transforms/extract/split_and_shard_patients.py
21-21: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
Additional comments not posted (19)
.github/workflows/code-quality-main.yaml (4)
14-17
: LGTM!The introduction of the matrix strategy for Python version management enhances flexibility for future updates.
19-19
: LGTM!Updating the checkout action to version 4 ensures that the workflow utilizes the most recent features and fixes.
21-24
: LGTM!Updating the setup-python action to version 5 ensures that the workflow utilizes the most recent features and fixes. Dynamically referencing the Python version from the matrix improves maintainability.
26-26
: LGTM!Using the pre-commit action ensures that code quality checks are consistently applied.
.github/workflows/code-quality-pr.yaml (4)
17-20
: LGTM!The introduction of the matrix strategy for Python version management enhances flexibility for future updates.
22-22
: LGTM!Updating the checkout action to version 4 ensures that the workflow utilizes the most recent features and fixes.
24-27
: LGTM!Updating the setup-python action to version 5 ensures that the workflow utilizes the most recent features and fixes. Dynamically referencing the Python version from the matrix improves maintainability.
Line range hint
35-35
: LGTM!Using the pre-commit action ensures that code quality checks are consistently applied.
.github/workflows/tests.yaml (4)
14-15
: LGTM!The introduction of the matrix strategy for Python version management enhances flexibility for future updates.
22-22
: LGTM!Updating the checkout action to version 4 ensures that the workflow utilizes the most recent features and fixes.
24-27
: LGTM!Updating the setup-python action to version 5 ensures that the workflow utilizes the most recent features and fixes. Dynamically referencing the Python version from the matrix improves maintainability.
Line range hint
37-45
: LGTM!Using the Codecov actions ensures that coverage and test results are consistently uploaded and tracked.
Tools
actionlint
30-30: shellcheck reported issue in this script: SC2102:info:1:17: Ranges can only match single chars (mentioned due to duplicates)
(shellcheck)
.github/workflows/python-build.yaml (2)
10-13
: LGTM!The matrix strategy for the Python version is correctly implemented and enhances the flexibility of the workflow.
The code changes are approved.
16-19
: LGTM!The changes to set up Python with the matrix variable and update the action version ensure that the latest features and fixes are utilized.
The code changes are approved.
.pre-commit-config.yaml (1)
48-48
: LGTM!The update to the
pyupgrade
hook configuration aligns the codebase with the latest Python standards and allows for the use of new language features introduced in Python 3.11.The code changes are approved.
src/MEDS_transforms/extract/split_and_shard_patients.py (2)
20-23
: LGTM!The changes to the function signature enhance type safety by specifying more precise types for the function parameters and return values.
The code changes are approved.
Tools
Ruff
21-21: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
109-117
: LGTM!The changes improve the precision of the validation and enhance the clarity of the feedback for users.
The code changes are approved.
src/MEDS_transforms/mapreduce/utils.py (2)
8-8
: LGTM!The introduction of the type variable
DF_T
enhances type hinting capabilities, improving type safety and clarity.The code changes are approved.
15-15
: LGTM!The modification to the
rwlock_wrap
function's signature streamlines its declaration while retaining the type hint for theread_fn
parameter, improving the clarity of the function's interface.The code changes are approved.
patients: np.ndarray, | ||
n_patients_per_shard: int = 50000, | ||
external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None, | ||
external_splits: dict[str, Sequence[int]] | None = None, | ||
split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using mutable data structures for argument defaults.
Using mutable default arguments can lead to unexpected behavior. Set the default value to None
and initialize within the function.
Apply this diff to fix the issue:
- split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+ split_fracs_dict: dict[str, float] | None = None,
Then, initialize the default value within the function:
if external_splits is None:
external_splits = {}
+ if split_fracs_dict is None:
+ split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}
Tools
Ruff
21-21: Do not use mutable data structures for argument defaults
Replace with
None
; initialize within function(B006)
…ed integration tests.
Adds an 'extract_values' transform to extract values and retype them from input MEDS data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (1)
src/MEDS_transforms/parser.py (1)
Line range hint
41-79
: LGTM! But consider removing.keys()
The function is correctly implemented and provides detailed error messages. However, you can simplify the code by removing
.keys()
in line 69.- if not all(isinstance(k, str) for k in matcher_cfg.keys()): + if not all(isinstance(k, str) for k in matcher_cfg):The code changes are approved.
Tools
Ruff
69-69: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (10)
- pyproject.toml (2 hunks)
- src/MEDS_transforms/init.py (1 hunks)
- src/MEDS_transforms/mapreduce/mapper.py (6 hunks)
- src/MEDS_transforms/parser.py (14 hunks)
- src/MEDS_transforms/transforms/add_time_derived_measurements.py (6 hunks)
- src/MEDS_transforms/transforms/extract_values.py (1 hunks)
- tests/test_extract_values.py (1 hunks)
- tests/test_filter_measurements.py (2 hunks)
- tests/transform_tester_base.py (14 hunks)
- tests/utils.py (4 hunks)
Files skipped from review as they are similar to previous changes (4)
- pyproject.toml
- src/MEDS_transforms/transforms/add_time_derived_measurements.py
- tests/test_filter_measurements.py
- tests/utils.py
Additional context used
Ruff
src/MEDS_transforms/parser.py
69-69: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (16)
src/MEDS_transforms/__init__.py (1)
35-41
: LGTM!The addition of
INFERRED_STAGE_KEYS
enhances the module's capability to manage and reference specific stages of data processing.The code changes are approved.
tests/test_extract_values.py (1)
1-116
: LGTM!The test file is well-structured and provides comprehensive test cases for the
extract_values
script.The code changes are approved.
src/MEDS_transforms/transforms/extract_values.py (1)
1-134
: LGTM!The script is well-structured and provides a clear and efficient way to extract values from the MEDS dataset. The use of
hydra
for configuration management andpolars
for data manipulation ensures that the script is both flexible and performant. The detailed docstrings and examples are helpful for understanding how to use the script.The code changes are approved.
tests/transform_tester_base.py (5)
208-210
: LGTM!The function is correctly implemented and enhances flexibility in data parsing.
The code changes are approved.
Line range hint
298-351
: LGTM!The function is correctly implemented and enhances modularity by handling
input_code_metadata
as a string and parsing it.The code changes are approved.
354-365
: LGTM!The function is correctly implemented and improves robustness by setting
outputs_from_cohort_dir
toTrue
by default.The code changes are approved.
Line range hint
398-443
: LGTM!The function is correctly implemented and enhances readability and structure by utilizing the new context manager and adding additional parameters.
The code changes are approved.
445-503
: LGTM!The function is correctly implemented and enhances the testing framework by allowing testing of multiple transformation stages in a single call and handling
stage_configs
anddo_pass_stage_name
as dictionaries.The code changes are approved.
src/MEDS_transforms/mapreduce/mapper.py (3)
229-249
: LGTM!The class is correctly implemented and enhances flexibility in the matching process by defining two distinct modes:
MATCH_AND_REVISE
andMULTI_MATCH_AND_REVISE
.The code changes are approved.
311-315
: LGTM!The function is correctly implemented and improves robustness by providing more informative feedback when a matcher is invalid.
The code changes are approved.
Line range hint
442-481
: LGTM!The function is correctly implemented and enhances flexibility in the matching process by incorporating the new
match_revise_mode
and updating control flow to handle different behaviors.The code changes are approved.
src/MEDS_transforms/parser.py (5)
119-131
: LGTM!The function is correctly implemented and provides detailed error messages.
The code changes are approved.
146-185
: LGTM!The new member and method are correctly implemented and provide detailed error messages.
The code changes are approved.
Line range hint
247-267
: LGTM!The method is correctly implemented and provides detailed error messages.
The code changes are approved.
Line range hint
303-333
: LGTM!The method is correctly implemented and provides detailed error messages.
The code changes are approved.
521-535
: LGTM!The function is correctly implemented and provides detailed error messages.
The code changes are approved.
Compiling the recent dev changes into a new main release in preparation for a larger change to dev to support MEDS v0.3.2 in #173
Summary by CodeRabbit
New Features
Bug Fixes
shard_patients
function for better user feedback on split fraction validation.Documentation
pyproject.toml
to reflect the new minimum Python version requirement, broadening compatibility.Style
Tests
extract values
script to validate patient data extraction.