Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Candidate 0.0.6 #175

Merged
merged 67 commits into from
Aug 27, 2024
Merged

Release Candidate 0.0.6 #175

merged 67 commits into from
Aug 27, 2024

Conversation

mmcdermott
Copy link
Owner

@mmcdermott mmcdermott commented Aug 21, 2024

Compiling the recent dev changes into a new main release in preparation for a larger change to dev to support MEDS v0.3.2 in #173

Summary by CodeRabbit

  • New Features

    • Enhanced GitHub Actions workflows to support multiple Python versions, improving compatibility and future scalability.
    • Updated pre-commit hook to align with Python 3.11 features, ensuring code modernization.
    • Introduced a new transformation utility for extracting numeric and categorical values from the MEDS dataset.
  • Bug Fixes

    • Improved error handling in the shard_patients function for better user feedback on split fraction validation.
  • Documentation

    • Updated pyproject.toml to reflect the new minimum Python version requirement, broadening compatibility.
  • Style

    • Enhanced type safety and clarity in function signatures across various modules.
  • Tests

    • Expanded testing capabilities with matrix strategy to run tests across Python 3.11 and 3.12.
    • Introduced a comprehensive testing framework for the extract values script to validate patient data extraction.

mmcdermott and others added 30 commits August 6, 2024 11:53
…match and revise to be fully implementable, so it is a scaffold for now until #119 goes in
…xtract

Remove aggregation of code metadata from default extraction ETL.
…ected normalization outputs and update the numbers.
Fix tests to expect Float32 throughout and time derived to specifically compute age in float32
…er_bug

Fixes and expands tests for `aggregate_code_metadata` across various aggregations
Copy link
Contributor

coderabbitai bot commented Aug 21, 2024

Walkthrough

The recent changes involve updates to GitHub Actions workflows for Python code quality and testing, enhancing version management through a matrix strategy. The Python setup actions have been upgraded to their latest versions, and the workflows now support multiple Python versions. Additionally, modifications to the pre-commit configuration and type annotations in the codebase improve clarity and compatibility. Overall, these changes streamline the development and testing processes.

Changes

Files Change Summary
.github/workflows/code-quality-main.yaml, .github/workflows/code-quality-pr.yaml, .github/workflows/python-build.yaml, .github/workflows/tests.yaml Updated GitHub Actions workflows to use a matrix strategy for Python versions, upgraded checkout and setup-python actions to their latest versions.
.pre-commit-config.yaml Updated pyupgrade hook argument from --py310-plus to --py311-plus for compatibility with Python 3.11 features.
pyproject.toml Changed required Python version from 3.12 to a minimum of 3.11 to broaden compatibility.
src/MEDS_transforms/aggregate_code_metadata.py Enhanced mapper_fntr function with a more complex configuration structure and improved aggregation logic.
src/MEDS_transforms/extract/split_and_shard_patients.py Modified shard_patients function signature and internal logic for better type safety and error handling.
src/MEDS_transforms/mapreduce/utils.py Introduced a new type variable DF_T for improved type hinting in the rwlock_wrap function.
src/MEDS_transforms/parser.py, src/MEDS_transforms/mapreduce/mapper.py Enhanced matcher functionality with improved error handling and new enumeration class for flexible operations.
tests/test_extract_values.py, tests/test_filter_measurements.py, tests/transform_tester_base.py Introduced new tests and modified existing ones for improved clarity and functionality; updated parameter names for better understanding.
tests/utils.py Updated schema for numeric values and enhanced CSV parsing logic for better flexibility in data handling.

Sequence Diagram(s)

sequenceDiagram
    participant Developer
    participant GitHubActions
    participant PythonEnv
    participant Codebase

    Developer->>GitHubActions: Push changes
    GitHubActions->>PythonEnv: Set up matrix for Python versions
    PythonEnv->>Codebase: Run tests and quality checks
    Codebase-->>GitHubActions: Return results
Loading

🐰 In the meadow where code does play,
The workflows dance, brightening the day.
Python versions now hop in a row,
With tests that leap, and improvements that glow.
A rabbit's cheer for the code so fine,
In this garden of changes, all will align! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov-commenter
Copy link

codecov-commenter commented Aug 21, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 96.27329% with 6 lines in your changes missing coverage. Please review.

Project coverage is 93.51%. Comparing base (6636428) to head (3f73a35).
Report is 2 commits behind head on main.

✅ All tests successful. No failed tests found.

Files Patch % Lines
src/MEDS_transforms/transforms/extract_values.py 90.90% 4 Missing ⚠️
src/MEDS_transforms/mapreduce/mapper.py 95.45% 1 Missing ⚠️
src/MEDS_transforms/transforms/tensorization.py 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #175      +/-   ##
==========================================
+ Coverage   93.38%   93.51%   +0.12%     
==========================================
  Files          25       26       +1     
  Lines        1859     1988     +129     
==========================================
+ Hits         1736     1859     +123     
- Misses        123      129       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (2)
src/MEDS_transforms/configs/extract.yaml (1)

Review and Update References to aggregate_code_metadata

The aggregate_code_metadata stage is referenced in multiple files, which suggests that its removal might affect tests, documentation, and example scripts. Please review and update the following files to ensure consistency:

  • pyproject.toml
  • tests/transform_tester_base.py
  • docs/preprocessing_operation_prototypes.md
  • tests/test_aggregate_code_metadata.py
  • MIMIC-IV_Example/joint_script_slurm.sh
  • MIMIC-IV_Example/joint_script.sh

Make sure that all references to aggregate_code_metadata are either removed or updated to reflect the current state of the pipeline. This will prevent potential issues with testing, documentation, and script execution.

Analysis chain

Line range hint 7-29: LGTM! Verify the impact of aggregate_code_metadata stage removal.

The removal of the aggregate_code_metadata stage simplifies the pipeline. Ensure that this change aligns with the overall objectives and does not introduce issues.

Run the following script to verify the impact of aggregate_code_metadata stage removal:

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the removal of `aggregate_code_metadata` stage does not affect the pipeline.

# Test: Search for occurrences of `aggregate_code_metadata` usage. Expect: No occurrences.
rg --type yaml $'aggregate_code_metadata'

Length of output: 40


Script:

#!/bin/bash
# Description: Search for occurrences of `aggregate_code_metadata` in the entire codebase to ensure its removal is safe.

# Test: Search for occurrences of `aggregate_code_metadata` in all file types.
rg 'aggregate_code_metadata' --hidden -A 5

Length of output: 5070

tests/test_extract_no_metadata.py (1)

9-11: Consider adding error handling for rootutils.setup_root.

The setup_root function is used consistently across multiple test files without error handling. Adding error handling can help manage potential issues during the setup process.

  • tests/test_extract_no_metadata.py
  • tests/test_extract.py
  • tests/transform_tester_base.py
Analysis chain

Verify the usage of rootutils.

Ensure that rootutils.setup_root is correctly setting up the environment. Consider adding error handling if the setup fails.

Run the following script to verify the usage of rootutils.setup_root:

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `rootutils.setup_root`.

# Test: Search for `setup_root` usage. Expect: Correct setup with error handling.
rg --type python -A 5 $'setup_root'

Length of output: 65


Script:

#!/bin/bash
# Description: Verify the usage of `rootutils.setup_root`.

# Test: Search for `setup_root` usage. Expect: Correct setup with error handling.
rg --type py -A 5 $'setup_root'

Length of output: 1203

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e7ed727 and 775ba5e.

Files selected for processing (28)
  • MIMIC-IV_Example/README.md (2 hunks)
  • README.md (1 hunks)
  • eICU_Example/README.md (1 hunks)
  • src/MEDS_transforms/aggregate_code_metadata.py (4 hunks)
  • src/MEDS_transforms/configs/extract.yaml (2 hunks)
  • src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml (1 hunks)
  • src/MEDS_transforms/extract/extract_code_metadata.py (1 hunks)
  • src/MEDS_transforms/fit_vocabulary_indices.py (1 hunks)
  • src/MEDS_transforms/transforms/add_time_derived_measurements.py (4 hunks)
  • src/MEDS_transforms/transforms/tensorization.py (2 hunks)
  • src/MEDS_transforms/transforms/tokenization.py (1 hunks)
  • src/MEDS_transforms/utils.py (2 hunks)
  • tests/test_add_time_derived_measurements.py (1 hunks)
  • tests/test_aggregate_code_metadata.py (1 hunks)
  • tests/test_extract.py (6 hunks)
  • tests/test_extract_no_metadata.py (1 hunks)
  • tests/test_filter_measurements.py (2 hunks)
  • tests/test_filter_patients.py (1 hunks)
  • tests/test_fit_vocabulary_indices.py (2 hunks)
  • tests/test_multi_stage_preprocess_pipeline.py (1 hunks)
  • tests/test_normalization.py (3 hunks)
  • tests/test_occlude_outliers.py (1 hunks)
  • tests/test_reorder_measurements.py (1 hunks)
  • tests/test_reshard_to_split.py (1 hunks)
  • tests/test_tensorization.py (2 hunks)
  • tests/test_tokenization.py (10 hunks)
  • tests/transform_tester_base.py (13 hunks)
  • tests/utils.py (4 hunks)
Files skipped from review due to trivial changes (7)
  • MIMIC-IV_Example/README.md
  • README.md
  • src/MEDS_transforms/configs/stage_configs/fit_vocabulary_indices.yaml
  • src/MEDS_transforms/fit_vocabulary_indices.py
  • tests/test_add_time_derived_measurements.py
  • tests/test_filter_measurements.py
  • tests/test_reshard_to_split.py
Additional context used
Ruff
tests/test_extract_no_metadata.py

410-410: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (39)
tests/test_tensorization.py (2)

28-29: LGTM! Verify consistency of parameter name changes across the codebase.

The change from want_outputs to want_data aligns with the new naming convention. Ensure that this change is consistently applied throughout the codebase.

Run the following script to verify the consistency of parameter name changes:

Verification successful

Parameter Name Change Verified Across Codebase

The parameter name want_data is consistently used across the codebase, confirming that the change from want_outputs to want_data has been applied correctly.

  • The parameter want_data appears in multiple test files, including:
    • tests/transform_tester_base.py
    • tests/test_tensorization.py
    • tests/test_reorder_measurements.py
    • tests/test_reshard_to_split.py
    • tests/test_normalization.py
    • tests/test_occlude_outliers.py
    • tests/test_tokenization.py
    • tests/test_filter_patients.py
    • tests/test_filter_measurements.py
    • tests/test_multi_stage_preprocess_pipeline.py
    • tests/test_add_time_derived_measurements.py

This confirms the consistency of the parameter name change throughout the codebase.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the parameter name `want_data` is consistently used in the codebase.

# Test: Search for occurrences of `want_data` usage. Expect: Consistent parameter name.
rg --type python -A 5 $'want_data'

Length of output: 64


Script:

#!/bin/bash
# Description: Verify that the parameter name `want_data` is consistently used in the codebase.

# Test: Search for occurrences of `want_data` usage. Expect: Consistent parameter name.
rg --type py -A 5 $'want_data'

Length of output: 5410


Line range hint 16-20: LGTM! Verify consistency of key changes across the codebase.

The change to append .nrt to each key in WANT_NRTS improves clarity. Ensure that this change is consistently applied throughout the codebase.

Run the following script to verify the consistency of key changes:

Verification successful

Change is consistent within its scope. The .nrt suffix is applied to keys in WANT_NRTS within tests/test_tensorization.py, and no other occurrences were found in the codebase. The change is localized and consistent.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the `.nrt` suffix is consistently applied to keys in the codebase.

# Test: Search for occurrences of `WANT_NRTS` usage. Expect: Consistent key format.
rg --type python -A 5 $'WANT_NRTS'

Length of output: 64


Script:

#!/bin/bash
# Description: Verify that the `.nrt` suffix is consistently applied to keys in the codebase.

# Corrected Test: Search for occurrences of `WANT_NRTS` usage in Python files. Expect: Consistent key format.
rg --glob '*.py' -A 5 'WANT_NRTS'

Length of output: 516

tests/test_fit_vocabulary_indices.py (2)

35-36: LGTM! Verify consistency of parameter name changes across the codebase.

The change from want_outputs to want_metadata aligns with the new naming convention. The direct use of parse_code_metadata_csv simplifies the test setup.

Run the following script to verify the consistency of parameter name changes:

Verification successful

Parameter Name Change Verified: want_metadata is Consistently Used Across the Codebase

The parameter name want_metadata is consistently used across multiple test files, confirming the successful application of the naming convention change.

  • tests/transform_tester_base.py
  • tests/test_fit_vocabulary_indices.py
  • tests/test_aggregate_code_metadata.py
  • tests/test_multi_stage_preprocess_pipeline.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the parameter name `want_metadata` is consistently used in the codebase.

# Test: Search for occurrences of `want_metadata` usage. Expect: Consistent parameter name.
rg --type python -A 5 $'want_metadata'

Length of output: 68


Script:

#!/bin/bash
# Description: Verify that the parameter name `want_metadata` is consistently used in the codebase.

# Test: Search for occurrences of `want_metadata` usage. Expect: Consistent parameter name.
rg -t py -A 5 'want_metadata'

Length of output: 3659


10-12: LGTM! Verify the impact of WANT_SCHEMA and WANT_DF removal.

The removal of WANT_SCHEMA and WANT_DF reduces complexity. Ensure that this removal does not affect other parts of the codebase.

Run the following script to verify the impact of WANT_SCHEMA and WANT_DF removal:

Verification successful

No impact from removal of WANT_SCHEMA and WANT_DF.

The specific identifiers WANT_SCHEMA and WANT_DF were not found in the codebase, indicating that their removal does not affect other parts of the code. However, similar identifiers like WANT_SCHEMAS_TRAIN_0 exist, which should be manually verified to ensure no related issues.

  • File: tests/test_tokenization.py (contains similar identifiers)
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the removal of `WANT_SCHEMA` and `WANT_DF` does not affect the codebase.

# Test: Search for occurrences of `WANT_SCHEMA` and `WANT_DF` usage. Expect: No occurrences.
rg --type python $'WANT_SCHEMA|WANT_DF'

Length of output: 71


Script:

#!/bin/bash
# Description: Verify that the removal of `WANT_SCHEMA` and `WANT_DF` does not affect the codebase.

# Test: Search for occurrences of `WANT_SCHEMA` and `WANT_DF` usage. Expect: No occurrences.
rg 'WANT_SCHEMA|WANT_DF'

Length of output: 708

tests/test_filter_patients.py (1)

85-85: Parameter name change acknowledged.

The parameter name change from want_outputs to want_data reflects a semantic shift in the expected data. Ensure that this change is consistently applied across all relevant tests and documentation.

tests/test_reorder_measurements.py (1)

114-114: Parameter name change acknowledged.

The parameter name change from want_outputs to want_data indicates a semantic shift in the expected data. Ensure that this change is consistently applied across all relevant tests and documentation.

src/MEDS_transforms/transforms/tensorization.py (1)

91-101: Enhanced error handling and logging acknowledged.

The added validation checks for empty columns and the use of logging improve error handling and observability. Ensure that these changes are documented and that any dependent code is updated to handle the new exceptions appropriately.

tests/test_occlude_outliers.py (1)

174-174: Parameter name changed for clarity.

The parameter want_outputs has been renamed to want_data to better reflect its purpose. Ensure that this change is consistent across all related tests and documentation.

tests/utils.py (2)

17-18: Schema update: Precision and new field added.

The numeric_value type has been changed from pl.Float64 to pl.Float32, which may affect precision. Additionally, a new field numeric_value/is_inlier of type pl.Boolean has been added. Ensure that these changes are compatible with the rest of the codebase.


111-112: New parameters improve command execution robustness.

The run_command function now includes stage_name and do_pass_stage_name parameters. This change enhances robustness by enforcing necessary parameters for certain operations. Ensure that all function calls are updated accordingly.

tests/test_aggregate_code_metadata.py (1)

1-186: New test file added for code metadata aggregation.

The test_aggregate_code_metadata.py file introduces a comprehensive test for the aggregate_code_metadata script. The test is well-structured and utilizes polars for DataFrame operations, ensuring clarity and efficiency in testing.

tests/test_tokenization.py (3)

44-44: Verify adequacy of Float32 precision for time_delta_days.

The change to Float32 reduces precision but optimizes memory usage. Ensure that this precision is sufficient for the application's requirements.

Run the following script to verify the adequacy of Float32 precision for the time_delta_days field:


Line range hint 71-102: Verify correctness of updated test data values.

The test data values have been updated to align with the new schema. Ensure that these values are correct and consistent with the schema changes.

Run the following script to verify the correctness of the updated test data values:

Also applies to: 119-133, 145-158, 179-196


42-42: Verify compatibility of UInt8 for code field.

The change to UInt8 optimizes memory usage, but ensure that all values in the code field are within the range of UInt8 (0-255).

Run the following script to verify the compatibility of UInt8 for the code field:

Verification successful

Change to UInt8 for code field is compatible.

The values assigned to the code field in the test cases are within the valid range for UInt8 (0-255), confirming compatibility with the change. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all values in the `code` field are within the range of `UInt8`.

# Test: Search for the `code` field values. Expect: All values should be within 0-255.
rg --type python -A 5 $'code: [[5], [1, 10, 11], [10, 11], [10, 11], [10, 11], [4]]'

Length of output: 116


Script:

#!/bin/bash
# Description: Search for assignments and manipulations of the `code` field to verify compatibility with `UInt8`.

# Search for the `code` field assignments in Python files.
rg 'code\s*:\s*\[\[' --glob '*.py' -A 5

# Search for any manipulations or calculations involving the `code` field.
rg 'code' --glob '*.py' -A 5

Length of output: 340503

tests/test_normalization.py (3)

134-161: Verify correctness of updated expected output strings.

The expected output strings have been adjusted to align with new calculations. Ensure that these values are correct and consistent with the schema changes.

Run the following script to verify the correctness of the updated expected output strings:

Also applies to: 168-179, 186-190, 197-205


90-90: Verify correctness of output formatting change.

The output formatting now converts each value to a float using a list comprehension. Ensure that this change does not affect the correctness of the output.

Run the following script to verify the correctness of the output formatting change:

Verification successful

Output formatting change verified.

The conversion of values to floats before printing in tests/test_normalization.py is correctly implemented and does not affect the correctness of the output.

  • Location: tests/test_normalization.py line with print([float(x) for x in subj_vals])
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of the output formatting change.

# Test: Search for output formatting change. Expect: Correctness should be maintained.
rg --type python -A 5 $'print([float(x) for x in subj_vals])'

Length of output: 93


Script:

#!/bin/bash
# Description: Verify the correctness of the output formatting change by searching for the specific print statement.

# Correcting the previous command by removing the unrecognized option.
rg -A 5 'print([float(x) for x in subj_vals])'

Length of output: 21752


84-84: Verify adequacy of np.float32 precision for normalization calculations.

The change to np.float32 reduces precision but optimizes memory usage. Ensure that this precision is sufficient for the application's requirements.

Run the following script to verify the adequacy of np.float32 precision for normalization calculations:

eICU_Example/README.md (3)

31-44: Verify overall clarity and usability of the README.

The README has been updated to improve the usability of the installation instructions. Ensure that the overall clarity and usability are improved.

Run the following script to verify the overall clarity and usability of the README:

Verification successful

README Clarity and Usability Verified

The installation instructions in eICU_Example/README.md are clear and formatted correctly. No issues with clarity or usability were found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the overall clarity and usability of the README.

# Test: Search for updated sections in the README. Expect: Overall clarity and usability should be improved.
rg --type markdown -A 5 $'## Step 0: Installation'

Length of output: 761


37-44: Verify correctness and effectiveness of script download and permission setting commands.

New commands download necessary scripts using wget and set executable permissions. Ensure that these commands are correct and effective.

Run the following script to verify the correctness and effectiveness of the script download and permission setting commands:

Verification successful

Commands are correctly documented in the README.md

The wget and chmod commands for downloading and setting permissions on the scripts are correctly documented in the README.md. Assuming they are executed in an environment with internet access and proper permissions, they should work as intended. No further issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness and effectiveness of script download and permission setting commands.

# Test: Search for script download and permission setting commands. Expect: Commands should be correct and effective.
rg --type markdown -A 5 $'wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/eICU_Example/joint_script.sh'

Length of output: 544


31-36: Verify clarity and correctness of installation instructions.

The installation instructions specify the installation of MEDS_transforms with the local_parallelism option. Ensure that these instructions are clear and correct.

Run the following script to verify the clarity and correctness of the installation instructions:

src/MEDS_transforms/transforms/tokenization.py (1)

187-198: LGTM! But verify the function usage in the codebase.

The changes improve type safety and memory efficiency. Ensure that all function calls to extract_seq_of_patient_events are consistent with the updated data types.

Run the following script to verify the function usage:

Also applies to: 203-210

Verification successful

Function usage is consistent within the file.

The extract_seq_of_patient_events function is defined and used within src/MEDS_transforms/transforms/tokenization.py. There are no external usages, ensuring that the changes do not impact other parts of the codebase.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `extract_seq_of_patient_events`.

# Test: Search for the function usage. Expect: Consistency with updated data types.
rg --type python -A 5 $'extract_seq_of_patient_events'

Length of output: 84


Script:

#!/bin/bash
# Description: Verify all function calls to `extract_seq_of_patient_events`.

# Test: Search for the function usage. Expect: Consistency with updated data types.
rg --type py -A 5 $'extract_seq_of_patient_events'

Length of output: 1732

src/MEDS_transforms/utils.py (1)

247-247: LGTM! But verify the function usage in the codebase.

The changes reflect a shift in data flow, particularly in metadata input directory handling. Ensure that all function calls to populate_stage are consistent with the updated logic.

Run the following script to verify the function usage:

Also applies to: 307-307

Verification successful

Function usage verified successfully.

The populate_stage function is used consistently with the updated logic, as evidenced by the doctests and its integration with OmegaConf. No further action is required.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `populate_stage`.

# Test: Search for the function usage. Expect: Consistency with updated logic.
rg --type python -A 5 $'populate_stage'

Length of output: 69


Script:

#!/bin/bash
# Description: Verify all function calls to `populate_stage`.

# Test: Search for the function usage. Expect: Consistency with updated logic.
rg --type py -A 5 $'populate_stage'

Length of output: 4187

src/MEDS_transforms/transforms/add_time_derived_measurements.py (1)

233-237: LGTM! But verify the function usage in the codebase.

The changes improve type consistency and flexibility in type conversion. Ensure that all function calls to age_fntr are consistent with the updated data types.

Run the following script to verify the function usage:

Also applies to: 253-253

Verification successful

Function Usage Verified

The function age_fntr is used consistently within the codebase, and its usage aligns with the expected data types. The examples provided in the code further confirm its correct application.

  • Usage examples and integration with add_new_events_fntr are present in src/MEDS_transforms/transforms/add_time_derived_measurements.py.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `age_fntr`.

# Test: Search for the function usage. Expect: Consistency with updated data types.
rg --type python -A 5 $'age_fntr'

Length of output: 63


Script:

#!/bin/bash
# Description: Verify all function calls to `age_fntr`.

# Test: Search for the function usage. Expect: Consistency with updated data types.
rg -t py -A 5 'age_fntr'

Length of output: 3357

tests/transform_tester_base.py (6)

7-12: LGTM! Imports are correctly updated for YAML handling.

The addition of YAML handling imports is appropriate for the new functionality.


206-208: LGTM! The parse_shards_yaml function is well-implemented.

The function effectively processes YAML strings into DataFrames with dynamic schema updates.


210-217: LGTM! The parse_code_metadata_csv function is correctly implemented.

The function effectively parses CSV strings and handles the parent_codes column as a list.


352-393: LGTM! The check_outputs function is well-structured.

The function effectively consolidates output verification logic and supports both data and metadata checks.


Line range hint 396-440: LGTM! The single_stage_transform_tester function improvements enhance readability.

The changes align with best practices and improve the function's structure.


443-501: LGTM! The multi_stage_transform_tester function is comprehensive.

The function integrates previous logic into a more robust framework for testing multiple transformation stages.

src/MEDS_transforms/extract/extract_code_metadata.py (1)

367-370: LGTM! The new conditional check enhances robustness.

The added check prevents unnecessary processing when metadata blocks are absent.

tests/test_extract.py (2)

263-268: LGTM! The simplification of MEDS_OUTPUT_CODE_METADATA_FILE improves readability.

The changes enhance the maintainability of the test code by focusing on essential validations.


544-544: LGTM! The updates in test_extraction align with the new testing strategy.

The removal of AGGREGATE_CODE_METADATA_SCRIPT simplifies the test logic.

tests/test_extract_no_metadata.py (2)

128-134: Verify the correctness of expected splits.

Ensure that the expected splits match the actual splits generated by the test. Consider adding comments explaining the expected values.

Run the following script to verify the correctness of expected splits:

Verification successful

Expected splits are subject to change due to non-determinism.

The expected splits in the test files are placeholders that may need updating if the data or random seed changes. The comments in the code already explain this potential variability.

  • tests/test_extract_no_metadata.py
  • tests/test_extract.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of expected splits.

# Test: Search for the expected splits usage. Expect: Correct values with explanations.
rg --type python -A 5 $'EXPECTED_SPLITS'

Length of output: 70


Script:

#!/bin/bash
# Description: Verify the correctness of expected splits.

# Test: Search for the expected splits usage. Expect: Correct values with explanations.
rg --type py -A 5 $'EXPECTED_SPLITS'

Length of output: 1862


16-24: Verify the environment variable usage.

Ensure that the environment variable DO_USE_LOCAL_SCRIPTS is set and used correctly. Consider adding default values or error handling.

Run the following script to verify the usage of the environment variable:

Verification successful

Environment Variable Usage Verified

The environment variable DO_USE_LOCAL_SCRIPTS is used correctly with a default value of "0", ensuring that the code behaves as expected even if the variable is not set. No additional error handling is necessary.

  • Usage found in:
    • tests/test_extract_no_metadata.py
    • tests/transform_tester_base.py
    • tests/test_extract.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `DO_USE_LOCAL_SCRIPTS` environment variable.

# Test: Search for the environment variable usage. Expect: Correct usage with default values or error handling.
rg --type python -A 5 $'os.environ.get("DO_USE_LOCAL_SCRIPTS")'

Length of output: 95


Script:

#!/bin/bash
# Description: Verify the usage of `DO_USE_LOCAL_SCRIPTS` environment variable.

# Test: Search for the environment variable usage. Expect: Correct usage with default values or error handling.
rg --type py -A 5 $'os.environ.get("DO_USE_LOCAL_SCRIPTS")'

Length of output: 60


Script:

#!/bin/bash
# Description: Search for any usage of the `DO_USE_LOCAL_SCRIPTS` environment variable in the codebase.

# Test: Search for any access to the `DO_USE_LOCAL_SCRIPTS` environment variable.
rg --type py -A 5 'DO_USE_LOCAL_SCRIPTS'

Length of output: 7365

src/MEDS_transforms/aggregate_code_metadata.py (4)

459-463: Verify the correctness of quantile handling.

Ensure that the quantile handling logic is correct and efficient. Consider optimizing if necessary.

Run the following script to verify the correctness of quantile handling:


714-718: Verify the correctness of file operations.

Ensure that file operations are correctly handled and optimized for performance. Consider adding error handling if necessary.

Run the following script to verify the correctness of file operations:

Verification successful

File operations are correctly handled and optimized.

The file operations involving reducer_fp are correctly implemented with directory creation and file writing. The use of exist_ok=True ensures that existing directories do not cause errors, which is a good practice. No further issues were found.

  • aggregate_code_metadata.py: Correct handling of directory creation and file writing.
  • extract_code_metadata.py: Similar operations are correctly implemented.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of file operations.

# Test: Search for file operations. Expect: Correct and optimized handling.
rg --type python -A 5 $'reducer_fp'

Length of output: 65


Script:

#!/bin/bash
# Description: Verify the correctness of file operations.

# Test: Search for file operations. Expect: Correct and optimized handling.
rg --type py -A 5 'reducer_fp'

Length of output: 1960


711-718: Verify the correctness of aggregation operations.

Ensure that the aggregation operations are correctly applied and optimized for performance.

Run the following script to verify the correctness of aggregation operations:

Verification successful

Aggregation operations are correctly implemented.

The aggregation operations in the code are structured and applied correctly using the agg_operations dictionary. They are dynamically configured and used in data frame operations, ensuring flexibility and correctness.

  • The operations are mapped from CODE_METADATA_AGGREGATIONS and applied using functions like group_by and select.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of aggregation operations.

# Test: Search for aggregation operations. Expect: Correct and optimized operations.
rg --type python -A 5 $'agg_operations'

Length of output: 69


Script:

#!/bin/bash
# Description: Verify the correctness of aggregation operations.

# Test: Search for aggregation operations. Expect: Correct and optimized operations.
rg --type py -A 5 'agg_operations'

Length of output: 3679


470-472: Verify the summarization logic.

Ensure that the summarization logic is correct and aligns with the configuration. Consider adding comments for clarity.

Run the following script to verify the summarization logic:

tests/test_multi_stage_preprocess_pipeline.py (1)

276-360: Verify the correctness of expected outlier detection results.

Ensure that the expected outlier detection results match the actual results generated by the test. Consider adding comments explaining the expected values.

Run the following script to verify the correctness of expected outlier detection results:

tests/test_extract_no_metadata.py Show resolved Hide resolved
tests/test_multi_stage_preprocess_pipeline.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 775ba5e and 1f80b15.

Files selected for processing (9)
  • .github/workflows/code-quality-main.yaml (1 hunks)
  • .github/workflows/code-quality-pr.yaml (1 hunks)
  • .github/workflows/python-build.yaml (1 hunks)
  • .github/workflows/tests.yaml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • pyproject.toml (1 hunks)
  • src/MEDS_transforms/aggregate_code_metadata.py (5 hunks)
  • src/MEDS_transforms/extract/split_and_shard_patients.py (4 hunks)
  • src/MEDS_transforms/mapreduce/utils.py (2 hunks)
Files skipped from review due to trivial changes (1)
  • pyproject.toml
Files skipped from review as they are similar to previous changes (1)
  • src/MEDS_transforms/aggregate_code_metadata.py
Additional context used
Ruff
src/MEDS_transforms/extract/split_and_shard_patients.py

21-21: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Additional comments not posted (19)
.github/workflows/code-quality-main.yaml (4)

14-17: LGTM!

The introduction of the matrix strategy for Python version management enhances flexibility for future updates.


19-19: LGTM!

Updating the checkout action to version 4 ensures that the workflow utilizes the most recent features and fixes.


21-24: LGTM!

Updating the setup-python action to version 5 ensures that the workflow utilizes the most recent features and fixes. Dynamically referencing the Python version from the matrix improves maintainability.


26-26: LGTM!

Using the pre-commit action ensures that code quality checks are consistently applied.

.github/workflows/code-quality-pr.yaml (4)

17-20: LGTM!

The introduction of the matrix strategy for Python version management enhances flexibility for future updates.


22-22: LGTM!

Updating the checkout action to version 4 ensures that the workflow utilizes the most recent features and fixes.


24-27: LGTM!

Updating the setup-python action to version 5 ensures that the workflow utilizes the most recent features and fixes. Dynamically referencing the Python version from the matrix improves maintainability.


Line range hint 35-35: LGTM!

Using the pre-commit action ensures that code quality checks are consistently applied.

.github/workflows/tests.yaml (4)

14-15: LGTM!

The introduction of the matrix strategy for Python version management enhances flexibility for future updates.


22-22: LGTM!

Updating the checkout action to version 4 ensures that the workflow utilizes the most recent features and fixes.


24-27: LGTM!

Updating the setup-python action to version 5 ensures that the workflow utilizes the most recent features and fixes. Dynamically referencing the Python version from the matrix improves maintainability.


Line range hint 37-45: LGTM!

Using the Codecov actions ensures that coverage and test results are consistently uploaded and tracked.

Tools
actionlint

30-30: shellcheck reported issue in this script: SC2102:info:1:17: Ranges can only match single chars (mentioned due to duplicates)

(shellcheck)

.github/workflows/python-build.yaml (2)

10-13: LGTM!

The matrix strategy for the Python version is correctly implemented and enhances the flexibility of the workflow.

The code changes are approved.


16-19: LGTM!

The changes to set up Python with the matrix variable and update the action version ensure that the latest features and fixes are utilized.

The code changes are approved.

.pre-commit-config.yaml (1)

48-48: LGTM!

The update to the pyupgrade hook configuration aligns the codebase with the latest Python standards and allows for the use of new language features introduced in Python 3.11.

The code changes are approved.

src/MEDS_transforms/extract/split_and_shard_patients.py (2)

20-23: LGTM!

The changes to the function signature enhance type safety by specifying more precise types for the function parameters and return values.

The code changes are approved.

Tools
Ruff

21-21: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


109-117: LGTM!

The changes improve the precision of the validation and enhance the clarity of the feedback for users.

The code changes are approved.

src/MEDS_transforms/mapreduce/utils.py (2)

8-8: LGTM!

The introduction of the type variable DF_T enhances type hinting capabilities, improving type safety and clarity.

The code changes are approved.


15-15: LGTM!

The modification to the rwlock_wrap function's signature streamlines its declaration while retaining the type hint for the read_fn parameter, improving the clarity of the function's interface.

The code changes are approved.

patients: np.ndarray,
n_patients_per_shard: int = 50000,
external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None,
external_splits: dict[str, Sequence[int]] | None = None,
split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using mutable data structures for argument defaults.

Using mutable default arguments can lead to unexpected behavior. Set the default value to None and initialize within the function.

Apply this diff to fix the issue:

-    split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+    split_fracs_dict: dict[str, float] | None = None,

Then, initialize the default value within the function:

    if external_splits is None:
        external_splits = {}
+    if split_fracs_dict is None:
+        split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}
Tools
Ruff

21-21: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
src/MEDS_transforms/parser.py (1)

Line range hint 41-79: LGTM! But consider removing .keys()

The function is correctly implemented and provides detailed error messages. However, you can simplify the code by removing .keys() in line 69.

- if not all(isinstance(k, str) for k in matcher_cfg.keys()):
+ if not all(isinstance(k, str) for k in matcher_cfg):

The code changes are approved.

Tools
Ruff

69-69: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1f80b15 and 3f73a35.

Files selected for processing (10)
  • pyproject.toml (2 hunks)
  • src/MEDS_transforms/init.py (1 hunks)
  • src/MEDS_transforms/mapreduce/mapper.py (6 hunks)
  • src/MEDS_transforms/parser.py (14 hunks)
  • src/MEDS_transforms/transforms/add_time_derived_measurements.py (6 hunks)
  • src/MEDS_transforms/transforms/extract_values.py (1 hunks)
  • tests/test_extract_values.py (1 hunks)
  • tests/test_filter_measurements.py (2 hunks)
  • tests/transform_tester_base.py (14 hunks)
  • tests/utils.py (4 hunks)
Files skipped from review as they are similar to previous changes (4)
  • pyproject.toml
  • src/MEDS_transforms/transforms/add_time_derived_measurements.py
  • tests/test_filter_measurements.py
  • tests/utils.py
Additional context used
Ruff
src/MEDS_transforms/parser.py

69-69: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (16)
src/MEDS_transforms/__init__.py (1)

35-41: LGTM!

The addition of INFERRED_STAGE_KEYS enhances the module's capability to manage and reference specific stages of data processing.

The code changes are approved.

tests/test_extract_values.py (1)

1-116: LGTM!

The test file is well-structured and provides comprehensive test cases for the extract_values script.

The code changes are approved.

src/MEDS_transforms/transforms/extract_values.py (1)

1-134: LGTM!

The script is well-structured and provides a clear and efficient way to extract values from the MEDS dataset. The use of hydra for configuration management and polars for data manipulation ensures that the script is both flexible and performant. The detailed docstrings and examples are helpful for understanding how to use the script.

The code changes are approved.

tests/transform_tester_base.py (5)

208-210: LGTM!

The function is correctly implemented and enhances flexibility in data parsing.

The code changes are approved.


Line range hint 298-351: LGTM!

The function is correctly implemented and enhances modularity by handling input_code_metadata as a string and parsing it.

The code changes are approved.


354-365: LGTM!

The function is correctly implemented and improves robustness by setting outputs_from_cohort_dir to True by default.

The code changes are approved.


Line range hint 398-443: LGTM!

The function is correctly implemented and enhances readability and structure by utilizing the new context manager and adding additional parameters.

The code changes are approved.


445-503: LGTM!

The function is correctly implemented and enhances the testing framework by allowing testing of multiple transformation stages in a single call and handling stage_configs and do_pass_stage_name as dictionaries.

The code changes are approved.

src/MEDS_transforms/mapreduce/mapper.py (3)

229-249: LGTM!

The class is correctly implemented and enhances flexibility in the matching process by defining two distinct modes: MATCH_AND_REVISE and MULTI_MATCH_AND_REVISE.

The code changes are approved.


311-315: LGTM!

The function is correctly implemented and improves robustness by providing more informative feedback when a matcher is invalid.

The code changes are approved.


Line range hint 442-481: LGTM!

The function is correctly implemented and enhances flexibility in the matching process by incorporating the new match_revise_mode and updating control flow to handle different behaviors.

The code changes are approved.

src/MEDS_transforms/parser.py (5)

119-131: LGTM!

The function is correctly implemented and provides detailed error messages.

The code changes are approved.


146-185: LGTM!

The new member and method are correctly implemented and provide detailed error messages.

The code changes are approved.


Line range hint 247-267: LGTM!

The method is correctly implemented and provides detailed error messages.

The code changes are approved.


Line range hint 303-333: LGTM!

The method is correctly implemented and provides detailed error messages.

The code changes are approved.


521-535: LGTM!

The function is correctly implemented and provides detailed error messages.

The code changes are approved.

@mmcdermott mmcdermott merged commit 9549d7e into main Aug 27, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants