Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate for 0.0.4 #146

Merged
merged 83 commits into from
Aug 11, 2024
Merged

Release candidate for 0.0.4 #146

merged 83 commits into from
Aug 11, 2024

Conversation

mmcdermott
Copy link
Owner

@mmcdermott mmcdermott commented Aug 11, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a new .editorconfig file to enforce consistent coding styles across the project.
    • Added a GitHub Actions workflow for automated building and publishing of Python packages to PyPI.
    • Enhanced testing workflow to generate JUnit-style XML reports and upload results to Codecov.
    • Added documentation configuration for Read the Docs to streamline the documentation build process.
    • Introduced scripts for generating code reference pages automatically and improved documentation organization.
    • Implemented functionalities for reshaping data processing workflows, including dynamic configuration handling and improved data validation.
  • Bug Fixes

    • Adjusted various scripts and configuration files to improve data handling and processing, ensuring consistent paths and file management.
  • Documentation

    • Updated multiple README files and documentation to enhance clarity and organization.
    • Added detailed guides on the data processing pipeline setup and usage.
  • Tests

    • Enhanced test coverage for new features related to data reshaping and filtering.
    • Consolidated and improved readability of test cases and their corresponding functions.
  • Chores

    • Refactored imports across files for improved readability and maintainability.
    • Streamlined configuration management for testing and running commands.

mmcdermott and others added 30 commits July 29, 2024 17:33
…an example of loading MEDS_transforms configs as defaults dynamically.
…low for pushing to pypi. The pypi push workflow requires local tagging via 'git tag 0.0.3' then 'git push origin 0.0.3'
This moves the parser to a general import, and absorbs some type checking commits from another PR to clean up the PR.
Improved compliance by removing creation of shards.json file and adding `patient_splits.parquet` file.
Adds a reshard-by-split stage as a transform.
…arallel_mode

Fixes brittle reduce stage checking in Reshard stage.
Copy link
Contributor

coderabbitai bot commented Aug 11, 2024

Walkthrough

This update introduces several enhancements across the project, including a new .editorconfig for consistent coding styles, automated workflows for building and publishing Python packages, and improved testing processes with JUnit reports. Additionally, documentation has been structured with MkDocs and Read the Docs configurations. The introduction of new features, such as a reshard_to_split mechanism, further expands the functionality, ensuring better data handling and processing throughout the MEDS Transforms pipeline.

Changes

File(s) Change Summary
.editorconfig New file added to establish coding style guidelines, including indentation, line endings, and specific settings for YAML files.
.github/workflows/python-build.yaml New GitHub Actions workflow for building and publishing Python packages to PyPI and TestPyPI, including jobs for building, publishing, and releasing.
.github/workflows/tests.yaml Enhancements to testing job, including JUnit XML report generation and Codecov result uploads.
.pre-commit-config.yaml Updates to excluded files and added new Markdown formatting tools.
.readthedocs.yaml New configuration file for building documentation with Read the Docs, specifying the build environment and package installation methods.
docs/gen_ref_pages.py, docs/index.md New script for generating code reference pages and a new index file linking to the README for improved documentation navigation.
docs/javascripts/mathjax.js New configuration for the MathJax library to render mathematical expressions in documentation.
docs/pipeline_configuration.md, docs/preprocessing_operation_prototypes.md Formatting adjustments to enhance readability without changing content.
mkdocs.yml New configuration file for MkDocs documentation site, including navigation structure and markdown extensions.
pyproject.toml Addition of documentation dependencies and a new script entry for transformation functionality.
src/MEDS_transforms/__init__.py New constants and data definitions added for mandatory columns and types, improving data handling.
src/MEDS_transforms/aggregate_code_metadata.py New aggregation type for quantiles added with corresponding function and updates to existing mappings.
src/MEDS_transforms/extract/... Various changes across multiple files, including adjustments to file paths, enhanced functionality, and updates to data handling.
tests/... Significant enhancements to various test files, including consolidated import statements, new test functions, and improved testing frameworks.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GitHub
    participant PyPI
    participant Docs

    User->>GitHub: Push code changes
    GitHub->>GitHub: Run build workflow
    GitHub->>PyPI: Publish package
    GitHub->>Docs: Generate documentation
    Docs-->>User: Access updated docs
Loading

🐇 In the land of code where changes grow,
A rabbit hops to share the flow.
With neat configs and workflows bright,
Our data dances in the light.
From docs to tests, all polished and grand,
Join the celebration, lend a hand! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Outside diff range, codebase verification and nitpick comments (21)
src/MEDS_transforms/__init__.py (1)

26-33: DEPRECATED_NAMES is not used in the codebase.

The DEPRECATED_NAMES dictionary is defined but not utilized anywhere in the codebase. Ensure that these mappings are applied where necessary to maintain backward compatibility in data transformations.

  • src/MEDS_transforms/__init__.py: Declaration of DEPRECATED_NAMES.
Analysis chain

Check DEPRECATED_NAMES for backward compatibility.

The DEPRECATED_NAMES dictionary maps old column names to new ones. Ensure these mappings are correctly applied to maintain backward compatibility.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `DEPRECATED_NAMES` in the codebase.

# Test: Search for `DEPRECATED_NAMES` usage. Expect: Correct application in data transformation logic.
rg --type python 'DEPRECATED_NAMES'

Length of output: 66


Script:

#!/bin/bash
# Description: Verify the usage of `DEPRECATED_NAMES` in the codebase.

# Test: Search for `DEPRECATED_NAMES` usage. Expect: Correct application in data transformation logic.
rg --type py 'DEPRECATED_NAMES'

Length of output: 84

docs/gen_ref_pages.py (1)

1-1: Add a detailed module docstring.

The current docstring provides a brief description. Consider expanding it to include details about the script's purpose, usage, and any assumptions or dependencies.

-"""Generate the code reference pages."""
+"""Generate the code reference pages for the project.

This script traverses the source directory to create markdown documentation files
for each Python module. It integrates with MkDocs to build a navigable API reference.
"""
.github/workflows/python-build.yaml (2)

1-1: Clarify the workflow name.

The name "Publish Python 🐍 distribution 📦 to PyPI and TestPyPI" suggests publishing to both PyPI and TestPyPI, but the workflow only targets PyPI. Consider updating the name for clarity.

- name: Publish Python 🐍 distribution 📦 to PyPI and TestPyPI
+ name: Publish Python 🐍 distribution 📦 to PyPI

82-82: Improve GitHub release notes.

Currently, the release notes are empty. Consider adding meaningful release notes to provide context for the release.

- --notes ""
+ --notes "Release notes for version ${{ github.ref_name }}"
.pre-commit-config.yaml (1)

129-129: Improve readability of nbqa-flake8 args.

The args for nbqa-flake8 have been reformatted to a single line. Consider using a multi-line format for better readability.

- args: ["--extend-ignore=E203,E402,E501,F401,F841", "--exclude=logs/*,data/*"]
+ args:
+   - "--extend-ignore=E203,E402,E501,F401,F841"
+   - "--exclude=logs/*,data/*"
docs/pipeline_configuration.md (4)

16-29: Correct punctuation in the list of stages.

There are loose punctuation marks at the start of each stage description. Consider reformatting for consistency.

- 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the
+ 1. `stage_1`: A metadata map-reduce stage (e.g., counting the occurrence rates of the codes in the data).
- 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes
+ 2. `stage_2`: A metadata-only processing stage (e.g., filtering the code dataframe to only codes that occur more than 10 times).
- 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the
+ 3. `stage_3`: A data processing stage (e.g., filtering the data to only rows with a code that is in the current running metadata file).
Tools
LanguageTool

[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1. stage_1: A metadata map-reduce stage (e.g., coun...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2. stage_2: A metadata-only processing stage (e.g.,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3. stage_3: A data processing stage (e.g., filterin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4. stage_4: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5. stage_5: A data processing stage (e.g., occludin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6. stage_6: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7. stage_7: A data processing stage (e.g., normaliz...

(UNLIKELY_OPENING_PUNCTUATION)


46-46: Consider adding a comma for clarity.

A comma is missing after "directory" for improved readability.

- This stage will read in the data from the `$INPUT_DIR/data` directory as there has still been no
+ This stage will read in the data from the `$INPUT_DIR/data` directory, as there has still been no
Tools
LanguageTool

[uncategorized] ~46-~46: Possible missing comma found.
Context: ... in the data from the $INPUT_DIR/data directory as there has still been no prio...

(AI_HYDRA_LEO_MISSING_COMMA)


82-82: Clarify the use of 'reduce' in the context.

The use of 'reduce' might be unclear. Consider rephrasing for clarity.

- (which do not have reduce stages).
+ (which do not include reduction stages).
Tools
LanguageTool

[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).

(HAVE_VB)


1-7: Misalignment Between Documentation and Code Implementation

The introduction in the documentation mentions the design philosophy of being "simple, reusable, efficient, and configurable." However, the implementation primarily references "simple" and lacks mentions of "reusable," "efficient," and "configurable." This suggests a potential misalignment between the documentation and the codebase. Consider reviewing the implementation to ensure consistency with the documented design philosophy.

  • Files with "simple": Multiple occurrences in src/MEDS_transforms/parser.py, aggregate_code_metadata.py, and others.
  • Missing Terms: No instances of "reusable," "efficient," or "configurable" found in the implementation.
Analysis chain

Ensure clarity and consistency in the introduction.

The introduction outlines the design philosophy. Ensure it aligns with the actual implementation and is clear to the intended audience.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the alignment of the introduction with the implementation.

# Test: Search for design philosophy keywords in the implementation. Expect: Consistent terminology.
rg 'simple|reusable|efficient|configurable' src/

Length of output: 3042

tests/utils.py (4)

107-107: Consider renaming do_use_config_yaml for clarity.

The parameter do_use_config_yaml could be renamed to something more intuitive, like use_yaml_config, to improve readability.

-    do_use_config_yaml: bool = False,
+    use_yaml_config: bool = False,

143-143: Improve command logging for better debugging.

Consider adding more context to the command log, such as the environment or additional parameters, to aid in debugging.

-    err_cmd_lines.append(f"Running command: {full_cmd}")
+    err_cmd_lines.append(f"Running command: {full_cmd} with environment: {os.environ}")

Line range hint 12-12:
Reminder: Add doctests.

The TODO comment indicates that doctests are missing. Adding them will ensure the function behaves as expected.

Do you want me to generate the doctests or open a GitHub issue to track this task?


114-116: Consider enhancing config_name validation.

The current validation ensures config_name is not None when do_use_config_yaml is True. However, consider adding checks to ensure it is a valid non-empty string to prevent potential issues.

  • tests/utils.py: Enhance validation logic for config_name.
Analysis chain

Ensure config_name validation is sufficient.

The check for config_name when do_use_config_yaml is True is crucial. Ensure that this validation is comprehensive and that config_name is not just None but also a valid string.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that `config_name` is a valid string when `do_use_config_yaml` is True.

# Test: Check for non-empty string assignment to `config_name`.
rg --type python -A 5 $'do_use_config_yaml=True' | grep -B 5 -A 5 'config_name'

Length of output: 107


Script:

#!/bin/bash
# Corrected script to verify that `config_name` is a valid string when `do_use_config_yaml` is True.

# Search for instances where `do_use_config_yaml` is set to True and check how `config_name` is handled.
rg 'do_use_config_yaml\s*=\s*True' -A 5 --type py

Length of output: 158


Script:

#!/bin/bash
# Extend the search to include more context around the usage of `do_use_config_yaml=True` in `tests/test_filter_measurements.py`.

# Search for the instance and include more lines to see how `config_name` is handled.
rg 'do_use_config_yaml\s*=\s*True' -A 10 -B 5 tests/test_filter_measurements.py

Length of output: 359


Script:

#!/bin/bash
# Search for `config_name` in `tests/test_filter_measurements.py` to understand its usage and validation.

rg 'config_name' -A 5 -B 5 tests/test_filter_measurements.py

Length of output: 60


Script:

#!/bin/bash
# Search the entire codebase for `config_name` to identify its definition and validation logic.

rg 'config_name' -A 5 -B 5 --type py

Length of output: 21468

eICU_Example/README.md (1)

13-27: Address markdownlint indentation issues.

Adjust the indentation to match markdownlint expectations for unordered lists.

-    - [ ] Identifying the pre-MEDS steps for eICU
+  - [ ] Identifying the pre-MEDS steps for eICU
-    - [ ] Test that it runs at all.
+  - [ ] Test that it runs at all.
-    - [ ] Test that the output is as expected.
+  - [ ] Test that the output is as expected.
-    - [ ] Sub-sharding
+  - [ ] Sub-sharding
-    - [ ] Patient split gathering
+  - [ ] Patient split gathering
-    - [ ] Event extraction
+  - [ ] Event extraction
-    - [ ] Merging
+  - [ ] Merging
-    - [ ] Basic validation
+  - [ ] Basic validation
-    - [ ] Detailed validation
+  - [ ] Detailed validation
Tools
Markdownlint

13-13: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


15-15: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


16-16: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


21-21: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


22-22: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


23-23: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


24-24: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


26-26: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


27-27: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

docs/preprocessing_operation_prototypes.md (1)

74-74: Consider rephrasing for conciseness.

The phrase "prior to" might be wordy. Consider using "before" for conciseness.

- patient data filters should be applied prior to aggregation.
+ patient data filters should be applied before aggregation.
src/MEDS_transforms/extract/README.md (2)

83-83: Correct grammatical error: Use "an" instead of "a".

Change "a input" to "an input" for grammatical correctness.

-    literal or a input column reference) is interpreted
+    literal or an input column reference) is interpreted
Tools
LanguageTool

[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...

(EN_A_VS_AN)


255-255: Consider more formal language.

Replace "feel free to" with a more formal alternative to enhance professionalism.

-    Feel free to reach out if you have any questions or concerns about this.
+    Please reach out if you have any questions or concerns about this.
Tools
LanguageTool

[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...

(FEEL_FREE_TO_STYLE_ME)

src/MEDS_transforms/utils.py (1)

Line range hint 61-98: Update the function signature and return statement.

The stage_init function's return type annotation should be updated to reflect the removal of shards_map_fp from the return tuple. The return type annotation should match the actual return statement.

- def stage_init(cfg: DictConfig) -> tuple[Path, Path, Path]:
+ def stage_init(cfg: DictConfig) -> tuple[Path, Path, Path, Path]:
src/MEDS_transforms/mapreduce/utils.py (1)

16-46: Add error logging for exception handling.

In the is_complete_parquet_file function, consider adding logging for exceptions to aid in debugging when a file is not complete.

- except Exception:
+ except Exception as e:
+     logger.error(f"Failed to read Parquet file {fp}: {e}")
tests/test_extract.py (1)

444-445: Update assertion message for clarity.

The assertion message should clearly indicate the expected file path for clarity.

- assert shards_fp.is_file(), f"Expected splits @ {str(shards_fp.resolve())} to exist."
+ assert shards_fp.is_file(), f"Expected shards file @ {str(shards_fp.resolve())} to exist."
README.md (1)

98-108: Consider style improvement and approve changes.

The modifications enhance clarity. However, consider addressing the style issue highlighted by static analysis regarding the missing comma:

- script in different terminal sessions or all at once via the Hydra `joblib` launcher
+ script in different terminal sessions, or all at once via the Hydra `joblib` launcher
Tools
LanguageTool

[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...

(MD_BASEFORM)


[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydra joblib l...

(AI_HYDRA_LEO_MISSING_COMMA)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d0e7930 and cda4f06.

Files selected for processing (42)
  • .editorconfig (1 hunks)
  • .github/workflows/python-build.yaml (1 hunks)
  • .github/workflows/tests.yaml (1 hunks)
  • .pre-commit-config.yaml (3 hunks)
  • .readthedocs.yaml (1 hunks)
  • MIMIC-IV_Example/README.md (2 hunks)
  • README.md (8 hunks)
  • docs/gen_ref_pages.py (1 hunks)
  • docs/index.md (1 hunks)
  • docs/javascripts/mathjax.js (1 hunks)
  • docs/pipeline_configuration.md (1 hunks)
  • docs/preprocessing_operation_prototypes.md (7 hunks)
  • eICU_Example/README.md (10 hunks)
  • eICU_Example/configs/table_preprocessors.yaml (1 hunks)
  • mkdocs.yml (1 hunks)
  • pyproject.toml (2 hunks)
  • src/MEDS_transforms/init.py (2 hunks)
  • src/MEDS_transforms/aggregate_code_metadata.py (12 hunks)
  • src/MEDS_transforms/configs/extract.yaml (1 hunks)
  • src/MEDS_transforms/configs/preprocess.yaml (2 hunks)
  • src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml (1 hunks)
  • src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1 hunks)
  • src/MEDS_transforms/extract/README.md (6 hunks)
  • src/MEDS_transforms/extract/init.py (2 hunks)
  • src/MEDS_transforms/extract/convert_to_sharded_events.py (2 hunks)
  • src/MEDS_transforms/extract/extract_code_metadata.py (2 hunks)
  • src/MEDS_transforms/extract/finalize_MEDS_metadata.py (2 hunks)
  • src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2 hunks)
  • src/MEDS_transforms/extract/split_and_shard_patients.py (7 hunks)
  • src/MEDS_transforms/mapreduce/mapper.py (4 hunks)
  • src/MEDS_transforms/mapreduce/utils.py (8 hunks)
  • src/MEDS_transforms/reshard_to_split.py (1 hunks)
  • src/MEDS_transforms/transforms/tokenization.py (5 hunks)
  • src/MEDS_transforms/utils.py (8 hunks)
  • tests/test_add_time_derived_measurements.py (1 hunks)
  • tests/test_extract.py (1 hunks)
  • tests/test_filter_measurements.py (2 hunks)
  • tests/test_occlude_outliers.py (1 hunks)
  • tests/test_reorder_measurements.py (1 hunks)
  • tests/test_reshard_to_split.py (1 hunks)
  • tests/transform_tester_base.py (7 hunks)
  • tests/utils.py (2 hunks)
Files skipped from review due to trivial changes (11)
  • .editorconfig
  • .readthedocs.yaml
  • MIMIC-IV_Example/README.md
  • docs/index.md
  • eICU_Example/configs/table_preprocessors.yaml
  • src/MEDS_transforms/configs/extract.yaml
  • src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml
  • src/MEDS_transforms/extract/extract_code_metadata.py
  • tests/test_add_time_derived_measurements.py
  • tests/test_occlude_outliers.py
  • tests/test_reorder_measurements.py
Additional context used
LanguageTool
docs/pipeline_configuration.md

[uncategorized] ~16-~16: Loose punctuation mark.
Context: ...s of the following stages: 1. stage_1: A metadata map-reduce stage (e.g., coun...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~18-~18: Loose punctuation mark.
Context: ...the codes in the data). 2. stage_2: A metadata-only processing stage (e.g.,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~20-~20: Loose punctuation mark.
Context: ... occur more than 10 times). 3. stage_3: A data processing stage (e.g., filterin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~22-~22: Loose punctuation mark.
Context: ... occur more than 10 times). 4. stage_4: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~24-~24: Loose punctuation mark.
Context: ...al values in the data). 5. stage_5: A data processing stage (e.g., occludin...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~26-~26: Loose punctuation mark.
Context: ... deviations from the mean). 6. stage_6: A metadata map-reduce stage (e.g., comp...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~28-~28: Loose punctuation mark.
Context: ...al values in the data). 7. stage_7: A data processing stage (e.g., normaliz...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~33-~33: Loose punctuation mark.
Context: ...s in the following manner. 1. stage_1: - As there is no preceding data sta...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...1/codes.parquetdirectory. 2.stage_2`: - This stage will read in the metad...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~45-~45: Loose punctuation mark.
Context: ...2/codes.parquetdirectory. 3.stage_3`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~46-~46: Possible missing comma found.
Context: ... in the data from the $INPUT_DIR/data directory as there has still been no prio...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~51-~51: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 4.stage_4`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~60-~60: Loose punctuation mark.
Context: ...tage_4/codes.parquetfile. 5.stage_5`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...$SHARD_NAME.parquetfiles. 6.stage_6`: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...n to the stage directories. 7. stage_7: - This stage will read in the data ...

(UNLIKELY_OPENING_PUNCTUATION)


[grammar] ~82-~82: The verb form ‘reduce’ does not seem to be suitable in this context.
Context: ...ssing stages (which do not have reduce stages).

(HAVE_VB)

docs/preprocessing_operation_prototypes.md

[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative.
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...

(EN_WORDINESS_PREMIUM_PRIOR_TO)

src/MEDS_transforms/extract/README.md

[misspelling] ~83-~83: Use “an” instead of ‘a’ if the following word starts with a vowel sound, e.g. ‘an article’, ‘an hour’.
Context: ...t (again either a string literal or a input column reference) is interpreted ...

(EN_A_VS_AN)


[style] ~255-~255: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... extraction tool for your raw data. Feel free to reach out if you have any questions or ...

(FEEL_FREE_TO_STYLE_ME)

README.md

[grammar] ~103-~103: The modal verb ‘will’ requires the verb’s base form.
Context: ...rker jobs. Note this will typically required a distributed file system to work corre...

(MD_BASEFORM)


[uncategorized] ~106-~106: Possible missing comma found.
Context: ...e same script in different terminal sessions or all at once via the Hydra joblib l...

(AI_HYDRA_LEO_MISSING_COMMA)


[style] ~181-~181: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing.
Context: ... and external ontologies? Will they need to specify select key-event concepts to id...

(REP_NEED_TO_VB)


[style] ~306-~306: This phrase is redundant. Consider using “outside”.
Context: ... - Removing numeric values that fall outside of pre-specified, per-code ranges (e.g., f...

(OUTSIDE_OF)


[style] ~315-~315: Consider using “who” when you are referring to a person instead of an object.
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...

(THAT_WHO)


[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ... each code occurs in the dataset. - Computing appropriate numeric bins for each code ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...

(KIND_OF_A)

Ruff
src/MEDS_transforms/reshard_to_split.py

132-132: Function definition does not bind loop variable in_fps

(B023)


138-138: Function definition does not bind loop variable in_fps

(B023)


147-147: Function definition does not bind loop variable subshard_dir

(B023)


147-147: Function definition does not bind loop variable out_fp

(B023)


153-153: Function definition does not bind loop variable subshard_dir

(B023)


154-154: Function definition does not bind loop variable in_fps

(B023)


158-158: Function definition does not bind loop variable subshard_dir

(B023)


159-159: Function definition does not bind loop variable subshard_dir

(B023)


164-164: Function definition does not bind loop variable subshard_dir

(B023)


166-166: Function definition does not bind loop variable subshard_dir

(B023)


167-167: Function definition does not bind loop variable subshard_dir

(B023)

src/MEDS_transforms/extract/split_and_shard_patients.py

22-22: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/MEDS_transforms/mapreduce/utils.py

275-278: Use ternary operator add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) instead of if-else-block

Replace if-else-block with add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

(SIM108)

src/MEDS_transforms/mapreduce/mapper.py

574-574: Do not perform function call partial in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

Markdownlint
eICU_Example/README.md

13-13: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


15-15: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


16-16: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


21-21: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


22-22: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


23-23: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


24-24: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


26-26: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)


27-27: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

Additional comments not posted (73)
src/MEDS_transforms/configs/stage_configs/split_and_shard_patients.yaml (1)

3-3: LGTM! But verify the impact of the directory change.

The change in the output_dir to ${cohort_dir}/metadata is a positive step towards better file organization.

However, ensure that the rest of the code that relies on the output directory is updated accordingly.

Verification successful

No references to old output directory structure found.

The change to output_dir: ${cohort_dir}/metadata does not appear to impact other parts of the codebase, as no references to the old structure were found in YAML files.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for code that references the old output directory structure.

# Test: Search for usage of the old output directory. Expect: No references to the old structure.
# Note: Adjust the pattern if the variable is used differently in the code.
ast-g-r --lang yaml --pattern $'output_dir: $COHORT_DIR'

Length of output: 132


Script:

#!/bin/bash
# Description: Check for code that references the old output directory structure.

# Test: Search for usage of the old output directory. Expect: No references to the old structure.
# Note: Adjust the pattern if the variable is used differently in the code.
ast-grep --lang yaml --pattern $'output_dir: $COHORT_DIR'

Length of output: 167


Script:

#!/bin/bash
# Description: Search for references to the old output directory structure in YAML files.

# Use ripgrep to search for the pattern in YAML files.
rg 'output_dir: \$COHORT_DIR' --glob '*.yaml' -A 2

Length of output: 52

docs/javascripts/mathjax.js (1)

1-19: MathJax configuration is well-structured.

The configuration for MathJax is standard and should work well for rendering mathematical expressions in the documentation.

src/MEDS_transforms/__init__.py (3)

4-4: Consider the impact of the polars import.

The introduction of polars suggests a shift towards using this library for data manipulation. Ensure that all necessary dependencies are installed and documented.

Verification successful

Polars is included in the dependencies.

The polars library is listed in the pyproject.toml file, ensuring it is part of the project's dependencies. No further action is needed regarding its inclusion.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the presence of `polars` in the dependencies.

# Test: Search for `polars` in the dependency files. Expect: Occurrences in requirements or setup files.
rg --type yaml --type toml --type ini --type txt 'polars'

Length of output: 85


Script:

#!/bin/bash
# Description: Re-verify the presence of `polars` in the dependencies.

# Test: Search for `polars` in the common dependency files. Expect: Occurrences in requirements or setup files.
rg --type yaml --type toml --type txt 'polars' --glob '*requirements*.txt' --glob 'Pipfile' --glob 'pyproject.toml' --glob 'setup.cfg'

Length of output: 369


17-24: Verify MANDATORY_TYPES for type consistency.

The MANDATORY_TYPES dictionary specifies expected data types using polars. Ensure these types are enforced and compatible with data sources.

Verification successful

Indirect enforcement of MANDATORY_TYPES through related constructs

The MANDATORY_TYPES dictionary itself is not directly used in the codebase. However, related constructs like MEDS_DATA_MANDATORY_TYPES and MEDS_METADATA_MANDATORY_TYPES are used in various files to enforce type consistency. These related constructs ensure that the expected data types are enforced across the codebase.

  • finalize_MEDS_metadata.py
  • extract_code_metadata.py
  • finalize_MEDS_data.py

These files contain logic to validate and enforce schema types, indirectly reflecting the intended use of MANDATORY_TYPES.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage and enforcement of `MANDATORY_TYPES` in the codebase.

# Test: Search for `MANDATORY_TYPES` usage. Expect: Consistent enforcement in data validation logic.
rg --type python 'MANDATORY_TYPES'

Length of output: 65


Script:

#!/bin/bash
# Description: Verify the usage and enforcement of `MANDATORY_TYPES` in the codebase.

# Corrected Test: Search for `MANDATORY_TYPES` usage in Python files.
rg 'MANDATORY_TYPES' --glob '*.py' -A 5

Length of output: 8562


15-15: Ensure MANDATORY_COLUMNS aligns with data processing requirements.

The MANDATORY_COLUMNS list defines essential columns for data processing. Verify that these columns are consistently used throughout the data pipeline.

Verification successful

MANDATORY_COLUMNS is consistently used in data processing.

The MANDATORY_COLUMNS list is effectively utilized in the data processing pipeline, ensuring that data types are checked and columns are selected appropriately. This aligns with the data processing requirements.

  • src/MEDS_transforms/extract/finalize_MEDS_data.py: Used for retyping columns and selecting mandatory columns in the DataFrame.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `MANDATORY_COLUMNS` in the codebase.

# Test: Search for `MANDATORY_COLUMNS` usage. Expect: Consistent references in data processing logic.
rg --type python 'MANDATORY_COLUMNS'

Length of output: 67


Script:

#!/bin/bash
# Description: Verify the usage of `MANDATORY_COLUMNS` in the codebase.

# Test: Search for `MANDATORY_COLUMNS` usage. Expect: Consistent references in data processing logic.
rg 'MANDATORY_COLUMNS' --glob '*.py' -A 3

Length of output: 801


Script:

#!/bin/bash
# Description: Verify the usage of `MEDS_DATA_MANDATORY_TYPES` in the codebase.

# Test: Search for `MEDS_DATA_MANDATORY_TYPES` usage. Expect: Consistent references in data processing logic.
rg 'MEDS_DATA_MANDATORY_TYPES' --glob '*.py' -A 3

Length of output: 1858

src/MEDS_transforms/configs/preprocess.yaml (1)

4-4: Evaluate the addition of reshard_to_split.

The reshard_to_split stage enhances the pipeline's reshaping capabilities. Verify its integration and ensure it aligns with the overall data processing strategy.

Verification successful

Integration of reshard_to_split verified.

The reshard_to_split stage is integrated into the pipeline, with references found in configuration and test files, indicating proper setup and testing. Further verification of its logic and alignment with the data processing strategy may be needed manually.

  • tests/test_reshard_to_split.py: Contains a test function for reshard_to_split.
  • src/MEDS_transforms/configs/stage_configs/reshard_to_split.yaml: Configuration entry for reshard_to_split.
  • src/MEDS_transforms/configs/preprocess.yaml: Includes reshard_to_split in the list.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the integration of `reshard_to_split` in the pipeline.

# Test: Search for `reshard_to_split` usage. Expect: Proper integration in pipeline configuration and logic.
rg --type yaml --type python 'reshard_to_split'

Length of output: 78


Script:

#!/bin/bash
# Description: Verify the integration of `reshard_to_split` in the pipeline.

# Test: Search for `reshard_to_split` usage. Expect: Proper integration in pipeline configuration and logic.
rg --type yaml --type py 'reshard_to_split'

Length of output: 514

.github/workflows/tests.yaml (2)

36-36: Review the addition of JUnit XML reporting.

The --junitxml=junit.xml option enhances test reporting. Ensure the generated reports are utilized effectively in the CI/CD pipeline.


42-46: Check the conditional upload of test results to Codecov.

The step to upload test results is conditioned on the job not being cancelled. Verify that this condition is correctly implemented and that the results are uploaded as expected.

mkdocs.yml (6)

1-4: Ensure accurate project metadata.

The site metadata, such as site_name, repo_url, site_description, and site_author, should accurately reflect the project details. Ensure these values are correct and up-to-date.


14-16: Review theme settings.

Ensure the chosen theme and locale settings meet the project's requirements. The material theme is widely used and customizable, but verify that it aligns with the desired presentation.


18-29: Validate markdown extensions configuration.

The listed markdown extensions enhance the documentation's capabilities. Verify that these extensions are necessary and correctly configured for the project's needs.


31-33: Check extra JavaScript inclusion.

Ensure the JavaScript files included are necessary and correctly configured. Verify that the URLs are correct and accessible.


35-45: Review plugin configuration.

Ensure that all listed plugins are necessary and correctly configured. Verify that they are compatible with the current MkDocs setup and project requirements.


6-12: Verify navigation links.

Ensure that the navigation links point to the correct markdown files and URLs. This includes checking that all referenced files exist and are properly named.

Verification successful

Navigation links verified successfully.

All referenced markdown files in mkdocs.yml exist and are correctly named.

  • index.md found at docs/index.md
  • pipeline_configuration.md found at docs/pipeline_configuration.md
  • preprocessing_operation_prototypes.md found at docs/preprocessing_operation_prototypes.md
  • terminology.md found at docs/terminology.md
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all navigation links in mkdocs.yml point to existing files.

# Test: Check for the existence of the referenced markdown files.
nav_files=("index.md" "pipeline_configuration.md" "preprocessing_operation_prototypes.md" "terminology.md")
for file in "${nav_files[@]}"; do
  if ! fd --type f --name "$file" docs; then
    echo "Missing file: $file"
  fi
done

Length of output: 1757


Script:

#!/bin/bash
# Corrected script to verify that all navigation links in mkdocs.yml point to existing files.

# Test: Check for the existence of the referenced markdown files.
nav_files=("index.md" "pipeline_configuration.md" "preprocessing_operation_prototypes.md" "terminology.md")
for file in "${nav_files[@]}"; do
  if ! fd --type f "$file"; then
    echo "Missing file: $file"
  fi
done

Length of output: 489

docs/gen_ref_pages.py (5)

12-15: Ensure correct path handling.

The script uses Path.rglob to find Python files. Verify that this approach correctly captures all relevant files, especially in complex directory structures.


32-32: Verify navigation structure.

Ensure that the navigation structure correctly reflects the project's module hierarchy. This affects how the documentation is organized and presented.


34-35: Check module documentation generation.

Ensure that the mkdocstrings syntax (::: {ident}) is correctly generating documentation for each module.


37-40: Ensure file writing operations are correct.

Verify that the file writing operations correctly create and update the documentation files without errors.


42-43: Validate navigation file generation.

Ensure the SUMMARY.md file is generated correctly and reflects the intended navigation structure.

pyproject.toml (2)

30-34: Verify documentation dependencies.

Ensure that the specified versions of documentation dependencies are compatible with each other and meet the project's requirements.


50-50: Check new script entry for correctness.

The new script entry MEDS_transform-reshard_to_split should be verified to ensure it correctly points to the intended function and module.

.github/workflows/python-build.yaml (3)

39-39: Replace placeholder in environment URL.

The URL contains a placeholder <package-name>. Ensure this is replaced with the actual package name before deploying.

- url: https://pypi.org/p/MEDS-transforms # Replace <package-name> with your PyPI project name
+ url: https://pypi.org/p/<actual-package-name>

Likely invalid or redundant comment.


51-51: Ensure correct version of PyPI publish action.

Using a specific version or release tag for pypa/gh-action-pypi-publish ensures stability and compatibility.

Verification successful

Version of PyPI publish action is correctly specified.

The workflow uses pypa/gh-action-pypi-publish@release/v1, which is a specific release tag ensuring stability and compatibility. No changes are necessary.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the version of `pypa/gh-action-pypi-publish` used in the workflow.

# Test: Check if a specific version or release tag is used. Expect: Specific version or release tag.
rg 'uses: pypa/gh-action-pypi-publish@release/v1' .github/workflows/python-build.yaml

Length of output: 140


15-15: Verify Python version compatibility.

Ensure that Python 3.12 is compatible with your project dependencies and target environments, as it is a relatively new release.

Verification successful

Python 3.12 Compatibility Confirmed

The pyproject.toml file specifies requires-python = ">=3.12", indicating compatibility with Python 3.12. There are no conflicting constraints in other dependency files.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify compatibility of Python 3.12 with project dependencies.

# Test: Check if Python 3.12 is specified in the dependency files. Expect: Compatibility declarations.
rg --type yaml 'python_version: "3.12"' || rg --type toml 'python = "3.12"'

Length of output: 76


Script:

#!/bin/bash
# Description: Search for Python version constraints in common dependency files.

# Check for Python version constraints in requirements.txt
rg 'python_version' $(fd requirements.txt)

# Check for Python version constraints in setup.py
rg 'python_requires' $(fd setup.py)

# Check for Python version constraints in pyproject.toml
rg 'requires-python' $(fd pyproject.toml)

Length of output: 162

.pre-commit-config.yaml (2)

101-103: Ensure additional dependencies are installed.

The added Markdown formatting tools (mdformat-mkdocs, mdformat-toc, mdformat-admon) require additional dependencies. Verify these are installed in the environment.


4-4: Review the updated exclude pattern.

The exclude pattern now includes specific README files. Ensure these are intentionally excluded from pre-commit checks.

tests/utils.py (2)

Line range hint 56-56:
LGTM!

The dict_to_hydra_kwargs function is well-implemented with comprehensive error handling and documentation.


125-127: Remember to clean up temporary directories.

Ensure that conf_dir.cleanup() is called in all execution paths to prevent resource leaks.

src/MEDS_transforms/reshard_to_split.py (5)

27-35: LGTM!

The valid_json_file function is well-implemented with proper error handling for JSON decoding.


52-53: LGTM!

The write_json function is well-implemented and correctly handles JSON serialization.


182-183: LGTM!

The __main__ block is standard and correctly invokes the main function.


38-49: LGTM! But verify external dependencies.

The make_new_shards_fn function is well-implemented. Ensure that the shard_patients function and configuration settings are correctly defined and used.


59-179: Verify re-sharding logic and external utilities.

The main function appears to handle the re-sharding process correctly. Ensure that the logic is sound and that external utilities such as rwlock_wrap and shard_iterator are correctly used and defined.

Tools
Ruff

132-132: Function definition does not bind loop variable in_fps

(B023)


138-138: Function definition does not bind loop variable in_fps

(B023)


147-147: Function definition does not bind loop variable subshard_dir

(B023)


147-147: Function definition does not bind loop variable out_fp

(B023)


153-153: Function definition does not bind loop variable subshard_dir

(B023)


154-154: Function definition does not bind loop variable in_fps

(B023)


158-158: Function definition does not bind loop variable subshard_dir

(B023)


159-159: Function definition does not bind loop variable subshard_dir

(B023)


164-164: Function definition does not bind loop variable subshard_dir

(B023)


166-166: Function definition does not bind loop variable subshard_dir

(B023)


167-167: Function definition does not bind loop variable subshard_dir

(B023)

tests/test_reshard_to_split.py (2)

11-190: LGTM!

The test setup is comprehensive and covers various scenarios for the re-sharding process.


193-202: Verify test utility integration.

The test_reshard_to_split function appears to correctly test the re-sharding process. Ensure that the single_stage_transform_tester utility is correctly integrated and used.

Verification successful

Integration of single_stage_transform_tester is correct.

The single_stage_transform_tester function is correctly defined in the transform_tester_base module and imported into the test file. Its integration and usage in the test_reshard_to_split function are appropriate. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the definition and usage of `single_stage_transform_tester`.

# Test: Search for the `single_stage_transform_tester` function definition.
ast-grep --lang python --pattern $'def single_stage_transform_tester($_, $_, $_, $_, $_, $_):'

Length of output: 95


Script:

#!/bin/bash
# Description: Search for import statements related to `single_stage_transform_tester`.

# Test: Search for the import of `single_stage_transform_tester`.
rg 'import .*single_stage_transform_tester' -A 2

# Test: Search for `from ... import ...` statements that include `single_stage_transform_tester`.
rg 'from .* import .*single_stage_transform_tester' -A 2

Length of output: 4185


Script:

#!/bin/bash
# Description: Locate the definition of `single_stage_transform_tester` in `transform_tester_base`.

# Find the file `transform_tester_base` and search for the function definition.
fd 'transform_tester_base' | xargs rg 'def single_stage_transform_tester'

Length of output: 108

tests/test_filter_measurements.py (3)

8-8: Consolidated imports improve readability.

The consolidation of imports enhances clarity without affecting functionality.


Line range hint 106-116:
Existing test function is well-structured.

The test_filter_measurements function uses the single_stage_transform_tester effectively to validate filtering logic.


211-228: New test function is well-structured and comprehensive.

The test_match_revise_filter_measurements function effectively tests filtering logic with specific criteria, enhancing test coverage.

src/MEDS_transforms/extract/finalize_MEDS_metadata.py (1)

Line range hint 149-196:
Improved handling of shards_map_fp enhances maintainability.

The direct assignment of shards_map_fp from configuration clarifies its source and reduces dependency on stage_init output.

eICU_Example/README.md (1)

Line range hint 92-195:
Formatting changes enhance document readability.

The adjustments to step numbering and indentation improve the logical flow and visual structure of the document.

src/MEDS_transforms/transforms/tokenization.py (5)

Line range hint 34-50: Function fill_to_nans is well-implemented.

The function is correctly implemented with clear documentation and examples.


Line range hint 53-69: Function split_static_and_dynamic is well-implemented.

The function effectively splits data into static and dynamic components using Polars LazyFrame operations.


Line range hint 72-108: Function extract_statics_and_schema is well-implemented.

The function correctly extracts static data and schema information using group-by operations.


Line range hint 111-143: Function extract_seq_of_patient_events is well-implemented.

The function effectively extracts sequences of patient events using Polars expressions.


Line range hint 231-267: Function main is well-implemented with improved logic.

The changes streamline data processing and enhance clarity. The conditional check for include_only_train adds robustness.

Ensure that shard_iterator is used correctly throughout the codebase.

Verification successful

Consistent Usage of shard_iterator Confirmed

The shard_iterator function is consistently used across multiple files in the codebase, including tokenization.py, reshard_to_split.py, and mapper.py. This indicates that the changes in tokenization.py regarding the use of shard_iterator are aligned with the overall structure and usage patterns within the project.

  • Files with shard_iterator usage:
    • src/MEDS_transforms/transforms/tokenization.py
    • src/MEDS_transforms/reshard_to_split.py
    • src/MEDS_transforms/mapreduce/mapper.py
    • src/MEDS_transforms/mapreduce/utils.py
    • src/MEDS_transforms/transforms/tensorization.py
    • src/MEDS_transforms/extract/merge_to_MEDS_cohort.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `shard_iterator` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage across the codebase.
rg --type python -A 5 $'shard_iterator'

Length of output: 69


Script:

#!/bin/bash
# Description: Verify the usage of `shard_iterator` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage across the codebase.
rg --type py -A 5 $'shard_iterator'

Length of output: 12884

src/MEDS_transforms/extract/merge_to_MEDS_cohort.py (2)

Line range hint 16-99: Function merge_subdirs_and_sort is well-implemented.

The function effectively merges and sorts parquet files with comprehensive documentation and logging.


Line range hint 102-144: Function main is well-implemented with enhanced shard handling.

The use of shard_iterator_by_shard_map improves shard handling. The configuration of read_fn using partial is efficient.

Ensure that shard_iterator_by_shard_map is used correctly throughout the codebase.

Verification successful

Correct usage of shard_iterator_by_shard_map confirmed.

The shard_iterator_by_shard_map function is correctly defined and used within the codebase. It is imported and utilized appropriately in merge_subdirs_and_sort in merge_to_MEDS_cohort.py.

  • File: src/MEDS_transforms/extract/merge_to_MEDS_cohort.py
  • Function: merge_subdirs_and_sort
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `shard_iterator_by_shard_map` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage across the codebase.
rg --type python -A 5 $'shard_iterator_by_shard_map'

Length of output: 82


Script:

#!/bin/bash
# Corrected: Verify the usage of `shard_iterator_by_shard_map` in the codebase.

# Test: Search for the function usage. Expect: Consistent usage across the codebase.
rg --type py -A 5 $'shard_iterator_by_shard_map'

Length of output: 3856

src/MEDS_transforms/extract/split_and_shard_patients.py (3)

111-135: LGTM! Validation logic is correctly adjusted.

The validation of the sum of split fractions now only occurs when there are patients to split, which optimizes the control flow.


Line range hint 205-272: LGTM! Improved file handling and logging.

The changes enhance clarity and reliability in constructing paths for logging and writing output.


136-142: Verify handling of split_fracs_dict when None.

Ensure that the logic correctly handles cases where split_fracs_dict is None and external splits cover all patients.

Verification successful

Handling of split_fracs_dict when None is correctly implemented.

The shard_patients function allows split_fracs_dict to be None when external splits fully specify the population, as indicated in the docstring. This ensures that the logic handles the scenario appropriately without errors. No further action is needed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify handling of `split_fracs_dict` when it is `None`.

# Test: Search for the function call to `shard_patients` with `split_fracs_dict=None`. Expect: No runtime errors.
rg --type python 'shard_patients\(.*split_fracs_dict=None.*\)'

Length of output: 95


Script:

#!/bin/bash
# Search for the definition of the `shard_patients` function and inspect its implementation.
ast-grep --lang python --pattern 'def shard_patients($_, split_fracs_dict, $$$):'

Length of output: 83


Script:

#!/bin/bash
# Search for calls to `shard_patients` with `split_fracs_dict=None`.
rg 'shard_patients\(.*split_fracs_dict=None.*\)' -A 3

Length of output: 571


Script:

#!/bin/bash
# Search for the definition of the `shard_patients` function using `rg`.
rg 'def shard_patients' -A 10

Length of output: 1042


Script:

#!/bin/bash
# Extract more lines from the `shard_patients` function to understand its handling of `split_fracs_dict`.
rg 'def shard_patients' -A 50

Length of output: 6031

tests/transform_tester_base.py (1)

Line range hint 303-374: LGTM! Enhanced configurability and data handling.

The introduction of new parameters and refactoring improves flexibility and readability. Ensure that the new parameters are correctly integrated.

src/MEDS_transforms/utils.py (1)

352-352: Ensure logic consistency with train_only.

The change from process_split to train_only should be verified for consistency across the codebase to ensure it aligns with the intended logic.

Verification successful

The train_only logic change is consistent across the codebase.

The train_only key is used consistently in configuration and logic checks across multiple files, indicating that the change aligns with the intended functionality.

  • Instances found in src/MEDS_transforms/utils.py, src/MEDS_transforms/mapreduce/utils.py, and src/MEDS_transforms/mapreduce/mapper.py.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistency of `train_only` usage across the codebase.

# Test: Search for `train_only` usage. Expect: Consistent usage across all instances.
rg --type python "train_only"

Length of output: 60


Script:

#!/bin/bash
# Correcting the file type for Python and searching for `train_only` usage.
rg --type py "train_only"

Length of output: 2069

src/MEDS_transforms/mapreduce/utils.py (4)

120-124: LGTM!

The default_file_checker function correctly checks for file existence and completeness.


505-588: LGTM!

The shard_iterator_by_shard_map function correctly handles shard iteration based on a map file.


136-137: Ensure out_fp_checker is used consistently.

The addition of out_fp_checker enhances flexibility. Ensure its usage is consistent throughout the function.

Verification successful

Verification Successful: Consistent Usage of out_fp_checker

The out_fp_checker is used consistently within the rwlock_wrap function to determine the existence of the output file path, influencing the function's control flow appropriately. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of `out_fp_checker` in `rwlock_wrap`.

# Test: Check for all instances of `out_fp_checker` in `rwlock_wrap`.
ast-grep --lang python --pattern $'def rwlock_wrap[$$$](
    $$$,
    out_fp_checker: Callable[[Path], bool] = default_file_checker,
    $$$
) -> bool:
    $$$
    if out_fp_checker(out_fp):
        $$$
    $$$
'

Length of output: 11117


480-490: Verify train_only logic for shard filtering.

The logic for filtering shards with train_only should be verified to ensure it behaves as expected.

Verification successful

Verified train_only logic in shard_iterator.

The logic for filtering shards based on the train_only configuration is correctly implemented in the shard_iterator function. The function filters shards to include only those with the "train/" prefix when train_only is true and such shards exist. Otherwise, it logs an informational message and processes all shards.

  • The train_only logic is correctly applied in the shard_iterator function.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify `train_only` logic for shard filtering in `shard_iterator`.

# Test: Ensure `train_only` correctly filters shards.
ast-grep --lang python --pattern $'def shard_iterator[$$$](
    $$$,
    train_only: bool = False,
    $$$
) -> tuple[list[tuple[Path, Path]], bool]:
    $$$
    train_only = cfg.stage_cfg.get("train_only", None)
    train_shards = [shard_name for shard_name in shards if shard_name.startswith("train/")]
    if train_only and train_shards:
        shards = train_shards
        includes_only_train = True
    $$$
'

Length of output: 416


Script:

#!/bin/bash
# Search for the `shard_iterator` function in the codebase to verify its implementation.
rg 'def shard_iterator' -A 20

Length of output: 4139


Script:

#!/bin/bash
# Search for the full implementation of the `shard_iterator` function to verify the `train_only` logic.
rg 'def shard_iterator' -A 50

Length of output: 9686


Script:

#!/bin/bash
# Search for the full implementation of the `shard_iterator` function with more lines to verify the `train_only` logic.
rg 'def shard_iterator' -A 100

Length of output: 16912


Script:

#!/bin/bash
# Search for the full implementation of the `shard_iterator` function with more lines to verify the `train_only` logic.
rg 'def shard_iterator' -A 150

Length of output: 21323

tests/test_extract.py (1)

447-447: LGTM!

The changes to the test logic from splits to shards maintain the integrity of the test.

src/MEDS_transforms/mapreduce/mapper.py (6)

228-239: Function implementation looks good.

The is_match_revise function correctly checks for the presence of the _match_revise key in the configuration.


242-288: Validation logic is comprehensive.

The validate_match_revise function effectively ensures the configuration is in the correct format. Consider adding test cases to cover all possible edge cases.


291-444: Complex function requires thorough testing.

The match_revise_fntr function is well-documented and handles complex logic for match and revise operations. Ensure comprehensive test coverage to validate its behavior under various configurations.


Line range hint 447-568: Pattern matching logic is well-implemented.

The bind_compute_fn function effectively uses pattern matching to bind parameters. Verify that all possible compute function types are correctly handled.


571-616: Control flow refactoring looks good.

The map_over function has been refactored to include match and revise logic and handle "train only" splits. Verify the control flow to ensure all scenarios are correctly managed.

Tools
Ruff

574-574: Do not perform function call partial in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


157-219: LGTM! Verify updated function usage.

The change to use a filter expression instead of a list of patient IDs enhances flexibility. Ensure that all calls to read_and_filter_fntr in the codebase are updated to match this new signature.

Verification successful

Function usage updated correctly.

The read_and_filter_fntr function is used with the new signature in the codebase, ensuring compatibility with the recent changes.

  • File: src/MEDS_transforms/reshard_to_split.py
  • Usage: Matches the new signature with a pl.Expr for filtering.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `read_and_filter_fntr` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type python -A 5 $'read_and_filter_fntr'

Length of output: 75


Script:

#!/bin/bash
# Description: Verify all function calls to `read_and_filter_fntr` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type py -A 5 $'read_and_filter_fntr'

Length of output: 2936

README.md (3)

3-4: Badges enhance visibility.

The addition of badges for code coverage and documentation status improves the project's visibility and accessibility. Ensure the links are correct and up-to-date.


25-33: Installation instructions are clearer.

The reformatting of installation instructions improves readability. Ensure the commands are accurate and up-to-date.


40-78: Consider grammatical improvement and approve formatting.

The reformatting enhances clarity. However, consider addressing the grammatical issue highlighted by static analysis regarding the use of "will":

- Note this will typically required a distributed file system
+ Note this will typically require a distributed file system

Likely invalid or redundant comment.

src/MEDS_transforms/aggregate_code_metadata.py (3)

106-151: Quantile reducer function looks good.

The quantile_reducer function effectively calculates specified quantiles. Verify its integration with existing aggregation logic to ensure seamless functionality.


64-67: Enumeration update is appropriate.

The addition of VALUES_QUANTILES to the METADATA_FN enumeration enhances statistical analysis capabilities. Ensure it is correctly integrated into the system.


165-171: Aggregation dictionary update is correct.

The addition of values/quantiles mapped to quantile_reducer in the CODE_METADATA_AGGREGATIONS dictionary appears correct. Verify its usage in the aggregation process to ensure proper functionality.

src/MEDS_transforms/extract/convert_to_sharded_events.py (2)

20-20: Consolidated imports for improved readability.

The consolidation of imports from MEDS_transforms.utils into a single line enhances the readability of the code. This change is approved.


683-685: Streamlined initialization of shards.

The change from returning shards_map_fn to directly using cfg.shards_map_fp simplifies the code and improves readability. Ensure that all usages of stage_init in the codebase are updated to reflect this change.

Verification successful

Consistent usage of stage_init across the codebase.

The changes to the stage_init function's return values and their usage in convert_to_sharded_events.py are consistent with other parts of the codebase. The updates have been correctly applied.

  • Confirmed in src/MEDS_transforms/extract/convert_to_sharded_events.py and other related files.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `stage_init` and ensure it aligns with the new return values.

# Test: Search for the function usage. Expect: Only occurrences of the new usage pattern.
rg --type python -A 5 $'stage_init'

Length of output: 65


Script:

#!/bin/bash
# Correct the `rg` command to search for Python files by using the `--type-add` option.
rg --type-add 'py:*.py' --type py -A 5 $'stage_init'

Length of output: 6830

src/MEDS_transforms/extract/__init__.py (2)

3-3: Imports are well-structured and necessary.

The addition of MANDATORY_COLUMNS and MANDATORY_TYPES is appropriate as they are used in constructing MEDS_DATA_MANDATORY_TYPES.


16-16: Dynamic construction of MEDS_DATA_MANDATORY_TYPES is efficient and maintainable.

The use of a dictionary comprehension to map columns to types enhances flexibility and centralizes configuration, which is a good practice.

@@ -19,7 +19,7 @@ def shard_patients[
patients: np.ndarray,
n_patients_per_shard: int = 50000,
external_splits: dict[str, Sequence[SUBJ_ID_T]] | None = None,
split_fracs_dict: dict[str, float] = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid mutable default arguments.

Using mutable data structures like dictionaries as default arguments can lead to unexpected behavior. Replace with None and initialize within the function.

-    split_fracs_dict: dict[str, float] | None = {"train": 0.8, "tuning": 0.1, "held_out": 0.1},
+    split_fracs_dict: dict[str, float] | None = None,

Then, initialize within the function:

if split_fracs_dict is None:
    split_fracs_dict = {"train": 0.8, "tuning": 0.1, "held_out": 0.1}
Tools
Ruff

22-22: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Comment on lines +21 to +30
if parts[-1] == "__init__":
parts = parts[:-1]
doc_path = doc_path.with_name("index.md")
full_doc_path = full_doc_path.with_name("index.md")

readme_path = Path("/".join(parts + ("README.md",)))
if (src / readme_path).exists():
md_file_lines.append(f'--8<-- "src/{str(readme_path)}"')
elif parts[-1] == "__main__":
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize conditional logic for special files.

The handling of __init__.py and __main__.py files is clear, but consider refactoring the logic for clarity and maintainability.

if parts[-1] in {"__init__", "__main__"}:
    if parts[-1] == "__init__":
        parts = parts[:-1]
        doc_path = doc_path.with_name("index.md")
        full_doc_path = full_doc_path.with_name("index.md")
        readme_path = Path("/".join(parts + ("README.md",)))
        if (src / readme_path).exists():
            md_file_lines.append(f'--8<-- "src/{str(readme_path)}"')
    continue

@@ -0,0 +1,95 @@
name: Publish Python 🐍 distribution 📦 to PyPI and TestPyPI

on: push
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trigger on specific branches or tags.

Currently, the workflow triggers on all pushes. Consider specifying branches or tags to avoid unnecessary builds.

on:
  push:
    branches:
      - main
      - 'release/*'
  tags:
    - 'v*'

Comment on lines +252 to +287
def shuffle_shards(shards: list[str], cfg: DictConfig) -> list[str]:
"""Shuffle the shards in a deterministic, pseudo-random way based on the worker ID in the configuration.

Args:
shards: The list of shards to shuffle.
cfg: The configuration dictionary for the overall pipeline. Should (possibly) contain the following
keys (some are optional, as marked below):
- `worker` (optional): The worker ID for the MR worker; this is also used to seed the
randomization process. If not provided, the randomization process will be unseeded.

Returns:
The shuffled list of shards.

Examples:
>>> cfg = DictConfig({"worker": 1})
>>> shards = ["train/0", "train/1", "tuning", "held_out"]
>>> shuffle_shards(shards, cfg)
['train/1', 'held_out', 'tuning', 'train/0']
>>> cfg = DictConfig({"worker": 2})
>>> shuffle_shards(shards, cfg)
['tuning', 'held_out', 'train/1', 'train/0']
"""

if "worker" in cfg:
add_str = str(cfg["worker"])
else:
add_str = str(datetime.now())

shard_keys = []
for shard in shards:
shard_hash = hashlib.sha256((add_str + shard).encode("utf-8")).hexdigest()
if shard_hash in shard_keys:
raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
shard_keys.append(int(shard_hash, 16))

return [shard for _, shard in sorted(zip(shard_keys, shards))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle potential hash collisions.

Consider adding a mechanism to handle hash collisions more gracefully in shuffle_shards.

- if shard_hash in shard_keys:
-     raise ValueError(f"Hash collision for shard {shard} with add_str {add_str}!")
+ if shard_hash in shard_keys:
+     logger.warning(f"Hash collision for shard {shard} with add_str {add_str}. Retrying with a new seed.")
+     return shuffle_shards(shards, cfg)

Committable suggestion was skipped due to low confidence.

Tools
Ruff

275-278: Use ternary operator add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now()) instead of if-else-block

Replace if-else-block with add_str = str(cfg["worker"]) if "worker" in cfg else str(datetime.now())

(SIM108)

Comment on lines +444 to +446
files. Similar to task configuration files, but for models.
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some
kind of a "prior stage name" config variable.
kind of a "prior stage name" config variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider grammatical improvement and approve content.

The section provides clear insights. However, consider addressing the grammatical issue highlighted by static analysis regarding the use of "kind of a":

- kind of a "prior stage name" config variable
+ kind of "prior stage name" config variable
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
files. Similar to task configuration files, but for models.
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some
kind of a "prior stage name" config variable.
kind of a "prior stage name" config variable.
files. Similar to task configuration files, but for models.
2. Figure out how to ensure that each pre-processing step reads from the right prior files. Likely need some
kind of "prior stage name" config variable.
Tools
LanguageTool

[grammar] ~446-~446: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.)
Context: ...right prior files. Likely need some kind of a "prior stage name" config variable. ##...

(KIND_OF_A)

@mmcdermott mmcdermott merged commit b5f0277 into main Aug 11, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants