Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusted join in flat reps to account for different timestamps with t… #107

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pargaw
Copy link
Collaborator

@pargaw pargaw commented May 3, 2024

No description provided.

Copy link

coderabbitai bot commented May 3, 2024

Walkthrough

This update introduces several enhancements and optimizations to the EventStream project. New parameters were added to improve the handling of caching, error messages, and data processing, including overwrite_cache_filtered_task, tuning_subset_size, and tuning_subset_seed. Additionally, subject IDs are now consistently converted to strings for filtering across functions. These changes improve the robustness and flexibility of the dataset processing tasks.

Changes

File Change Summary
EventStream/baseline/FT_task_baseline.py Added parameter overwrite_cache_filtered_task to load_flat_rep function and updated related caching logic.
EventStream/data/dataset_base.py Updated DataFrame names for better error messages, and added progress bars for processing events and measurements.
EventStream/data/dataset_polars.py Modified handling of subject IDs by converting them to strings, and updated filtering mechanisms accordingly.
scripts/build_flat_reps.py Added a main function to build flat representation datasets, incorporating dataset loading, cache resolution, and configuration.
EventStream/data/config.py Added tuning_subset_size and tuning_subset_seed parameters to PytorchDatasetConfig, and included validation logic for these.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@pargaw pargaw requested a review from mmcdermott May 3, 2024 18:25
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Out of diff range and nitpick comments (2)
EventStream/baseline/FT_task_baseline.py (2)

49-49: Add a docstring entry for the new parameter overwrite_cache_filtered_task.

It's good practice to document all parameters in the function's docstring to improve code maintainability and readability.


Line range hint 243-243: Use isinstance() for type checking instead of comparing types directly.

- if type(window_sizes) is not list:
+ if not isinstance(window_sizes, list):

This change uses the more Pythonic isinstance() for type checking, which is generally preferred over direct type comparison.

Comment on lines 188 to 189
df = filter_join_df.join_asof(df, by='subject_id', on='timestamp',
strategy='forward' if '-' in window_size else 'backward')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor the join_asof method to handle potential discrepancies in data alignment.

- df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', 
-                               strategy='forward' if '-' in window_size else 'backward')
+ strategy = 'forward' if '-' in window_size else 'backward'
+ try:
+     df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', strategy=strategy)
+ except Exception as e:
+     # Handle or log the exception
+     raise e

This refactoring adds error handling around the join_asof method, which can be crucial when dealing with large datasets and complex joins.


Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
df = filter_join_df.join_asof(df, by='subject_id', on='timestamp',
strategy='forward' if '-' in window_size else 'backward')
strategy = 'forward' if '-' in window_size else 'backward'
try:
df = filter_join_df.join_asof(df, by='subject_id', on='timestamp', strategy=strategy)
except Exception as e:
# Handle or log the exception
raise e

Repository owner deleted a comment from coderabbitai bot May 3, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Out of diff range and nitpick comments (1)
EventStream/baseline/FT_task_baseline.py (1)

Line range hint 247-247: Use isinstance() for type checking instead of comparing types directly. This is more Pythonic and handles subclasses properly.

- if type(window_sizes) is not list:
+ if not isinstance(window_sizes, list):

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b006195 and 22dca2d.

Files selected for processing (2)
  • EventStream/data/dataset_polars.py (3 hunks)
  • scripts/build_flat_reps.py (1 hunks)
Additional context used
Ruff
scripts/build_flat_reps.py

13-13: omegaconf.OmegaConf imported but unused (F401)

Remove unused import: omegaconf.OmegaConf

EventStream/data/dataset_polars.py

204-204: Do not compare types, use isinstance() (E721)


1618-1618: Function definition does not bind loop variable m (B023)


1618-1618: Function definition does not bind loop variable key_prefix (B023)


1622-1622: Function definition does not bind loop variable m (B023)


1622-1622: Function definition does not bind loop variable val_prefix (B023)


1624-1624: Function definition does not bind loop variable m (B023)


1624-1624: Function definition does not bind loop variable val_prefix (B023)


1627-1627: Function definition does not bind loop variable m (B023)


1627-1627: Function definition does not bind loop variable val_prefix (B023)


1628-1628: Function definition does not bind loop variable m (B023)


1628-1628: Function definition does not bind loop variable val_prefix (B023)


1629-1629: Function definition does not bind loop variable m (B023)


1629-1629: Function definition does not bind loop variable val_prefix (B023)


1639-1639: Function definition does not bind loop variable m (B023)

Additional comments not posted (6)
scripts/build_flat_reps.py (3)

4-9: Good handling of optional dependencies with graceful degradation.


25-41: Well-structured data processing logic with clear separation of configuration and execution.


26-41: Effective use of logging to track process flow and completion.

EventStream/data/dataset_polars.py (3)

1420-1421: The changes to cast subject_id to string before filtering align with the PR objectives and improve consistency in ID handling.


1481-1482: The modifications to cast subject_id to string in _summarize_time_dependent_measurements are consistent with the changes in other methods, ensuring uniform handling of IDs.


1545-1547: The changes in _summarize_dynamic_measurements to cast subject_id to string and filter before joining are consistent with similar modifications in other methods, ensuring uniform ID handling.

Comment on lines 21 to 23
save_dir = Path(cfg.pop("save_dir"))
window_sizes = cfg.pop("window_sizes")
subjects_per_output_file = cfg.pop("subjects_per_output_file") if 'subjects_per_output_file' in cfg else None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling for missing configuration items to prevent runtime errors.

# Example of adding default values and error handling
save_dir = Path(cfg.pop("save_dir", "default/path"))
window_sizes = cfg.pop("window_sizes", [default_window_size])
subjects_per_output_file = cfg.pop("subjects_per_output_file", default_subjects_per_output_file)


from pathlib import Path
import hydra
from omegaconf import DictConfig, OmegaConf
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import to clean up the code.

- from omegaconf import DictConfig, OmegaConf
+ from omegaconf import DictConfig
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from omegaconf import DictConfig, OmegaConf
from omegaconf import DictConfig
Tools
Ruff

13-13: omegaconf.OmegaConf imported but unused (F401)

Remove unused import: omegaconf.OmegaConf

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (3)
EventStream/data/dataset_base.py (1)

Line range hint 226-269: Consider optimizing the exception handling for better clarity and maintainability.

The current implementation of exception handling within the loop (lines 232 and 269) could be improved for clarity and maintainability. Specifically, the generic Exception is caught, which might obscure the source of errors. It's generally a good practice to catch more specific exceptions. Additionally, the error messages could provide more context about the failure. Here's a proposed refactor:

try:
    df = cls._load_input_df(df_name, all_columns, subject_id_col, subject_ids_map, subject_id_dtype)
except SpecificExceptionType as e:  # Replace SpecificExceptionType with the actual expected exception types
    raise ValueError(f"Error loading data from {df_name}: {str(e)}") from e

...

try:
    new_events = cls._inc_df_col(events, "event_id", running_event_id_max)
except SpecificExceptionType as e:  # Replace SpecificExceptionType with the actual expected exception types
    raise ValueError(f"Error incrementing event_id for {event_type}: {str(e)}") from e

This change not only makes the code easier to understand and maintain but also aids in debugging by providing clearer, more descriptive error messages.

EventStream/data/dataset_polars.py (2)

Line range hint 204-204: Use isinstance() instead of direct type comparison for better practice and readability.

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 1618-1618: Bind loop variables correctly in function definitions to avoid late binding issues.

- .map_alias(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),
+ .map_alias(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),

Also applies to: 1622-1622, 1624-1624, 1627-1627, 1628-1628, 1629-1629, 1639-1639

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 22dca2d and 435d968.

Files selected for processing (2)
  • EventStream/data/dataset_base.py (2 hunks)
  • EventStream/data/dataset_polars.py (6 hunks)
Additional context used
Ruff
EventStream/data/dataset_polars.py

204-204: Do not compare types, use isinstance() (E721)


1618-1618: Function definition does not bind loop variable m (B023)


1618-1618: Function definition does not bind loop variable key_prefix (B023)


1622-1622: Function definition does not bind loop variable m (B023)


1622-1622: Function definition does not bind loop variable val_prefix (B023)


1624-1624: Function definition does not bind loop variable m (B023)


1624-1624: Function definition does not bind loop variable val_prefix (B023)


1627-1627: Function definition does not bind loop variable m (B023)


1627-1627: Function definition does not bind loop variable val_prefix (B023)


1628-1628: Function definition does not bind loop variable m (B023)


1628-1628: Function definition does not bind loop variable val_prefix (B023)


1629-1629: Function definition does not bind loop variable m (B023)


1629-1629: Function definition does not bind loop variable val_prefix (B023)


1639-1639: Function definition does not bind loop variable m (B023)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range and nitpick comments (2)
EventStream/data/dataset_polars.py (2)

Line range hint 204-204: Use isinstance() instead of comparing types directly.

Direct type comparisons are generally less flexible and can lead to errors if subtypes are involved. Replace the type comparison with isinstance().

- if type(qq) is Path:
+ if isinstance(qq, Path):

Line range hint 1620-1620: Address loop variable binding issues in lambda functions.

Python's late binding behavior can lead to unexpected results when using loop variables inside lambda functions. To fix this, you can use default arguments in the lambda functions to bind the current value of the loop variable at each iteration.

- .map_alias(lambda c: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),
+ .map_alias(lambda c, m=m, key_prefix=key_prefix: f"dynamic/{m}/{c.replace(key_prefix, '')}/count"),

Apply this pattern to all similar occurrences in the code to ensure that the lambda functions capture the correct values of m and key_prefix or val_prefix at each iteration.

Also applies to: 1624-1624, 1626-1626, 1629-1629, 1630-1630, 1631-1631, 1641-1641

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 435d968 and 5c6cb4b.

Files selected for processing (3)
  • EventStream/data/dataset_base.py (2 hunks)
  • EventStream/data/dataset_polars.py (5 hunks)
  • scripts/build_flat_reps.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • EventStream/data/dataset_base.py
  • scripts/build_flat_reps.py
Additional context used
Ruff
EventStream/data/dataset_polars.py

204-204: Do not compare types, use isinstance() (E721)


1620-1620: Function definition does not bind loop variable m (B023)


1620-1620: Function definition does not bind loop variable key_prefix (B023)


1624-1624: Function definition does not bind loop variable m (B023)


1624-1624: Function definition does not bind loop variable val_prefix (B023)


1626-1626: Function definition does not bind loop variable m (B023)


1626-1626: Function definition does not bind loop variable val_prefix (B023)


1629-1629: Function definition does not bind loop variable m (B023)


1629-1629: Function definition does not bind loop variable val_prefix (B023)


1630-1630: Function definition does not bind loop variable m (B023)


1630-1630: Function definition does not bind loop variable val_prefix (B023)


1631-1631: Function definition does not bind loop variable m (B023)


1631-1631: Function definition does not bind loop variable val_prefix (B023)


1641-1641: Function definition does not bind loop variable m (B023)

Comment on lines +1420 to +1423
self.subjects_df = self.subjects_df.with_columns(pl.col("subject_id").cast(pl.Utf8))
df = self.subjects_df.filter(
pl.col("subject_id").is_in([str(id) for id in include_only_subjects])
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure consistent data type handling for subject_id across methods.

It's a good practice to maintain consistency in how data types are handled across different methods. Here, the subject_id is cast to string in several filtering operations. Consider creating a helper function to perform this casting and filtering operation, which can then be reused across these methods, enhancing code reusability and maintainability.

- self.subjects_df = self.subjects_df.with_columns(pl.col("subject_id").cast(pl.Utf8))
- df = self.subjects_df.filter(pl.col("subject_id").is_in([str(id) for id in include_only_subjects]))
+ df = self._filter_subjects_by_id(include_only_subjects)

And add a new method in the class:

def _filter_subjects_by_id(self, include_only_subjects):
    self.subjects_df = self.subjects_df.with_columns(pl.col("subject_id").cast(pl.Utf8))
    return self.subjects_df.filter(pl.col("subject_id").is_in([str(id) for id in include_only_subjects]))

Also applies to: 1483-1484, 1547-1551

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5c6cb4b and 69b99ce.

Files selected for processing (1)
  • EventStream/baseline/FT_task_baseline.py (6 hunks)

@@ -46,6 +46,7 @@ def load_flat_rep(
do_update_if_missing: bool = True,
task_df_name: str | None = None,
do_cache_filtered_task: bool = True,
overwrite_cache_filtered_task: bool = False,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper documentation for the new parameter.

The new parameter overwrite_cache_filtered_task should be included in the function's docstring to maintain comprehensive documentation.

+        overwrite_cache_filtered_task: If `True`, the flat representations will be regenerated. If `False`, the cached file will be loaded if exists.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
overwrite_cache_filtered_task: bool = False,
overwrite_cache_filtered_task: bool = False,

Comment on lines +79 to +80
overwrite_cache_filtered_task: If `True`, the flat representations will be regenerated. If `False`,
the cached file will be loaded if exists.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

Ensure all calls to load_flat_rep include the new parameter overwrite_cache_filtered_task.

The following calls to load_flat_rep in sample_data/examine_synthetic_data.ipynb do not include the new parameter and should be updated:

  • flat_reps = load_flat_rep(ESD, window_sizes=['7d'])
  • flat_reps = load_flat_rep(ESD, window_sizes=['FULL'], feature_inclusion_frequency=0.001)
  • flat_reps = load_flat_rep(ESD, window_sizes=['1d', '7d', 'FULL'])
  • flat_reps = load_flat_rep(ESD, window_sizes=['2d'], do_update_if_missing=False)
  • flat_reps = load_flat_rep(ESD, window_sizes=['1d', '7d', 'FULL'], subjects_included={'train': {0, 1, 2}})
  • flat_reps = load_flat_rep(ESD, window_sizes=['7d'], task_df_name='single_label_binary_classification')
  • flat_reps = load_flat_rep(ESD, window_sizes=['FULL', '1d'], task_df_name='multi_class_classification', do_cache_filtered_task=False)
  • flat_reps = load_flat_rep(ESD, window_sizes=['FULL', '1d'], task_df_name='single_label_binary_classification', subjects_included={'train': {0, 1, 2}})

Please update these calls to include the overwrite_cache_filtered_task parameter as needed.

Analysis chain

LGTM! But verify the new parameter usage in the codebase.

The code changes are approved.

However, ensure that all function calls to load_flat_rep are updated to use the new parameter if necessary.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `load_flat_rep` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type python $'load_flat_rep'

Length of output: 63


Script:

#!/bin/bash
# Description: Verify all function calls to `load_flat_rep` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg 'load_flat_rep'

Length of output: 1215


Script:

#!/bin/bash
# Description: Extract lines around each call to `load_flat_rep` to verify the usage of the new parameter.

# Extract lines around the calls in the Jupyter notebook
rg -A 3 -B 3 'load_flat_rep' sample_data/examine_synthetic_data.ipynb

# Extract lines around the calls in the Python file
rg -A 3 -B 3 'load_flat_rep' EventStream/baseline/FT_task_baseline.py

Length of output: 3273

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
EventStream/data/config.py (1)

806-809: Add Documentation for New Parameters.

The new parameters tuning_subset_size and tuning_subset_seed should be documented in the docstring of the class.

+        tuning_subset_size: If the tuning data should be subsampled randomly, this specifies the size of the
+            tuning subset. If `None` or "FULL", then the full tuning data is used.
+        tuning_subset_seed: If the tuning data should be subsampled randomly, this specifies the seed for
+            that random subsampling.
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 69b99ce and bf453e1.

Files selected for processing (1)
  • EventStream/data/config.py (3 hunks)
Additional comments not posted (2)
EventStream/data/config.py (2)

880-881: Add New Parameters to Class Definition.

The new parameters tuning_subset_size and tuning_subset_seed are correctly added to the class definition.


916-931: Verify Validation Logic for New Parameters.

The validation logic for the new parameters tuning_subset_size and tuning_subset_seed in the __post_init__ method looks good. It ensures that the parameters are within the required range and sets a default seed if not provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants