From 110bd81a7f49bd3fad079eecac84361cc1cbf110 Mon Sep 17 00:00:00 2001 From: Matthew McDermott Date: Thu, 29 Aug 2024 12:25:43 -0400 Subject: [PATCH] Updated the README to reflect the new usage. --- MIMIC-IV_Example/README.md | 155 +++++++++++++------------------------ 1 file changed, 53 insertions(+), 102 deletions(-) diff --git a/MIMIC-IV_Example/README.md b/MIMIC-IV_Example/README.md index 6bf348d..dbfebf9 100644 --- a/MIMIC-IV_Example/README.md +++ b/MIMIC-IV_Example/README.md @@ -6,33 +6,34 @@ up from this one). ## Step 0: Installation -Download this repository and install the requirements: -If you want to install via pypi, (note that for now, you still need to copy some files locally even with a -pypi installation, which is covered below, so make sure you are in a suitable directory) use: - ```bash conda create -n MEDS python=3.12 conda activate MEDS -pip install "MEDS_transforms[local_parallelism]" -mkdir MIMIC-IV_Example -cd MIMIC-IV_Example -wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh -wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh -wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py -chmod +x joint_script.sh -chmod +x joint_script_slurm.sh -chmod +x pre_MEDS.py -cd .. +pip install "MEDS_transforms[local_parallelism,slurm_parallelism]" ``` -If you want to install locally, use: +If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`. +## Step 0.5: Set-up +Set some environment variables and download the necessary files: ```bash -git clone git@github.com:mmcdermott/MEDS_transforms.git -cd MEDS_transforms -conda create -n MEDS python=3.12 -conda activate MEDS -pip install .[local_parallelism] +export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data +export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data +export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data + +export VERSION=0.0.6 # or whatever version you want +export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example" + +wget $URL/run.sh +wget $URL/pre_MEDS.py +wget $URL/local_parallelism_runner.yaml +wget $URL/slurm_runner.yaml +mkdir configs +cd configs +wget $URL/configs/extract_MIMIC.yaml +cd .. +chmod +x run.sh +chmod +x pre_MEDS.py ``` ## Step 1: Download MIMIC-IV @@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g., ```bash cd $MIMIC_RAW_DIR -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv -wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv +export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map +wget $MIMIC_URL/d_labitems_to_loinc.csv +wget $MIMIC_URL/inputevents_to_rxnorm.csv +wget $MIMIC_URL/lab_itemid_to_loinc.csv +wget $MIMIC_URL/meas_chartevents_main.csv +wget $MIMIC_URL/meas_chartevents_value.csv +wget $MIMIC_URL/numerics-summary.csv +wget $MIMIC_URL/outputevents_to_loinc.csv +wget $MIMIC_URL/proc_datetimeevents.csv +wget $MIMIC_URL/proc_itemid.csv +wget $MIMIC_URL/waveforms-summary.csv ``` -## Step 2: Run the basic MEDS ETL - -This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the -`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make -sure you enable this feature by including the `[local_parallelism]` option during installation) or via -`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you -enable this feature by including the `[slurm_parallelism]` option during installation). This script entails -several steps: - -### Step 2.1: Get the data ready for base MEDS extraction - -This is a step in a few parts: - -1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In - particular, we need to join: - - the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each - `hadm_id`. - - the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`. -2. Convert the subject's static data to a more parseable form. This entails: - - Get the subject's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and - `anchor_offset` fields. - - Merge the subject's `dod` with the `deathtime` from the `admissions` table. - -After these steps, modified files or symlinks to the original files will be written in a new directory which -will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this -directory. +## Step 2: Run the MEDS ETL -This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the -base command that is run is as follows (assumed to be run **not** from this directory but from the -root directory of this repository): +To run the MEDS ETL, run the following command: ```bash -./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR +./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true ``` -In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total. +To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`. -### Step 2.2: Run the MEDS extraction ETL +To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an +additional argument -We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`. -Note this is a different directory than the pre-MEDS directory (though, of course, they can both be -subdirectories of the same root directory). - -This is a step in 4 parts: - -1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers - performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers. - - This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected - format of the command. - -2. Extract and form the subject splits and sub-shards. The `./scripts/extraction/split_and_shard_subjects.py` - script is used for this step. See `joint_script*.sh` for the expected format of the command. - -3. Extract subject sub-shards and convert to MEDS events. The - `./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for - the expected format of the command. - -4. Merge the MEDS events into a single file per subject sub-shard. The - `./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the - expected format of the command. - -5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed - currently in the `joint_script*.sh` scripts. - -## Limitations / TO-DOs: - -Currently, some tables are ignored, including: +```bash +export N_WORKERS=5 +./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \ + stage_runner_fp=slurm_runner.yaml +``` -1. `hosp/emar_detail` -2. `hosp/microbiologyevents` -3. `hosp/services` -4. `icu/datetimeevents` -5. `icu/ingredientevents` +The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used +at maximum. -Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS -events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the -timeline which is otherwise stored at the _datetime_ resolution? +The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm +worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system +so that the partition names are correct before use.**_ The memory and time costs are viable in the current +configuration, but if your nodes are sufficiently different you may need to adjust those as well. -Other questions: +The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the +launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment +variable and there is nothing to customize in this file. -1. How to handle merging the deathtimes between the hosp table and the subjects table? -2. How to handle the dob nonsense MIMIC has? +To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end. ## Notes