Updated the README to reflect the new usage.

mmcdermott · Aug 29, 2024 · 110bd81 · 110bd81
1 parent 4cea12c
commit 110bd81
Showing 1 changed file with 53 additions and 102 deletions.
diff --git a/MIMIC-IV_Example/README.md b/MIMIC-IV_Example/README.md
@@ -6,33 +6,34 @@ up from this one).
 
 ## Step 0: Installation
 
-Download this repository and install the requirements:
-If you want to install via pypi, (note that for now, you still need to copy some files locally even with a
-pypi installation, which is covered below, so make sure you are in a suitable directory) use:
-
 ```bash
 conda create -n MEDS python=3.12
 conda activate MEDS
-pip install "MEDS_transforms[local_parallelism]"
-mkdir MIMIC-IV_Example
-cd MIMIC-IV_Example
-wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh
-wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh
-wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py
-chmod +x joint_script.sh
-chmod +x joint_script_slurm.sh
-chmod +x pre_MEDS.py
-cd ..
+pip install "MEDS_transforms[local_parallelism,slurm_parallelism]"
 ```
 
-If you want to install locally, use:
+If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`.
 
+## Step 0.5: Set-up
+Set some environment variables and download the necessary files:
 ```bash
-git clone git@github.com:mmcdermott/MEDS_transforms.git
-cd MEDS_transforms
-conda create -n MEDS python=3.12
-conda activate MEDS
-pip install .[local_parallelism]
+export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
+export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
+export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
+
+export VERSION=0.0.6 # or whatever version you want
+export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example"
+
+wget $URL/run.sh
+wget $URL/pre_MEDS.py
+wget $URL/local_parallelism_runner.yaml
+wget $URL/slurm_runner.yaml
+mkdir configs
+cd configs
+wget $URL/configs/extract_MIMIC.yaml
+cd ..
+chmod +x run.sh
+chmod +x pre_MEDS.py
 ```
 
 ## Step 1: Download MIMIC-IV
@@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g.,
 
 ```bash
 cd $MIMIC_RAW_DIR
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv
-wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv
+export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map
+wget $MIMIC_URL/d_labitems_to_loinc.csv
+wget $MIMIC_URL/inputevents_to_rxnorm.csv
+wget $MIMIC_URL/lab_itemid_to_loinc.csv
+wget $MIMIC_URL/meas_chartevents_main.csv
+wget $MIMIC_URL/meas_chartevents_value.csv
+wget $MIMIC_URL/numerics-summary.csv
+wget $MIMIC_URL/outputevents_to_loinc.csv
+wget $MIMIC_URL/proc_datetimeevents.csv
+wget $MIMIC_URL/proc_itemid.csv
+wget $MIMIC_URL/waveforms-summary.csv
 ```
 
-## Step 2: Run the basic MEDS ETL
-
-This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the
-`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make
-sure you enable this feature by including the `[local_parallelism]` option during installation) or via
-`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you
-enable this feature by including the `[slurm_parallelism]` option during installation). This script entails
-several steps:
-
-### Step 2.1: Get the data ready for base MEDS extraction
-
-This is a step in a few parts:
-
-1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In
-    particular, we need to join:
-    - the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each
-        `hadm_id`.
-    - the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`.
-2. Convert the subject's static data to a more parseable form. This entails:
-    - Get the subject's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
-        `anchor_offset` fields.
-    - Merge the subject's `dod` with the `deathtime` from the `admissions` table.
-
-After these steps, modified files or symlinks to the original files will be written in a new directory which
-will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
-directory.
+## Step 2: Run the MEDS ETL
 
-This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the
-base command that is run is as follows (assumed to be run **not** from this directory but from the
-root directory of this repository):
+To run the MEDS ETL, run the following command:
 
 ```bash
-./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
+./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true
 ```
 
-In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
+To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`.
 
-### Step 2.2: Run the MEDS extraction ETL
+To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an
+additional argument
 
-We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
-Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
-subdirectories of the same root directory).
-
-This is a step in 4 parts:
-
-1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
-    performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers.
-
-    This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected
-    format of the command.
-
-2. Extract and form the subject splits and sub-shards. The `./scripts/extraction/split_and_shard_subjects.py`
-    script is used for this step. See `joint_script*.sh` for the expected format of the command.
-
-3. Extract subject sub-shards and convert to MEDS events. The
-    `./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for
-    the expected format of the command.
-
-4. Merge the MEDS events into a single file per subject sub-shard. The
-    `./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the
-    expected format of the command.
-
-5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
-    currently in the `joint_script*.sh` scripts.
-
-## Limitations / TO-DOs:
-
-Currently, some tables are ignored, including:
+```bash
+export N_WORKERS=5
+./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \
+    stage_runner_fp=slurm_runner.yaml
+```
 
-1. `hosp/emar_detail`
-2. `hosp/microbiologyevents`
-3. `hosp/services`
-4. `icu/datetimeevents`
-5. `icu/ingredientevents`
+The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used
+at maximum.
 
-Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS
-events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
-timeline which is otherwise stored at the _datetime_ resolution?
+The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm
+worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system
+so that the partition names are correct before use.**_ The memory and time costs are viable in the current
+configuration, but if your nodes are sufficiently different you may need to adjust those as well.
 
-Other questions:
+The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the
+launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment
+variable and there is nothing to customize in this file.
 
-1. How to handle merging the deathtimes between the hosp table and the subjects table?
-2. How to handle the dob nonsense MIMIC has?
+To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end.
 
 ## Notes