Skip to content

Commit

Permalink
Updated the README to reflect the new usage.
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcdermott committed Aug 29, 2024
1 parent 4cea12c commit 110bd81
Showing 1 changed file with 53 additions and 102 deletions.
155 changes: 53 additions & 102 deletions MIMIC-IV_Example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,34 @@ up from this one).

## Step 0: Installation

Download this repository and install the requirements:
If you want to install via pypi, (note that for now, you still need to copy some files locally even with a
pypi installation, which is covered below, so make sure you are in a suitable directory) use:

```bash
conda create -n MEDS python=3.12
conda activate MEDS
pip install "MEDS_transforms[local_parallelism]"
mkdir MIMIC-IV_Example
cd MIMIC-IV_Example
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script.sh
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/joint_script_slurm.sh
wget https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/main/MIMIC-IV_Example/pre_MEDS.py
chmod +x joint_script.sh
chmod +x joint_script_slurm.sh
chmod +x pre_MEDS.py
cd ..
pip install "MEDS_transforms[local_parallelism,slurm_parallelism]"
```

If you want to install locally, use:
If you want to profile the time and memory costs of your ETL, also install: `pip install hydra-profiler`.

## Step 0.5: Set-up
Set some environment variables and download the necessary files:
```bash
git clone git@github.com:mmcdermott/MEDS_transforms.git
cd MEDS_transforms
conda create -n MEDS python=3.12
conda activate MEDS
pip install .[local_parallelism]
export MIMICIV_RAW_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
export MIMICIV_PRE_MEDS_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data
export MIMICIV_MEDS_COHORT_DIR=??? # set to the directory in which you want to store the raw MIMIC-IV data

export VERSION=0.0.6 # or whatever version you want
export URL="https://raw.githubusercontent.com/mmcdermott/MEDS_transforms/$VERSION/MIMIC-IV_Example"

wget $URL/run.sh
wget $URL/pre_MEDS.py
wget $URL/local_parallelism_runner.yaml
wget $URL/slurm_runner.yaml
mkdir configs
cd configs
wget $URL/configs/extract_MIMIC.yaml
cd ..
chmod +x run.sh
chmod +x pre_MEDS.py
```

## Step 1: Download MIMIC-IV
Expand All @@ -46,101 +47,51 @@ the root directory of where the resulting _core data files_ are stored -- e.g.,

```bash
cd $MIMIC_RAW_DIR
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/d_labitems_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/inputevents_to_rxnorm.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/lab_itemid_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_main.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/meas_chartevents_value.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/numerics-summary.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/outputevents_to_loinc.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_datetimeevents.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/proc_itemid.csv
wget https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map/waveforms-summary.csv
export MIMIC_URL=https://raw.githubusercontent.com/MIT-LCP/mimic-code/v2.4.0/mimic-iv/concepts/concept_map
wget $MIMIC_URL/d_labitems_to_loinc.csv
wget $MIMIC_URL/inputevents_to_rxnorm.csv
wget $MIMIC_URL/lab_itemid_to_loinc.csv
wget $MIMIC_URL/meas_chartevents_main.csv
wget $MIMIC_URL/meas_chartevents_value.csv
wget $MIMIC_URL/numerics-summary.csv
wget $MIMIC_URL/outputevents_to_loinc.csv
wget $MIMIC_URL/proc_datetimeevents.csv
wget $MIMIC_URL/proc_itemid.csv
wget $MIMIC_URL/waveforms-summary.csv
```

## Step 2: Run the basic MEDS ETL

This step contains several sub-steps; luckily, all these substeps can be run via a single script, with the
`joint_script.sh` script which uses the Hydra `joblib` launcher to run things with local parallelism (make
sure you enable this feature by including the `[local_parallelism]` option during installation) or via
`joint_script_slurm.sh` which uses the Hydra `submitit` launcher to run things through slurm (make sure you
enable this feature by including the `[slurm_parallelism]` option during installation). This script entails
several steps:

### Step 2.1: Get the data ready for base MEDS extraction

This is a step in a few parts:

1. Join a few tables by `hadm_id` to get the right times in the right rows for processing. In
particular, we need to join:
- the `hosp/diagnoses_icd` table with the `hosp/admissions` table to get the `dischtime` for each
`hadm_id`.
- the `hosp/drgcodes` table with the `hosp/admissions` table to get the `dischtime` for each `hadm_id`.
2. Convert the subject's static data to a more parseable form. This entails:
- Get the subject's DOB in a format that is usable for MEDS, rather than the integral `anchor_year` and
`anchor_offset` fields.
- Merge the subject's `dod` with the `deathtime` from the `admissions` table.

After these steps, modified files or symlinks to the original files will be written in a new directory which
will be used as the input to the actual MEDS extraction ETL. We'll use `$MIMICIV_PREMEDS_DIR` to denote this
directory.
## Step 2: Run the MEDS ETL

This step is run in the `joint_script.sh` script or the `joint_script_slurm.sh` script, but in either case the
base command that is run is as follows (assumed to be run **not** from this directory but from the
root directory of this repository):
To run the MEDS ETL, run the following command:

```bash
./MIMIC-IV_Example/pre_MEDS.py raw_cohort_dir=$MIMICIV_RAW_DIR output_dir=$MIMICIV_PREMEDS_DIR
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true
```

In practice, on a machine with 150 GB of RAM and 10 cores, this step takes less than 5 minutes in total.
To not unzip the `.csv.gz` files, set `do_unzip=false` instead of `do_unzip=true`.

### Step 2.2: Run the MEDS extraction ETL
To use a specific stage runner file (e.g., to set different parallelism options), you can specify it as an
additional argument

We will assume you want to output the final MEDS dataset into a directory we'll denote as `$MIMICIV_MEDS_DIR`.
Note this is a different directory than the pre-MEDS directory (though, of course, they can both be
subdirectories of the same root directory).

This is a step in 4 parts:

1. Sub-shard the raw files. Run this command as many times simultaneously as you would like to have workers
performing this sub-sharding step. See below for how to automate this parallelism using hydra launchers.

This step uses the `./scripts/extraction/shard_events.py` script. See `joint_script*.sh` for the expected
format of the command.

2. Extract and form the subject splits and sub-shards. The `./scripts/extraction/split_and_shard_subjects.py`
script is used for this step. See `joint_script*.sh` for the expected format of the command.

3. Extract subject sub-shards and convert to MEDS events. The
`./scripts/extraction/convert_to_sharded_events.py` script is used for this step. See `joint_script*.sh` for
the expected format of the command.

4. Merge the MEDS events into a single file per subject sub-shard. The
`./scripts/extraction/merge_to_MEDS_cohort.py` script is used for this step. See `joint_script*.sh` for the
expected format of the command.

5. (Optional) Generate preliminary code statistics and merge to external metadata. This is not performed
currently in the `joint_script*.sh` scripts.

## Limitations / TO-DOs:

Currently, some tables are ignored, including:
```bash
export N_WORKERS=5
./run.sh $MIMICIV_RAW_DIR $MIMICIV_PRE_MEDS_DIR $MIMICIV_MEDS_DIR do_unzip=true \
stage_runner_fp=slurm_runner.yaml
```

1. `hosp/emar_detail`
2. `hosp/microbiologyevents`
3. `hosp/services`
4. `icu/datetimeevents`
5. `icu/ingredientevents`
The `N_WORKERS` environment variable set before the command controls how many parallel workers should be used
at maximum.

Lots of questions remain about how to appropriately handle times of the data -- e.g., things like HCPCS
events are stored at the level of the _date_, not the _datetime_. How should those be slotted into the
timeline which is otherwise stored at the _datetime_ resolution?
The `slurm_runner.yaml` file (downloaded above) runs each stage across several workers on separate slurm
worker nodes using the `submitit` launcher. _**You will need to customize this file to your own slurm system
so that the partition names are correct before use.**_ The memory and time costs are viable in the current
configuration, but if your nodes are sufficiently different you may need to adjust those as well.

Other questions:
The `local_parallelism_runner.yaml` file (downloaded above) runs each stage via separate processes on the
launching machine. There are no additional arguments needed for this stage beyond the `N_WORKERS` environment
variable and there is nothing to customize in this file.

1. How to handle merging the deathtimes between the hosp table and the subjects table?
2. How to handle the dob nonsense MIMIC has?
To profile the time and memory costs of your ETL, add the `do_profile=true` flag at the end.

## Notes

Expand Down

0 comments on commit 110bd81

Please sign in to comment.