Skip to content
Zoey Jiang edited this page Feb 24, 2023 · 6 revisions

Introduction

After reading through the README.md file, you should have a basic idea of CEHR-BERT and get started with fine-tuning on your own dataset. As a data engineer, a software developer or a researcher, you might be more interested in how I can prepare my data to feed into the model and create my own prediction tasks. This wiki will describe serve as a developer guide and give you step-by-step instructions.


1. Data Engineering

cehrbert-architecture-CEHR-BERT Data Engineering drawio

To test your own model, the 1st step is the data engineering process which contains 5 parts: cohort builder, prediction task builder, model selection, pre-training data generation, sequence data generation.

1.1 Prediction Task Builder

image

A prediction task can be phrased as the following, “among a particular group of people, who will go on and experience some event”. One can think of this problem as defining a target cohort that represents the initial group of people, and an outcome cohort that represents the subset of the initial group who will experience a particular event, e.g. among the type 2 diabetes patients, who will go on and develop heart failure. Both target and outcome cohorts can be defined as a group of people who satisfy certain inclusion criteria for a certain period of time. Typically, a cohort definition includes a cohort entry event and a set of inclusion criteria (an exclusion criterion can be thought of as an inclusion criterion with 0 occurrence). Specifically, the cohort entry event defines the index date, at which the patients enter the cohort, and the inclusion criteria add more constraints to the cohort if applicable, such as the requirements of certain diagnosis, medications, procedures or temporal relationships among criteria, and etc. In addition, a prediction window needs to be specified for generating the ground truth labels for the given target and outcome cohorts, if the outcome index date falls between the index date of the target cohort and the prediction window, we will declare the case to be positive, and otherwise negative.

You can refer to this Type 2 Diabetes Mellitus to create individual cohorts. Then you can mix and match two cohorts as your target and outcome cohorts.

1.2 Model Selection

After defining your target and outcome cohort, you can choose which type of models you want to train on either sequence based BERT models including CEHR-BERT, MedBERT and BEHRT or frequency based models, such as Logistic Regression and XGBoost.

1.3 Pre-training Data Generation

This step is to use the sequence representation method to generate your patient sequence and feed into the model.

2. BERT Pre-training

pre-training drawio

3. BERT Fine-tuning

Fine-tuning No Hierarchical drawio