<sup>1</sup> Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton<br>
<sup>2</sup> School of Primary Care Population Sciences and Medical Education, University of Southampton <br>
<sup>3</sup> Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University <br>
<br>
<sup>*</sup>Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk
### 🖋 How to cite this work
*Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk*
### Citation
> Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing
## 🙌 Introduction
This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria.
The output code list is then used by data providers to select MELD-B cohorts.
## Introduction
This tool automates the verification, translation and organisation of medical coding lists defining cohort phenotypes for inclusion criteria. By processing externally sourced clinical inclusion criteria into actionable code lists, this tool ensures consistent and efficient curation of cohort definitions. These code lists can be subsequently used by data providers (e.g. SAIL) to construct study cohorts.
## 📃 Method
## Methods
### Process
1. Approved MELB-B concepts are defined in a CSV spreadsheet (currently PHEN_summary_working.csv).
2. Imported Code Lists in `/src`are verified against all NHS TRUDregistered codes
3. Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within`PHEN_assign_v3.json`.
### Workflow Overview
1. Approved MELD-B concepts are outlined in a CSV spreadsheet (e.g., `PHEN_summary_working.csv`).
2. Imported code lists in the `/src`directory are validated against NHS TRUD-registered codes.
3. Mappings from imported code lists to outputted MELD-B concepts are defined in the`PHEN_assign_v3.json`file.
- See "JSON Phenotype Mapping" section for more details
4. Process is executed from command line either manually or from bash script `run.sh`
- See "Usage" section for more details
5. Output Concept Code Lists are saved to the `/concepts` git repository and any changes are tracked.
6. Output Concept Code Lists can be exported into SAIL or any other Data Bank
4. The process is executed via command-line. Refer to the "Usage" section for execution instructions.
5. Outputted concept code lists are saved to the `/concepts` Git repository, with all changes tracked.
6. The code lists can be exported to SAIL or any other Data Bank.
### Medical Coding Standards Supported
| Code Type | Verification | Maps to |
### Supported Medical Coding Standards
The tool supports verification and mapping across various diagnostic coding formats:
@@ -43,53 +44,51 @@ The output code list is then used by data providers to select MELD-B cohorts.
| OPCS4 | NHS TRUD | |
| ATC | None | |
MELD-B refers to various diagnostic code formats included in target datasets.
* Read V2
* Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
* SAIL only supports five character read codes V2
* SNOMED-CT was adopted by the NHS around 2018
* CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
* Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
* ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets.
* ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO.
## ⚙️ Setup
### Code Translation Tables
1. Due to the licencing of NHS TRUD resources, you <mark>MUST first [Sign Up](https://isd.digital.nhs.uk/trud/user/guest/filters/0/account/form) to NHS TRUD and accept the following licences</mark>:
2. Once all licences are accepted, get your [API Key](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/account/manage) for NHS TRUD.
3. Finally, run the automated extraction script, inputting your API Key to granty temporary access to the resources above. Use the command `python trud_api.py --key <INSERT KEY>` (replacing your key in the marked area).
- The convertion Tables will be saved as `.parquet` tables in the folder `maps/processed/`.
- NHS TRUD defines one-way mappings and does <b>NOT ADVISE</b> reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given `.parquet` table and reverse the filename (e.g. `read2_code_to_snomed_code.parquet` to `snomed_code_to_read2_code.parquet`)
4. Populate the SQLite3 database with OMOP Vocabularies. These can be download from https://athena.ohdsi.org/vocabulary/list.
- Install the following vocabularies by ticking the box:
- 1-SNOMED
- 2-ICD9CM
- 17-Readv2
- 21-ATC
- 55-OPCS4
- 57-HES Specialty
- 70-ICD10CM
- 75-dm+d
- 144-UK Biobank
- 154-NHS Ethnic Category
- 155-NHS Place of Service
- Use the command `python omop_api.py --install <INSERT PATH>` to load vocabularies into database (insert your own path to unzipped download folder).
### JSON phenotype mapping
Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`.
#### Defining the Strucutre for Folders and Files:
```
2.**Obtain API Key:** Retrieve your API key from [NHS TRUD Account Management](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/account/manage).
3.**Install TRUD:** Download and Install NHS TRUD medical code resources.
Executing the script using the command: `python trud_api.py --key <API_KEY>`.
Processed tables will be saved as `.parquet` files in the `maps/processed/` directory.
-*Note: NHS TRUD defines one-way mappings and does <b>NOT ADVISE</b> reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given `.parquet` table and reverse the filename (e.g. `read2_code_to_snomed_code.parquet` to `snomed_code_to_read2_code.parquet`)*
4.***Optional: Install OMOP Database:** Download and install OMOP vocabularies from [Athena OHDSI](https://athena.ohdsi.org/vocabulary/list).
- Required vocabularies include:
- 1) SNOMED
- 2) ICD9CM
- 17) Readv2
- 21) ATC
- 55) OPCS4
- 57) HES Specialty
- 70) ICD10CM
- 75) dm+d
- 144) UK Biobank
- 154) NHS Ethnic Category
- 155) NHS Place of Service
- Un-zip the downloaded folder and copy it's path.
-In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object.
### Additional preprocessing (if required):
In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object.
- Table with a sub-categorical column:
- In order to sub-divide a table by a categorical column use the "divide_col" action
- e.g. ``` "actions":{"divide_col": "MMCode"}```
- Table with multiple code types in single column:
- Need to split column into multiple columns, so only one code type per column.
- The "split_col" attribute is the categorical column indicating the code type in that row. The <b>category names should replace column</b> names in the `columns` properties.
- The "codes_col" attribute is the code column with mulitple code types in a single column
- e.g.
```
"actions":{
"split_col":"coding_system",
"codes_col":"code"
},
"columns":{
"read2_code":"Read codes v2",
"med_code":"Med codes",
"icd10_code":"ICD10 codes",
"metadata":["description"]
},
```
#### Table with a sub-categorical column:
In order to sub-divide a table by a categorical column use the "divide_col" action
```json
"actions":{
"divide_col":"MMCode"
}
```
#### Table with multiple code types in single column:
Need to split column into multiple columns, so only one code type per column.
- The "split_col" attribute is the categorical column indicating the code type in that row. The <b>category names should replace column</b> names in the `columns` properties.
- The "codes_col" attribute is the code column with mulitple code types in a single column
```json
"actions":{
"split_col":"coding_system",
"codes_col":"code"
},
"columns":{
"read2_code":"Read codes v2",
"med_code":"Med codes",
"icd10_code":"ICD10 codes",
"metadata":["description"]
},
```
*<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate
**<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate*
## ⚡ Usage
## Usage
Script preprocess code lists and to map to given concept/phenotype
@@ -180,12 +180,11 @@ git tag -a v1.0.0 -m "added features ..."
git push
```
## 🏦 Funding
This project has received funding from the National Institute of Health Research under grant agreement NIHR203988.
## Acknowledgements
This project was developed in the context of the [MELD-B](https://www.southampton.ac.uk/publicpolicy/support-for-policymakers/policy-projects/Current%20projects/meld-b.page) project, which is funded by the UK [National Institute of Health Research](https://www.nihr.ac.uk/) under grant agreement NIHR203988.