

A Tool for Automating the Curation of Medical Concepts derived from Coding Lists
Jakub J. Dylag 1, Roberta Chiovoloni 3, Ashley Akbari 3, Simon D. Fraser 2, Michael J. Boniface 1
1 Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton
2 School of Primary Care Population Sciences and Medical Education, University of Southampton
3 Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University
*Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk
How to cite this work
Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing
Introduction
This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria.
The output code list is then used by data providers to select MELD-B cohorts.
Method
Process
- Approved MELB-B concepts are defined in a CSV spreadsheet (currently PHEN_summary_working.csv).
- Imported Code Lists in
/src
are verified against all NHS TRUD registered codes - Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within
PHEN_assign_v3.json
.- See "JSON Phenotype Mapping" section for more details
- Process is executed from command line either manually or from bash script
run.sh
- See "Usage" section for more details
- Output Concept Code Lists are saved to the
/concepts
git repository and any changes are tracked. - Output Concept Code Lists can be exported into SAIL or any other Data Bank
Medical Coding Standards Supported
Code Type | Verification | Maps to |
---|---|---|
Readv2 | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4, ATC |
Readv3 (CTV3) | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4 |
ICD10 | NHS TRUD | |
SNOMED | NHS TRUD | |
OPCS4 | NHS TRUD | |
ATC | None | |
MED | None | |
CPRD Product | None |
MELD-B refers to various diagnostic code formats included in target datasets.
- Read V2
- Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
- SAIL only supports five character read codes V2
- SNOMED-CT was adopted by the NHS around 2018
- CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
- Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
- ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets.
- ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO.
Setup
- Delete corrupted files that cannot be read with
bash import.sh
Code Translation Tables
-
Due to the licencing of NHS TRUD coding tables, the following resources must be downloaded separately:
-
Next, prepare the convertion Tables by saving them as
.parquet
tables.- See "Mappings" section in process_codes_WP.ipynb to generate table with appropriate name
- For reversible convertions create a duplicate table with the name reversed. However be aware this is NOT ADVISED and goes against NHS guidance.
JSON phenotype mapping
Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within PHEN_assign_v3.json
.
Defining the Strucutre for Folders and Files:
"folder":"codes/Medication code source",
"description":"Medication Codes - downloaded 15/12/23",
"files": [
{
"file":"WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx"
}
]
Define Column Code Types
"columns":{
"read2_code":"READCODE",
"metadata":["DESCRIPTION"]
},
Define Concepts to be mapped to
"meldb_phenotypes": ["ALL_MEDICATIONS"]
Actions: Additional preprocessing (if required):
-
In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a
action
object inside of thefile
object. -
Table with a sub-categorical column:
- In order to sub-divide a table by a categorical column use the "divide_col" action
- e.g.
"actions":{"divide_col": "MMCode"}
-
Table with multiple code types in single column:
- Need to split column into multiple columns, so only one code type per column.
- The "split_col" attribute is the categorical column indicating the code type in that row. The category names should replace column names in the
columns
properties. - The "codes_col" attribute is the code column with mulitple code types in a single column
- e.g.
"actions":{ "split_col":"coding_system", "codes_col":"code" }, "columns":{ "read2_code":"Read codes v2", "med_code":"Med codes", "icd10_code":"ICD10 codes", "metadata":["description"] },
*Large Code lists with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate
Usage
Script preprocess code lists and to map to given concept/phenotype
Execution (Bash Script)
bash ./run.sh
Execution (Shell Command)
usage: python main.py [-h] [-r2] [-r3] [-i] [-s] [-o] [-a] [-m] [-c] [--no-translate] [--no-verify] [--output] [--error-log] mapping_file
positional arguments:
-
mapping_file
Concept/Phenotype Assignment File (json)
optional arguments:
-
-r2
,--read2-code
Read V2 Codes Column name in Source File -
-r3
,--read3-code
Read V3 Codes Column name in Source File -
-i
,--icd10-code
ICD10 Codes Column name in Source File -
-s
,--snomed-code
SNOMED Codes Column name in Source File -
-o
,--opcs4-code
OPCS4 Codes Column name in Source File -
-a
,--atc-code
ATC Codes Column name in Source File -
-m
,--med-code
Med Codes Column name in Source File -
-c
,--cprd-code
CPRD Product Codes Column name in Source File -
--no-translate
Do not translate code types -
--no-verify
Do not verify codes are correct -
--output
Filepath to save output to -
--error-log
Filepath to save error log to
EXAMPLE:
python main.py PHEN_assign_v3.json -r2 --output output/MELD_concepts_readv2.csv --error-log output/MELD_errors.csv
Contributing
Commit to GitLab
git add .
git commit -m "my message ..."
git tag -a v1.0.0 -m "added features ..."
git push
Funding
This project has received funding from the National Institute of Health Research under grant agreement NIHR203988.
License
This work is licensed under a Apache License, Version 2.0.