<sup>*</sup>Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk
### 🖋 How to cite this work
> Dylag J. J., Chiovoloni R., Akbari A., Fraser S. D., Boniface M. J., A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. May 2024. https://git.soton.ac.uk/meld/meldb/concepts-processing
> Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meld/meldb/concepts-processing
## 🙌 Introduction
This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria.
...
...
@@ -24,18 +24,14 @@ The output code list is then used by data providers to select MELD-B cohorts.
## 📃 Method
### Process
1. MELB-B conditions are defined in a Excel spreadsheet (currently CONC_summary_working.xlsx).
2.
2. Each sheet in the file includes a mapping from a MELD-B condition to source MLTC condition that is then associated with a diagnostic code list
3. Each sheet is processed to create a master code list
* READ_CODE: full 7 character read code id
* CPRD_GOLD_MEDICAL_CODE_ID: CPRD GOLD medical code id
* CPRD_AURUM_MEDICAL_CODE_ID: CPRD AURUM medical code id
* DESCRIPTION: diagnosis description
* MELDB_CONDITION: meld b multimorbidity label
* DATABASE: list of databases mapped
* SOURCE: list of sources mapped
* any other meta data columns (including descriptions etc)
1. Approved MELB-B concepts are defined in a Excel spreadsheet (currently CONC_summary_working.xlsx).
2. Imported Code Lists in `/codes` are verified against all NHS TRUD registered codes
3. Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`.
- See "JSON Phenotype Mapping" section for more details
4. Process is executed from command line either manually or from bash script `run.sh`
- See "Usage" section for more details
5. Output Concept Code Lists are saved to the `/concepts` git repository and any changes are tracked.
6. Output Concept Code Lists can be exported into SAIL or any other Data Bank
### Medical Coding Standards Supported
| Code Type | Verification | Maps to |
...
...
@@ -51,11 +47,11 @@ The output code list is then used by data providers to select MELD-B cohorts.
MELD-B refers to various diagnostic code formats included in target datasets.
* Read V2
* Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
* SAIL only supports five character read codes V2
* Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
* SAIL only supports five character read codes V2
* SNOMED-CT was adopted by the NHS around 2018
* CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
* Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
* CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
* Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
* ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets.
* ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO.
...
...
@@ -65,12 +61,12 @@ MELD-B refers to various diagnostic code formats included in target datasets.
MELD-B has defined a set of phenotypes for MLTC conditions that are considered burdensome. Each Phenotype includes one or more diagnosis.
* Ho et al - https://cronfa.swan.ac.uk/Record/cronfa60877/Download/60877__25107__3700915cf20e418aae714e5639722449.pdf
* The diagnositc codes have been mapped by SAIL to by the ThinkingGroup https://github.com/THINKINGGroup/phenotypes which has been replicated by the RSF https://github.com/aim-rsf/phenotypes
* Azcoaga-Lorenzo, A., Akbari, A., Davies, J., Khunti, K., Kadam, U.T., Lyons, R., McCowan, C., Mercer, S.W., Nirantharakumar, K., Staniszewska, S. and Guthrie, B., 2022. Measuring multimorbidity in research: Delphi consensus study. BMJ medicine, 1(1), p.e000247.
* The diagnositc codes have been mapped by SAIL to by the ThinkingGroup https://github.com/THINKINGGroup/phenotypes which has been replicated by the RSF https://github.com/aim-rsf/phenotypes
* Azcoaga-Lorenzo, A., Akbari, A., Davies, J., Khunti, K., Kadam, U.T., Lyons, R., McCowan, C., Mercer, S.W., Nirantharakumar, K., Staniszewska, S. and Guthrie, B., 2022. Measuring multimorbidity in research: Delphi consensus study. BMJ medicine, 1(1), p.e000247.
* Hanlon et al - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8901063/#pmed.1003931.s005
* Hanlon, P., Jani, B.D., Nicholl, B., Lewsey, J., McAllister, D.A. and Mair, F.S., 2022. Associations between multimorbidity and adverse health outcomes in UK Biobank and the SAIL Databank: A comparison of longitudinal cohort studies. PLoS Medicine, 19(3), p.e1003931.
* Hanlon, P., Jani, B.D., Nicholl, B., Lewsey, J., McAllister, D.A. and Mair, F.S., 2022. Associations between multimorbidity and adverse health outcomes in UK Biobank and the SAIL Databank: A comparison of longitudinal cohort studies. PLoS Medicine, 19(3), p.e1003931.
- ClinicalCodes Project, University of Manchester - https://clinicalcodes.rss.mhs.man.ac.uk/
...
...
@@ -80,7 +76,7 @@ MELD-B has defined a set of phenotypes for MLTC conditions that are considered b
- Gilbert et al (for Frailty Secondary Care) - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946808/
- Gilbert T, Neuburger J, Kraindler J, Keeble E, Smith P, Ariti C, Arora S, Street A, Parker S, Roberts HC, Bardsley M, Conroy S. Development and validation of a Hospital Frailty Risk Score focusing on older people in acute care settings using electronic hospital records: an observational study. Lancet. 2018 May 5;391(10132):1775-1782. doi: 10.1016/S0140-6736(18)30668-8. Epub 2018 Apr 26. PMID: 29706364; PMCID: PMC5946808.
- Abbasizanjani et al (for Long Covid) https://www.adruk.org/news-publications/publications-reports/data-insight-clinical-coding-and-capture-of-long-covid-a-cohort-study-in-wales-using-linked-health-and-demographic-data-824/
- Abbasizanjani et al (for Long Covid) https://www.adruk.org/news-publications/publications-reports/data-insight-clinical-coding-and-capture-of-long-covid-a-cohort-study-in-wales-using-linked-health-and-demographic-data-824
- Abbasizanjani H, Bedston S, Robinson L, Curds M, Akbari A. Clinical coding and capture of Long COVID: a cohort study in Wales using linked health and demographic data. ADR Wales Data Insight. August 2023. https://adrwales.org/wp-content/uploads/2023/08/Clinical-coding-and-capture-of-Long-COVID.pdf
...
...
@@ -112,29 +108,76 @@ MELD-B uses drug codes as a proxy indicator of burden. This codes are derived fr
## ⚙️ Setup
1. Delete corrupted files:`bash import.sh`
- Delete corrupted files that cannot be read with`bash import.sh`
### Code Translationg Tables
### Code Translation Tables
1. Due to the licencing of NHS TRUD coding tables, the following resources <mark>must be downloaded separately</mark>:
Due to the licencing of NHS TRUD coding tables, the following
2. Update Convertion Tables:
2. Next, prepare the convertion Tables by saving them as `.parquet` tables.
- See "Mappings" section in process_codes_WP.ipynb to generate table with appropriate name
- For reversible convertions create a duplicate table with the name reversed
- For reversible convertions create a duplicate table with the name reversed. However be aware this is <b>NOT ADVISED</b> and goes against NHS guidance.
### JSON phenotype mapping
3. Update JSON Codes List
- Manually Edit the PHEN_asssign_v2.json
- Use "Ho generate JSON" section in process_codes_WP.ipynb to generate JSON for Ho
- Cases which require additional preprocessing
<!-- - Large Table with sub-categorical column
- Need to split table by categorical column
- Then read each categorical file individually
- USE "divide_col" action
- Table with multiple code types in single column
- Need to split column into multiple columns, so only one code type per column
- USE "split_col" action -->
Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`.
#### Defining the Strucutre for Folders and Files:
- In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object.
- Table with a sub-categorical column:
- In order to sub-divide a table by a categorical column use the "divide_col" action
- e.g. ``` "actions":{"divide_col": "MMCode"}```
- Table with multiple code types in single column:
- Need to split column into multiple columns, so only one code type per column.
- The "split_col" attribute is the categorical column indicating the code type in that row. The <b>category names should replace column</b> names in the `columns` properties.
- The "codes_col" attribute is the code column with mulitple code types in a single column
- e.g.
```
"actions":{
"split_col":"coding_system",
"codes_col":"code"
},
"columns":{
"read2_code":"Read codes v2",
"med_code":"Med codes",
"icd10_code":"ICD10 codes",
"metadata":["description"]
},
```
*<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate