From b66f3d89edeff4d1655943134a074752c61e69df Mon Sep 17 00:00:00 2001 From: Jakub Dylag <jjd1c23@soton.ac.uk> Date: Thu, 18 Apr 2024 23:10:33 +0100 Subject: [PATCH] Updated README --- README.md | 121 ++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 82 insertions(+), 39 deletions(-) diff --git a/README.md b/README.md index 0abd9fd..7886583 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ <sup>*</sup>Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk ### 🖋 How to cite this work -> Dylag J. J., Chiovoloni R., Akbari A., Fraser S. D., Boniface M. J., A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. May 2024. https://git.soton.ac.uk/meld/meldb/concepts-processing +> Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meld/meldb/concepts-processing ## 🙌 Introduction This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria. @@ -24,18 +24,14 @@ The output code list is then used by data providers to select MELD-B cohorts. ## 📃 Method ### Process -1. MELB-B conditions are defined in a Excel spreadsheet (currently CONC_summary_working.xlsx). -2. -2. Each sheet in the file includes a mapping from a MELD-B condition to source MLTC condition that is then associated with a diagnostic code list -3. Each sheet is processed to create a master code list - * READ_CODE: full 7 character read code id - * CPRD_GOLD_MEDICAL_CODE_ID: CPRD GOLD medical code id - * CPRD_AURUM_MEDICAL_CODE_ID: CPRD AURUM medical code id - * DESCRIPTION: diagnosis description - * MELDB_CONDITION: meld b multimorbidity label - * DATABASE: list of databases mapped - * SOURCE: list of sources mapped - * any other meta data columns (including descriptions etc) +1. Approved MELB-B concepts are defined in a Excel spreadsheet (currently CONC_summary_working.xlsx). +2. Imported Code Lists in `/codes` are verified against all NHS TRUD registered codes +3. Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`. + - See "JSON Phenotype Mapping" section for more details +4. Process is executed from command line either manually or from bash script `run.sh` + - See "Usage" section for more details +5. Output Concept Code Lists are saved to the `/concepts` git repository and any changes are tracked. +6. Output Concept Code Lists can be exported into SAIL or any other Data Bank ### Medical Coding Standards Supported | Code Type | Verification | Maps to | @@ -51,11 +47,11 @@ The output code list is then used by data providers to select MELD-B cohorts. MELD-B refers to various diagnostic code formats included in target datasets. * Read V2 - * Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9 - * SAIL only supports five character read codes V2 + * Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9 + * SAIL only supports five character read codes V2 * SNOMED-CT was adopted by the NHS around 2018 - * CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does. - * Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud + * CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does. + * Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud * ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets. * ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO. @@ -65,12 +61,12 @@ MELD-B refers to various diagnostic code formats included in target datasets. MELD-B has defined a set of phenotypes for MLTC conditions that are considered burdensome. Each Phenotype includes one or more diagnosis. * Ho et al - https://cronfa.swan.ac.uk/Record/cronfa60877/Download/60877__25107__3700915cf20e418aae714e5639722449.pdf - * The diagnositc codes have been mapped by SAIL to by the ThinkingGroup https://github.com/THINKINGGroup/phenotypes which has been replicated by the RSF https://github.com/aim-rsf/phenotypes - * Azcoaga-Lorenzo, A., Akbari, A., Davies, J., Khunti, K., Kadam, U.T., Lyons, R., McCowan, C., Mercer, S.W., Nirantharakumar, K., Staniszewska, S. and Guthrie, B., 2022. Measuring multimorbidity in research: Delphi consensus study. BMJ medicine, 1(1), p.e000247. + * The diagnositc codes have been mapped by SAIL to by the ThinkingGroup https://github.com/THINKINGGroup/phenotypes which has been replicated by the RSF https://github.com/aim-rsf/phenotypes + * Azcoaga-Lorenzo, A., Akbari, A., Davies, J., Khunti, K., Kadam, U.T., Lyons, R., McCowan, C., Mercer, S.W., Nirantharakumar, K., Staniszewska, S. and Guthrie, B., 2022. Measuring multimorbidity in research: Delphi consensus study. BMJ medicine, 1(1), p.e000247. * Hanlon et al - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8901063/#pmed.1003931.s005 - * https://github.com/dmcalli2/dynamic_protocols/blob/master/defining_comorbidities_SAIL.md?? - * Hanlon, P., Jani, B.D., Nicholl, B., Lewsey, J., McAllister, D.A. and Mair, F.S., 2022. Associations between multimorbidity and adverse health outcomes in UK Biobank and the SAIL Databank: A comparison of longitudinal cohort studies. PLoS Medicine, 19(3), p.e1003931. + * https://github.com/dmcalli2/dynamic_protocols/blob/master/defining_comorbidities_SAIL.md?? + * Hanlon, P., Jani, B.D., Nicholl, B., Lewsey, J., McAllister, D.A. and Mair, F.S., 2022. Associations between multimorbidity and adverse health outcomes in UK Biobank and the SAIL Databank: A comparison of longitudinal cohort studies. PLoS Medicine, 19(3), p.e1003931. - ClinicalCodes Project, University of Manchester - https://clinicalcodes.rss.mhs.man.ac.uk/ @@ -80,7 +76,7 @@ MELD-B has defined a set of phenotypes for MLTC conditions that are considered b - Gilbert et al (for Frailty Secondary Care) - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946808/ - Gilbert T, Neuburger J, Kraindler J, Keeble E, Smith P, Ariti C, Arora S, Street A, Parker S, Roberts HC, Bardsley M, Conroy S. Development and validation of a Hospital Frailty Risk Score focusing on older people in acute care settings using electronic hospital records: an observational study. Lancet. 2018 May 5;391(10132):1775-1782. doi: 10.1016/S0140-6736(18)30668-8. Epub 2018 Apr 26. PMID: 29706364; PMCID: PMC5946808. -- Abbasizanjani et al (for Long Covid) https://www.adruk.org/news-publications/publications-reports/data-insight-clinical-coding-and-capture-of-long-covid-a-cohort-study-in-wales-using-linked-health-and-demographic-data-824/ +- Abbasizanjani et al (for Long Covid) https://www.adruk.org/news-publications/publications-reports/data-insight-clinical-coding-and-capture-of-long-covid-a-cohort-study-in-wales-using-linked-health-and-demographic-data-824 - Abbasizanjani H, Bedston S, Robinson L, Curds M, Akbari A. Clinical coding and capture of Long COVID: a cohort study in Wales using linked health and demographic data. ADR Wales Data Insight. August 2023. https://adrwales.org/wp-content/uploads/2023/08/Clinical-coding-and-capture-of-Long-COVID.pdf @@ -112,29 +108,76 @@ MELD-B uses drug codes as a proxy indicator of burden. This codes are derived fr ## ⚙️ Setup -1. Delete corrupted files: `bash import.sh` +- Delete corrupted files that cannot be read with `bash import.sh` -### Code Translationg Tables +### Code Translation Tables +1. Due to the licencing of NHS TRUD coding tables, the following resources <mark>must be downloaded separately</mark>: + - [nhs_readbrowser_25.0.0_20180401000001](https://isd.digital.nhs.uk/trud/users/guest/filters/2/categories/9/items/8/releases) + - [nhs_datamigration_29.0.0_20200401000001](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/9/items/9/releases) + - [ICD10_Edition5_XML_20160401](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/28/items/258/releases?source=summary) + - [OPCS-4.10 Data files](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/10/items/119/releases) + - [BNF/Snomed Mapping data.xlsx](https://www.nhsbsa.nhs.uk/prescription-data/understanding-our-data/bnf-snomed-mapping) -Due to the licencing of NHS TRUD coding tables, the following - -2. Update Convertion Tables: +2. Next, prepare the convertion Tables by saving them as `.parquet` tables. - See "Mappings" section in process_codes_WP.ipynb to generate table with appropriate name - - For reversible convertions create a duplicate table with the name reversed + - For reversible convertions create a duplicate table with the name reversed. However be aware this is <b>NOT ADVISED</b> and goes against NHS guidance. ### JSON phenotype mapping -3. Update JSON Codes List - - Manually Edit the PHEN_asssign_v2.json - - Use "Ho generate JSON" section in process_codes_WP.ipynb to generate JSON for Ho - - Cases which require additional preprocessing - <!-- - Large Table with sub-categorical column - - Need to split table by categorical column - - Then read each categorical file individually - - USE "divide_col" action - - Table with multiple code types in single column - - Need to split column into multiple columns, so only one code type per column - - USE "split_col" action --> +Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`. + +#### Defining the Strucutre for Folders and Files: +``` +"folder":"codes/Medication code source", +"description":"Medication Codes - downloaded 15/12/23", +"files": [ + { + "file":"WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx" + } +] +``` + +#### Define Column Code Types +``` +"columns":{ + "read2_code":"READCODE", + "metadata":["DESCRIPTION"] +}, +``` + +#### Define Concepts to be mapped to +``` +"meldb_phenotypes": ["ALL_MEDICATIONS"] +``` + +#### Actions: Additional preprocessing (if required): +- In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object. + +- Table with a sub-categorical column: + - In order to sub-divide a table by a categorical column use the "divide_col" action + - e.g. ``` "actions":{"divide_col": "MMCode"}``` + +- Table with multiple code types in single column: + - Need to split column into multiple columns, so only one code type per column. + - The "split_col" attribute is the categorical column indicating the code type in that row. The <b>category names should replace column</b> names in the `columns` properties. + - The "codes_col" attribute is the code column with mulitple code types in a single column + - e.g. + ``` + "actions":{ + "split_col":"coding_system", + "codes_col":"code" + }, + "columns":{ + "read2_code":"Read codes v2", + "med_code":"Med codes", + "icd10_code":"ICD10 codes", + "metadata":["description"] + }, + ``` + + +*<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate + ## ⚡ Usage -- GitLab