Skip to content
Snippets Groups Projects
mjbonifa's avatar
mjbonifa authored
cc3a7db5
History

A Tool for Automating the Curation of Medical Concepts derived from Coding Lists

Jakub J. Dylag 1, Roberta Chiovoloni 3, Ashley Akbari 3, Simon D. Fraser 2, Michael J. Boniface 1

1 Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton
2 School of Primary Care Population Sciences and Medical Education, University of Southampton
3 Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University

*Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk

How to cite this work

Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing

Introduction

This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria.

The output code list is then used by data providers to select MELD-B cohorts.

Method

Process

  1. Approved MELB-B concepts are defined in a CSV spreadsheet (currently PHEN_summary_working.csv).
  2. Imported Code Lists in /src are verified against all NHS TRUD registered codes
  3. Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within PHEN_assign_v3.json.
    • See "JSON Phenotype Mapping" section for more details
  4. Process is executed from command line either manually or from bash script run.sh
    • See "Usage" section for more details
  5. Output Concept Code Lists are saved to the /concepts git repository and any changes are tracked.
  6. Output Concept Code Lists can be exported into SAIL or any other Data Bank

Medical Coding Standards Supported

Code Type Verification Maps to
Readv2 NHS TRUD Readv3, SNOMED, ICD10, OPCS4, ATC
Readv3 (CTV3) NHS TRUD Readv3, SNOMED, ICD10, OPCS4
ICD10 NHS TRUD
SNOMED NHS TRUD
OPCS4 NHS TRUD
ATC None
MED None
CPRD Product None

MELD-B refers to various diagnostic code formats included in target datasets.

  • Read V2
  • SNOMED-CT was adopted by the NHS around 2018
    • CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
    • Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
  • ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets.
  • ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO.

Setup

  • Delete corrupted files that cannot be read with bash import.sh

Code Translation Tables

  1. Due to the licencing of NHS TRUD coding tables, the following resources must be downloaded separately:

  2. Next, prepare the convertion Tables by saving them as .parquet tables.

    • See "Mappings" section in process_codes_WP.ipynb to generate table with appropriate name
    • For reversible convertions create a duplicate table with the name reversed. However be aware this is NOT ADVISED and goes against NHS guidance.

JSON phenotype mapping

Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within PHEN_assign_v3.json.

Defining the Strucutre for Folders and Files:

"folder":"codes/Medication code source",
"description":"Medication Codes - downloaded 15/12/23",
"files": [
		{
			"file":"WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx"
		}
]

Define Column Code Types

"columns":{
	"read2_code":"READCODE",
	"metadata":["DESCRIPTION"]
},

Define Concepts to be mapped to

"meldb_phenotypes": ["ALL_MEDICATIONS"]

Actions: Additional preprocessing (if required):

  • In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a action object inside of the file object.

  • Table with a sub-categorical column:

    • In order to sub-divide a table by a categorical column use the "divide_col" action
    • e.g. "actions":{"divide_col": "MMCode"}
  • Table with multiple code types in single column:

    • Need to split column into multiple columns, so only one code type per column.
    • The "split_col" attribute is the categorical column indicating the code type in that row. The category names should replace column names in the columns properties.
    • The "codes_col" attribute is the code column with mulitple code types in a single column
    • e.g.
     "actions":{
     	"split_col":"coding_system",
     	"codes_col":"code"
     },
     "columns":{
     	"read2_code":"Read codes v2",
     	"med_code":"Med codes",
     	"icd10_code":"ICD10 codes",
     	"metadata":["description"]
     },

*Large Code lists with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate

Usage

Script preprocess code lists and to map to given concept/phenotype

Execution (Bash Script)

bash ./run.sh

Execution (Shell Command)

usage: python main.py [-h] [-r2] [-r3] [-i] [-s] [-o] [-a] [-m] [-c] [--no-translate] [--no-verify] [--output] [--error-log] mapping_file

positional arguments:

  • mapping_file Concept/Phenotype Assignment File (json)

optional arguments:

  • -r2, --read2-code Read V2 Codes Column name in Source File
  • -r3, --read3-code Read V3 Codes Column name in Source File
  • -i, --icd10-code ICD10 Codes Column name in Source File
  • -s, --snomed-code SNOMED Codes Column name in Source File
  • -o, --opcs4-code OPCS4 Codes Column name in Source File
  • -a, --atc-code ATC Codes Column name in Source File
  • -m, --med-code Med Codes Column name in Source File
  • -c, --cprd-code CPRD Product Codes Column name in Source File
  • --no-translate Do not translate code types
  • --no-verify Do not verify codes are correct
  • --output Filepath to save output to
  • --error-log Filepath to save error log to

EXAMPLE: python main.py PHEN_assign_v3.json -r2 --output output/MELD_concepts_readv2.csv --error-log output/MELD_errors.csv

Contributing

Commit to GitLab

git add .
git commit -m "my message ..."
git tag -a v1.0.0 -m "added features ..."
git push

Funding

This project has received funding from the National Institute of Health Research under grant agreement NIHR203988.

License

apache2

This work is licensed under a Apache License, Version 2.0.