Skip to content
Snippets Groups Projects

A Tool for Automating the Curation of Medical Concepts derived from Coding Lists (ACMC)

Jakub J. Dylag 1, Roberta Chiovoloni 3, Ashley Akbari 3, Simon D. Fraser 2, Michael J. Boniface 1

1 Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton
2 School of Primary Care Population Sciences and Medical Education, University of Southampton
3 Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University

Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk

Citation

Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing

Introduction

This tool automates the verification, translation and organisation of medical coding lists defining phenotypes for inclusion criteria in cohort analysis. By processing externally sourced clinical inclusion criteria into actionable code lists, this tool ensures consistent and efficient curation of cohort definitions. These code lists can be subsequently used by data providers to construct study cohorts.

Overview

Workflow

The high level steps to use the tools are outlined below:

1. Define concept sets: A domain expert defines a list of concept sets for each observable characteristic of the phenotype using CSV file format (e.g., PHEN_concept_sets.csv).

2. Define code lists for concept sets: A domain expert defines code lists for each concept set within the phenotype using supported coding list formats and stores them in the /src directory.

3. Define mapping from code lists to concept sets: A domain expert defines a phenotype mapping that maps code lists to concept sets in JSON file format (PHEN_assign_v3.json)

4. Generate versioned phenotype coding lists and translations: A domain expert or data engineer processes the phenotype mappings using the command line tool to validate against NHS TRUD-registered codes and mapping and to generate versioned concept set code lists with translations between coding standards.

Supported Medical Coding Standards

The tool supports verification and mapping across diagnostic coding formats below:

Medical Code Verification Translation to
Readv2 NHS TRUD Readv3, SNOMED, ICD10, OPCS4, ATC
Readv3 (CTV3) NHS TRUD Readv3, SNOMED, ICD10, OPCS4
ICD10 NHS TRUD None
SNOMED NHS TRUD None
OPCS4 NHS TRUD None
ATC None None
  • Read V2: NHS clinical terminology standard used in primary care and replaced by SNOMED-CT in 2018; Still supported by some data providers as widely used in primary care, e.g. SAIL Databank
  • SNOMED-CT: international standard for clinical terminology for Electronic Healthcare Records adopted by the NHS in 2018; Mappings to Read codes are partially provided by Clinical Research Practice Database (CPRD) and NHS Technology Reference Update Distribution (TRUD).
  • ICD-10: International Classification of Diseases (ICD) is a medical classification list from the World Health Organization (WHO) and widely used in hospital settings, e.g. Hospital Episode Statistics (HES).
  • ATC Codes: Anatomical Therapeutic Chemical (ATC) Classification is a drug classification list from the World Health Organization (WHO)

Installation

  1. Setup Conda Enviroment: Download and Install Python Enviroment. Follow insturctions to install minicoda from https://docs.conda.io/en/latest/miniconda.html.
  • Run the following command to recreate the environment: conda env create -f conda.yaml.
  • Activate the environment: conda activate acmc
  1. Sign Up: Register at NHS TRUD

  2. Subscribe and accept the following licenses:

Each data file has a "Subscribe" link that will take you to the licence. You will need to "Tell us about your subscription request" that summarises why you need access to the data. Your subscription will not be approved immediately and will remain in the "pending" state until it is. This is usually approved within 24 hours.

  1. Get API Key: Retrieve your API key from NHS TRUD Account Management.

  2. Install TRUD: Download and install NHS TRUD medical code resources.

Executing the script using the command: python trud.py --key <API_KEY>.

Processed tables will be saved as .parquet files in the maps/processed/ directory. - Note: NHS TRUD defines one-way mappings and does NOT ADVISE reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given .parquet table and reverse the filename (e.g. read2_code_to_snomed_code.parquet to snomed_code_to_read2_code.parquet)

  1. *Optional: Install OMOP Database: Download and install OMOP vocabularies from Athena OHDSI.
    • Required vocabularies include:
        1. SNOMED
        1. ICD9CM
        1. Readv2
        1. ATC
        1. OPCS4
        1. HES Specialty
        1. ICD10CM
        1. dm+d
        1. UK Biobank
        1. NHS Ethnic Category
        1. NHS Place of Service
    • Un-zip the downloaded folder and copy it's path.
    • Install vocabularies using:
      python omop_api.py --install <PATH_TO_DOWNLOADED_FILES>

Configuration

The JSON configuration file specifies how input codes are grouped into concept sets, which are collections of related codes used for defining phenotypes or other data subsets. The configuration is divided into two main components: the "concept_sets" object and the "codes" object. The "codes" objects specifies the inputted codes; their filepaths, column names and code types, as well as any formatting actions that maybe be neccessary. The "concept_sets" object defines a grouping that each of the inputted codes will be assigned to. All files must be formatted as shown below.

{
	"concept_sets": {
	},
	"codes":[
	]
}

EXAMPLE: Configuration file used in the MELD-B project: https://git.soton.ac.uk/meldb/concepts/-/blob/main/PHEN_assign_v3.json?ref_type=heads

Folder and File Definitions

The "codes" section defines the location and description of all input medical coding lists required for processing. Each "folder" is defined as an object of within the "codes" list. Similarily all files are objects within the "files" list.

  • folder: Specifies the directory containing the input files.
  • description: Provides a brief summary of the content or purpose of the files, often including additional context such as the date the data was downloaded.
  • files: Lists the files within the specified folder. Each file is represented as an object with the key "file" and the file name as its value. Definitions of the columns in each file are detailed below.
"codes":[
	{
		"folder": "codes/Medication code source",
		"description": "Medication Codes - downloaded 15/12/23",
		"files": [
			{
				"file": "WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx"
			}
		]
	}
]

Column Definitions in Files

The "columns" property within a file object specifies the type and corresponding names of columns in the input file. Each key in the object represents a column type, while the associated value denotes the name of the column in the input file.

The supported column types include:

  • read2_code: Read Version 2 codes
  • read3_code: Read Version 3 codes
  • icd10_code: International Classification of Diseases, 10th Revision
  • snomed_code: SNOMED-CT codes
  • opcs4_code: OPCS Classification of Interventions and Procedures, Version 4
  • atc_code: Anatomical Therapeutic Chemical classification codes

Additionally, the "metadata" object ensures that any remaining columns not explicitly categorized by the supported column types are preserved in the output file. These columns are specified as an array of column names to be copied directly.

"files": [
	{
		"file":"WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx",
		"columns": {
			"read2_code": "READCODE",
			"metadata": ["DESCRIPTION"]
		}
	}
]

Concept Set Assigment

The "concept_sets" object defines the structure and rules for grouping input codes into concept sets based on a source CSV file. Key elements include:

  • file: Specifies the CSV file used as the input for defining concept sets.

  • version: Identifies the version of the concept set definitions being used. This can help track changes over time.

  • concept_set: Defines a list of concept_set objects along with their attributes:

    • concept_set_name: Specifies the name of the concept set.
    • concept_set_status: Indicates the status of the concept set. Only concept sets the "AGREED" status will be outputted!
    • metadata (optional): A list of additional properties that will be copied into the output. Can be used for descriptive or contextual purposes.

The "codes" object specifies the source files containing input codes and assigns them to the corresponding concept sets through the "concept_set" field.

  • concept_set: Lists the concept sets to which all codes within this file will be assigned.
{
	"concept_sets": {
		"version": "3.2.10",
		"omop": {
			"vocabulary_id": "MELDB",
			"vocabulary_name": "Multidisciplinary Ecosystem to study Lifecourse Determinants and Prevention of Early-onset Burdensome Multimorbidity",
			"vocabulary_reference": "https://www.it-innovation.soton.ac.uk/projects/meldb"
		},
		"concept_set": [
			{
				"concept_set_name": "ABDO_PAIN",
				"concept_set_status": "AGREED",
				"metadata": {
					"concept_set_description": "Abdominal pain",
				}
			}
		]
	},
	"codes":[
		{
			"folder": "codes/Medication code source",
			"description": "Medication Codes - downloaded 15/12/23",
			"files": [
				{
					"file": "WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx",
					"concept_set": ["ALL_MEDICATIONS"]
				}
			]
		}
	]
}

Additional preprocessing (if required):

In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a action object inside of the file object.

Table with a sub-categorical column:

In order to sub-divide a table by a categorical column use the "divide_col" action

"actions":{
	"divide_col": "MMCode"
}

Table with multiple code types in single column:

Need to split column into multiple columns, so only one code type per column.

  • The "split_col" attribute is the categorical column indicating the code type in that row. The category names should replace column names in the columns properties.
  • The "codes_col" attribute is the code column with mulitple code types in a single column
"actions":{
	"split_col":"coding_system",
	"codes_col":"code"
},
"columns":{
	"read2_code":"Read codes v2",
	"med_code":"Med codes",
	"icd10_code":"ICD10 codes",
	"metadata":["description"]
},

*Large Code lists with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate

Usage

Script preprocess code lists and to map to given concept/phenotype

Execute Command Line

Execute via shell with customizable parameters:

python main.py [-h] [-r2] [-r3] [-i] [-s] [-o] [-a] [--no-translate] [--no-verify] [--output] [--error-log] mapping_file

Required Arguments:

  • mapping_file Concept/Phenotype Assignment File (json)
  • --output Filepath to save output to CSV or OMOP SQLite Database

Options Arguments:

  • -r2, --read2-code Read V2 Codes Column name in Source File
  • -r3, --read3-code Read V3 Codes Column name in Source File
  • -i, --icd10-code ICD10 Codes Column name in Source File
  • -s, --snomed-code SNOMED Codes Column name in Source File
  • -o, --opcs4-code OPCS4 Codes Column name in Source File
  • -a, --atc-code ATC Codes Column name in Source File
  • --no-translate Do not translate code types
  • --no-verify Do not verify codes are correct
  • --error-log Filepath to save error log to

EXAMPLE: python main.py PHEN_assign_v3.json -r2 --output output/MELD_concepts_readv2.csv --error-log output/MELD_errors.csv

Contributing

Commit to GitLab

git add .
git commit -m "my message ..."
git tag -a v1.0.0 -m "added features ..."
git push

Acknowledgements

This project was developed in the context of the MELD-B project, which is funded by the UK National Institute of Health Research under grant agreement NIHR203988.

License

This work is licensed under a Apache License, Version 2.0.

apache2