

A Tool for Automating the Curation of Medical Concepts derived from Coding Lists (ACMC)
Jakub J. Dylag 1, Roberta Chiovoloni 3, Ashley Akbari 3, Simon D. Fraser 2, Michael J. Boniface 1
1 Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton
2 School of Primary Care Population Sciences and Medical Education, University of Southampton
3 Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University
Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk
Citation
Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing
Introduction
This tool automates the verification, translation and organisation of medical coding lists defining phenotypes for inclusion criteria in cohort analysis. By processing externally sourced clinical inclusion criteria into actionable code lists, this tool ensures consistent and efficient curation of cohort definitions. These code lists can be subsequently used by data providers to construct study cohorts.
Overview
Workflow
The high level steps to use the tools are outlined below:
1. Define concept sets: A domain expert defines a list of concept sets for each observable characteristic of the phenotype using CSV file format (e.g., PHEN_concept_sets.csv
).
2. Define concept code lists for concept sets: A domain expert defines code lists for each concept set within the phenotype using supported coding list formats and stores them in the /src
directory.
3. Define mapping from code lists to concept sets: A domain expert defines a phenotype mapping that maps code lists to concept sets.
4. Generate versioned phenotype coding lists and translations: A domain expert or data engineer processes the phenotype mappings using the command line tool to validate against NHS TRUD-registered codes and mapping and to generate versioned concept set code lists with translations between coding standards.
Supported Medical Coding Standards
The tool supports verification and mapping across diagnostic coding formats below:
Medical Code | Verification | Translation to |
---|---|---|
Readv2 | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4, ATC |
Readv3 (CTV3) | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4 |
ICD10 | NHS TRUD | None |
SNOMED | NHS TRUD | None |
OPCS4 | NHS TRUD | None |
ATC | None | None |
- Read V2: NHS clinical terminology standard used in primary care and replaced by SNOMED-CT in 2018; Still supported by some data providers as widely used in primary care, e.g. SAIL Databank
- SNOMED-CT: international standard for clinical terminology for Electronic Healthcare Records adopted by the NHS in 2018; Mappings to Read codes are partially provided by Clinical Research Practice Database (CPRD) and NHS Technology Reference Update Distribution (TRUD).
- ICD-10: International Classification of Diseases (ICD) is a medical classification list from the World Health Organization (WHO) and widely used in hospital settings, e.g. Hospital Episode Statistics (HES).
- ATC Codes: Anatomical Therapeutic Chemical (ATC) Classification is a drug classification list from the World Health Organization (WHO)
Notes
OMOP
Content of your package
Vocabularies release version: v20240830
Linux/macOS:
export ACMC_TRUD_API_KEY="your_api_key"
export ACMC_GITLAB_PAT="your_personal_access_token"
export ACMC_GITHUB_PAT="your_personal_access_token"
Windows (Command prompt):
set ACMC_TRUD_API_KEY=your_api_key
set ACMC_GITLAB_PAT=your_personal_access_token
set ACMC_GITHUB_PAT=your_personal_access_token
Windows (Powershell):
$env:ACMC_TRUD_API_KEY="your_api_key"
$env:ACMC_GITLAB_PAT="your_personal_access_token"
$env:ACMC_GITHUB_PAT="your_personal_access_token"
Installation
1. Setup Conda Enviroment
ACMC requires Python and the enviroment is maintained using conda.
- Ensure you have conda installed, e.g. following instructions for miniconda from https://docs.conda.io/en/latest/miniconda.html.
- Create environment:
conda env create -f conda.yaml
- Activate environment:
conda activate acmc
2. Register at TRUD to access clinically assured terminology mappings NHS TRUD
3. Subscribe and accept the following licenses
ACMC uses clinically assured medical terminologies provided by the NHS. The datafiles are downloaded automatically but you need to register, request subscription and obtain an API key.
Each data file has a "Subscribe" link that will take you to the licence. You will need to "Tell us about your subscription request" that summarises why you need access to the data, e.g. for a specific research project. Your subscription will not be approved immediately and will remain in the "pending" state until it is. This is usually approved within 24 hours.
4. Get TRUD API Key
Go to your NHS TRUD Account Management and copy you api key to a safe place, e.g. a personnal key store. The api key is required by ACMC tools to download TRUD resources.
5. Download and install TRUD resources
Execute the following script to download, install and process TRUD resources
python acmc.py trud install --key <API_KEY>
.
Processed TRUD resources are saved as .parquet
files in the build/maps/processed/
directory.
Note: NHS TRUD defines one-way mappings and does NOT ADVISE reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given .parquet
table and reverse the filename (e.g. read2_code_to_snomed_code.parquet
to snomed_code_to_read2_code.parquet
)
6. Optional: Install OMOP Database:
ACMC optionally supports outputting coding lists in structured OMOP database. To do this you will need to register with Athena and then download the following vocabularies manually from Athena OHDSI.
- Required vocabularies include:
-
- SNOMED
-
- ICD9CM
-
- Readv2
-
- ATC
-
- OPCS4
-
- HES Specialty
-
- ICD10CM
-
- dm+d
-
- UK Biobank
-
- NHS Ethnic Category
-
- NHS Place of Service
-
The vocabularies will not be available immediately, you will be notified by email when they are ready. This process cannot be automated due to the way that Athena delivers vocabularies for download.
-
Un-zip the downloaded folder and copy it's path.
-
Install vocabularies using the following command:
python acmc.py omop install -f <Path to extracted OMOP downloads folder>
Defining phenotypes
Phenotypes are defined in a JSON configuration file. The file describes how source concept codes (i.e. a code list) are mapped to the collection of concept set included in the phenotype.
- concept_sets: defines the collection of observable characteristics of the phenotype. See Observational Health Data Sciences and Informatics (OHDSI) definition for Concept Set
- codes: defines lists of source concept codes associated with a specific concept set and declarative actions (e.g. filepaths, column names, code types, actions) to process source concept code files. See OMOP Common Data Model definition for Concept Codes
An example concept set and code list for Abdominal Pain is show below:
{
"concept_sets": {
"version": "3.2.10",
"omop": {
"vocabulary_id": "MELDB",
"vocabulary_name": "Multidisciplinary Ecosystem to study Lifecourse Determinants and Prevention of Early-onset Burdensome Multimorbidity",
"vocabulary_reference": "https://www.it-innovation.soton.ac.uk/projects/meldb"
},
"concept_set": [
{
"concept_set_name": "ABDO_PAIN",
"concept_set_status": "AGREED",
"metadata": {
"#": "18",
"CONCEPT DESCRIPTION": "Abdominal pain",
"CONCEPT TYPE": "Workload indicator (symptom)",
"DATE ADDED ": "2023-08-25",
"REQUEST REASON ": "Clinician SF - requested by email - symptom example from Qualitative Evidence Synthesis",
"SOURCE INFO": "YES",
"FUNCTION": "QUERY BY CODING LIST",
"FUNCTION.1": "https://clinicalcodes.rss.mhs.man.ac.uk/",
"CODING LIST": "https://git.soton.ac.uk/meld/meldb-external/phenotype/-/tree/main/codes/ClinicalCodes.org%20from%20the%20University%20of%20Manchester/Symptom%20code%20lists/Abdominal%20pain/res176-abdominal-pain.csv ",
"NOTES": "2023-09-08: Clinical SF confirmed that the clinical view would be that this would need to be recurrent or persistent."
}
}
]
},
"codes": [
{
"folder": "clinical-codes-org",
"description": "SF's clinical codes - downloaded 16/11/23",
"files": [
{
"file": "Symptom code lists/Abdominal pain/res176-abdominal-pain.csv",
"columns": {
"read2_code": "code",
"metadata": [
"description"
]
},
"concept_set": [
"ABDO_PAIN"
]
}
]
}
]
}
A full example of the phenotype for burdensome multiple long term conditions from the MELDB project can be found here
Defining concept sets
The "concept_sets"
object defines the structure for grouping input codes into concept sets based on source concepts. Key elements include:
-
version
: Identifies the version of the concept set definitions being used. This can help track changes over time. -
concept_set
: Defines a list of concept_set objects along with their attributes:-
concept_set_name
: Specifies the name of the concept set. -
concept_set_status
: Indicates the status of the concept set. Only concept sets the "AGREED" status will be outputted! -
metadata
(optional): A list of additional properties that will be copied into the output. Can be used for descriptive or contextual purposes.
-
Defining concept codes
The "codes"
object defines the location and description of all input medical coding lists required for processing. Each "folder"
is defined as an object within the "codes"
list. Similarily all files are objects within the "files"
list.
-
folder
: Specifies the directory containing the input files. -
description
: Provides a brief summary of the content or purpose of the files, often including additional context such as the date the data was downloaded. -
files
: Lists the files within the specified folder. Each file is represented as an object with the key"file"
and the file name as its value. Definitions of the columns in each file are detailed below.
Mapping source column definitions in files to standard vocabulary types
The "columns"
property within a file object specifies the type and corresponding names of columns in the input file. Each key in the object represents a column type, while the associated value denotes the name of the column in the input file.
The supported column types include:
-
read2_code
: Read Version 2 codes -
read3_code
: Read Version 3 codes -
icd10_code
: International Classification of Diseases, 10th Revision -
snomed_code
: SNOMED-CT codes -
opcs4_code
: OPCS Classification of Interventions and Procedures, Version 4 -
atc_code
: Anatomical Therapeutic Chemical classification codes
Additionally, the "metadata"
object ensures that any remaining columns not explicitly categorized by the supported column types are preserved in the output file. These columns are specified as an array of column names to be copied directly.
Mapping codes to concept sets
The "codes"
object are mapping to a corresponding concept sets through the "concept_set"
field.
-
concept_set
: Lists the concept sets to which all codes within this file will be assigned.
Additional preprocessing actions supported
In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a action
object inside of the file
object.
Table with a sub-categorical column:
In order to sub-divide a table by a categorical column use the "divide_col" action
"actions":{
"divide_col": "MMCode"
}
Table with multiple code types in single column:
Need to split column into multiple columns, so only one code type per column.
- The "split_col" attribute is the categorical column indicating the code type in that row. The category names should replace column names in the
columns
properties. - The "codes_col" attribute is the code column with mulitple code types in a single column
"actions":{
"split_col":"coding_system",
"codes_col":"code"
},
"columns":{
"read2_code":"Read codes v2",
"med_code":"Med codes",
"icd10_code":"ICD10 codes",
"metadata":["description"]
},
*Large Code lists with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate
Usage - ACMC Command-Line Tool
Usage
The tool follows a structured command system:
python acmc.py <command> <subcommand> [options]
Available Commands
-
trud
– Manage TRUD components -
omop
– Manage OMOP codes and database -
map
– Process mapping configurations
TRUD Command
Install TRUD Components
acmc trud install -k <TRUD_API_KEY>
Options:
-
-k, --api-key
(required) – TRUD API key
OMOP Commands
Install OMOP Codes
acmc omop install -f <OMOP_FOLDER_PATH>
Options:
-
-f, --omop-folder
(required) – Path to extracted OMOP downloads folder
Clear OMOP Data
acmc omop clear
Removes OMOP data from the database.
Delete OMOP Database
acmc omop delete
Deletes the entire OMOP database.
MAP Commands
Process Phenotype Configuration
acmc map process -c <CONFIG_FILE> -s <SOURCE_CODES_DIR> -o <OUTPUT_DIR> -t <TARGET_CODING> [options]
Required Options:
-
-c, --config-file
– Path to the phenotype configuration file -
-s, --source-codes-dir
– Root directory of source codes -
-o, --output-dir
– Directory for CSV or OMOP database output -
-t, --target-coding
– Target coding system (choices: read2, read3, icd10, snomed, opcs4)
Optional Flags:
-
-tr, --translate
– Enable code translation (default: disabled) -
-v, --verify
– Enable code verification (default: disabled)
Optional Arguments:
-
-l, --error-log
– Filepath to save error log (default:error.csv
)
Examples
Install TRUD Components
acmc trud install -k my-trud-api-key
Install OMOP Codes
acmc omop install -f /path/to/omop
Process Mapping Configuration with Read2 Target Coding
acmc map process -c config.json -s /data/source -o /data/output -t read2 --translate --verify
License
MIT License
Support
For issues, open a ticket in the repository or contact support@example.com.
Contributing
Commit to GitLab
git add .
git commit -m "my message ..."
git tag -a v1.0.0 -m "added features ..."
git push
Acknowledgements
This project was developed in the context of the MELD-B project, which is funded by the UK National Institute of Health Research under grant agreement NIHR203988.
License
This work is licensed under a Apache License, Version 2.0.