Skip to content
Snippets Groups Projects
Commit b613e078 authored by mjbonifa's avatar mjbonifa
Browse files

updated readme; updated handling of build directories as these were not created automatically

parent d6a686c1
No related branches found
No related tags found
No related merge requests found
...@@ -13,6 +13,7 @@ __pycache__ ...@@ -13,6 +13,7 @@ __pycache__
~$* ~$*
# Build # Build
build/*
output/ output/
concepts-output/ concepts-output/
archive/ archive/
......
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
<img src="img/swansea-university-logo-vector.png" height="100" /> <img src="img/swansea-university-logo-vector.png" height="100" />
</center> </center>
# A Tool for Automating the Curation of Medical Concepts derived from Coding Lists # A Tool for Automating the Curation of Medical Concepts derived from Coding Lists (ACMC)
### Jakub J. Dylag <sup>1</sup>, Roberta Chiovoloni <sup>3</sup>, Ashley Akbari <sup>3</sup>, Simon D. Fraser <sup>2</sup>, Michael J. Boniface <sup>1</sup> ### Jakub J. Dylag <sup>1</sup>, Roberta Chiovoloni <sup>3</sup>, Ashley Akbari <sup>3</sup>, Simon D. Fraser <sup>2</sup>, Michael J. Boniface <sup>1</sup>
...@@ -16,60 +16,72 @@ ...@@ -16,60 +16,72 @@
### Citation ### Citation
> Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing > Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing
## Introduction
## Introduction This tool automates the verification, translation and organisation of medical coding lists defining phenotypes for inclusion criteria in cohort analysis. By processing externally sourced clinical inclusion criteria into actionable code lists, this tool ensures consistent and efficient curation of cohort definitions. These code lists can be subsequently used by data providers to construct study cohorts.
This tool automates the verification, translation and organisation of medical coding lists defining cohort phenotypes for inclusion criteria. By processing externally sourced clinical inclusion criteria into actionable code lists, this tool ensures consistent and efficient curation of cohort definitions. These code lists can be subsequently used by data providers (e.g. SAIL) to construct study cohorts.
## Overview
## Methods ### Workflow
### Workflow Overview The high level steps to use the tools are outlined below:
1. Approved concept sets are outlined in a CSV spreadsheet (e.g., `PHEN_summary_working.csv`).
2. Imported code lists in the `/src` directory are validated against NHS TRUD-registered codes. **1. Define concept sets:** A domain expert defines a list of [concept sets](#concept-set-assigment) for each observable characteristic of the phenotype using CSV file format (e.g., `PHEN_concept_sets.csv`).
3. Mappings from imported code lists to outputted concept sets are defined in the `PHEN_assign_v3.json` file.
- See "JSON Phenotype Mapping" section for more details **2. Define code lists for concept sets:** A domain expert defines [code lists](#???) for each concept set within the phenotype using supported coding list formats and stores them in the `/src` directory.
4. The process is executed via command-line. Refer to the "Usage" section for execution instructions.
5. Outputted concept set codes lists are saved to the `/concepts` Git repository, with all changes tracked. **3. Define mapping from code lists to concept sets:** A domain expert defines a [phenotype mapping](#???) that maps code lists to concept sets in JSON file format (PHEN_assign_v3.json)
6. The code lists can be exported to SAIL or any other Data Bank.
**4. Generate versioned phenotype coding lists and translations:** A domain expert or data engineer processes the phenotype mappings [using the command line tool](#usage) to validate against NHS TRUD-registered codes and mapping and to generate versioned concept set code lists with translations between coding standards.
### Supported Medical Coding Standards ### Supported Medical Coding Standards
The tool supports verification and mapping across various diagnostic coding formats:
The tool supports verification and mapping across diagnostic coding formats below:
| Medical Code | Verification | Translation to | | Medical Code | Verification | Translation to |
|---------------|--------------|-----------------------------------| |---------------|--------------|-----------------------------------|
| Readv2 | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4, ATC | | Readv2 | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4, ATC |
| Readv3 (CTV3) | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4 | | Readv3 (CTV3) | NHS TRUD | Readv3, SNOMED, ICD10, OPCS4 |
| ICD10 | NHS TRUD | | | ICD10 | NHS TRUD | None |
| SNOMED | NHS TRUD | | | SNOMED | NHS TRUD | None |
| OPCS4 | NHS TRUD | | | OPCS4 | NHS TRUD | None |
| ATC | None | | | ATC | None | None |
#### Notes on Code Systems: - [**Read V2:**](https://digital.nhs.uk/services/terminology-and-classifications/read-codes) NHS clinical terminology standard used in primary care and replaced by SNOMED-CT in 2018; Still supported by some data providers as widely used in primary care, e.g. [SAIL Databank](https://saildatabank.com/)
- **Read V2:** Replaced by SNOMED-CT in 2018, but still supported by SAIL (restricted to five-character codes). - [**SNOMED-CT:**](https://icd.who.int/browse10/2019/en) international standard for clinical terminology for Electronic Healthcare Records adopted by the NHS in 2018; Mappings to Read codes are partially provided by [Clinical Research Practice Database (CPRD)](https://www.cprd.com/) and [NHS Technology Reference Update Distribution (TRUD)](https://isd.digital.nhs.uk/trud).
- **SNOMED-CT:** Adopted widely by the NHS in 2018; mappings to Read codes are partially provided by CPRD and NHS TRUD. - [**ICD-10:**](https://icd.who.int/browse10/2019/en) International Classification of Diseases (ICD) is a medical classification list from the World Health Organization (WHO) and widely used in hospital settings, e.g. Hospital Episode Statistics (HES).
- **ICD-10:** Widely used in hospital settings and critical for HES-linked datasets. - [**ATC Codes:**](https://www.who.int/tools/atc-ddd-toolkit/atc-classification) Anatomical Therapeutic Chemical (ATC) Classification is a drug classification list from the World Health Organization (WHO)
- **ATC Codes:** Maintained by WHO and used internationally for medication classification.
## Installation ## Installation
1. **Setup Conda Enviroment:** Download and Install Python Enviroment. Follow insturctions to install minicoda from [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html).
- Run the following command to recreate the environment: `conda env create -f conda.yaml`.
- Activate the environment: `conda activate base`
2. **Sign Up:** Register at [NHS TRUD](https://isd.digital.nhs.uk/trud/user/guest/group/0/account/form) and accept the following licenses: 1. **Setup Conda Enviroment:** Download and Install Python Enviroment. Follow insturctions to install minicoda from [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html).
- Run the following command to recreate the environment: `conda env create -f conda.yaml`.
- Activate the environment: `conda activate acmc`
2. **Sign Up:** Register at [NHS TRUD](https://isd.digital.nhs.uk/trud/user/guest/group/0/account/form)
3. **Subscribe** and accept the following licenses:
- [NHS Read Browser](https://isd.digital.nhs.uk/trud/users/guest/filters/2/categories/9/items/8/releases) - [NHS Read Browser](https://isd.digital.nhs.uk/trud/users/guest/filters/2/categories/9/items/8/releases)
- [NHS Data Migration](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/9/items/9/releases) - [NHS Data Migration](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/9/items/9/releases)
- https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/categories/8/items/9/releases
- [ICD10 Edition 5 XML](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/categories/28/items/259/releases) - [ICD10 Edition 5 XML](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/categories/28/items/259/releases)
- [OPCS-4.10 Data Files](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/10/items/119/releases) - [OPCS-4.10 Data Files](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/10/items/119/releases)
<!-- - [BNF/Snomed Mapping data.xlsx](https://www.nhsbsa.nhs.uk/prescription-data/understanding-our-data/bnf-snomed-mapping) --> <!-- - [BNF/Snomed Mapping data.xlsx](https://www.nhsbsa.nhs.uk/prescription-data/understanding-our-data/bnf-snomed-mapping) -->
Each data file has a "Subscribe" link that will take you to the licence. You will need to "Tell us about your subscription request" that summarises why you need access to the data. Your subscription will not be approved immediately and will remain in the "pending" state until it is. This is usually approved within 24 hours.
3. **Obtain API Key:** Retrieve your API key from [NHS TRUD Account Management](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/account/manage). 4. **Get API Key:** Retrieve your API key from [NHS TRUD Account Management](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/account/manage).
5. **Install TRUD:** Download and install NHS TRUD medical code resources.
Executing the script using the command: `python trud_api.py --key <API_KEY>`.
4. **Install TRUD:** Download and Install NHS TRUD medical code resources.
Executing the script using the command: `python trud_api.py --key <API_KEY>`.
Processed tables will be saved as `.parquet` files in the `maps/processed/` directory. Processed tables will be saved as `.parquet` files in the `maps/processed/` directory.
- *Note: NHS TRUD defines one-way mappings and does <b>NOT ADVISE</b> reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given `.parquet` table and reverse the filename (e.g. `read2_code_to_snomed_code.parquet` to `snomed_code_to_read2_code.parquet`)* - *Note: NHS TRUD defines one-way mappings and does <b>NOT ADVISE</b> reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given `.parquet` table and reverse the filename (e.g. `read2_code_to_snomed_code.parquet` to `snomed_code_to_read2_code.parquet`)*
5. ***Optional: Install OMOP Database:** Download and install OMOP vocabularies from [Athena OHDSI](https://athena.ohdsi.org/vocabulary/list). 6. ***Optional: Install OMOP Database:** Download and install OMOP vocabularies from [Athena OHDSI](https://athena.ohdsi.org/vocabulary/list).
- Required vocabularies include: - Required vocabularies include:
- 1) SNOMED - 1) SNOMED
- 2) ICD9CM - 2) ICD9CM
......
name: base name: acmc
channels: channels:
- conda-forge - conda-forge
dependencies: dependencies:
......
...@@ -3,6 +3,7 @@ import sys ...@@ -3,6 +3,7 @@ import sys
import requests import requests
import json import json
import argparse import argparse
import shutil
from pathlib import Path from pathlib import Path
from base import bcolors from base import bcolors
...@@ -29,9 +30,10 @@ def get_releases(item_id, API_KEY, latest=False): ...@@ -29,9 +30,10 @@ def get_releases(item_id, API_KEY, latest=False):
url = f"https://{FQDN}/trud/api/v1/keys/{API_KEY}/items/{item_id}/releases" url = f"https://{FQDN}/trud/api/v1/keys/{API_KEY}/items/{item_id}/releases"
if latest: if latest:
url += "?latest" url += "?latest"
response = requests.get(url) response = requests.get(url)
if response.status_code != 200: if response.status_code != 200:
error_exit(f"Failed to fetch releases for item {item_id}. Status code: {response.status_code}") error_exit(f"Failed to fetch releases for item {item_id}. Status code: {response.status_code}, error {response.json()['message']}. If no releases found for API key, please ensure you are subscribed to the data release and that it is not pending approval")
data = response.json() data = response.json()
if data.get("message") != "OK": if data.get("message") != "OK":
...@@ -39,7 +41,7 @@ def get_releases(item_id, API_KEY, latest=False): ...@@ -39,7 +41,7 @@ def get_releases(item_id, API_KEY, latest=False):
return data.get("releases", []) return data.get("releases", [])
def download_release_file(item_id, release_ordinal, release, file_json_prefix, file_type=None, items_folder="maps"): def download_release_file(item_id, release_ordinal, release, file_json_prefix, file_type=None, items_folder="build/maps/downloads"):
"""Download specified file type for a given release of an item.""" """Download specified file type for a given release of an item."""
file_type = file_type or file_json_prefix file_type = file_type or file_json_prefix
file_url = release.get(f"{file_json_prefix}FileUrl") file_url = release.get(f"{file_json_prefix}FileUrl")
...@@ -49,7 +51,7 @@ def download_release_file(item_id, release_ordinal, release, file_json_prefix, f ...@@ -49,7 +51,7 @@ def download_release_file(item_id, release_ordinal, release, file_json_prefix, f
if not file_url or not file_name: if not file_url or not file_name:
error_exit(f"Missing {file_type} file information for release {release_ordinal} of item {item_id}.") error_exit(f"Missing {file_type} file information for release {release_ordinal} of item {item_id}.")
print(f"Downloading item {item_id} {file_type} file: {file_name}") print(f"Downloading item {item_id} {file_type} file: {file_name} from {file_url} to {file_destination}")
response = requests.get(file_url, stream=True) response = requests.get(file_url, stream=True)
if response.status_code == 200: if response.status_code == 200:
...@@ -68,137 +70,176 @@ def validate_download_hash(file_destination:str, item_hash:str): ...@@ -68,137 +70,176 @@ def validate_download_hash(file_destination:str, item_hash:str):
else: else:
error_exit(f"Could not validate origin of {file_destination}. The SHA-256 hash should be: {item_hash}, but got {hash} instead") error_exit(f"Could not validate origin of {file_destination}. The SHA-256 hash should be: {item_hash}, but got {hash} instead")
def unzip_download(file_destination:str, items_folder="maps"): def unzip_download(file_destination:str, items_folder="build/maps/downloads"):
with zipfile.ZipFile(file_destination, 'r') as zip_ref: with zipfile.ZipFile(file_destination, 'r') as zip_ref:
zip_ref.extractall(items_folder) zip_ref.extractall(items_folder)
def extract_icd10(): def extract_icd10():
#ICD10_edition5 #ICD10_edition5
df = pd.read_xml("maps/ICD10_Edition5_XML_20160401/Content/ICD10_Edition5_CodesAndTitlesAndMetadata_GB_20160401.xml",) file_path = Path('build') / 'maps' / 'downloads' / 'ICD10_Edition5_XML_20160401' / 'Content' / 'ICD10_Edition5_CodesAndTitlesAndMetadata_GB_20160401.xml'
df = pd.read_xml(file_path)
df = df[["CODE", "ALT_CODE", "DESCRIPTION"]] df = df[["CODE", "ALT_CODE", "DESCRIPTION"]]
df = df.rename(columns={"CODE":"icd10_code", df = df.rename(columns={"CODE":"icd10_code",
"ALT_CODE":"icd10_alt_code", "ALT_CODE":"icd10_alt_code",
"DESCRIPTION":"description" "DESCRIPTION":"description"
}) })
df.to_parquet("maps/processed/icd10_code.parquet", index=False) df.to_parquet("build/maps/processed/icd10_code.parquet", index=False)
print("Extracted ", "maps/processed/icd10_code.parquet") print("Extracted ", "build/maps/processed/icd10_code.parquet")
def extract_opsc4(): def extract_opsc4():
df = pd.read_csv("maps/OPCS410 Data files txt/OPCS410 CodesAndTitles Nov 2022 V1.0.txt", sep='\t', dtype=str, header=None) file_path = Path('build') / 'maps' / 'downloads' / 'OPCS410 Data files txt' / 'OPCS410 CodesAndTitles Nov 2022 V1.0.txt'
df = pd.read_csv(file_path, sep='\t', dtype=str, header=None)
df = df.rename(columns={0:"opcs4_code", 1:"description"}) df = df.rename(columns={0:"opcs4_code", 1:"description"})
df.to_parquet("maps/processed/opcs4_code.parquet", index=False) df.to_parquet("build/maps/processed/opcs4_code.parquet", index=False)
print("Extracted ", "maps/processed/opcs4_code.parquet") print("Extracted ", "build/maps/processed/opcs4_code.parquet")
def extract_nhs_data_migrations(): def extract_nhs_data_migrations():
#NHS Data Migrations #NHS Data Migrations
file_path = Path('build') / 'maps' / 'downloads' / 'Mapping Tables' / 'Updated' / 'Clinically Assured' / 'sctcremap_uk_20200401000001.txt'
#snomed only #snomed only
df = pd.read_csv('maps/Mapping Tables/Updated/Clinically Assured/sctcremap_uk_20200401000001.txt', sep='\t') df = pd.read_csv(file_path, sep='\t')
df = df[["SCT_CONCEPTID"]] df = df[["SCT_CONCEPTID"]]
df = df.rename(columns={"SCT_CONCEPTID":"snomed_code"}) df = df.rename(columns={"SCT_CONCEPTID":"snomed_code"})
df = df.drop_duplicates() df = df.drop_duplicates()
df = df.astype(str) df = df.astype(str)
df.to_parquet("maps/processed/snomed_code.parquet", index=False) df.to_parquet("build/maps/processed/snomed_code.parquet", index=False)
print("Extracted ", "maps/processed/snomed_code.parquet") print("Extracted ", "build/maps/processed/snomed_code.parquet")
#r2 -> r3 #r2 -> r3
df = pd.read_csv('maps/Mapping Tables/Updated/Clinically Assured/rctctv3map_uk_20200401000001.txt', sep='\t') file_path = Path('build') / 'maps' / 'downloads' / 'Mapping Tables' / 'Updated' / 'Clinically Assured' / 'rctctv3map_uk_20200401000001.txt'
df = pd.read_csv(file_path, sep='\t')
df = df[["V2_CONCEPTID", "CTV3_CONCEPTID"]] df = df[["V2_CONCEPTID", "CTV3_CONCEPTID"]]
df = df.rename(columns={"V2_CONCEPTID":"read2_code", df = df.rename(columns={"V2_CONCEPTID":"read2_code",
"CTV3_CONCEPTID":"read3_code"}) "CTV3_CONCEPTID":"read3_code"})
df.to_parquet("maps/processed/read2_code_to_read3_code.parquet", index=False) df.to_parquet("build/maps/processed/read2_code_to_read3_code.parquet", index=False)
print("Extracted ", "maps/processed/read2_code_to_read3_code.parquet") print("Extracted ", "build/maps/processed/read2_code_to_read3_code.parquet")
#r3->r2 #r3->r2
df = pd.read_csv('maps/Mapping Tables/Updated/Clinically Assured/ctv3rctmap_uk_20200401000002.txt', sep='\t') file_path = Path('build') / 'maps' / 'downloads' / 'Mapping Tables' / 'Updated' / 'Clinically Assured' / 'ctv3rctmap_uk_20200401000002.txt'
df = pd.read_csv(file_path, sep='\t')
df = df[["CTV3_CONCEPTID", "V2_CONCEPTID"]] df = df[["CTV3_CONCEPTID", "V2_CONCEPTID"]]
df = df.rename(columns={"CTV3_CONCEPTID":"read3_code", df = df.rename(columns={"CTV3_CONCEPTID":"read3_code",
"V2_CONCEPTID":"read2_code"}) "V2_CONCEPTID":"read2_code"})
df = df.drop_duplicates() df = df.drop_duplicates()
df = df[~df["read2_code"].str.match("^.*_.*$")] #remove r2 codes with '_' df = df[~df["read2_code"].str.match("^.*_.*$")] #remove r2 codes with '_'
df.to_parquet("maps/processed/read3_code_to_read2_code.parquet", index=False) df.to_parquet("build/maps/processed/read3_code_to_read2_code.parquet", index=False)
print("Extracted ", "maps/processed/read3_code_to_read2_code.parquet") print("Extracted ", "build/maps/processed/read3_code_to_read2_code.parquet")
#r2 -> snomed #r2 -> snomed
df = pd.read_csv('maps/Mapping Tables/Updated/Clinically Assured/rcsctmap2_uk_20200401000001.txt', sep='\t', dtype=str) file_path = Path('build') / 'maps' / 'downloads' / 'Mapping Tables' / 'Updated' / 'Clinically Assured' / 'rcsctmap2_uk_20200401000001.txt'
df = pd.read_csv(file_path, sep='\t', dtype=str)
df = df[["ReadCode", "ConceptId"]] df = df[["ReadCode", "ConceptId"]]
df = df.rename(columns={"ReadCode":"read2_code", df = df.rename(columns={"ReadCode":"read2_code",
"ConceptId":"snomed_code"}) "ConceptId":"snomed_code"})
df.to_parquet("maps/processed/read2_code_to_snomed_code.parquet", index=False) df.to_parquet("build/maps/processed/read2_code_to_snomed_code.parquet", index=False)
print("Extracted ", "maps/processed/read2_code_to_snomed_code.parquet") print("Extracted ", "build/maps/processed/read2_code_to_snomed_code.parquet")
#r3->snomed #r3->snomed
df = pd.read_csv('maps/Mapping Tables/Updated/Clinically Assured/ctv3sctmap2_uk_20200401000001.txt', sep='\t') file_path = Path('build') / 'maps' / 'downloads' / 'Mapping Tables' / 'Updated' / 'Clinically Assured' / 'ctv3sctmap2_uk_20200401000001.txt'
df = df[["CTV3_TERMID", "SCT_CONCEPTID"]] df = df[["CTV3_TERMID", "SCT_CONCEPTID"]]
df = df.rename(columns={"CTV3_TERMID":"read3_code", df = df.rename(columns={"CTV3_TERMID":"read3_code",
"SCT_CONCEPTID":"snomed_code"}) "SCT_CONCEPTID":"snomed_code"})
df["snomed_code"] = df["snomed_code"].astype(str) df["snomed_code"] = df["snomed_code"].astype(str)
df = df[~df["snomed_code"].str.match("^.*_.*$")] #remove snomed codes with '_' df = df[~df["snomed_code"].str.match("^.*_.*$")] #remove snomed codes with '_'
df.to_parquet("maps/processed/read3_code_to_snomed_code.parquet", index=False) df.to_parquet("build/maps/processed/read3_code_to_snomed_code.parquet", index=False)
print("Extracted ", "maps/processed/read3_code_to_snomed_code.parquet") print("Extracted ", "build/maps/processed/read3_code_to_snomed_code.parquet")
def extract_nhs_read_browser(): def extract_nhs_read_browser():
#r2 only #r2 only
df = simpledbf.Dbf5('maps/Standard/V2/ANCESTOR.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V2/ANCESTOR.DBF').to_dataframe()
df = pd.concat([df['READCODE'], df['DESCENDANT']]) df = pd.concat([df['READCODE'], df['DESCENDANT']])
df = pd.DataFrame(df.drop_duplicates()) df = pd.DataFrame(df.drop_duplicates())
df = df.rename(columns={0:"read2_code"}) df = df.rename(columns={0:"read2_code"})
df.to_parquet("maps/processed/read2_code.parquet", index=False) df.to_parquet("build/maps/processed/read2_code.parquet", index=False)
print("Extracted ", "maps/processed/read2_code.parquet") print("Extracted ", "build/maps/processed/read2_code.parquet")
#r2 -> atc #r2 -> atc
df = simpledbf.Dbf5('maps/Standard/V2/ATC.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V2/ATC.DBF').to_dataframe()
df = df[["READCODE", "ATC"]] df = df[["READCODE", "ATC"]]
df = df.rename(columns={"READCODE":"read2_code", "ATC":"atc_code"}) df = df.rename(columns={"READCODE":"read2_code", "ATC":"atc_code"})
df.to_parquet("maps/processed/read2_code_to_atc_code.parquet", index=False) df.to_parquet("build/maps/processed/read2_code_to_atc_code.parquet", index=False)
print("Extracted ", "maps/processed/read2_code_to_atc_code.parquet") print("Extracted ", "build/maps/processed/read2_code_to_atc_code.parquet")
#r2 -> icd10 #r2 -> icd10
df = simpledbf.Dbf5('maps/Standard/V2/ICD10.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V2/ICD10.DBF').to_dataframe()
df = df[["READ_CODE", "TARG_CODE"]] df = df[["READ_CODE", "TARG_CODE"]]
df = df.rename(columns={"READ_CODE":"read2_code", "TARG_CODE":"icd10_code"}) df = df.rename(columns={"READ_CODE":"read2_code", "TARG_CODE":"icd10_code"})
df = df[~df["icd10_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["icd10_code"].str.match("^.*-.*$")] #remove codes with '-'
df = df[~df["read2_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["read2_code"].str.match("^.*-.*$")] #remove codes with '-'
df.to_parquet("maps/processed/read2_code_to_icd10_code.parquet", index=False) df.to_parquet("build/maps/processed/read2_code_to_icd10_code.parquet", index=False)
print("Extracted ", "maps/processed/read2_code_to_icd10_code.parquet") print("Extracted ", "build/maps/processed/read2_code_to_icd10_code.parquet")
#r2 -> opcs4 #r2 -> opcs4
df = simpledbf.Dbf5('maps/Standard/V2/OPCS4V3.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V2/OPCS4V3.DBF').to_dataframe()
df = df[["READ_CODE", "TARG_CODE"]] df = df[["READ_CODE", "TARG_CODE"]]
df = df.rename(columns={"READ_CODE":"read2_code", "TARG_CODE":"opcs4_code"}) df = df.rename(columns={"READ_CODE":"read2_code", "TARG_CODE":"opcs4_code"})
df = df[~df["opcs4_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["opcs4_code"].str.match("^.*-.*$")] #remove codes with '-'
df = df[~df["read2_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["read2_code"].str.match("^.*-.*$")] #remove codes with '-'
df.to_parquet("maps/processed/read2_code_to_opcs4_code.parquet", index=False) df.to_parquet("build/maps/processed/read2_code_to_opcs4_code.parquet", index=False)
print("Extracted ", "maps/processed/read2_code_to_opcs4_code.parquet") print("Extracted ", "build/maps/processed/read2_code_to_opcs4_code.parquet")
#r3 only #r3 only
df = simpledbf.Dbf5('maps/Standard/V3/ANCESTOR.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V3/ANCESTOR.DBF').to_dataframe()
df = pd.concat([df['READCODE'], df['DESCENDANT']]) df = pd.concat([df['READCODE'], df['DESCENDANT']])
df = pd.DataFrame(df.drop_duplicates()) df = pd.DataFrame(df.drop_duplicates())
df = df.rename(columns={0:"read3_code"}) df = df.rename(columns={0:"read3_code"})
df.to_parquet("maps/processed/read3_code.parquet", index=False) df.to_parquet("build/maps/processed/read3_code.parquet", index=False)
print("Extracted ", "maps/processed/read3_code.parquet") print("Extracted ", "build/maps/processed/read3_code.parquet")
#r3 -> icd10 #r3 -> icd10
df = simpledbf.Dbf5('maps/Standard/V3/ICD10.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V3/ICD10.DBF').to_dataframe()
df = df[["READ_CODE", "TARG_CODE"]] df = df[["READ_CODE", "TARG_CODE"]]
df = df.rename(columns={"READ_CODE":"read3_code", "TARG_CODE":"icd10_code"}) df = df.rename(columns={"READ_CODE":"read3_code", "TARG_CODE":"icd10_code"})
df = df[~df["icd10_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["icd10_code"].str.match("^.*-.*$")] #remove codes with '-'
df = df[~df["read3_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["read3_code"].str.match("^.*-.*$")] #remove codes with '-'
df.to_parquet("maps/processed/read3_code_to_icd10_code.parquet", index=False) df.to_parquet("build/maps/processed/read3_code_to_icd10_code.parquet", index=False)
print("Extracted ", "maps/processed/read3_code_to_icd10_code.parquet") print("Extracted ", "build/maps/processed/read3_code_to_icd10_code.parquet")
#r3 -> icd9 #r3 -> icd9
# dbf = simpledbf.Dbf5('maps/Standard/V3/ICD9V3.DBF') # dbf = simpledbf.Dbf5('build/maps/downloads/Standard/V3/ICD9V3.DBF')
#r3 -> opcs4 #r3 -> opcs4
df = simpledbf.Dbf5('maps/Standard/V3/OPCS4V3.DBF').to_dataframe() df = simpledbf.Dbf5('build/maps/downloads/Standard/V3/OPCS4V3.DBF').to_dataframe()
df = df[["READ_CODE", "TARG_CODE"]] df = df[["READ_CODE", "TARG_CODE"]]
df = df.rename(columns={"READ_CODE":"read3_code", "TARG_CODE":"opcs4_code"}) df = df.rename(columns={"READ_CODE":"read3_code", "TARG_CODE":"opcs4_code"})
df = df[~df["opcs4_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["opcs4_code"].str.match("^.*-.*$")] #remove codes with '-'
df = df[~df["read3_code"].str.match("^.*-.*$")] #remove codes with '-' df = df[~df["read3_code"].str.match("^.*-.*$")] #remove codes with '-'
df.to_parquet("maps/processed/read3_code_to_opcs4_code.parquet", index=False) df.to_parquet("build/maps/processed/read3_code_to_opcs4_code.parquet", index=False)
print("Extracted ", "maps/processed/read3_code_to_opcs4_code.parquet") print("Extracted ", "build/maps/processed/read3_code_to_opcs4_code.parquet")
def create_build_directories(build_dir='build'):
"""Create build directories."""
build_path = Path(build_dir)
# Check if build directory exists
create_build_dirs = False
if build_path.exists() and build_path.is_dir():
user_input = input(f"The build directory {build_path} already exists. Do you want to delete and recreate all data? (y/n): ").strip().lower()
if user_input == "y":
# delete all build files
shutil.rmtree(build_path)
create_build_dirs = True
else:
create_build_dirs = True
if create_build_dirs:
# create build directory
build_path.mkdir(parents=True, exist_ok=True)
# create maps directories
maps_path = build_path / 'maps'
maps_path.mkdir(parents=True, exist_ok=True)
maps_download_path = maps_path / 'downloads'
maps_download_path.mkdir(parents=True, exist_ok=True)
maps_processed_path = maps_path / 'processed'
maps_processed_path.mkdir(parents=True,exist_ok=True)
def main(): def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
...@@ -214,8 +255,10 @@ def main(): ...@@ -214,8 +255,10 @@ def main():
args = parser.parse_args() args = parser.parse_args()
create_build_directories()
items_latest = True items_latest = True
items_folder = "maps" items_folder = "build/maps/downloads"
items = [ items = [
{ {
"id": 259, "id": 259,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment