Re-write README

3876e7cd · Jakub Dylag · 5b98129c · 3876e7cd
Commit 3876e7cd authored 6 months ago by Jakub Dylag
--- a/README.md
+++ b/README.md
@@ -10,31 +10,32 @@
 <sup>1</sup> Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton<br>
 <sup>2</sup> School of Primary Care Population Sciences and Medical Education, University of Southampton <br>
 <sup>3</sup> Population Data Science, Swansea University Medical School, Faculty of Medicine, Health & Life Science, Swansea University <br>
-<br>
-<sup>*</sup>Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk

-### 🖋 How to cite this work
+*Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk*
+
+### Citation
 > Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meldb/concepts-processing

-## 🙌 Introduction 
-This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria. 

-The output code list is then used by data providers to select MELD-B cohorts. 
+## Introduction 
+This tool automates the verification, translation and organisation of medical coding lists defining cohort phenotypes for inclusion criteria. By processing externally sourced clinical inclusion criteria into actionable code lists, this tool ensures consistent and efficient curation of cohort definitions. These code lists can be subsequently used by data providers (e.g. SAIL) to construct study cohorts.
+

-## 📃 Method
+## Methods

-### Process
-1. Approved MELB-B concepts are defined in a CSV spreadsheet (currently PHEN_summary_working.csv).
-2. Imported Code Lists in `/src` are verified against all NHS TRUD registered codes
-3. Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`. 
+### Workflow Overview
+1. Approved MELD-B concepts are outlined in a CSV spreadsheet (e.g., `PHEN_summary_working.csv`).
+2. Imported code lists in the `/src` directory are validated against NHS TRUD-registered codes.
+3. Mappings from imported code lists to outputted MELD-B concepts are defined in the `PHEN_assign_v3.json` file.
 	- See "JSON Phenotype Mapping" section for more details 
-4. Process is executed from command line either manually or from bash script `run.sh` 
-	- See "Usage" section for more details 
-5. Output Concept Code Lists are saved to the `/concepts` git repository and any changes are tracked.
-6. Output Concept Code Lists can be exported into SAIL or any other Data Bank 
+4. The process is executed via command-line. Refer to the "Usage" section for execution instructions.
+5. Outputted concept code lists are saved to the `/concepts` Git repository, with all changes tracked.
+6. The code lists can be exported to SAIL or any other Data Bank.
+
+### Supported Medical Coding Standards
+The tool supports verification and mapping across various diagnostic coding formats:

-### Medical Coding Standards Supported
-| Code Type     | Verification | Maps to                           |
+| Medical Code  | Verification | Translation to                    |
 |---------------|--------------|-----------------------------------|
 | Readv2        | NHS TRUD     | Readv3, SNOMED, ICD10, OPCS4, ATC |
 | Readv3 (CTV3) | NHS TRUD     | Readv3, SNOMED, ICD10, OPCS4      |
@@ -43,53 +44,51 @@ The output code list is then used by data providers to select MELD-B cohorts.
 | OPCS4         | NHS TRUD     |                                   |
 | ATC           | None         |                                   |

-MELD-B refers to various diagnostic code formats included in target datasets. 
-* Read V2 
-	* Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
-	* SAIL only supports five character read codes V2 
-* SNOMED-CT was adopted by the NHS around 2018
-	* CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
-	* Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
-* ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets. 
-* ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO.
-
-## ⚙️ Setup
-
-### Code Translation Tables
-1. Due to the licencing of NHS TRUD resources, you <mark>MUST first [Sign Up](https://isd.digital.nhs.uk/trud/user/guest/filters/0/account/form) to NHS TRUD and accept the following licences</mark>:
-	- [nhs_readbrowser_25.0.0_20180401000001](https://isd.digital.nhs.uk/trud/users/guest/filters/2/categories/9/items/8/releases)
-	- [nhs_datamigration_29.0.0_20200401000001](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/9/items/9/releases)
-	- [ICD10_Edition5_XML_20160401](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/categories/28/items/259/releases)  
-	- [OPCS-4.10 Data files](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/10/items/119/releases)
-	<!-- - [BNF/Snomed Mapping data.xlsx](https://www.nhsbsa.nhs.uk/prescription-data/understanding-our-data/bnf-snomed-mapping) -->
-	
-2. Once all licences are accepted, get your [API Key](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/account/manage) for NHS TRUD. 
+#### Notes on Code Systems:
+- **Read V2:** Replaced by SNOMED-CT in 2018, but still supported by SAIL (restricted to five-character codes).
+- **SNOMED-CT:** Adopted widely by the NHS in 2018; mappings to Read codes are partially provided by CPRD and NHS TRUD.
+- **ICD-10:** Widely used in hospital settings and critical for HES-linked datasets.
+- **ATC Codes:** Maintained by WHO and used internationally for medication classification.

-3. Finally, run the automated extraction script, inputting your API Key to granty temporary access to the resources above. Use the command `python trud_api.py --key <INSERT KEY>` (replacing your key in the marked area).
-	- The convertion Tables will be saved as `.parquet` tables in the folder `maps/processed/`.
-	- NHS TRUD defines one-way mappings and does <b>NOT ADVISE</b> reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given `.parquet` table and reverse the filename (e.g. `read2_code_to_snomed_code.parquet` to `snomed_code_to_read2_code.parquet`)
+## Installation

-4. Populate the SQLite3 database with OMOP Vocabularies. These can be download from https://athena.ohdsi.org/vocabulary/list.
-	-  Install the following vocabularies by ticking the box:
-		- 1-SNOMED
-		- 2-ICD9CM
-		- 17-Readv2
-		- 21-ATC
-		- 55-OPCS4
-		- 57-HES Specialty
-		- 70-ICD10CM
-		- 75-dm+d
-		- 144-UK Biobank
-		- 154-NHS Ethnic Category
-		- 155-NHS Place of Service
-	- Use the command `python omop_api.py --install <INSERT PATH>` to load vocabularies into database (insert your own path to unzipped download folder).  
-
-### JSON phenotype mapping
-
-Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`. 
+1. **Sign Up:** Register at [NHS TRUD](https://isd.digital.nhs.uk/trud/user/guest/group/0/account/form) and accept the following licenses:
+   - [NHS Read Browser](https://isd.digital.nhs.uk/trud/users/guest/filters/2/categories/9/items/8/releases)
+   - [NHS Data Migration](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/9/items/9/releases)
+   - [ICD10 Edition 5 XML](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/categories/28/items/259/releases)
+   - [OPCS-4.10 Data Files](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/10/items/119/releases)
+   	<!-- - [BNF/Snomed Mapping data.xlsx](https://www.nhsbsa.nhs.uk/prescription-data/understanding-our-data/bnf-snomed-mapping) -->
 	
-#### Defining the Strucutre for Folders and Files:
-```
+2. **Obtain API Key:** Retrieve your API key from [NHS TRUD Account Management](https://isd.digital.nhs.uk/trud/users/authenticated/filters/0/account/manage).
+
+3. **Install TRUD:** Download and Install NHS TRUD medical code resources. 
+Executing the script using the command: `python trud_api.py --key <API_KEY>`. 
+Processed tables will be saved as `.parquet` files in the `maps/processed/` directory.
+	- *Note: NHS TRUD defines one-way mappings and does <b>NOT ADVISE</b> reversing the mappings. If you still wish to reverse these into two-way mappings, duplicate the given `.parquet` table and reverse the filename (e.g. `read2_code_to_snomed_code.parquet` to `snomed_code_to_read2_code.parquet`)*
+
+4. ***Optional: Install OMOP Database:** Download and install OMOP vocabularies from [Athena OHDSI](https://athena.ohdsi.org/vocabulary/list). 
+	- Required vocabularies include:
+   		- 1) SNOMED
+		- 2) ICD9CM
+		- 17) Readv2
+		- 21) ATC
+		- 55) OPCS4
+		- 57) HES Specialty
+		- 70) ICD10CM
+		- 75) dm+d
+		- 144) UK Biobank
+		- 154) NHS Ethnic Category
+		- 155) NHS Place of Service
+   - Un-zip the downloaded folder and copy it's path.  
+   - Install vocabularies using:  
+     `python omop_api.py --install <PATH_TO_DOWNLOADED_FILES>`
+
+## Configuration
+
+The mappings from imported code lists to outputted MELD-B concept code lists are defined in JSON format in `PHEN_assign_v3.json`.
+
+### Folder and File Definitions:
+```json
 "folder":"codes/Medication code source",
 "description":"Medication Codes - downloaded 15/12/23",
 "files": [
@@ -99,32 +98,35 @@ Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are de
 ]
 ```

-#### Define Column Code Types
-```
+### Columns in Files:
+```json
 "columns":{
 	"read2_code":"READCODE",
 	"metadata":["DESCRIPTION"]
 },
 ```

-#### Define Concepts to be mapped to
-```
+### Concept Set Assigment
+```json
 "meldb_phenotypes": ["ALL_MEDICATIONS"]
 ```

-#### Actions: Additional preprocessing (if required):
- In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object.
+### Additional preprocessing (if required):
+In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object.

- Table with a sub-categorical column:
-	- In order to sub-divide a table by a categorical column use the "divide_col" action
-	- e.g. ``` "actions":{"divide_col": "MMCode"}```
+#### Table with a sub-categorical column:
+In order to sub-divide a table by a categorical column use the "divide_col" action
+```json
+"actions":{
+	"divide_col": "MMCode"
+}
+```

- Table with multiple code types in single column:
-	- Need to split column into multiple columns, so only one code type per column.
+#### Table with multiple code types in single column:
+Need to split column into multiple columns, so only one code type per column.
 - The "split_col" attribute is the categorical column indicating the code type in that row. The <b>category names should replace column</b> names in the `columns` properties.
 - The "codes_col" attribute is the code column with mulitple code types in a single column
-	- e.g. 
-	```
+```json
 "actions":{
 	"split_col":"coding_system",
 	"codes_col":"code"
@@ -137,40 +139,38 @@ Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are de
 },
 ```

+**<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate* 

-*<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate 


-
-## ⚡ Usage
+## Usage
 Script preprocess code lists and to map to given concept/phenotype

-### Execution (Bash Script)
-`bash ./run.sh`
-
-### Execution (Shell Command)
-usage: `python main.py [-h] [-r2] [-r3] [-i] [-s] [-o] [-a] [-m] [-c] [--no-translate] [--no-verify] [--output] [--error-log] mapping_file`
+### Execute Command Line
+Execute via shell with customizable parameters:
+```bash
+python main.py [OPTIONS] mapping_file
+```
+usage: `python main.py [-h] [-r2] [-r3] [-i] [-s] [-o] [-a] [--no-translate] [--no-verify] [--output] [--error-log] mapping_file`

-positional arguments:
+**Required Arguments:**
  - `mapping_file`         Concept/Phenotype Assignment File (json)
+  - `--output`       Filepath to save output to CSV or OMOP SQLite Database

-optional arguments:
+**Options Arguments:**
  - `-r2`, `--read2-code`  Read V2 Codes Column name in Source File
  - `-r3`, `--read3-code`  Read V3 Codes Column name in Source File
  - `-i`, `--icd10-code`  ICD10 Codes Column name in Source File
  - `-s`, `--snomed-code`  SNOMED Codes Column name in Source File
  - `-o`, `--opcs4-code`  OPCS4 Codes Column name in Source File
  - `-a`, `--atc-code`  ATC Codes Column name in Source File
-  - `-m`, `--med-code`  Med Codes Column name in Source File
-  - `-c`, `--cprd-code`  CPRD Product Codes Column name in Source File 
  - `--no-translate`     Do not translate code types
  - `--no-verify`    Do not verify codes are correct 
-  - `--output`       Filepath to save output to
  - `--error-log`    Filepath to save error log to

 > **_EXAMPLE:_**  `python main.py PHEN_assign_v3.json -r2 --output output/MELD_concepts_readv2.csv --error-log output/MELD_errors.csv`

-## ❤️ Contributing
+## Contributing

 ### Commit to GitLab
 ```
@@ -180,12 +180,11 @@ git tag -a v1.0.0 -m "added features ..."
 git push
 ```

-## 🏦 Funding 
-This project has received funding from the National Institute of Health Research under grant agreement NIHR203988.
+## Acknowledgements  
+This project was developed in the context of the [MELD-B](https://www.southampton.ac.uk/publicpolicy/support-for-policymakers/policy-projects/Current%20projects/meld-b.page) project, which is funded by the UK [National Institute of Health Research](https://www.nihr.ac.uk/) under grant agreement NIHR203988.

-<img src="img/nihr-logo-1200-375.jpg" height="100" />
+## License
+This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

-## ⚖️ License
 ![apache2](https://img.shields.io/github/license/saltstack/salt)

-This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
\ No newline at end of file