From b66f3d89edeff4d1655943134a074752c61e69df Mon Sep 17 00:00:00 2001
From: Jakub Dylag <jjd1c23@soton.ac.uk>
Date: Thu, 18 Apr 2024 23:10:33 +0100
Subject: [PATCH] Updated README

---
 README.md | 121 ++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 82 insertions(+), 39 deletions(-)

diff --git a/README.md b/README.md
index 0abd9fd..7886583 100644
--- a/README.md
+++ b/README.md
@@ -14,7 +14,7 @@
 <sup>*</sup>Correspondence to: Jakub J. Dylag, Digital Health and Biomedical Engineering, School of Electronics and Computer Science, Faculty of Engineering and Physical Sciences, University of Southampton, J.J.Dylag@soton.ac.uk
 
 ### 🖋 How to cite this work
-> Dylag J. J., Chiovoloni R., Akbari A., Fraser S. D., Boniface M. J., A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. May 2024. https://git.soton.ac.uk/meld/meldb/concepts-processing
+> Dylag JJ, Chiovoloni R, Akbari A, Fraser SD, Boniface MJ. A Tool for Automating the Curation of Medical Concepts derived from Coding Lists. GitLab [Internet]. May 2024. Available from: https://git.soton.ac.uk/meld/meldb/concepts-processing
 
 ## 🙌 Introduction 
 This project generate the medical coding lists that defines cohort phenotypes used for inclusion criteria in MELD-B. The goal is to automatically prepare a code list from an approved clinical specification of inclusion criteria. 
@@ -24,18 +24,14 @@ The output code list is then used by data providers to select MELD-B cohorts.
 ## 📃 Method
 
 ### Process
-1. MELB-B conditions are defined in a Excel spreadsheet (currently CONC_summary_working.xlsx).
-2. 
-2. Each sheet in the file includes a mapping from a MELD-B condition to source MLTC condition that is then associated with a diagnostic code list
-3. Each sheet is processed to create a master code list
-	* READ_CODE: full 7 character read code id
-	* CPRD_GOLD_MEDICAL_CODE_ID: CPRD GOLD medical code id
-	* CPRD_AURUM_MEDICAL_CODE_ID: CPRD AURUM medical code id
-	* DESCRIPTION: diagnosis description 
-	* MELDB_CONDITION: meld b multimorbidity label
-	* DATABASE: list of databases mapped
-	* SOURCE: list of sources mapped
-	* any other meta data columns (including descriptions etc)
+1. Approved MELB-B concepts are defined in a Excel spreadsheet (currently CONC_summary_working.xlsx).
+2. Imported Code Lists in `/codes` are verified against all NHS TRUD registered codes
+3. Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`. 
+	- See "JSON Phenotype Mapping" section for more details 
+4. Process is executed from command line either manually or from bash script `run.sh` 
+	- See "Usage" section for more details 
+5. Output Concept Code Lists are saved to the `/concepts` git repository and any changes are tracked.
+6. Output Concept Code Lists can be exported into SAIL or any other Data Bank 
 
 ### Medical Coding Standards Supported
 | Code Type     | Verification | Maps to                           |
@@ -51,11 +47,11 @@ The output code list is then used by data providers to select MELD-B cohorts.
 
 MELD-B refers to various diagnostic code formats included in target datasets. 
 * Read V2 
-  * Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
-  * SAIL only supports five character read codes V2 
+	* Read codes were used widely in primary care but were replaced by SNOMED-CT from around 2018 https://isd.digital.nhs.uk/trud/user/guest/group/0/pack/9
+	* SAIL only supports five character read codes V2 
 * SNOMED-CT was adopted by the NHS around 2018
-  * CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
-  * Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
+	* CPRD AURUM uses SNOWMED codes and include mapping to read codes but no other database (CPRD Gold, SAIL) does.
+	* Mappings exist from SNOWMED to Read codes, some provided by CPRD and others NHS Trud
 * ICD-10 are codes used in hospital settings and are importnat for the HES linked datasets. 
 * ATC codes are interntionally accepted for the classification of medicinces and maintained by the WHO.
 
@@ -65,12 +61,12 @@ MELD-B refers to various diagnostic code formats included in target datasets.
 MELD-B has defined a set of phenotypes for MLTC conditions that are considered burdensome. Each Phenotype includes one or more diagnosis.  
 
 * Ho et al - https://cronfa.swan.ac.uk/Record/cronfa60877/Download/60877__25107__3700915cf20e418aae714e5639722449.pdf 
-  * The diagnositc codes have been mapped by SAIL to by the ThinkingGroup https://github.com/THINKINGGroup/phenotypes which has been replicated by the RSF https://github.com/aim-rsf/phenotypes
-  * Azcoaga-Lorenzo, A., Akbari, A., Davies, J., Khunti, K., Kadam, U.T., Lyons, R., McCowan, C., Mercer, S.W., Nirantharakumar, K., Staniszewska, S. and Guthrie, B., 2022. Measuring multimorbidity in research: Delphi consensus study. BMJ medicine, 1(1), p.e000247.
+	* The diagnositc codes have been mapped by SAIL to by the ThinkingGroup https://github.com/THINKINGGroup/phenotypes which has been replicated by the RSF https://github.com/aim-rsf/phenotypes
+	* Azcoaga-Lorenzo, A., Akbari, A., Davies, J., Khunti, K., Kadam, U.T., Lyons, R., McCowan, C., Mercer, S.W., Nirantharakumar, K., Staniszewska, S. and Guthrie, B., 2022. Measuring multimorbidity in research: Delphi consensus study. BMJ medicine, 1(1), p.e000247.
 
 * Hanlon et al - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8901063/#pmed.1003931.s005
-  * https://github.com/dmcalli2/dynamic_protocols/blob/master/defining_comorbidities_SAIL.md??
-  * Hanlon, P., Jani, B.D., Nicholl, B., Lewsey, J., McAllister, D.A. and Mair, F.S., 2022. Associations between multimorbidity and adverse health outcomes in UK Biobank and the SAIL Databank: A comparison of longitudinal cohort studies. PLoS Medicine, 19(3), p.e1003931.
+	* https://github.com/dmcalli2/dynamic_protocols/blob/master/defining_comorbidities_SAIL.md??
+	* Hanlon, P., Jani, B.D., Nicholl, B., Lewsey, J., McAllister, D.A. and Mair, F.S., 2022. Associations between multimorbidity and adverse health outcomes in UK Biobank and the SAIL Databank: A comparison of longitudinal cohort studies. PLoS Medicine, 19(3), p.e1003931.
 
 - ClinicalCodes Project, University of Manchester - https://clinicalcodes.rss.mhs.man.ac.uk/
 
@@ -80,7 +76,7 @@ MELD-B has defined a set of phenotypes for MLTC conditions that are considered b
 - Gilbert et al (for Frailty Secondary Care) - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946808/
     - Gilbert T, Neuburger J, Kraindler J, Keeble E, Smith P, Ariti C, Arora S, Street A, Parker S, Roberts HC, Bardsley M, Conroy S. Development and validation of a Hospital Frailty Risk Score focusing on older people in acute care settings using electronic hospital records: an observational study. Lancet. 2018 May 5;391(10132):1775-1782. doi: 10.1016/S0140-6736(18)30668-8. Epub 2018 Apr 26. PMID: 29706364; PMCID: PMC5946808.
 
-- Abbasizanjani et al (for Long Covid) https://www.adruk.org/news-publications/publications-reports/data-insight-clinical-coding-and-capture-of-long-covid-a-cohort-study-in-wales-using-linked-health-and-demographic-data-824/
+- Abbasizanjani et al (for Long Covid) https://www.adruk.org/news-publications/publications-reports/data-insight-clinical-coding-and-capture-of-long-covid-a-cohort-study-in-wales-using-linked-health-and-demographic-data-824
     - Abbasizanjani H, Bedston S, Robinson L, Curds M, Akbari A. Clinical coding and capture of Long COVID: a cohort study in Wales using linked health and demographic data. ADR Wales Data Insight. August 2023. https://adrwales.org/wp-content/uploads/2023/08/Clinical-coding-and-capture-of-Long-COVID.pdf
 
 
@@ -112,29 +108,76 @@ MELD-B uses drug codes as a proxy indicator of burden. This codes are derived fr
 
 ## ⚙️ Setup
 
-1. Delete corrupted files: `bash import.sh`
+- Delete corrupted files that cannot be read with `bash import.sh`
 
-### Code Translationg Tables
+### Code Translation Tables
+1. Due to the licencing of NHS TRUD coding tables, the following resources <mark>must be downloaded separately</mark>:
+	- [nhs_readbrowser_25.0.0_20180401000001](https://isd.digital.nhs.uk/trud/users/guest/filters/2/categories/9/items/8/releases)
+	- [nhs_datamigration_29.0.0_20200401000001](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/9/items/9/releases)
+	- [ICD10_Edition5_XML_20160401](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/28/items/258/releases?source=summary)  
+	- [OPCS-4.10 Data files](https://isd.digital.nhs.uk/trud/users/guest/filters/0/categories/10/items/119/releases)
+	- [BNF/Snomed Mapping data.xlsx](https://www.nhsbsa.nhs.uk/prescription-data/understanding-our-data/bnf-snomed-mapping)
 
-Due to the licencing of NHS TRUD coding tables, the following 
-
-2. Update Convertion Tables:
+2. Next, prepare the convertion Tables by saving them as `.parquet` tables. 
 	- See "Mappings" section in process_codes_WP.ipynb to generate table with appropriate name
-	- For reversible convertions create a duplicate table with the name reversed
+	- For reversible convertions create a duplicate table with the name reversed. However be aware this is <b>NOT ADVISED</b> and goes against NHS guidance.
 
 ### JSON phenotype mapping
 
-3. Update JSON Codes List
-	- Manually Edit the PHEN_asssign_v2.json
-	- Use "Ho generate JSON" section in process_codes_WP.ipynb to generate JSON for Ho 
-  - Cases which require additional preprocessing
-    <!-- - Large Table with sub-categorical column
-        - Need to split table by categorical column
-        - Then read each categorical file individually
-      - USE "divide_col" action
-    - Table with multiple code types in single column
-        - Need to split column into multiple columns, so only one code type per column
-      - USE "split_col" action -->
+Mappings from Imported Code Lists to Outputted MELD-B Concept's Code list are defined in JSON format within `PHEN_assign_v3.json`. 
+
+#### Defining the Strucutre for Folders and Files:
+```
+"folder":"codes/Medication code source",
+"description":"Medication Codes - downloaded 15/12/23",
+"files": [
+		{
+			"file":"WP02_SAIL_WILK_matched_drug_codes_with_categories.xlsx"
+		}
+]
+```
+
+#### Define Column Code Types
+```
+"columns":{
+	"read2_code":"READCODE",
+	"metadata":["DESCRIPTION"]
+},
+```
+
+#### Define Concepts to be mapped to
+```
+"meldb_phenotypes": ["ALL_MEDICATIONS"]
+```
+
+#### Actions: Additional preprocessing (if required):
+- In certain cases where you wish to sub-divde a code list table or a column features multiple code types additional processing is required. Add a `action` object inside of the `file` object.
+
+- Table with a sub-categorical column:
+	- In order to sub-divide a table by a categorical column use the "divide_col" action
+	- e.g. ``` "actions":{"divide_col": "MMCode"}```
+
+- Table with multiple code types in single column:
+	- Need to split column into multiple columns, so only one code type per column.
+	- The "split_col" attribute is the categorical column indicating the code type in that row. The <b>category names should replace column</b> names in the `columns` properties.
+	- The "codes_col" attribute is the code column with mulitple code types in a single column
+	- e.g. 
+	```
+	"actions":{
+		"split_col":"coding_system",
+		"codes_col":"code"
+	},
+	"columns":{
+		"read2_code":"Read codes v2",
+		"med_code":"Med codes",
+		"icd10_code":"ICD10 codes",
+		"metadata":["description"]
+	},
+	```
+
+
+*<b>Large Code lists</b> with numerous phenotypes (e.g. Ho et al), require lots of JSON to be generated. See the "Ho generate JSON" section in process_codes_WP.ipynb for example code to generate 
+
 
 
 ## ⚡ Usage
-- 
GitLab