Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
fc0e9c7
Add initial script to get individiual results
mart-r Nov 7, 2025
f138663
Add script to get overall results (startup, warm, cold)
mart-r Nov 7, 2025
fe04f49
Fix default args
mart-r Nov 7, 2025
7ce5e8d
Add master script for getting load speed for multiple models
mart-r Nov 7, 2025
06a513f
Add v1 and v2 (and my localy setup) specific scripts for getting load…
mart-r Nov 7, 2025
a519fcc
Avoid unknown run types
mart-r Nov 7, 2025
a0df0e6
Add option to specify number of repeats when doing all load experimen…
mart-r Nov 7, 2025
98a49fd
Move to a timeit based approach
mart-r Nov 7, 2025
e294df4
Add output folder
mart-r Nov 7, 2025
074e7a1
Add automatic json output
mart-r Nov 7, 2025
a80b780
Fix type of save json argument
mart-r Nov 7, 2025
2b05b70
Always save results to a file when doing in bulk
mart-r Nov 7, 2025
39a47b5
Allow overwriting output prefix if/when required
mart-r Nov 7, 2025
3ac3d0d
Separated speed scripts from (future) performance ones
mart-r Nov 7, 2025
2652fd5
Move a bunch of code to a separate module
mart-r Nov 7, 2025
d70054d
Allow for a more general error handling when running subprocesses
mart-r Nov 7, 2025
7f2ff90
Add a few overarching scripts to run all the speed scripts at once
mart-r Nov 7, 2025
45f03a6
Centralise combining of experiments
mart-r Nov 7, 2025
9d2ee46
Only produce results for run types that are required
mart-r Nov 7, 2025
6c0164f
Add uncommitted changes from last commit
mart-r Nov 7, 2025
2390d2b
Add modules to get inference speed
mart-r Nov 7, 2025
fc7b065
Fix serialisation issue
mart-r Nov 7, 2025
21ecc3d
Add overall inference speed getter
mart-r Nov 7, 2025
48cf5f3
Add setup-specific scripts for inference speed
mart-r Nov 7, 2025
a9bd75f
Allow scripts to actually run
mart-r Nov 7, 2025
b5283a2
Fix a small issue (running load speed instead of inference speed)
mart-r Nov 8, 2025
340f4fe
Fix some argument issues in bash scripts
mart-r Nov 8, 2025
b96a879
Make names file-name safe
mart-r Nov 8, 2025
546a7d8
Change divider type between scripts
mart-r Nov 8, 2025
25a1fbc
Fix a bash script logic issue
mart-r Nov 8, 2025
9f6e087
Read output from last line
mart-r Nov 8, 2025
7cf0c40
Fix issue with errorenoushly newlines
mart-r Nov 9, 2025
3511a05
Improve output for getting time from stdout
mart-r Nov 10, 2025
ab90769
Add some more output when doing inference speed
mart-r Nov 10, 2025
2abeec8
Fix some comment
mart-r Nov 10, 2025
7676ced
Some whitespace changes
mart-r Nov 10, 2025
4aeb859
Fix typo
mart-r Nov 10, 2025
636c4cc
Remove unneeded output
mart-r Nov 11, 2025
f218e7f
Fix script running for specific version
mart-r Nov 11, 2025
7341c42
Remove unused empty method
mart-r Nov 12, 2025
fecf5ab
Add script to get unsupervised training speed as well
mart-r Nov 12, 2025
42e888a
Add script to summarise output
mart-r Nov 12, 2025
b5e611c
Add script to combine all unsuperivsed training output for a particul…
mart-r Nov 12, 2025
f49dab3
Add scripts to get unsupervised speed overall
mart-r Nov 12, 2025
c3d8672
Add folder for inference and unsupervised training output
mart-r Nov 12, 2025
e70a9db
Removed empty / old files
mart-r Nov 12, 2025
de88018
Improve / fix profiling
mart-r Nov 13, 2025
3370562
Move version specified to common module
mart-r Nov 13, 2025
9f626c7
Reset subanmes after model load if v2
mart-r Nov 13, 2025
58ed893
Fix typo
mart-r Nov 13, 2025
17c85a8
Add subname reset when doing unsupervised training speed
mart-r Nov 13, 2025
5b1c82d
Add some minor comments
mart-r Nov 13, 2025
b2eded2
Add initial regression performance script
mart-r Nov 18, 2025
9ab2190
some linting / whitespace fixes
mart-r Nov 18, 2025
f87dcc4
some further linting / whitespace fixes
mart-r Nov 18, 2025
d14af99
Add out/performance folder
mart-r Nov 18, 2025
f2aa642
Add script to get all of regression
mart-r Nov 18, 2025
8905e3f
Add conversion script for MDACE
mart-r Nov 19, 2025
c5449ec
Add mapping from ICD to Snomed
mart-r Nov 19, 2025
3205385
Add conversion for distemist dataset
mart-r Nov 19, 2025
2e1282e
Add a new stats methodology for multi-optioned datasets
mart-r Nov 20, 2025
00a1967
Minor updates to new stats method
mart-r Nov 20, 2025
7fa25a5
Update stats to allow projct processing with project filters
mart-r Nov 20, 2025
f56b944
Add v1 implementation for missing stuff (hopefully)
mart-r Nov 20, 2025
6fbc6a6
Fix minor import path issues
mart-r Nov 20, 2025
e19a7a7
Fix problematic dunder call method
mart-r Nov 20, 2025
f22e0d3
Fix typo in name
mart-r Nov 20, 2025
54e7a91
Add performance script for model and dataset(s)
mart-r Nov 20, 2025
bba1171
Remove commented code
mart-r Nov 20, 2025
993c692
Allow filtering before disamb (optionally)
mart-r Nov 20, 2025
d22a85f
Add README for MDACE dataset
mart-r Nov 20, 2025
2553fbe
Add README for distemist dataset
mart-r Nov 20, 2025
aad0e06
Add conversion script - from linking challenge to trainer export
mart-r Nov 20, 2025
817feba
Add README for linking challenge data prep
mart-r Nov 20, 2025
dad18e2
Add README for COMETA dataset
mart-r Nov 20, 2025
56d8bbc
Add cometa dataset conversion script
mart-r Nov 20, 2025
adfe353
Add medmentions conversion scripts
mart-r Nov 20, 2025
9a9283f
Remove some unneeded code
mart-r Nov 20, 2025
5961786
Add MedMentions dataset README
mart-r Nov 21, 2025
a878241
Keep unsupervised data folder
mart-r Nov 21, 2025
5f62fd8
Add script to get all performance
mart-r Nov 21, 2025
bbb0625
Fix performance script
mart-r Nov 21, 2025
98e64e1
CU-869b9h7y6: Add faster linker that only links to primary names
mart-r Nov 25, 2025
d72b4f9
CU-869b9h7y6: Remove debug output
mart-r Nov 25, 2025
0839a24
CU-869b9h7y6: Add proper filtering as well as usage of single-possibl…
mart-r Nov 25, 2025
48396af
CU-869b9h7y6: Add a simple test for the new linker
mart-r Nov 25, 2025
9ea321d
Merge branch 'main' into medcat-v2-paper
mart-r Nov 27, 2025
6d7910b
Add a few scripts to show possible variance in performance and throug…
mart-r Nov 27, 2025
12cca89
Merge branch 'feat/medcat/CU-869b9h7y6-add-faster-linker' into medcat…
mart-r Nov 27, 2025
146662e
Update script to include embedding linker in it
mart-r Nov 27, 2025
f3a545d
Add embedding linker stuff to script
mart-r Nov 27, 2025
bb5951c
Start moving towards a better format for variance getting (get 1 outp…
mart-r Nov 27, 2025
1f92f83
Remove some echo /debug output
mart-r Nov 27, 2025
0b4ecab
Add dataset name to output
mart-r Nov 27, 2025
2c8feaf
Add header to output
mart-r Nov 27, 2025
59aaeef
Add run time to output
mart-r Nov 27, 2025
0307818
Merge branch 'main' into medcat-v2-paper-and-faster-linker
mart-r Nov 30, 2025
b037556
Add 1 time embedding linker conversion script
mart-r Nov 30, 2025
a5202a2
Some whitespace changes
mart-r Nov 30, 2025
ff4c0ab
Make last line of conversion be the model path
mart-r Nov 30, 2025
1d63dae
Convert embedding model once
mart-r Nov 30, 2025
279e59e
Merge branch 'main' into medcat-v2-paper-and-faster-linker
mart-r Dec 17, 2025
040af18
Try to redo filtering for embedding linker
mart-r Dec 18, 2025
540caaa
Try to redo filtering for embedding linker (attempt no 2)
mart-r Dec 18, 2025
d336bfe
Try to redo filtering for embedding linker (attempt no 3)
mart-r Dec 18, 2025
a0fc0d8
Merge branch 'main' into medcat-v2-paper-and-faster-linker
mart-r Feb 6, 2026
8f48e50
Merge branch 'main' into medcat-v2-paper-and-faster-linker
mart-r Feb 11, 2026
c8d80ad
Add a throughput script
mart-r Feb 11, 2026
56c6cfb
Add throughput to variance calculations
mart-r Feb 11, 2026
93daa30
Revert "Add a throughput script"
mart-r Feb 11, 2026
2908e69
Merge branch 'main' into medcat-v2-paper-and-faster-linker
github-actions[bot] Apr 8, 2026
3e3c018
Merge branch 'main' into medcat-v2-paper-and-faster-linker
github-actions[bot] Apr 9, 2026
7663949
Update model paths
github-actions[bot] Apr 10, 2026
fcfd725
Run performance against 2023 models again
github-actions[bot] Apr 10, 2026
ae74fce
Add a script to run everything at once
github-actions[bot] Apr 10, 2026
1db6f6a
CU-869cw9zmj: Use faster way to calculate unit vector
github-actions[bot] Apr 13, 2026
887a180
CU-869cw9zmj: Speed up context vector obtaining
github-actions[bot] Apr 13, 2026
de3c25e
Merge branch 'main' into medcat-v2-paper-and-faster-linker-w-faster-gcv
github-actions[bot] Jun 3, 2026
7972292
Add variance plotting script
github-actions[bot] Jun 3, 2026
d9a74fb
Revert changes to matutils
github-actions[bot] Jun 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions medcat-v2/paper/data/supervised/MDACE/raw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
First we download the MDACE dataset and prepare it with MIMIC-IV as per instructions:
https://github.com/3mcloud/MDACE

Then, we need to convert the data to a format MedCAT can understand using:
```python
python convert_to_mct_export.py # no need for arguments if in this folder
```

However, that still only has ICD-10 codes.
Yet the models we're comparing to use SNOMED.

So we then need to convert to SNOMED by doing:
```python
python map_from_icd_to_snomed.py <model_pack_path> ../icd10_convert.json ../mct_export_with_candidates.json
```

This will create a trainer export that has multiple CUIs as options for each annotation.
That is because ICD-10 codes can map to multiple different Snomed concepts and there is no automated way to create a 1 to 1 mapping.
87 changes: 87 additions & 0 deletions medcat-v2/paper/data/supervised/MDACE/raw/convert_to_mct_export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import json
import os
import sys
from datetime import datetime
from typing import Iterator

from medcat.data.mctexport import (
MedCATTrainerExport, MedCATTrainerExportProject,
MedCATTrainerExportDocument, MedCATTrainerExportAnnotation)
from medcat.data.mctexport import count_all_annotations, count_all_docs

DEFAULT_INPUT_DIR = "with_text/gold"
DEFAULT_OUTPUT_PATH = "../icd10_convert.json"


def get_all_jsons(input_dir: str) -> Iterator[str]:
for fn in os.listdir(input_dir):
path = os.path.join(input_dir, fn)
if os.path.isdir(path):
yield from get_all_jsons(path)
elif path.endswith(".json"):
yield path


def do_conversion(
input_dir: str = DEFAULT_INPUT_DIR,
output_file: str = DEFAULT_OUTPUT_PATH):
mod_time = datetime.now().isoformat()
all_out: MedCATTrainerExport = {
"projects": []
}

for path in get_all_jsons(input_dir):
if not path.endswith(".json"):
continue
with open(path) as f:
in_data = json.load(f)
documents: list[MedCATTrainerExportDocument] = []
proj_id = in_data["hadm_id"]
proj_name = f'MDACE_{proj_id}'
project: MedCATTrainerExportProject = {
"documents": documents,
"name": proj_name,
"id": proj_id,
"cuis": "",
"tuis": "",
}
all_out["projects"].append(project)

in_notes = in_data["notes"] # guess name
for in_doc in in_notes:
doc_id = in_doc["note_id"]
doc_name = f'{in_doc["description"]}_{doc_id}'
anns: list[MedCATTrainerExportAnnotation] = []
documents.append(
{
"name": doc_name,
"id": doc_id,
"last_modified": mod_time,
"text": in_doc["text"],
"annotations": anns,
}
)

for ann_num, ann in enumerate(in_doc["annotations"]):
anns.append(
{
"start": ann["begin"],
"end": ann["end"],
# NOTE: this is currently in ICD
"cui": ann["code"],
"value": ann["covered_text"],
"id": f"{proj_name}_{doc_name}_{ann_num}",
"meta_anns": [],
"validated": True,
}
)
print("GOT", len(all_out["projects"]), "projects",
"with", count_all_annotations(all_out), "annotations",
"across", count_all_docs(all_out), "documents")

with open(output_file, "w") as of:
json.dump(all_out, of, indent=2)


if __name__ == "__main__":
do_conversion(*sys.argv[1:])
104 changes: 104 additions & 0 deletions medcat-v2/paper/data/supervised/MDACE/raw/map_from_icd_to_snomed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import sys
import json
from collections import defaultdict

from medcat.cat import CAT
from medcat.data.mctexport import (
MedCATTrainerExport, MedCATTrainerExportAnnotation,
count_all_annotations, count_all_docs)


def load_export(path: str) -> MedCATTrainerExport:
with open(path) as f:
return json.load(f)


def icd2snomed(cat: CAT) -> dict[str, list[str]]:
code2snomed: dict[str, list[str]] = defaultdict(list)
cui2icd10 = cat.cdb.addl_info["cui2icd10"]
for cui_info in cat.cdb.cui2info.values():
cui = cui_info["cui"]
for icd10 in cui2icd10.get(cui, []):
code2snomed[icd10].append(cui)
print("GOT", len(code2snomed), "ICD codes")
print("Mapped to", sum(len(v) for v in code2snomed.values()),
"total Snomed CUIs")
return code2snomed


def pick_concept(cat: CAT,
mapper: dict[str, list[str]],
ann: MedCATTrainerExportAnnotation) -> str | None:
# NOTE: I could try and select 1 - the best
# but there isn't really a good way to do that.
# Instead, we'll use all as candidates
return mapper.get(ann["cui"])


def convert_export(
cat: CAT, export: MedCATTrainerExport
) -> MedCATTrainerExport:
mapper = icd2snomed(cat)
return {
"projects": [
{
"id": proj["id"],
"name": proj["name"],
"cuis": proj["cuis"],
"tuis": proj["tuis"],
"documents": docs
}
for proj in export["projects"]
if (docs := [
{
"id": doc["id"],
"name": doc["name"],
"last_modified": doc["last_modified"],
"text": doc["text"],
"annotations": anns
} for doc in proj["documents"]
if (anns := [
{
"id": ann["id"],
"start": ann["start"],
"end": ann["end"],
"value": ann["value"],
"cui": mapped_cui,
"meta_anns": ann["meta_anns"],
"validated": ann["validated"]
} for ann in doc["annotations"]
if (mapped_cui := pick_concept(cat, mapper, ann))
])
])
]
}


def main(model_pack_path: str,
icd10_export_path: str,
final_export_path: str):
print("Loading model pack", model_pack_path)
cat = CAT.load_model_pack(model_pack_path)
print("Loading export")
export = load_export(icd10_export_path)
print("Initial import has", count_all_docs(export), "docs",
"and", count_all_annotations(export), "anns within",
len(export["projects"]), "projects")
print("Converting...")
converted = convert_export(cat, export)
print("CONVERTED export HAS", count_all_docs(converted), "docs",
"and", count_all_annotations(converted), "anns within",
len(converted["projects"]), "projects")
from medcat.data.mctexport import iter_anns
lens = []
for _, _, ann in iter_anns(converted):
lens.append(len(ann["cui"]) if isinstance(ann["cui"], list) else 1)
print("Total", len(lens), "annotations with", sum(lens) / len(lens),
"values on average")
print("Saving to", final_export_path)
with open(final_export_path, 'w') as f:
json.dump(converted, f)


if __name__ == "__main__":
main(*sys.argv[1:])
7 changes: 7 additions & 0 deletions medcat-v2/paper/data/supervised/cometa/raw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
First, we need to download the dataset:
https://metatext.io/datasets/cometa

Then we need to convert to a format MedCAT understands:
```python
python conversion/converter.py chv.csv ../mct_export.json
```
115 changes: 115 additions & 0 deletions medcat-v2/paper/data/supervised/cometa/raw/conversion/converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
from sys import argv
import json
import os.path
from datetime import datetime

from tqdm import tqdm
import pandas as pd

from medcat.data.mctexport import (
MedCATTrainerExport, MedCATTrainerExportProject,
MedCATTrainerExportAnnotation)
from medcat.data.mctexport import count_all_docs, count_all_annotations


COLS = ['Term', 'General SNOMED Label', 'General SNOMED ID',
'Specific SNOMED Label', 'Specific SNOMED ID', 'Example',
'Example Link', 'Origin_Sheet']
COL4VALUE = "Term"
COL4CUI = "Specific SNOMED ID"
COL4TEXT = "Example"
COL4LINK = "Example Link"

# November 2020
LAST_MODIFIED = datetime(year=2020, month=11, day=1).isoformat()


def find_annotations(value: str, text: str, cui: str
) -> list[MedCATTrainerExportAnnotation]:
value = value.lower()
orig_text = text
text = text.lower()
if value not in text:
raise ValueError(f"{repr(value)} not in text ({repr(text)})")
cur_start = 0
anns: list[MedCATTrainerExportAnnotation] = []
while (cur_index := text.find(value, cur_start)) >= 0:
start = cur_index
end = cur_index + len(value)
anns.append(
{
"cui": str(cui),
"value": orig_text[start: end],
"start": start,
"end": end,
}
)
cur_start = end
if len(anns) > 100:
raise KeyError(
f"Too many annotations!, {start}, {end}, for {value}. "
f"cur start at {cur_start}")
return anns


def do_conversion(df: pd.DataFrame, proj_base_id: str, proj_base_name: str
) -> MedCATTrainerExport:
projects: list[MedCATTrainerExportProject] = []
for line_num, (index, line) in enumerate(tqdm(df.iterrows(),
total=len(df.index))):
text = line[COL4TEXT]
cui = line[COL4CUI]
try:
anns = find_annotations(
line[COL4VALUE], text, cui)
except ValueError as e:
print("LINE", line_num, "at index", index,
"Failed to load(VE):", str(e))
continue
except AttributeError as e:
print("LINE", line_num, "at index", index,
"Failed to load(AE):", str(e))
continue
proj_id = proj_base_id + str(index)
proj_name = proj_base_name + "@" + str(index)
# NOTE: each document is a project so that I can use per-project
# filters and thus only focus on the CUI in question and not
# the other terms in the text
projects.append({
"documents": [
{
"text": text,
"annotations": anns,
"id": str(index),
"name": f"LINK: {line[COL4LINK]}; ID: {index}",
"last_modified": LAST_MODIFIED
}
],
"id": proj_id,
"name": proj_name,
"cuis": f'{cui}',
"tuis": '',
})
return {"projects": projects}


def main(file_path: str,
export_path: str,
# TODO: options
):
df = pd.read_csv(file_path, sep='\t', index_col=0, header=0).sort_index()
proj_name = export_path.split(os.path.sep + "cometa" + os.path.sep, 1)[-1]
proj_id = ".".join(proj_name.split(os.path.sep)[-2:]).replace(".csv", "")
print("Giving 'project' a name of", repr(proj_name))
print("And setting ID to", proj_id)
mct_export = do_conversion(df, proj_id, proj_name)
print("Got", len(mct_export["projects"]), "projects with a total of",
count_all_docs(mct_export), "documents and a total of",
count_all_annotations(mct_export), "annotations")
print("Saving to", repr(export_path))
with open(export_path, 'w') as f:
json.dump(mct_export, f)


if __name__ == "__main__":
main(*argv[1:])
11 changes: 11 additions & 0 deletions medcat-v2/paper/data/supervised/distemist/raw/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
First we need to download and extract the distemist dataset:
https://temu.bsc.es/distemist/distemist-linking/

Subsequently, we convert to MedCAT supported format:
```python
python convert_to_mct_export.py distemist_zenodo/multilingual_resources/training_text_files/en distemist_zenodo/multilingual_resources/en ../mct_export.json
```

NOTE:
The underlying dataset (at least in some cases) links to multiple concepts per annotation.
And because of that the output also allows a subset of concepts.
Loading
Loading