Standalone pipeline to score SNVs with the MPact model and estimate m6A effect size.
cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
# Create env (first time only)
python -m venv /g/data/qq78/akanksha/m6A-snp/.venv
source /g/data/qq78/akanksha/m6A-snp/.venv/bin/activate
pip install -r requirements.txt
# Run on bundled sample TSV
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
--input mini_scoreable.tsv \
--output-tsv smoke_test_results.tsv \
--fasta hg38.fa \
--model-path model_window_501.h5 \
--rediportal-gz TABLE1_hg38_v3.txt.gz \
--scan-radius 5 \
--batch-size 256- Script:
score_mpact.py - Default model:
model_window_501.h5(501 nt window) - Alternative trained models:
model_window_101.h5andmodel_window_201.h5(use with--window-size 101or--window-size 201) - FASTA:
hg38.fa - FASTA index:
hg38.fa.fai - GTF:
Homo_sapiens.GRCh38.110.gtf.gz - A-to-I reference:
TABLE1_hg38_v3.txt.gz - Sample scoreable TSV:
mini_scoreable.tsv - Sample input VCF:
mini_unannotated_test.vcf
Required annotation files:
Homo_sapiens.GRCh38.110.gtf.gzfor strand inferenceTABLE1_hg38_v3.txt.gzfor A-to-I annotation
- TSV
- VCF
- VCF.GZ
TSV requires these columns (tab-separated):
| #Chromosome | Position | Reference | Alteration |
|---|---|---|---|
| chr1 | 100000 | A | G |
VCF uses standard CHROM POS ID REF ALT columns.
transcript_id/ENSTto improve transcript-aware strand mapping.strand(if user-provided).- Extra columns are preserved into output.
cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
--input /path/to/variants.tsv \
--output-tsv /path/to/predictions.tsv \
--fasta hg38.fa \
--model-path model_window_501.h5 \
--gtf Homo_sapiens.GRCh38.110.gtf.gz \
--rediportal-gz TABLE1_hg38_v3.txt.gz \
--output-plot /path/to/delta_histogram.png \
--scan-radius 5 \
--batch-size 512cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
--input /path/to/variants.vcf.gz \
--output-tsv /path/to/predictions.tsv \
--fasta hg38.fa \
--model-path model_window_501.h5 \
--gtf Homo_sapiens.GRCh38.110.gtf.gz \
--rediportal-gz TABLE1_hg38_v3.txt.gz \
--scan-radius 5 \
--batch-size 512cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
--input mini_scoreable.tsv \
--output-tsv sample_predictions.tsv \
--fasta hg38.fa \
--model-path model_window_501.h5 \
--gtf Homo_sapiens.GRCh38.110.gtf.gz \
--rediportal-gz TABLE1_hg38_v3.txt.gz \
--output-plot sample_delta_hist.pngFrom current score_mpact.py:
--gtfis required for strand inference and usesHomo_sapiens.GRCh38.110.gtf.gzby default.--rediportal-gzis required for A-to-I annotation and usesTABLE1_hg38_v3.txt.gzby default.- Default
--scan-radius:5 - Default
--batch-size:1024 - Default
--input-chunk-size:50000 - VCF mode defaults to genic-only filtering (
--genic-onlyis on).
If a run stops partway through, rerun the same command with --resume.
You can also set --resume-from-row to continue from a specific input row and --checkpoint-path to point at the saved checkpoint JSON.
Example:
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
--input mini_scoreable.tsv \
--output-tsv smoke_test_results.tsv \
--fasta hg38.fa \
--model-path model_window_501.h5 \
--gtf Homo_sapiens.GRCh38.110.gtf.gz \
--rediportal-gz TABLE1_hg38_v3.txt.gz \
--resumeIf you want to keep intergenic VCF rows too, add:
--allow-nongenicMain output is a TSV with one row per scored candidate A-site.
The output contains:
- All normalized input variant columns (for example
#Chromosome,Position,Reference,Alteration, and any additional input metadata columns) - Plus the MPact-generated columns below
| Header | Meaning |
|---|---|
| strand_gencode | Final strand used for scoring (+ or -), inferred from GTF transcript/interval logic. |
| strand_gencode_source | Label of the GTF reference used (for example GRCh38.110). |
| score_type | Candidate class: disruption, creation, or context. |
| a_genomic_pos1 | 1-based genomic coordinate of the candidate A center that was scored. |
| snp_to_a_mRNA_offset | Signed offset from SNP to candidate A in oriented transcript coordinates. |
| ref_center_is_A | Whether reference 501-nt oriented sequence has A at center position. |
| alt_center_is_A | Whether alternate 501-nt oriented sequence has A at center position. |
| overlaps_AtoI_exact | True if candidate A exactly overlaps an indexed A-to-I site; otherwise False. |
| near_AtoI_5nt | True if nearest A-to-I site is within 5 nt; otherwise False. |
| near_AtoI_10nt | True if nearest A-to-I site is within 10 nt; otherwise False. |
| mpact_ref_score | MPact model score on reference-oriented 501-nt sequence. |
| mpact_alt_score | MPact model score on alternate-oriented 501-nt sequence. |
| mpact_ref_stoichiometry_pct | Reference score scaled to percent (mpact_ref_score * 100). |
| mpact_alt_stoichiometry_pct | Alternate score scaled to percent (mpact_alt_score * 100). |
| mpact_delta_stoichiometry_pct | Percent delta ((alt - ref) * 100). |
| alt_center_A_destroyed | True when ALT no longer has center A (not alt_center_is_A). |
| mpact_delta_alt_minus_ref | Raw effect size: mpact_alt_score - mpact_ref_score. |
| mpact_abs_delta | Absolute effect size: abs(mpact_delta_alt_minus_ref). |
| delta_zscore | Z-score of raw delta against global delta distribution for the run. |
| delta_p_two_sided | Two-sided p-value computed from delta_zscore. |
| ref_scan_seq | Centered reference context slice from the scored window; length follows --scan-radius, capped between 5 and 21 nt. |
| alt_scan_seq | Centered alternate context slice from the scored window; length follows --scan-radius, capped between 5 and 21 nt. |
Note:
delta_zscoreanddelta_p_two_sidedare computed after chunk scoring using global delta mean/std over the temporary scored output.
All input columns are preserved first in the output. This means original variant annotations (for example ClinVar labels, INFO-derived fields, cohort tags) stay attached to each scored candidate row.
Important behavior:
- One input SNV can produce multiple output rows, because MPact scans candidate A centers around the SNV (
--scan-radius). - The same input variant may therefore appear many times, each with a different
a_genomic_pos1and possibly differentscore_type.
strand_gencode: The strand actually used for sequence orientation and scoring. This is the critical strand field to trust for downstream analysis.strand_gencode_source: The GTF build label used to infer strand. Useful for reproducibility and cross-run auditing.a_genomic_pos1: Candidate m6A-centered genomic coordinate (1-based).snp_to_a_mRNA_offset: SNP-to-center distance in transcript orientation.
Offset sign meaning:
- Positive value: SNP is downstream of candidate A in the oriented transcript frame.
- Negative value: SNP is upstream of candidate A in the oriented transcript frame.
- Zero: SNP overlaps the candidate center position.
ref_center_is_Aandalt_center_is_Adescribe whether the center base is A before and after applying ALT.score_typeis derived from these booleans:disruption: REF center is A and ALT center is not Acreation: REF center is not A and ALT center is Acontext: REF center is A and ALT center is A
alt_center_A_destroyedis simplynot alt_center_is_Aand is most informative for disruptive events.
These annotate known editing context around the candidate center:
overlaps_AtoI_exact: direct overlapnear_AtoI_5nt: local neighborhood overlap within 5 ntnear_AtoI_10nt: broader neighborhood overlap within 10 nt
Practical use:
- Use
overlaps_AtoI_exact == Truefor strict known-site overlap analyses. - Use
near_AtoI_5ntornear_AtoI_10ntwhen testing local editing-environment enrichment.
Core numeric fields:
mpact_ref_score: model score on REF sequencempact_alt_score: model score on ALT sequencempact_delta_alt_minus_ref: ALT minus REF (main direction-aware effect size)mpact_abs_delta: absolute magnitude of effect
Stoichiometry percent fields use a dedicated model stoichiometry head when present; otherwise the scorer falls back to class scores scaled by 100.
Direction interpretation for mpact_delta_alt_minus_ref:
- Negative: ALT decreases predicted m6A signal relative to REF
- Positive: ALT increases predicted m6A signal relative to REF
- Near zero: limited predicted effect for that candidate center
delta_zscore: standardized delta across all candidate rows in the rundelta_p_two_sided: two-sided p-value from that z-score
These are run-level normalized values, so they depend on the cohort/distribution in that specific run. If you compare different runs, compare carefully because z-score baselines can shift.
ref_scan_seqandalt_scan_seqare short centered context strings from the full 501-nt windows.- They are useful for quick motif-level sanity checks and visual inspection.
- They are not a replacement for full-window model inputs, but they are helpful for debugging and reporting.
If a row has:
score_type = disruptionmpact_delta_alt_minus_ref = -0.42mpact_abs_delta = 0.42delta_p_two_sided = 0.001
Then that candidate center is predicted to show a strong ALT-driven loss of m6A signal, with relatively extreme effect size within the run distribution.
Optional histogram PNG is written when --output-plot is set.
cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
--input mini_scoreable.tsv \
--output-tsv smoke_test_results.tsv \
--fasta hg38.fa \
--model-path model_window_501.h5 \
--scan-radius 5 \
--batch-size 256Observed result:
- Exit code
0 - Output TSV created successfully
- 5 lines total (header + 4 scored candidates)
Template script: submit_mpact_scoring.pbs
Before qsub, update:
#PBS -P, queue, walltime, ncpus, mem, storageINPUT_PATH,OUTPUT_DIR,FASTA,MODEL- optional
REDIPORTAL_GZ,GTF
Submit with:
qsub submit_mpact_scoring.pbsThe run is resumable with the updated PBS wrapper:
qsub /g/data/qq78/akanksha/m6A-snp/Scripts/run_gnomad_genic_snv_dtm6a501_scan20.pbsThe wrapper calls score_mpact.py with --resume, so already-scored chunk outputs in external_validation/gnomad/chrom_chunks are skipped and only missing chunks are processed.
The launcher now passes absolute paths for:
- FASTA:
MPact_Scoring_Pipeline/hg38.fa - GTF:
MPact_Scoring_Pipeline/Homo_sapiens.GRCh38.110.gtf.gz - REDIportal:
MPact_Scoring_Pipeline/TABLE1_hg38_v3.txt.gz
samtools faidx /path/to/hg38.famodule load samtools
which samtoolsTensorFlow may print CUDA warnings on CPU-only nodes. This is expected if no GPU is present, and CPU inference still runs.
Use a smaller batch size:
--batch-size 256