MPact m6A Variant Scoring Pipeline

Standalone pipeline to score SNVs with the MPact model and estimate m6A effect size.

Quick Start (Copy-Paste)

cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline

# Create env (first time only)
python -m venv /g/data/qq78/akanksha/m6A-snp/.venv
source /g/data/qq78/akanksha/m6A-snp/.venv/bin/activate
pip install -r requirements.txt

# Run on bundled sample TSV
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
  --input mini_scoreable.tsv \
  --output-tsv smoke_test_results.tsv \
  --fasta hg38.fa \
  --model-path model_window_501.h5 \
  --rediportal-gz TABLE1_hg38_v3.txt.gz \
  --scan-radius 5 \
  --batch-size 256

Bundled Paths (All In This Directory)

Script: score_mpact.py
Default model: model_window_501.h5 (501 nt window)
Alternative trained models: model_window_101.h5 and model_window_201.h5 (use with --window-size 101 or --window-size 201)
FASTA: hg38.fa
FASTA index: hg38.fa.fai
GTF: Homo_sapiens.GRCh38.110.gtf.gz
A-to-I reference: TABLE1_hg38_v3.txt.gz
Sample scoreable TSV: mini_scoreable.tsv
Sample input VCF: mini_unannotated_test.vcf

Required annotation files:

Homo_sapiens.GRCh38.110.gtf.gz for strand inference
TABLE1_hg38_v3.txt.gz for A-to-I annotation

Inputs

Supported file types

TSV
VCF
VCF.GZ

Required variant fields

TSV requires these columns (tab-separated):

#Chromosome	Position	Reference	Alteration
chr1	100000	A	G

VCF uses standard CHROM POS ID REF ALT columns.

Optional fields

transcript_id / ENST to improve transcript-aware strand mapping.
strand (if user-provided).
Extra columns are preserved into output.

Core Run Commands

1. Run on your own TSV

cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
  --input /path/to/variants.tsv \
  --output-tsv /path/to/predictions.tsv \
  --fasta hg38.fa \
  --model-path model_window_501.h5 \
  --gtf Homo_sapiens.GRCh38.110.gtf.gz \
  --rediportal-gz TABLE1_hg38_v3.txt.gz \
  --output-plot /path/to/delta_histogram.png \
  --scan-radius 5 \
  --batch-size 512

2. Run on your own VCF

cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
  --input /path/to/variants.vcf.gz \
  --output-tsv /path/to/predictions.tsv \
  --fasta hg38.fa \
  --model-path model_window_501.h5 \
  --gtf Homo_sapiens.GRCh38.110.gtf.gz \
  --rediportal-gz TABLE1_hg38_v3.txt.gz \
  --scan-radius 5 \
  --batch-size 512

3. Run on included sample (recommended first check)

cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
  --input mini_scoreable.tsv \
  --output-tsv sample_predictions.tsv \
  --fasta hg38.fa \
  --model-path model_window_501.h5 \
  --gtf Homo_sapiens.GRCh38.110.gtf.gz \
  --rediportal-gz TABLE1_hg38_v3.txt.gz \
  --output-plot sample_delta_hist.png

Important Runtime Defaults

From current score_mpact.py:

--gtf is required for strand inference and uses Homo_sapiens.GRCh38.110.gtf.gz by default.
--rediportal-gz is required for A-to-I annotation and uses TABLE1_hg38_v3.txt.gz by default.
Default --scan-radius: 5
Default --batch-size: 1024
Default --input-chunk-size: 50000
VCF mode defaults to genic-only filtering (--genic-only is on).

Resume an interrupted run

If a run stops partway through, rerun the same command with --resume. You can also set --resume-from-row to continue from a specific input row and --checkpoint-path to point at the saved checkpoint JSON.

Example:

/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
  --input mini_scoreable.tsv \
  --output-tsv smoke_test_results.tsv \
  --fasta hg38.fa \
  --model-path model_window_501.h5 \
  --gtf Homo_sapiens.GRCh38.110.gtf.gz \
  --rediportal-gz TABLE1_hg38_v3.txt.gz \
  --resume

If you want to keep intergenic VCF rows too, add:

--allow-nongenic

Output

Main output is a TSV with one row per scored candidate A-site.

Header dictionary (all output columns)

The output contains:

All normalized input variant columns (for example #Chromosome, Position, Reference, Alteration, and any additional input metadata columns)
Plus the MPact-generated columns below

Header	Meaning
strand_gencode	Final strand used for scoring (`+` or `-`), inferred from GTF transcript/interval logic.
strand_gencode_source	Label of the GTF reference used (for example `GRCh38.110`).
score_type	Candidate class: `disruption`, `creation`, or `context`.
a_genomic_pos1	1-based genomic coordinate of the candidate A center that was scored.
snp_to_a_mRNA_offset	Signed offset from SNP to candidate A in oriented transcript coordinates.
ref_center_is_A	Whether reference 501-nt oriented sequence has `A` at center position.
alt_center_is_A	Whether alternate 501-nt oriented sequence has `A` at center position.
overlaps_AtoI_exact	`True` if candidate A exactly overlaps an indexed A-to-I site; otherwise `False`.
near_AtoI_5nt	`True` if nearest A-to-I site is within 5 nt; otherwise `False`.
near_AtoI_10nt	`True` if nearest A-to-I site is within 10 nt; otherwise `False`.
mpact_ref_score	MPact model score on reference-oriented 501-nt sequence.
mpact_alt_score	MPact model score on alternate-oriented 501-nt sequence.
mpact_ref_stoichiometry_pct	Reference score scaled to percent (`mpact_ref_score * 100`).
mpact_alt_stoichiometry_pct	Alternate score scaled to percent (`mpact_alt_score * 100`).
mpact_delta_stoichiometry_pct	Percent delta (`(alt - ref) * 100`).
alt_center_A_destroyed	`True` when ALT no longer has center A (`not alt_center_is_A`).
mpact_delta_alt_minus_ref	Raw effect size: `mpact_alt_score - mpact_ref_score`.
mpact_abs_delta	Absolute effect size: `abs(mpact_delta_alt_minus_ref)`.
delta_zscore	Z-score of raw delta against global delta distribution for the run.
delta_p_two_sided	Two-sided p-value computed from `delta_zscore`.
ref_scan_seq	Centered reference context slice from the scored window; length follows `--scan-radius`, capped between 5 and 21 nt.
alt_scan_seq	Centered alternate context slice from the scored window; length follows `--scan-radius`, capped between 5 and 21 nt.

Note:

delta_zscore and delta_p_two_sided are computed after chunk scoring using global delta mean/std over the temporary scored output.

Detailed interpretation guide

1) Input passthrough columns

All input columns are preserved first in the output. This means original variant annotations (for example ClinVar labels, INFO-derived fields, cohort tags) stay attached to each scored candidate row.

Important behavior:

One input SNV can produce multiple output rows, because MPact scans candidate A centers around the SNV (--scan-radius).
The same input variant may therefore appear many times, each with a different a_genomic_pos1 and possibly different score_type.

2) Strand and coordinate columns

strand_gencode: The strand actually used for sequence orientation and scoring. This is the critical strand field to trust for downstream analysis.
strand_gencode_source: The GTF build label used to infer strand. Useful for reproducibility and cross-run auditing.
a_genomic_pos1: Candidate m6A-centered genomic coordinate (1-based).
snp_to_a_mRNA_offset: SNP-to-center distance in transcript orientation.

Offset sign meaning:

Positive value: SNP is downstream of candidate A in the oriented transcript frame.
Negative value: SNP is upstream of candidate A in the oriented transcript frame.
Zero: SNP overlaps the candidate center position.

3) Event typing columns

ref_center_is_A and alt_center_is_A describe whether the center base is A before and after applying ALT.
score_type is derived from these booleans:
- disruption: REF center is A and ALT center is not A
- creation: REF center is not A and ALT center is A
- context: REF center is A and ALT center is A
alt_center_A_destroyed is simply not alt_center_is_A and is most informative for disruptive events.

4) A-to-I proximity columns

These annotate known editing context around the candidate center:

overlaps_AtoI_exact: direct overlap
near_AtoI_5nt: local neighborhood overlap within 5 nt
near_AtoI_10nt: broader neighborhood overlap within 10 nt

Practical use:

Use overlaps_AtoI_exact == True for strict known-site overlap analyses.
Use near_AtoI_5nt or near_AtoI_10nt when testing local editing-environment enrichment.

5) Score and delta columns

Core numeric fields:

mpact_ref_score: model score on REF sequence
mpact_alt_score: model score on ALT sequence
mpact_delta_alt_minus_ref: ALT minus REF (main direction-aware effect size)
mpact_abs_delta: absolute magnitude of effect

Stoichiometry percent fields use a dedicated model stoichiometry head when present; otherwise the scorer falls back to class scores scaled by 100.

Direction interpretation for mpact_delta_alt_minus_ref:

Negative: ALT decreases predicted m6A signal relative to REF
Positive: ALT increases predicted m6A signal relative to REF
Near zero: limited predicted effect for that candidate center

6) Statistical normalization columns

delta_zscore: standardized delta across all candidate rows in the run
delta_p_two_sided: two-sided p-value from that z-score

These are run-level normalized values, so they depend on the cohort/distribution in that specific run. If you compare different runs, compare carefully because z-score baselines can shift.

7) Local context columns

ref_scan_seq and alt_scan_seq are short centered context strings from the full 501-nt windows.
They are useful for quick motif-level sanity checks and visual inspection.
They are not a replacement for full-window model inputs, but they are helpful for debugging and reporting.

8) Row-level interpretation example

If a row has:

score_type = disruption
mpact_delta_alt_minus_ref = -0.42
mpact_abs_delta = 0.42
delta_p_two_sided = 0.001

Then that candidate center is predicted to show a strong ALT-driven loss of m6A signal, with relatively extreme effect size within the run distribution.

Optional histogram PNG is written when --output-plot is set.

Test command

cd /g/data/qq78/akanksha/m6A-snp/MPact_Scoring_Pipeline
/g/data/qq78/akanksha/m6A-snp/.venv/bin/python score_mpact.py \
  --input mini_scoreable.tsv \
  --output-tsv smoke_test_results.tsv \
  --fasta hg38.fa \
  --model-path model_window_501.h5 \
  --scan-radius 5 \
  --batch-size 256

Observed result:

Exit code 0
Output TSV created successfully
5 lines total (header + 4 scored candidates)

HPC PBS Usage

Template script: submit_mpact_scoring.pbs

Before qsub, update:

#PBS -P, queue, walltime, ncpus, mem, storage
INPUT_PATH, OUTPUT_DIR, FASTA, MODEL
optional REDIPORTAL_GZ, GTF

Submit with:

qsub submit_mpact_scoring.pbs

Resume Note

The run is resumable with the updated PBS wrapper:

qsub /g/data/qq78/akanksha/m6A-snp/Scripts/run_gnomad_genic_snv_dtm6a501_scan20.pbs

The wrapper calls score_mpact.py with --resume, so already-scored chunk outputs in external_validation/gnomad/chrom_chunks are skipped and only missing chunks are processed.

The launcher now passes absolute paths for:

FASTA: MPact_Scoring_Pipeline/hg38.fa
GTF: MPact_Scoring_Pipeline/Homo_sapiens.GRCh38.110.gtf.gz
REDIportal: MPact_Scoring_Pipeline/TABLE1_hg38_v3.txt.gz

Troubleshooting

Missing FASTA index

samtools faidx /path/to/hg38.fa

`samtools` not found

module load samtools
which samtools

Slow startup or no GPU on login node

TensorFlow may print CUDA warnings on CPU-only nodes. This is expected if no GPU is present, and CPU inference still runs.

Memory pressure

Use a smaller batch size:

--batch-size 256

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
README.pdf		README.pdf
ato_i_filter.py		ato_i_filter.py
example_run.sh		example_run.sh
mini_annotated_e2e.gtf		mini_annotated_e2e.gtf
mini_annotated_e2e.vcf		mini_annotated_e2e.vcf
mini_annotated_test.vcf		mini_annotated_test.vcf
mini_mixed.tsv		mini_mixed.tsv
mini_scoreable.tsv		mini_scoreable.tsv
mini_unannotated_test.vcf		mini_unannotated_test.vcf
model_window_101.h5		model_window_101.h5
model_window_201.h5		model_window_201.h5
model_window_501.h5		model_window_501.h5
requirements.txt		requirements.txt
score_mpact.py		score_mpact.py
submit_mpact_scoring.pbs		submit_mpact_scoring.pbs
test_dbsnp_5k.tsv		test_dbsnp_5k.tsv

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MPact m6A Variant Scoring Pipeline

Quick Start (Copy-Paste)

Bundled Paths (All In This Directory)

Inputs

Supported file types

Required variant fields

Optional fields

Core Run Commands

1. Run on your own TSV

2. Run on your own VCF

3. Run on included sample (recommended first check)

Important Runtime Defaults

Resume an interrupted run

Output

Header dictionary (all output columns)

Detailed interpretation guide

1) Input passthrough columns

2) Strand and coordinate columns

3) Event typing columns

4) A-to-I proximity columns

5) Score and delta columns

6) Statistical normalization columns

7) Local context columns

8) Row-level interpretation example

Test command

HPC PBS Usage

Resume Note

Troubleshooting

Missing FASTA index

samtools not found

Slow startup or no GPU on login node

Memory pressure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`samtools` not found

Packages