Chapter 05 · Reading the evidence

Live calls — what the corrupted bases actually say.

After every pipeline pass has finished editing the sequence, GenAIRR runs an evidence-based caller against whatever bases survived. It scores every reference allele, keeps the strict tie at max score, and orders the surviving truth allele first when it's in the tie. The result is published in v_call / d_call / j_call, sitting beside truth_v_call in the same record so an aligner-style call and the sampled-at-source truth can be compared row-by-row.

01 Score every allele +1 per matching base, structural + elastic
02 Strict tie at max only alleles with the highest score survive
03 Truth first if it survives keeps v_call.split(",")[0] aligned with the alignment fields

Three deterministic steps. No reject-and-retry, no stochastic search — every base contributes one score increment to every allele it matches, and the call falls out of arithmetic.

PART 01

Why GenAIRR records two call fields per segment.

truth_v_call tells you what was sampled at recombination — fixed at the source, immune to any downstream pass. v_call tells you what an evidence-based caller can recover from the bases that survived corruption. After heavy SHM or 5′ loss the two can legitimately diverge, and that divergence is itself the signal: it's a window into how ambiguity grows as the molecule gets chewed.

The pre-Y6 problem

The earlier caller used a strict-intersect over compatible alleles per base: as soon as one base ruled out the last shared allele, v_call collapsed to empty. A single mutation could wipe the entire call field. The caller wasn't wrong — it was honest about having no consensus — but the field stopped carrying useful information.

What you see in the record now
# single confident call
v_call       "IGHV3-23*04"

# tie-set — paralogs the surviving bases can't tell apart
v_call       "IGHV3-23*04, IGHV3-23*01"
truth_v_call "IGHV3-23*01"

# honest drift — truth dropped out of the evidence tie
v_call       "IGHD6-13*01, IGHD6-19*01"
truth_d_call "IGHD6-25*01"
PART 02

Score every allele. Take the strict tie at max.

Per segment, GenAIRR builds a score vector of length N alleles. It walks each base the segment claims, looks up which alleles have that base at the projected reference position, and increments their counts. The final call is the set of alleles tied at the maximum score — nothing fancier.

The structural walk

Each base inside V's (or D's, or J's) own region is scored against every reference allele's germline byte at that ref position. Indel-inserted bases (no ref position) are skipped without failing the call.

Elastic boundary extension

After the structural walk, the caller steps into the adjacent NP region — and optionally into the next structural neighbour — looking for bases that still fit each allele's projected continuation. Each match adds to the score; the walk halts when no allele can cover the projected position.

Two caps on the walker

Trim cap: the walk can reach at most trim_5 ref positions before the structural start (and trim_3 after the end) — the sampled allele's own footprint bounds the search. Structural boundary cap: the walk can overlap into the one neighbour segment, but never skip past it into more-distant territory.

Strict tie at max

The call is the set of alleles whose final score equals the maximum. Alleles with strictly more matches win outright. When the max score is zero (extreme corruption, every base edited away from every allele's germline), the full pool ships — every allele is equally consistent with the absence of evidence.

PART 03

Truth-first ordering keeps the row internally consistent.

The AIRR row carries one germline_alignment, one v_identity, one v_cigar — singular fields, computed against a single projected allele. When the tie-set has more than one member, the call has to pick one for projection. GenAIRR prefers the truth allele when it sits inside the tie-set: it's the only allele whose germline bytes really do match the pool bases.

Two seams, one rule
  • projected_allele_idpicks truth from the tie-set; falls back to the first tied id; ultimately falls back to truth even when there's no live-call layer
  • live_call_nameorders the comma-separated string so the truth allele appears first when it's in the tie
Fields that follow the projection
  • germline_alignmentfilled with the projected allele's bytes at every claimed ref position
  • v_identity · d_identity · j_identitymatches over the alignment span vs. the projected allele's bytes
  • v_cigar · d_cigar · j_cigarM / I / D ops referenced to the projected allele
  • v_germline_start / endref-position span the projection covered

Net effect: v_call.split(",")[0] always names the allele the alignment-side fields were computed against. A downstream consumer that reads only the first call gets coherent metadata; one that parses the whole tie-set gets the honest ambiguity envelope.

PART 04

Reading the tie-set.

The shape of v_call across a batch carries diagnostic information about how hard the simulated reads are to call. Three patterns to recognise.

Single allele — confident call

The surviving bases narrow the evidence to one winner. Most records under moderate corruption land here; v_call equals truth_v_call verbatim. This is the regime an aligner is supposed to recover the call cleanly.

Two or three alleles — paralog ambiguity

Close paralogs (*01 / *04 of the same gene, sibling D genes like IGHD4-11 / IGHD4-4) tie when the surviving bases don't cover the discriminating SNPs. D ties most often — the segment is only ~17 bp and a single coincidental NP match is enough to lift another allele to truth's score.

Truth dropped from the tie

Under heavy SHM or corruption the evidence can shift away from the originally-sampled allele entirely. v_call reports the alleles that genuinely scored highest; truth_v_call remembers what was really drawn. The gap is the honest aligner-drift case a benchmarking pipeline wants to detect.

Walking the tie-set in Python
import GenAIRR as ga

records = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(count=25)
       .corrupt_pcr(count=8)
       .run_records(n=1000, seed=42, expose_provenance=True)
)

for rec in records:
    tie = rec["v_call"].split(",")
    truth_in_tie = rec["truth_v_call"] in tie

    # benchmark: did an aligner's call appear anywhere in our tie?
    aligner_in_tie = predictions[rec["sequence_id"]] in tie

    # diagnostic: did the evidence drift off truth?
    drift = not truth_in_tie
Continue

Next chapter.

← Chapter 04 · The AIRR Record

The 70-field schema GenAIRR writes per sequence — truth columns, evidence columns, coordinates, junction.

Back to all concepts →

Five chapters: pipeline, persistent IR, contracts, AIRR record, live calls.