After every pipeline pass has finished editing the sequence, GenAIRR runs an evidence-based
caller against whatever bases survived. It scores every reference allele, keeps the strict
tie at max score, and orders the surviving truth allele first when it's in the tie. The
result is published in v_call / d_call / j_call,
sitting beside truth_v_call in the same record so an aligner-style call and the
sampled-at-source truth can be compared row-by-row.
Three deterministic steps. No reject-and-retry, no stochastic search — every base contributes one score increment to every allele it matches, and the call falls out of arithmetic.
truth_v_call tells you what was sampled at recombination — fixed at the source,
immune to any downstream pass. v_call tells you what an evidence-based caller
can recover from the bases that survived corruption. After heavy SHM or 5′ loss the two can
legitimately diverge, and that divergence is itself the signal: it's a window into how
ambiguity grows as the molecule gets chewed.
The earlier caller used a strict-intersect over compatible alleles per base: as soon as one
base ruled out the last shared allele, v_call collapsed to empty. A single
mutation could wipe the entire call field. The caller wasn't wrong — it was honest about
having no consensus — but the field stopped carrying useful information.
# single confident call
v_call "IGHV3-23*04"
# tie-set — paralogs the surviving bases can't tell apart
v_call "IGHV3-23*04, IGHV3-23*01"
truth_v_call "IGHV3-23*01"
# honest drift — truth dropped out of the evidence tie
v_call "IGHD6-13*01, IGHD6-19*01"
truth_d_call "IGHD6-25*01"
Per segment, GenAIRR builds a score vector of length N alleles. It walks each base the segment claims, looks up which alleles have that base at the projected reference position, and increments their counts. The final call is the set of alleles tied at the maximum score — nothing fancier.
Each base inside V's (or D's, or J's) own region is scored against every reference allele's germline byte at that ref position. Indel-inserted bases (no ref position) are skipped without failing the call.
After the structural walk, the caller steps into the adjacent NP region — and optionally into the next structural neighbour — looking for bases that still fit each allele's projected continuation. Each match adds to the score; the walk halts when no allele can cover the projected position.
Trim cap: the walk can reach at most trim_5 ref positions before
the structural start (and trim_3 after the end) — the sampled allele's own
footprint bounds the search. Structural boundary cap: the walk can overlap into
the one neighbour segment, but never skip past it into more-distant territory.
The call is the set of alleles whose final score equals the maximum. Alleles with strictly more matches win outright. When the max score is zero (extreme corruption, every base edited away from every allele's germline), the full pool ships — every allele is equally consistent with the absence of evidence.
The AIRR row carries one germline_alignment, one v_identity, one
v_cigar — singular fields, computed against a single projected allele. When
the tie-set has more than one member, the call has to pick one for projection. GenAIRR
prefers the truth allele when it sits inside the tie-set: it's the only allele
whose germline bytes really do match the pool bases.
Net effect: v_call.split(",")[0] always names the allele the alignment-side
fields were computed against. A downstream consumer that reads only the first call gets
coherent metadata; one that parses the whole tie-set gets the honest ambiguity envelope.
The shape of v_call across a batch carries diagnostic information about how
hard the simulated reads are to call. Three patterns to recognise.
The surviving bases narrow the evidence to one winner. Most records under moderate
corruption land here; v_call equals truth_v_call verbatim. This
is the regime an aligner is supposed to recover the call cleanly.
Close paralogs (*01 / *04 of the same gene, sibling D genes
like IGHD4-11 / IGHD4-4) tie when the surviving bases don't
cover the discriminating SNPs. D ties most often — the segment is only ~17 bp and a
single coincidental NP match is enough to lift another allele to truth's score.
Under heavy SHM or corruption the evidence can shift away from the originally-sampled
allele entirely. v_call reports the alleles that genuinely scored highest;
truth_v_call remembers what was really drawn. The gap is the honest
aligner-drift case a benchmarking pipeline wants to detect.
import GenAIRR as ga
records = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(count=25)
.corrupt_pcr(count=8)
.run_records(n=1000, seed=42, expose_provenance=True)
)
for rec in records:
tie = rec["v_call"].split(",")
truth_in_tie = rec["truth_v_call"] in tie
# benchmark: did an aligner's call appear anywhere in our tie?
aligner_in_tie = predictions[rec["sequence_id"]] in tie
# diagnostic: did the evidence drift off truth?
drift = not truth_in_tie