A GenAIRR record is not an accumulator of fields filled in across the pipeline. It's
computed once, at the end, by reading the final IR snapshot. Coordinates, CIGARs,
junctions, identities, calls — everything is derived from a single source of truth, so
nothing can drift out of sync. And alongside the conventional AIRR schema, GenAIRR carries
a parallel set of truth_* fields that record what actually happened
during simulation — independent of what an aligner would later infer.
Many simulators carry an AIRR record alongside the sequence and update fields as the pipeline progresses. An indel happens, and you scramble to decrement coordinates downstream; a trim arrives, and you bookkeep the gene end. Every update is a chance to introduce a desynchronization bug.
GenAIRR keeps no such running record. Coordinates exist only on the final pool. CIGARs are
walked once at the end. productive is decided by reading the junction codons
from the final state. There is no parallel state to fall out of sync — there is nothing to
fall out of sync with.
Two questions sit at the heart of every AIRR record: what gene did this come from? and what gene does this sequence look like? Most of the time these agree. Under heavy somatic hypermutation they don't — and the disagreement is exactly the signal you want to study when benchmarking an aligner. GenAIRR records both, always, in parallel columns.
These are the names of the alleles drawn from the reference library at the moment of recombination. They don't change. They're the ground truth.
Computed at record time by scoring every reference allele against the final sequence and
returning the strict tie-set at max score. When the truth allele survives the tie it is
listed first, so v_call.split(",")[0] stays consistent with the alignment and
identity fields. After heavy SHM or corruption the tie-set may diverge from
truth_* entirely — exactly the case where an aligner could be misled.
Same record. The evidence caller can't tell *01 from *04 after the
surviving bases — both score the same — so v_call reports both, truth first.
truth_v_call remembers that the actual draw was *01.
The junction is the most important region of an immune receptor — and the most likely to shift around as trims and mutations accumulate. To find it on the final sequence, GenAIRR doesn't measure offsets from the start. It looks for biological anchors: the conserved Cysteine codon at the 3′ end of V, and the Tryptophan (heavy / κ / TR) or Phenylalanine (λ) codon at the 5′ end of J. Both anchors are baked into the per-base provenance, so the lookup survives any amount of mutation, trim, or sequencing loss.
rec = exp.run(1)[0]
# the final sequence has been:
# - mutated at 11 positions in V
# - 5'-corrupted (lost the first 28 bp)
# - has an indel insert at position 142
# the junction is STILL found correctly:
rec.final_simulation().junction_span()
# (267, 309) — Cys-codon start, anchor-codon end
rec.junction_aa
# 'CARDVPYAFDIW' — anchor + payload + anchor
Every record is the same shape regardless of what passes ran. Unused phases simply leave their fields at sentinel defaults — there's no schema branching. The fields fall into six families, each derived from the same pool snapshot.
Beyond the AIRR record, every random draw the engine made is logged in an addressed trace. Think of it as the black-box flight recorder for one simulation. If a record looks surprising, you don't have to guess why — you can look up the exact draw that produced it.
trace = outcome.trace()
# single draw, by exact address
trace.find("sample_allele.v").value
# 146 — pool index of the V allele drawn
# every draw under a prefix
trace.prefix_query("np.np1")
# [length=5, bases[0]='A', bases[1]='C', ...]
# count the draws below a prefix
trace.prefix_count("mutate.s5f")
# 11 — number of SHM substitutions in this record