Chapter 04 · The answer key

The AIRR record — ~70 fields, all derived.

A GenAIRR record is not an accumulator of fields filled in across the pipeline. It's computed once, at the end, by reading the final IR snapshot. Coordinates, CIGARs, junctions, identities, calls — everything is derived from a single source of truth, so nothing can drift out of sync. And alongside the conventional AIRR schema, GenAIRR carries a parallel set of truth_* fields that record what actually happened during simulation — independent of what an aligner would later infer.

01 Read pool final nucleotide pool · regions · assignments

02 Locate anchors V Cys · J anchor · junction span

03 Derive fields coordinates · CIGARs · calls · truth columns

PART 01

Derived, not accumulated.

Many simulators carry an AIRR record alongside the sequence and update fields as the pipeline progresses. An indel happens, and you scramble to decrement coordinates downstream; a trim arrives, and you bookkeep the gene end. Every update is a chance to introduce a desynchronization bug.

GenAIRR keeps no such running record. Coordinates exist only on the final pool. CIGARs are walked once at the end. productive is decided by reading the junction codons from the final state. There is no parallel state to fall out of sync — there is nothing to fall out of sync with.

Most simulators

incrementaleach phase updates AIRR fields in place
brittleindels and trims cascade through coordinates
silent driftan off-by-one survives until much later

GenAIRR

terminalrecord is computed once from the final IR
single sourceevery field reads the same pool snapshot
guaranteeddesynchronization is categorically impossible

PART 02

Birth certificate vs lookalike.

Two questions sit at the heart of every AIRR record: what gene did this come from? and what gene does this sequence look like? Most of the time these agree. Under heavy somatic hypermutation they don't — and the disagreement is exactly the signal you want to study when benchmarking an aligner. GenAIRR records both, always, in parallel columns.

Sampled truth — the birth certificate

truth_v_callthe V allele GenAIRR sampled at recombination
truth_d_callthe D allele sampled — fixed at the source
truth_j_callthe J allele sampled — fixed at the source

These are the names of the alleles drawn from the reference library at the moment of recombination. They don't change. They're the ground truth.

Evidence-based call — what the sequence looks like now

v_callV allele(s) the corrupted sequence supports — comma-separated when paralogs can't be told apart
d_callD allele(s) the (often short) middle region supports — D ties most often
j_callJ allele(s) the final 3′ region supports

Computed at record time by scoring every reference allele against the final sequence and returning the strict tie-set at max score. When the truth allele survives the tie it is listed first, so v_call.split(",")[0] stays consistent with the alignment and identity fields. After heavy SHM or corruption the tie-set may diverge from truth_* entirely — exactly the case where an aligner could be misled.

truth (sampled)

IGHV3-23*01

≠

evidence (tie-set after SHM)

IGHV3-23*04, IGHV3-23*01

Same record. The evidence caller can't tell *01 from *04 after the surviving bases — both score the same — so v_call reports both, truth first. truth_v_call remembers that the actual draw was *01.

PART 03

Locating the junction by landmarks.

The junction is the most important region of an immune receptor — and the most likely to shift around as trims and mutations accumulate. To find it on the final sequence, GenAIRR doesn't measure offsets from the start. It looks for biological anchors: the conserved Cysteine codon at the 3′ end of V, and the Tryptophan (heavy / κ / TR) or Phenylalanine (λ) codon at the 5′ end of J. Both anchors are baked into the per-base provenance, so the lookup survives any amount of mutation, trim, or sequencing loss.

Anchor-driven coordinate fields

junction_start / junction_endfirst base of Cys codon · last base of anchor codon
junction · junction_aanucleotide and amino-acid sequence between anchors
cdr3_start / cdr3_endjunction minus the flanking anchor codons

Survives even heavy trim and SHM

rec = exp.run(1)[0]

# the final sequence has been:
#   - mutated at 11 positions in V
#   - 5'-corrupted (lost the first 28 bp)
#   - has an indel insert at position 142
# the junction is STILL found correctly:

rec.final_simulation().junction_span()
# (267, 309) — Cys-codon start, anchor-codon end

rec.junction_aa
# 'CARDVPYAFDIW'  — anchor + payload + anchor

PART 04

The ~70 fields, by family.

Every record is the same shape regardless of what passes ran. Unused phases simply leave their fields at sentinel defaults — there's no schema branching. The fields fall into six families, each derived from the same pool snapshot.

Identity & truth

sequence_id str truth_v_call IGHV3-23*01 truth_d_call IGHD3-10*01 truth_j_call IGHJ4*02 v_call · d_call · j_call evidence tie-set locus · chain IGH · heavy

Sequence & coordinates

sequence bytes sequence_length 320 v_sequence_start / end 0 · 258 d_sequence_start / end 263 · 276 j_sequence_start / end 281 · 320 v_germline_start / end 0 · 286 + d_germline_*, j_germline_*, np1_*, np2_*

Junction & productivity

junction TGT…TGG junction_aa CARDVPYAFDIW junction_length 42 cdr3_aa ARDVPYAFDI productive True stop_codon False vj_in_frame True

Alignment & CIGAR

v_cigar 3S280M d_cigar 12M j_cigar 35M4S v_identity 0.962 d_identity · j_identity float · float

Mutation

truth_n_mutations 11 mutation_rate 0.031 shm_model s5f mutation_pos / ref / alt list[int] · list[str]

Corruption (lab effects)

corruption_added True corrupted_5prime · _3prime 28 · 12 n_count 3 indels_added 1

PART 05

The trace — every decision, addressable.

Beyond the AIRR record, every random draw the engine made is logged in an addressed trace. Think of it as the black-box flight recorder for one simulation. If a record looks surprising, you don't have to guess why — you can look up the exact draw that produced it.

Trace addresses, by example

sample_allele.vthe V pool index that was drawn
trim.v_3the 3′ V trim length
np.np1.lengththe N1 region length
np.np1.bases[3]the 4th NP1 base drawn
mutate.s5f.position[7]the 8th SHM mutation location

Querying the trace

trace = outcome.trace()

# single draw, by exact address
trace.find("sample_allele.v").value
# 146 — pool index of the V allele drawn

# every draw under a prefix
trace.prefix_query("np.np1")
# [length=5, bases[0]='A', bases[1]='C', ...]

# count the draws below a prefix
trace.prefix_count("mutate.s5f")
# 11 — number of SHM substitutions in this record

Continue

Next chapter.

← Chapter 03 · Contracts

How GenAIRR enforces biology by pruning the candidate distribution at sample time.

Chapter 05 · Live calls →

The evidence-based caller behind v_call / d_call / j_call — score-and-tie, elastic boundaries, truth-first ordering.