Chapter 04 · The answer key

The AIRR record — ~70 fields, all derived.

A GenAIRR record is not an accumulator of fields filled in across the pipeline. It's computed once, at the end, by reading the final IR snapshot. Coordinates, CIGARs, junctions, identities, calls — everything is derived from a single source of truth, so nothing can drift out of sync. And alongside the conventional AIRR schema, GenAIRR carries a parallel set of truth_* fields that record what actually happened during simulation — independent of what an aligner would later infer.

01 Read pool final nucleotide pool · regions · assignments
02 Locate anchors V Cys · J anchor · junction span
03 Derive fields coordinates · CIGARs · calls · truth columns
PART 01

Derived, not accumulated.

Many simulators carry an AIRR record alongside the sequence and update fields as the pipeline progresses. An indel happens, and you scramble to decrement coordinates downstream; a trim arrives, and you bookkeep the gene end. Every update is a chance to introduce a desynchronization bug.

GenAIRR keeps no such running record. Coordinates exist only on the final pool. CIGARs are walked once at the end. productive is decided by reading the junction codons from the final state. There is no parallel state to fall out of sync — there is nothing to fall out of sync with.

Most simulators
  • incrementaleach phase updates AIRR fields in place
  • brittleindels and trims cascade through coordinates
  • silent driftan off-by-one survives until much later
GenAIRR
  • terminalrecord is computed once from the final IR
  • single sourceevery field reads the same pool snapshot
  • guaranteeddesynchronization is categorically impossible
PART 02

Birth certificate vs lookalike.

Two questions sit at the heart of every AIRR record: what gene did this come from? and what gene does this sequence look like? Most of the time these agree. Under heavy somatic hypermutation they don't — and the disagreement is exactly the signal you want to study when benchmarking an aligner. GenAIRR records both, always, in parallel columns.

Sampled truth — the birth certificate
  • truth_v_callthe V allele GenAIRR sampled at recombination
  • truth_d_callthe D allele sampled — fixed at the source
  • truth_j_callthe J allele sampled — fixed at the source

These are the names of the alleles drawn from the reference library at the moment of recombination. They don't change. They're the ground truth.

Evidence-based call — what the sequence looks like now
  • v_callV allele(s) the corrupted sequence supports — comma-separated when paralogs can't be told apart
  • d_callD allele(s) the (often short) middle region supports — D ties most often
  • j_callJ allele(s) the final 3′ region supports

Computed at record time by scoring every reference allele against the final sequence and returning the strict tie-set at max score. When the truth allele survives the tie it is listed first, so v_call.split(",")[0] stays consistent with the alignment and identity fields. After heavy SHM or corruption the tie-set may diverge from truth_* entirely — exactly the case where an aligner could be misled.

truth (sampled)
IGHV3-23*01
evidence (tie-set after SHM)
IGHV3-23*04, IGHV3-23*01

Same record. The evidence caller can't tell *01 from *04 after the surviving bases — both score the same — so v_call reports both, truth first. truth_v_call remembers that the actual draw was *01.

PART 03

Locating the junction by landmarks.

The junction is the most important region of an immune receptor — and the most likely to shift around as trims and mutations accumulate. To find it on the final sequence, GenAIRR doesn't measure offsets from the start. It looks for biological anchors: the conserved Cysteine codon at the 3′ end of V, and the Tryptophan (heavy / κ / TR) or Phenylalanine (λ) codon at the 5′ end of J. Both anchors are baked into the per-base provenance, so the lookup survives any amount of mutation, trim, or sequencing loss.

Anchor-driven coordinate fields
  • junction_start / junction_endfirst base of Cys codon · last base of anchor codon
  • junction · junction_aanucleotide and amino-acid sequence between anchors
  • cdr3_start / cdr3_endjunction minus the flanking anchor codons
Survives even heavy trim and SHM
rec = exp.run(1)[0]

# the final sequence has been:
#   - mutated at 11 positions in V
#   - 5'-corrupted (lost the first 28 bp)
#   - has an indel insert at position 142
# the junction is STILL found correctly:

rec.final_simulation().junction_span()
# (267, 309) — Cys-codon start, anchor-codon end

rec.junction_aa
# 'CARDVPYAFDIW'  — anchor + payload + anchor
PART 04

The ~70 fields, by family.

Every record is the same shape regardless of what passes ran. Unused phases simply leave their fields at sentinel defaults — there's no schema branching. The fields fall into six families, each derived from the same pool snapshot.

Identity & truth
sequence_id str truth_v_call IGHV3-23*01 truth_d_call IGHD3-10*01 truth_j_call IGHJ4*02 v_call · d_call · j_call evidence tie-set locus · chain IGH · heavy
Sequence & coordinates
sequence bytes sequence_length 320 v_sequence_start / end 0 · 258 d_sequence_start / end 263 · 276 j_sequence_start / end 281 · 320 v_germline_start / end 0 · 286 + d_germline_*, j_germline_*, np1_*, np2_*
Junction & productivity
junction TGT…TGG junction_aa CARDVPYAFDIW junction_length 42 cdr3_aa ARDVPYAFDI productive True stop_codon False vj_in_frame True
Alignment & CIGAR
v_cigar 3S280M d_cigar 12M j_cigar 35M4S v_identity 0.962 d_identity · j_identity float · float
Mutation
truth_n_mutations 11 mutation_rate 0.031 shm_model s5f mutation_pos / ref / alt list[int] · list[str]
Corruption (lab effects)
corruption_added True corrupted_5prime · _3prime 28 · 12 n_count 3 indels_added 1
PART 05

The trace — every decision, addressable.

Beyond the AIRR record, every random draw the engine made is logged in an addressed trace. Think of it as the black-box flight recorder for one simulation. If a record looks surprising, you don't have to guess why — you can look up the exact draw that produced it.

Trace addresses, by example
  • sample_allele.vthe V pool index that was drawn
  • trim.v_3the 3′ V trim length
  • np.np1.lengththe N1 region length
  • np.np1.bases[3]the 4th NP1 base drawn
  • mutate.s5f.position[7]the 8th SHM mutation location
Querying the trace
trace = outcome.trace()

# single draw, by exact address
trace.find("sample_allele.v").value
# 146 — pool index of the V allele drawn

# every draw under a prefix
trace.prefix_query("np.np1")
# [length=5, bases[0]='A', bases[1]='C', ...]

# count the draws below a prefix
trace.prefix_count("mutate.s5f")
# 11 — number of SHM substitutions in this record
Continue

Next chapter.

← Chapter 03 · Contracts

How GenAIRR enforces biology by pruning the candidate distribution at sample time.

Chapter 05 · Live calls →

The evidence-based caller behind v_call / d_call / j_call — score-and-tie, elastic boundaries, truth-first ordering.