Chapter 02 · The architecture

Persistent IR — every base remembers.

Most simulators store a sequence as a string and mutate it in place. GenAIRR stores it as a persistent intermediate representation: an immutable snapshot per phase, where every nucleotide carries its own provenance. The result is Git for DNA — a complete, replayable history of how the sequence came to be, with mathematically guaranteed coordinates.

R0 Empty pool length 0 · no regions
R1 Recombined V·NP1·D·NP2·J in the pool
R2 Mutated SHM substitutions flagged
PART 01

The Nucleotide Pool — every base carries a passport.

Instead of a flat "ACGT…" string, GenAIRR stores the sequence as a pool of nucleotide records. Each base is a small package: what it is right now, what it was originally, which segment it came from, which gene position it occupied, and a set of flags marking mutations, insertions, or ambiguous calls.

A base, unpacked
  • current_baseA · C · G · T · N — the base as it stands today
  • germline_basethe original base before any mutation or error
  • germline_positionwhere this base lived in its source allele
  • segmentV · D · J · NP1 · NP2 — the structural role
  • flagsMUTATED · INDEL_INS · N_CORRUPTION · …
Why per-base provenance
# A mutation arrives. The base flips A → G.
nuc.current_base   = "G"      # new value
nuc.germline_base  = "A"      # preserved
nuc.flags         |= MUTATED        # marked

# Later, an indel inserts a base nearby.
inserted.germline_base = "\0"  # no germline origin
inserted.flags        |= INDEL_INS

# A coordinate query at the end still works:
# "where did v_germline_end land after all this?"
# the pool answers — no manual bookkeeping needed.
One nucleotide, fully described
pool_index 187 current_base G germline_base A germline_position 187 segment V allele_id IGHV3-23*01 flags MUTATED + codon_phase, region_offset, edit_history
PART 02

Structure is separate from content.

Alongside the nucleotide pool, the IR carries a structural overlay: regions (segment spans), assignments (which alleles were picked), and the trace (every random draw, keyed by address). Bases live in one layer; structure lives in another. You can substitute an allele, retrim a segment, or pivot the junction without disturbing the surrounding pool — because the two layers compose, they don't entangle.

Layers
  • Nucleotide poolthe bases themselves, each with provenance
  • RegionsV · NP1 · D · NP2 · J spans on the pool
  • Assignmentswhich V·D·J alleles were sampled
  • Traceevery random draw, addressed and replayable
The layers, observed
sim = outcome.final_simulation()

# layer 1 — the bases (pool)
sim.bases()              # b'gag...gtg'

# layer 2 — regions (structure)
[ r.segment for r in sim.regions() ]
# ['V', 'NP1', 'D', 'NP2', 'J']

# layer 3 — assignments (identity)
sim.v_allele_id()        # 146 → 'IGHV3-23*01'

# layer 4 — trace (history)
outcome.trace().find("sample_allele.v").value
# 146 — the actual draw that produced this record
PART 03

Every pass writes a new revision.

A pass never mutates a previous revision in place. It reads the latest snapshot, computes its changes, and commits a new revision pointing back at the old one. The result is an append-only history of the simulation — every intermediate state is preserved and queryable.

Three steps, every pass
  • Analyzeread the latest revision — where can changes apply?
  • Plandecide the edits — sample positions, draw substitutes, etc.
  • Commitemit a new revision; previous revision is preserved
Stepping through revisions
outcome = exp.run(1)[0]

# every pass that ran, in order
outcome.pass_names()
# ['sample_allele.v', 'sample_allele.d', 'sample_allele.j',
#  'trim.v_3', 'generate_np.np1', 'assemble.v', ..., 'mutate.s5f']

# the IR snapshot after the i-th pass
outcome.revision(5).bases()         # mid-recombination state
outcome.revision(-1).bases()        # final state (alias)

# the first snapshot a named pass produced
outcome.revision_after("mutate.s5f")
# Simulation — pool + regions immediately after SHM
R0 · empty
pool length 0
Rk · recombined
V
D
J
Rn · mutated
V
D
J
PART 04

What you get because of this.

Persistent IR isn't an academic flourish. Three concrete properties fall out of it, and every claim GenAIRR makes about correctness rests on at least one of them.

Audit-ready output

Every truth field, every coordinate, every CIGAR character traces back to a specific random draw in the trace. You don't have to trust the result — you can replay the decision that produced it.

Coordinate correctness by construction

Coordinates are derived from the final pool, not maintained incrementally. An indel can't desynchronize anything — there's no separate counter to fall out of step with. Off-by-one bugs become categorically impossible.

A real answer key

Because every base remembers its origin, GenAIRR can tell you exactly which call an aligner got wrong — not approximately, not statistically. A perfect ground-truth comparison for benchmarking annotation tools.

Continue

Next chapter.

Chapter 01 · Simulation Pipeline

How the three biological phases compose into a flat list of passes.

Chapter 04 · AIRR Record →

How GenAIRR derives ~70 ground-truth fields from a single pool snapshot.