Chapter 02 · The architecture

Persistent IR — every base remembers.

Most simulators store a sequence as a string and mutate it in place. GenAIRR stores it as a persistent intermediate representation: an immutable snapshot per phase, where every nucleotide carries its own provenance. The result is Git for DNA — a complete, replayable history of how the sequence came to be, with mathematically guaranteed coordinates.

R0 Empty pool length 0 · no regions

R1 Recombined V·NP1·D·NP2·J in the pool

R2 Mutated SHM substitutions flagged

PART 01

The Nucleotide Pool — every base carries a passport.

Instead of a flat "ACGT…" string, GenAIRR stores the sequence as a pool of nucleotide records. Each base is a small package: what it is right now, what it was originally, which segment it came from, which gene position it occupied, and a set of flags marking mutations, insertions, or ambiguous calls.

A base, unpacked

current_baseA · C · G · T · N — the base as it stands today
germline_basethe original base before any mutation or error
germline_positionwhere this base lived in its source allele
segmentV · D · J · NP1 · NP2 — the structural role
flagsMUTATED · INDEL_INS · N_CORRUPTION · …

Why per-base provenance

# A mutation arrives. The base flips A → G.
nuc.current_base   = "G"      # new value
nuc.germline_base  = "A"      # preserved
nuc.flags         |= MUTATED        # marked

# Later, an indel inserts a base nearby.
inserted.germline_base = "\0"  # no germline origin
inserted.flags        |= INDEL_INS

# A coordinate query at the end still works:
# "where did v_germline_end land after all this?"
# the pool answers — no manual bookkeeping needed.

One nucleotide, fully described

pool_index 187 current_base G germline_base A germline_position 187 segment V allele_id IGHV3-23*01 flags MUTATED + codon_phase, region_offset, edit_history

PART 02

Structure is separate from content.

Alongside the nucleotide pool, the IR carries a structural overlay: regions (segment spans), assignments (which alleles were picked), and the trace (every random draw, keyed by address). Bases live in one layer; structure lives in another. You can substitute an allele, retrim a segment, or pivot the junction without disturbing the surrounding pool — because the two layers compose, they don't entangle.

Layers

Nucleotide poolthe bases themselves, each with provenance
RegionsV · NP1 · D · NP2 · J spans on the pool
Assignmentswhich V·D·J alleles were sampled
Traceevery random draw, addressed and replayable

The layers, observed

sim = outcome.final_simulation()

# layer 1 — the bases (pool)
sim.bases()              # b'gag...gtg'

# layer 2 — regions (structure)
[ r.segment for r in sim.regions() ]
# ['V', 'NP1', 'D', 'NP2', 'J']

# layer 3 — assignments (identity)
sim.v_allele_id()        # 146 → 'IGHV3-23*01'

# layer 4 — trace (history)
outcome.trace().find("sample_allele.v").value
# 146 — the actual draw that produced this record

PART 03

Every pass writes a new revision.

A pass never mutates a previous revision in place. It reads the latest snapshot, computes its changes, and commits a new revision pointing back at the old one. The result is an append-only history of the simulation — every intermediate state is preserved and queryable.

Three steps, every pass

Analyzeread the latest revision — where can changes apply?
Plandecide the edits — sample positions, draw substitutes, etc.
Commitemit a new revision; previous revision is preserved

Stepping through revisions

outcome = exp.run(1)[0]

# every pass that ran, in order
outcome.pass_names()
# ['sample_allele.v', 'sample_allele.d', 'sample_allele.j',
#  'trim.v_3', 'generate_np.np1', 'assemble.v', ..., 'mutate.s5f']

# the IR snapshot after the i-th pass
outcome.revision(5).bases()         # mid-recombination state
outcome.revision(-1).bases()        # final state (alias)

# the first snapshot a named pass produced
outcome.revision_after("mutate.s5f")
# Simulation — pool + regions immediately after SHM

R0 · empty

pool length 0

→

R_k · recombined

V

D

J

→

R_n · mutated

V

D

J

PART 04

What you get because of this.

Persistent IR isn't an academic flourish. Three concrete properties fall out of it, and every claim GenAIRR makes about correctness rests on at least one of them.

Audit-ready output

Every truth field, every coordinate, every CIGAR character traces back to a specific random draw in the trace. You don't have to trust the result — you can replay the decision that produced it.

Coordinate correctness by construction

Coordinates are derived from the final pool, not maintained incrementally. An indel can't desynchronize anything — there's no separate counter to fall out of step with. Off-by-one bugs become categorically impossible.

A real answer key

Because every base remembers its origin, GenAIRR can tell you exactly which call an aligner got wrong — not approximately, not statistically. A perfect ground-truth comparison for benchmarking annotation tools.

Continue

Next chapter.

Chapter 01 · Simulation Pipeline

How the three biological phases compose into a flat list of passes.

Chapter 04 · AIRR Record →

How GenAIRR derives ~70 ground-truth fields from a single pool snapshot.