Most simulators store a sequence as a string and mutate it in place. GenAIRR stores it as a persistent intermediate representation: an immutable snapshot per phase, where every nucleotide carries its own provenance. The result is Git for DNA — a complete, replayable history of how the sequence came to be, with mathematically guaranteed coordinates.
Instead of a flat "ACGT…" string, GenAIRR stores the sequence as a pool of
nucleotide records. Each base is a small package: what it is right now, what it was
originally, which segment it came from, which gene position it occupied, and a set of
flags marking mutations, insertions, or ambiguous calls.
# A mutation arrives. The base flips A → G.
nuc.current_base = "G" # new value
nuc.germline_base = "A" # preserved
nuc.flags |= MUTATED # marked
# Later, an indel inserts a base nearby.
inserted.germline_base = "\0" # no germline origin
inserted.flags |= INDEL_INS
# A coordinate query at the end still works:
# "where did v_germline_end land after all this?"
# the pool answers — no manual bookkeeping needed.
Alongside the nucleotide pool, the IR carries a structural overlay: regions (segment spans), assignments (which alleles were picked), and the trace (every random draw, keyed by address). Bases live in one layer; structure lives in another. You can substitute an allele, retrim a segment, or pivot the junction without disturbing the surrounding pool — because the two layers compose, they don't entangle.
sim = outcome.final_simulation()
# layer 1 — the bases (pool)
sim.bases() # b'gag...gtg'
# layer 2 — regions (structure)
[ r.segment for r in sim.regions() ]
# ['V', 'NP1', 'D', 'NP2', 'J']
# layer 3 — assignments (identity)
sim.v_allele_id() # 146 → 'IGHV3-23*01'
# layer 4 — trace (history)
outcome.trace().find("sample_allele.v").value
# 146 — the actual draw that produced this record
A pass never mutates a previous revision in place. It reads the latest snapshot, computes its changes, and commits a new revision pointing back at the old one. The result is an append-only history of the simulation — every intermediate state is preserved and queryable.
outcome = exp.run(1)[0]
# every pass that ran, in order
outcome.pass_names()
# ['sample_allele.v', 'sample_allele.d', 'sample_allele.j',
# 'trim.v_3', 'generate_np.np1', 'assemble.v', ..., 'mutate.s5f']
# the IR snapshot after the i-th pass
outcome.revision(5).bases() # mid-recombination state
outcome.revision(-1).bases() # final state (alias)
# the first snapshot a named pass produced
outcome.revision_after("mutate.s5f")
# Simulation — pool + regions immediately after SHM
Persistent IR isn't an academic flourish. Three concrete properties fall out of it, and every claim GenAIRR makes about correctness rests on at least one of them.
Every truth field, every coordinate, every CIGAR character traces back to a specific random draw in the trace. You don't have to trust the result — you can replay the decision that produced it.
Coordinates are derived from the final pool, not maintained incrementally. An indel can't desynchronize anything — there's no separate counter to fall out of step with. Off-by-one bugs become categorically impossible.
Because every base remembers its origin, GenAIRR can tell you exactly which call an aligner got wrong — not approximately, not statistically. A perfect ground-truth comparison for benchmarking annotation tools.