A GenAIRR simulation is a small composition of passes applied in a fixed order to a persistent IR. The DSL surfaces them as three phases — recombine, mutate, corrupt — because that's how biology and the lab actually happen. The engine surfaces them as a flat list of pure functions, so you can insert, replace, or reorder anything.
The simulator samples a V, D, and J allele from the reference, trims each at biologically plausible distributions, and stitches in NP regions of TdT-style untemplated nucleotides. Every coordinate is logged at the moment it's chosen.
# one line of DSL — many passes underneath
exp = (Experiment("human_igh_naive")
.recombine()) # ← all four passes above, in order
rec = exp.run(1)[0]
# rec.truth_v_call → "IGHV3-23*01"
# rec.junction_aa → "CARDVPYAFDIW"
# rec.productive → True
Once a B-cell receptor is recombined, repeated encounters with antigen drive somatic hypermutation: AID-mediated substitutions concentrated in the V region. GenAIRR ships two SHM models — uniform (rate-only) and S5F (motif-aware) — and both log every substitution at base-pair resolution.
exp = (Experiment("human_igh_naive")
.recombine()
.mutate(model="s5f",
count=(5, 25))) # draw # subs from uniform
rec = exp.run(1)[0]
# rec.truth_n_mutations → 11
# rec.mutation_rate → 0.031
# rec.v_identity → 0.962
Real reads aren't clean. Library prep clips ends. Sequencers introduce N-bases and substitutions. PCR amplifies its own errors. GenAIRR models each artifact as its own pass, with explicit rates — and remembers what the sequence was before any of it.
exp = (Experiment("human_igh_naive")
.recombine()
.mutate(model="s5f", count=(5, 25))
.corrupt_5prime_loss(p=0.30, max_loss=60)
.corrupt_3prime_loss(p=0.15, max_loss=30)
.corrupt_indels(rate=0.005)
.corrupt_ns(rate=0.001))
rec = exp.run(1)[0]
# rec.sequence_length → 320 (was 360 before loss)
# rec.corruption_added → True
# rec.truth_v_call → "IGHV3-23*01" (still known!)
The pipeline transforms the IR. The contracts watch what it transforms into.
A contract is a predicate the engine can either verify after a pass or filter
sampling choices with, before they happen. The canonical productive()
bundle composes three:
productive=False and excluded from productive().
A pass is the Pass trait: it takes an IR, returns a new IR, and emits
zero or more events into the trace. There's no privileged status for the built-ins —
write your own pass, drop it into the chain, and it composes with everything else.
The contracts still apply. The trace still records every choice.
class PolyaTailPass(Pass):
"""Append a poly-A tail of variable length."""
def apply(self, ir: Simulation, rt: PassRuntime):
n = rt.rng.range(20, 60)
ir = ir.with_appended_bases("A" * n)
rt.trace.emit("polya_added", length=n)
return ir
exp.add_pass(PolyaTailPass()) # composes with the rest
Every base in its arena. Every revision a new layer. How "just clone the IR" makes branching, retries, and trace replay free.
Back to concepts →The 5-lesson tutorial follows one IGH sequence through all three phases and shows the AIRR record fill in field by field.
Open the tutorial →