Chapter 01 · The architecture

The Simulation Pipeline.

A GenAIRR simulation is a small composition of passes applied in a fixed order to a persistent IR. The DSL surfaces them as three phases — recombine, mutate, corrupt — because that's how biology and the lab actually happen. The engine surfaces them as a flat list of pure functions, so you can insert, replace, or reorder anything.

01 Recombine V·D·J selection · trimming · NP synthesis
02 Mutate somatic hypermutation
03 Corrupt lab + sequencing artifacts
PHASE 01

Recombination — building a clean V(D)J.

The simulator samples a V, D, and J allele from the reference, trims each at biologically plausible distributions, and stitches in NP regions of TdT-style untemplated nucleotides. Every coordinate is logged at the moment it's chosen.

Passes used
  • SampleAllelePasspick V·D·J from reference, weighted by family prevalence
  • TrimPassdraw 3′ V / 5′ D / 3′ D / 5′ J trim lengths
  • AssembleSegmentPasscopy the kept germline slice into the nucleotide pool
  • GenerateNPPasssynthesize N1/N2 from a TdT-like base distribution
The DSL call
#  one line of DSL — many passes underneath
exp = (Experiment("human_igh_naive")
       .recombine())   # ← all four passes above, in order

rec = exp.run(1)[0]
# rec.truth_v_call    → "IGHV3-23*01"
# rec.junction_aa     → "CARDVPYAFDIW"
# rec.productive      → True
before
empty IR — pool length 0
after recombine
V
D
J
Truth fields recorded
truth_v_call IGHV3-23*01 truth_d_call IGHD3-10*01 truth_j_call IGHJ4*02 v_trim_3 4 bp d_trim_5 / 3 2 / 3 bp j_trim_5 1 bp n1_length 5 bp n2_length 4 bp junction TGTGCG…GGGTGG junction_aa CARDVPYAFDIW productive True + 16 segment-coordinate fields
PHASE 02

Maturation — somatic hypermutation.

Once a B-cell receptor is recombined, repeated encounters with antigen drive somatic hypermutation: AID-mediated substitutions concentrated in the V region. GenAIRR ships two SHM models — uniform (rate-only) and S5F (motif-aware) — and both log every substitution at base-pair resolution.

Passes used
  • S5FMutationPass5-mer-conditioned substitution probabilities (Yaari et al., 2013)
  • UniformMutationPassflat-rate substitutions — useful as null model
The DSL call
exp = (Experiment("human_igh_naive")
       .recombine()
       .mutate(model="s5f",
               count=(5, 25)))   # draw # subs from uniform

rec = exp.run(1)[0]
# rec.truth_n_mutations  → 11
# rec.mutation_rate      → 0.031
# rec.v_identity         → 0.962
before
V
D
J
after mutate
V
D
J
Truth fields recorded
truth_n_mutations 11 mutation_rate 0.031 v_identity 0.962 shm_model s5f mutation_pos [34, 67, 92, …] mutation_ref [A, C, G, …] mutation_alt [G, T, A, …] + per-region mutation counts
PHASE 03

Corruption — what the lab does to your sequence.

Real reads aren't clean. Library prep clips ends. Sequencers introduce N-bases and substitutions. PCR amplifies its own errors. GenAIRR models each artifact as its own pass, with explicit rates — and remembers what the sequence was before any of it.

Passes used
  • EndLossPass5′ and 3′ random truncation (read-length variation)
  • IndelPasslow-rate insertions / deletions (homopolymer drift)
  • NCorruptionPassscatter ambiguous bases (low-quality calls)
  • PCRErrorPasssubstitution errors weighted toward early-cycle amplification
  • ContaminantPassinject off-target sequence fragments
  • QualityErrorPassplatform-shaped quality score degradation
The DSL call
exp = (Experiment("human_igh_naive")
       .recombine()
       .mutate(model="s5f", count=(5, 25))
       .corrupt_5prime_loss(p=0.30, max_loss=60)
       .corrupt_3prime_loss(p=0.15, max_loss=30)
       .corrupt_indels(rate=0.005)
       .corrupt_ns(rate=0.001))

rec = exp.run(1)[0]
# rec.sequence_length    → 320       (was 360 before loss)
# rec.corruption_added   → True
# rec.truth_v_call       → "IGHV3-23*01"  (still known!)
before
V
D
J
after corrupt
5′×
V
D
J
3′×
Truth fields recorded
sequence_length 320 bp corruption_added True 5_prime_loss 28 bp 3_prime_loss 12 bp n_count 3 n_positions [89, 207, 275] indel_count 1 indel_events [ins@142] + per-pass corruption flags
Across every phase

What stays true.

The pipeline transforms the IR. The contracts watch what it transforms into. A contract is a predicate the engine can either verify after a pass or filter sampling choices with, before they happen. The canonical productive() bundle composes three:

anchor_preserved Trimming may eat into a segment, but the conserved anchor codon (Cys-104 on V, Trp/Phe on J) must remain in the retained slice. If a candidate trim would remove it, the sampler tries again.
productive_junction_frame Length of the V-anchor → J-anchor junction must be divisible by three. Out-of-frame rearrangements are valid in nature but get marked productive=False and excluded from productive().
no_stop_codon_in_junction No TAA / TAG / TGA inside CDR3. The filter mode checks candidate NP bases and candidate substitutions before they're committed, so an unproductive draw never leaves a partial record behind.
Open at the seams

Every pass is a function.
So is yours.

A pass is the Pass trait: it takes an IR, returns a new IR, and emits zero or more events into the trace. There's no privileged status for the built-ins — write your own pass, drop it into the chain, and it composes with everything else. The contracts still apply. The trace still records every choice.

Guide → write a custom pass
class PolyaTailPass(Pass):
    """Append a poly-A tail of variable length."""

    def apply(self, ir: Simulation, rt: PassRuntime):
        n = rt.rng.range(20, 60)
        ir = ir.with_appended_bases("A" * n)
        rt.trace.emit("polya_added", length=n)
        return ir

exp.add_pass(PolyaTailPass())   # composes with the rest
Next chapter

02 / The Persistent IR

Every base in its arena. Every revision a new layer. How "just clone the IR" makes branching, retries, and trace replay free.

Back to concepts →
Or run it

Walk a real molecule.

The 5-lesson tutorial follows one IGH sequence through all three phases and shows the AIRR record fill in field by field.

Open the tutorial →