GenAIRR — Simulate immune receptor sequences with absolute ground truth

SPECIMEN 00042 human_igh · seed=42 ★ baseline

Recombination A clean V–D–J rearrangement.

Above is one record from a single Experiment.run(). Every annotation is ground truth — placed by the simulator at the moment of construction, not inferred by an aligner. Cycle through the specimens to watch a clean rearrangement pick up mutations, then sequencing artifacts, then reveal its full provenance.

Open the 5-lesson tutorial → pip install GenAIRR

What it's for

A DSL for AIRR. An engine that knows the answer.

GenAIRR pairs a fluent simulation language with an engine that records every decision it makes. The result: realistic synthetic repertoires whose ground truth is surgically precise — usable as benchmark datasets, principled null models, or experimental sandboxes for any phenomenon you can encode.

01 / BENCHMARK

Tools, models, and pipelines against absolute truth

Test annotation pipelines, aligners, and ML models on data where every V/D/J call, mutation, and trimming length is known by construction. Stable evaluation at any scale, across 22 species and 106 built-in configs.

02 / NULL MODEL

Principled controls for repertoire analyses

Generate matched controls for clonal expansion, diversity metrics, lineage inference, and convergent-selection studies. Hold biology fixed, vary one phase at a time, and watch the downstream signal change.

03 / PHENOMENA

A composable lab for what shapes a repertoire

Compose biological mechanisms (S5F mutation, antigen selection, class switch, receptor revision) and lab effects (5'/3' loss, indels, N-bases) in isolation. The persistent IR keeps every decision traceable across phases.

ENGINE compiled Rust kernel — scales linearly per record ~70 ground-truth fields per record persistent IR — every base knows its origin bit-identical across Linux · macOS · Windows

What GenAIRR does

Six capabilities. One composable engine.

The DSL is a thin surface on top of an engine that does more than chain biological phases. Compose recombination, maturation, and lab effects on the fluent side; the engine layer underneath enforces constraints at sample time, branches into clonal families on demand, and remembers every base's origin and every random draw it made.

01 DSL · .recombine()

Recombine

Sample V · D · J alleles from the reference, draw biologically-plausible trim lengths, synthesize NP regions, assemble the rearrangement. The whole V(D)J ceremony in one call.

02 DSL · .mutate()

Mutate

Somatic hypermutation via S5F (motif-aware, Yaari et al. 2013) or a uniform null model. BCR-only by design — TCR configs reject the call so the simulation stays biologically honest.

03 DSL · .with_clonal_structure()

Clone

Fork after recombination — N parents × K descendants each. Clonal families with shared V(D)J ancestry and divergent mutation histories. Every record carries a clone_id.

04 DSL · .corrupt_*()

Corrupt

Eight independent passes for wet-lab and sequencing reality: PCR errors, quality decay, indels, 5′/3′ loss, N-bases, reverse-complement, contaminants. Compose only what your platform actually does.

05 ENGINE · respect=ga.productive()

Constrain

Pruning-aware sampling: pass a contract bundle and the engine filters every draw's candidate distribution so productive recombinations fall out by construction. No reject-and-retry. Downstream corruption can still degrade productivity — that's the part you're modelling. Strict mode for hard guarantees at sample time.

06 ENGINE · persistent IR + trace

Replay & Audit

Every pass writes a new IR snapshot, every base carries its provenance, every random draw is addressed (np.np1.bases[3]). Fork from any phase, replay any draw, derive ~70 truth fields on demand.

simulate.py — every capability in one chain

import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")

      # Optional — attach sample metadata to every record
      .with_metadata(donor="D001", protocol="MiSeq-v3")

      # 01 — V(D)J recombination
      .recombine()

      # 02 — somatic hypermutation (BCR-only)
      .mutate(model="s5f", count=(5, 25))

      # 03 — clonal family fork is also a single call:
      #       .with_clonal_structure(n_clones=50, size=20)
      #       (skipped here: incompatible with productive contracts below)

      # 04 — lab effects · compose what your platform actually does
      .corrupt_5prime_loss(length=(5, 30))
      .corrupt_3prime_loss(length=(0, 15))
      .corrupt_indels(count=(0, 2), insertion_prob=0.5)
      .corrupt_ns(count=(0, 5))
      .corrupt_pcr(count=(0, 10))
      .corrupt_quality(count=(0, 8))
      .corrupt_reverse_complement(prob=0.5)

      .run_records(
          n=1000, seed=42,
          # 05 — productive contract bundle (strict=True to enforce)
          respect=ga.productive(),
          # 06 — ground-truth columns alongside the evidence calls
          expose_provenance=True,
      )
)

Learn by doing

Build intuition in 5 lessons.

Each lesson takes ~3 minutes, ends with a runnable code snippet, and an exercise you can drop straight into your own analysis.

Open the tutorial →

Install

One command. No compiler.

$ pip install GenAIRR Python 3.9+ · pre-built wheels: Linux, macOS, Windows · zero mandatory dependencies

If you use GenAIRR in your research, please cite: GenAIRR — Briefings in Bioinformatics, 2024. DOI: 10.1093/bib/bbae556