Above is one record from a single Experiment.run(). Every annotation
is ground truth — placed by the simulator at the moment of construction, not
inferred by an aligner. Cycle through the specimens to watch a clean rearrangement
pick up mutations, then sequencing artifacts, then reveal its full provenance.
GenAIRR pairs a fluent simulation language with an engine that records every decision it makes. The result: realistic synthetic repertoires whose ground truth is surgically precise — usable as benchmark datasets, principled null models, or experimental sandboxes for any phenomenon you can encode.
Test annotation pipelines, aligners, and ML models on data where every V/D/J call, mutation, and trimming length is known by construction. Stable evaluation at any scale, across 22 species and 106 built-in configs.
Generate matched controls for clonal expansion, diversity metrics, lineage inference, and convergent-selection studies. Hold biology fixed, vary one phase at a time, and watch the downstream signal change.
Compose biological mechanisms (S5F mutation, antigen selection, class switch, receptor revision) and lab effects (5'/3' loss, indels, N-bases) in isolation. The persistent IR keeps every decision traceable across phases.
The DSL is a thin surface on top of an engine that does more than chain biological phases. Compose recombination, maturation, and lab effects on the fluent side; the engine layer underneath enforces constraints at sample time, branches into clonal families on demand, and remembers every base's origin and every random draw it made.
Sample V · D · J alleles from the reference, draw biologically-plausible trim lengths, synthesize NP regions, assemble the rearrangement. The whole V(D)J ceremony in one call.
Somatic hypermutation via S5F (motif-aware, Yaari et al. 2013) or a uniform null model. BCR-only by design — TCR configs reject the call so the simulation stays biologically honest.
Fork after recombination — N parents × K descendants each. Clonal families with shared V(D)J ancestry and divergent mutation histories. Every record carries a clone_id.
Eight independent passes for wet-lab and sequencing reality: PCR errors, quality decay, indels, 5′/3′ loss, N-bases, reverse-complement, contaminants. Compose only what your platform actually does.
Pruning-aware sampling: pass a contract bundle and the engine filters every draw's candidate distribution so productive recombinations fall out by construction. No reject-and-retry. Downstream corruption can still degrade productivity — that's the part you're modelling. Strict mode for hard guarantees at sample time.
Every pass writes a new IR snapshot, every base carries its provenance, every random draw is addressed (np.np1.bases[3]). Fork from any phase, replay any draw, derive ~70 truth fields on demand.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
# Optional — attach sample metadata to every record
.with_metadata(donor="D001", protocol="MiSeq-v3")
# 01 — V(D)J recombination
.recombine()
# 02 — somatic hypermutation (BCR-only)
.mutate(model="s5f", count=(5, 25))
# 03 — clonal family fork is also a single call:
# .with_clonal_structure(n_clones=50, size=20)
# (skipped here: incompatible with productive contracts below)
# 04 — lab effects · compose what your platform actually does
.corrupt_5prime_loss(length=(5, 30))
.corrupt_3prime_loss(length=(0, 15))
.corrupt_indels(count=(0, 2), insertion_prob=0.5)
.corrupt_ns(count=(0, 5))
.corrupt_pcr(count=(0, 10))
.corrupt_quality(count=(0, 8))
.corrupt_reverse_complement(prob=0.5)
.run_records(
n=1000, seed=42,
# 05 — productive contract bundle (strict=True to enforce)
respect=ga.productive(),
# 06 — ground-truth columns alongside the evidence calls
expose_provenance=True,
)
)
Each lesson takes ~3 minutes, ends with a runnable code snippet, and an exercise you can drop straight into your own analysis.
$ pip install GenAIRR
Python 3.9+ · pre-built wheels: Linux, macOS, Windows · zero mandatory dependencies
If you use GenAIRR in your research, please cite: GenAIRR — Briefings in Bioinformatics, 2024. DOI: 10.1093/bib/bbae556