Skip to content

Experiment

Experiment is GenAIRR's fluent pipeline builder. Every method returns the same Experiment extended by one pipeline stage; the pipeline runs when you call .run_records(...), .run(...), or .compile().run(...). For the conceptual walk-through — which methods are ancestor-phase vs descendant-phase, which clonal model to choose, how compile reuse works — see the Experiment builder guide. The reference below catalogues the public surface.

Common methods

The methods you'll reach for in most real pipelines:

Method Purpose
.recombine(...) Add the V(D)J recombination pass — the foundational ancestor-phase mechanism
.productive_only() Constrain sampling so only productive rearrangements survive
.mutate(...) Apply biological SHM (uniform or S5F) on top of recombination
.clonal_lineage(...) Grow BCR affinity-maturation lineage trees
.clonal_repertoire(...) Generate TCR / flat-BCR abundance repertoires with clone sizes and duplicate_count
.expand_clones(...) Legacy fixed-size clonal star model
.paired_end(...) Project assembled sequences as paired R1 / R2 reads
.run_records(...) Compile + run + return a SimulationResult

For the full pipeline-stage map (constraints, trim overrides, all the corruption methods, compile reuse, strict mode), see The Experiment builder.

Class reference

The autogenerated reference below comes straight from the docstrings in src/GenAIRR/experiment.py. This page is the pilot for the wider generated-reference effort.

GenAIRR.experiment.Experiment

Fluent builder for a simulation pipeline.

Build an Experiment from a config name, a :class:DataConfig, or a :class:GenAIRR._engine.RefDataConfig via :meth:Experiment.on, chain configuration steps (currently just :meth:recombine), then call :meth:run (or :meth:compile for explicit two-stage flow).

The builder is stateful but not destructive: each fluent call returns self after appending a step. The same Experiment can be compile()-d and run()-d multiple times with different seeds.

on(source) classmethod

Start an experiment against the given reference data.

source is one of: - a config-name string (e.g. "human_igh"), - a :class:GenAIRR.DataConfig instance, - a :class:GenAIRR._engine.RefDataConfig.

When source is a config name or a DataConfig, the underlying empirical distributions (NP lengths, per-gene trims) are kept on the experiment so :meth:recombine can use them as the default sampling distributions. A bare RefDataConfig has no such backing — :meth:recombine falls through to its uniform [0..6] placeholder unless the caller passes np1_lengths / np2_lengths explicitly.

recombine(*, np1_lengths=None, np2_lengths=None, v_allele_weights=None, d_allele_weights=None, j_allele_weights=None)

Append a standard V(D)J recombination step.

Compiles to: - VJ: sample V → sample J → (trim V_3, J_5) → assemble V → generate NP1 → assemble J. - VDJ: sample V → sample D → sample J → (trim V_3, D_5, D_3, J_5) → assemble V → generate NP1 → assemble D → generate NP2 → assemble J.

np1_lengths / np2_lengths default to the species' empirical NP-length distributions (from DataConfig.NP_lengths) when the experiment is bound to a DataConfig. For raw-RefDataConfig experiments where no empirical data is available, both fall back to the uniform [(0, 1.0), ..., (6, 1.0)] distribution and emit a :class:UserWarning so the caller knows the synthetic default is being used. Pass an explicit iterable of (length, weight) tuples to override the default. Passing np2_lengths on a VJ chain raises ValueError (VJ chains have no NP2 region — there's no D segment to bracket).

Exonuclease trim is enabled by default and uses the empirical per-segment trim distributions from the bound DataConfig when available. To disable trim, or supply custom trim distributions, call :meth:trim after :meth:recombine in the chain (before any mutation / corruption step). On a raw RefDataConfig without trim data, recombine emits a :class:UserWarning and falls back to a no-op trim.

v_allele_weights / d_allele_weights / j_allele_weights — optional {allele_name: weight} dicts that bias allele sampling. Listed alleles get the supplied positive weight; unlisted alleles default to 1.0, so e.g. v_allele_weights={"IGHV3-23*01": 100} boosts that allele while keeping every other V allele possible at 1/100th the rate. Mutually exclusive with the per-segment :meth:restrict_alleles restriction. Raises ValueError for unknown allele names or non-positive weights.

productive_only()

Require every emitted record to be a productive sequence.

Attaches the canonical productive-sequence contract bundle to the experiment: junction frame in-register, no stop codons in the junction, and V/J anchor amino acids preserved. The bundle is enforced during recombination, mutation, and corruption passes by narrowing each pass's action support before sampling. If a constrained support is empty, permissive execution uses the pass's explicit no-op / sentinel behavior; use strict=True at run time to raise instead.

Failure surfaces (see docs/productive_failure_mode_audit.md for the full matrix):

  • Compile-time precondition — when a sampling distribution is statically impossible under the bundle (e.g. every NP1 length violates frame), :meth:compile raises ValueError regardless of the strict flag.
  • Runtime fresh strict (run(..., strict=True)) — when dynamic state makes a sampler's admissible support empty, raises :class:GenAIRR._engine.StrictSamplingError with structured (pass_name, address, reason) args.
  • Runtime fresh permissive (strict=False, default) — the pass records its declared sentinel (indel site=-1, NP length 0, NP base N, trim 0) or skips the slot; the record continues.
  • Trace replay (replay_from_trace_file) — consumes recorded values verbatim, does not re-evaluate admissibility. A permissive-sentinel trace replays cleanly even with strict=True.

Order-independent. This method is a constraint declaration, not a pipeline step — it can be called anywhere in the chain and the result is identical. Convention is to place it last (right before run_records()) so the constraint reads as a post-hoc requirement on the emitted records.

TCR refdata accepts the call but raises ValueError at :meth:compile time because TCRs don't have somatic hypermutation and the productive bundle's anchor checks assume BCR semantics. Catch this early at the builder if it matters to you.

Example::

result = (
    Experiment.on("human_igh")
    .recombine()
    .mutate(count=(5, 15))
    .productive_only()
    .run_records(seed=42)
)

mutate(*, model='s5f', count=None, rate=None, s5f_model='hh_s5f', segment_rates=None, v_subregion_rates=None)

Append a somatic-hypermutation step.

model selects the mutation kernel: - "s5f" (default) — context-dependent SHM via the bundled S5F kernel named in s5f_model. Available kernels: "hh_s5f", "hh_s5f_60", "hh_s5f_opposite", "hkl_s5f". - "uniform" — position-independent SHM. Each mutated position gets a uniformly drawn A/C/G/T replacement.

Specify intensity with exactly one of rate or count.

rate is the per-base mutation rate (e.g. 0.03 for 3 % SHM, which is roughly memory B-cell SHM). At execute time the engine draws count ~ Poisson(rate × pool_len) against each record's current sequence length — so the realized count scales with each record's actual length, matching how immunologists report SHM in the literature. This is the canonical, biology-default form.

count is the legacy explicit count distribution, useful for benchmark scripts that want a deterministic count independent of record length: - count=15 — fixed: every simulation gets exactly 15 mutations. - count=(5, 25) — uniform integer in [5, 25] (both endpoints inclusive). - count=[(5, 1.0), (10, 2.0), ...] — explicit empirical (count, weight) distribution.

Passing both count and rate raises ValueError. Passing neither raises ValueError.

TCR guard: somatic hypermutation is a B-cell phenomenon — T-cells do not undergo SHM in the periphery. Calling .mutate() on a TCR-configured experiment raises ValueError to prevent silent biological misuse. Use pcr_amplify / sequencing_errors for sequencing-error realism on TCR data instead.

expand_clones(*, n_clones, per_clone)

Expand the pipeline into clonal lineages.

Marks the per-clone / per-descendant boundary in the chain: steps appended before this call run once per clone — typically just :meth:recombine, which establishes the parent V/D/J + trim + NP + assembled IR for the clonal family. Steps appended after this call run once per read inside the family — typically :meth:mutate and the library-prep / sequencing-stage steps, which introduce per-read divergence within the clone.

Concrete shape::

exp = (Experiment.on("human_igh")
       .recombine()
       .expand_clones(n_clones=10, per_clone=20)
       .mutate(rate=0.05)
       .pcr_amplify(count=2))
result = exp.run_records(seed=0)
# 10 clones × 20 descendants = 200 records.
# Each record carries a ``clone_id`` integer in [0, 10).

n can be omitted from :meth:run_records for a clonal experiment — the runtime expands n_clones * per_clone records automatically. Passing n is allowed only when n == n_clones * per_clone.

Constraints: - Both n_clones and per_clone must be positive ints. - At most one expansion per pipeline; calling this method twice raises ValueError.

Implementation note: the runtime forks the parent's IR (final Simulation after the pre-fork plan) into descendants by running the post-fork plan from that IR with distinct seeds. Within a clone, every descendant shares the same recombination provenance (V allele, trim, NP bases) and only diverges through the post-fork passes.

paired_end(*, r1_length, r2_length=None, insert_size)

Append a paired-end / read-layout step.

Models the Illumina paired-end read layout: each fragment produces R1 (forward from the 5' adapter) and R2 (reverse-complemented from the 3' adapter) windows over the final projected molecule, plus an insert size that locates R2's 3' end. The DSL exposes three integer distributions:

  • r1_length — required.
  • r2_length — defaults to r1_length when None. Many Illumina libraries do run asymmetric (R2 quality drops faster); the explicit shape lets callers opt in.
  • insert_size — required.

Each accepts the same three shapes the rest of the DSL already uses for length-like distributions:

  • int — fixed value.
  • (low, high) — uniform integer in the closed interval [low, high].
  • [(value, weight), …] — explicit empirical distribution.

Engine path: a trace-only :class:~GenAIRR._engine.PairedEndSamplingPass records three Ints at paired_end.r1_length / paired_end.r2_length / paired_end.insert_size; the AIRR builder reads them back at projection time and populates the eight read_layout / r1_sequence / r2_sequence / r1_start / r1_end / r2_start / r2_end / insert_size fields via the Slice B projection kernel. rec.sequence is the only coordinate space — end-loss and rev-comp projections have already finalised the molecule by the time paired-end windows are drawn (design doc §6 / §7).

Both VDJ and VJ chains supported. Paired-end is a sequencing-stage observable, not a biology mechanism; it makes sense on every chain.

At most once per experiment. Calling :meth:paired_end twice raises ValueError — last-one-wins semantics would silently override an over-eager builder.

Position in the chain. The compile pre-pass extracts the step and pushes the engine pass at the end of the plan, after every IR-mutating / corruption / orientation step. Even though the pass is trace-only, recording the choices last keeps the trace order aligned with the biological/readout order (recombine → mutation → corruption → end-loss → paired-end).

Returns self so the call chains fluently.

run_records(*, n=None, seed=0, strict=False, expose_provenance=False, allow_curatable_refdata=None, validate_records=False)

Compile and run, then return the batch as a :class:SimulationResult ready for .to_csv / .to_fasta / .to_dataframe export.

For non-clonal experiments n defaults to 1. For clonal experiments (when the pipeline contains :meth:with_clonal _structure) n defaults to n_clones * size and may be omitted; passing n explicitly is allowed only if it matches that product.

expose_provenance=True appends truth_v_call, truth_d_call, truth_j_call columns containing the originally-sampled allele names — distinct from the evidence-driven v_call / d_call / j_call fields an aligner would produce. Useful for benchmarking aligners against ground truth without keeping a side truth file.

strict semantics match :meth:run — strict-mode applies only to fresh sampling. Trace replay (:meth:CompiledExperiment.replay_from_trace_file) consumes recorded sentinel values verbatim, so a permissive trace replays cleanly even with strict=True. See docs/productive_failure_mode_audit.md §5.

validate_records=True runs :meth:SimulationResult.validate_records on the freshly built batch before returning. If any record fails the postcondition validator the call raises :class:GenAIRR._validation.RecordValidationFailedError (a :class:RuntimeError subclass) carrying a machine-greppable summary of the failures. The check costs roughly one outcome-side re-derivation per record, so it defaults to False; flip it on in CI or when chasing a suspected projection bug. The validator runs before any with_metadata stamps are applied, matching the order :meth:SimulationResult.validate_records would see on a separate post-hoc call (metadata columns are per-batch annotations, not engine-derived fields).

Returns a :class:SimulationResult; clonal records carry an integer clone_id field per row.

run(*, n=None, seed=0, strict=False, allow_curatable_refdata=None)

Compile and run this experiment n times.

Equivalent to self.compile().run(n=n, seed=seed, strict=strict). Returns a list of :class:GenAIRR._engine.Outcome objects in clone-major order for clonal experiments.

Attach :meth:productive_only (or any future constraint method) to the chain to require admissible records; the runtime filters NP base draws, length samples, and mutation / contamination substitutions in real time so the resulting sequences satisfy the bundle by construction.

Statically impossible contract configurations fail during compile() with ValueError. For runtime residue — i.e., empty admissible support emerging dynamically at sample time — strict=False (default) lets a pass consume the slot as its explicit no-op / sentinel; strict=True raises :class:GenAIRR._engine.StrictSamplingError instead.

Note the two error paths use different exception classes: ValueError for compile-time preconditions, StrictSamplingError (subclass of Exception, NOT of ValueError) for runtime empty-support. A bare except ValueError: will not catch the runtime case. See docs/productive_failure_mode_audit.md §6.1.

strict only governs fresh sampling. Trace replay (:meth:CompiledExperiment.replay_from_trace_file) consumes recorded values verbatim; a permissive-recorded sentinel trace replays cleanly even with strict=True. To re-execute a trace under strict-fresh semantics, call simulator.run(seed=<original_seed>, strict=True) instead.

Output-correctness validation is on :meth:run_records only. This method returns raw Outcome objects, which have no projected AIRR record to validate; pass validate_records=True to :meth:run_records to opt into the post-build check (which raises :class:GenAIRR._validation.RecordValidationFailedError on any failure). For an outcome-by-outcome post-hoc check without re-running, build a :class:SimulationResult via :meth:SimulationResult.from_outcomes and call :meth:SimulationResult.validate_records on it.

compile(*, allow_curatable_refdata=None)

Compile the recorded steps into a reusable :class:CompiledExperiment (or :class:CompiledClonalExperiment when the pipeline contains a :meth:expand_clones fork).

Idempotent: calling compile() twice produces two distinct compiled instances with structurally-equal simulators.

Constraints declared via :meth:productive_only (or future bundle methods) are baked into the compiled simulator at this step; they're not runtime knobs. To run without constraints, omit the constraint methods from the chain.

allow_curatable_refdata selects the refdata validation mode. None (default) inherits the instance flag set by :meth:allow_curatable_refdata; an explicit True / False overrides per-call. False runs the gate in strict mode — every issue rejects compile with a :class:ValueError. True runs the lenient mode — Fatal issues (empty pool, duplicates, invalid byte, anchor out of bounds) still reject, but Curatable issues (pseudogene-shape anchor anomalies) pass.

describe()

Render a biology-style narrative of this experiment.

The output is one line per step, prefixed with its position in the chain. Locks, allele weights, NP-length distributions, SHM kernels, and per-corruption rates are all surfaced. Use this to sanity-check what a fluent chain actually encodes — if it doesn't read like an immunology protocol, the chain is too murky.

Returns a multi-line string ending without a trailing newline. Safe to print(exp.describe()).

Example::

>>> print(Experiment.on("human_igh").recombine().mutate(count=(5, 15)).describe())
Experiment on human_igh (vdj, DataConfig)
  1. V(D)J recombination: sample V/D/J alleles; empirical exonuclease trim (V3', D5', D3', J5'); insert NP1 (0–11 weighted bases) and NP2 (0–11 weighted bases)
  2. Somatic hypermutation (S5F context model, human heavy-chain (HH_S5F)): 5–15 mutations/record