Experiment¶
Experiment is GenAIRR's fluent pipeline
builder. Every method returns the same Experiment
extended by one pipeline stage; the pipeline runs when you call
.run_records(...), .run(...), or
.compile().run(...). For the conceptual walk-through —
which methods are ancestor-phase vs descendant-phase, which clonal
model to choose, how compile reuse works — see
the Experiment builder
guide. The reference below catalogues the public surface.
Common methods¶
The methods you'll reach for in most real pipelines:
| Method | Purpose |
|---|---|
.recombine(...) |
Add the V(D)J recombination pass — the foundational ancestor-phase mechanism |
.productive_only() |
Constrain sampling so only productive rearrangements survive |
.mutate(...) |
Apply biological SHM (uniform or S5F) on top of recombination |
.clonal_lineage(...) |
Grow BCR affinity-maturation lineage trees |
.clonal_repertoire(...) |
Generate TCR / flat-BCR abundance repertoires with clone sizes and duplicate_count |
.expand_clones(...) |
Legacy fixed-size clonal star model |
.paired_end(...) |
Project assembled sequences as paired R1 / R2 reads |
.run_records(...) |
Compile + run + return a SimulationResult |
For the full pipeline-stage map (constraints, trim overrides, all the corruption methods, compile reuse, strict mode), see The Experiment builder.
Class reference¶
The autogenerated reference below comes straight from the
docstrings in src/GenAIRR/experiment.py. This page is the
pilot for the wider generated-reference effort.
GenAIRR.experiment.Experiment
¶
Fluent builder for a simulation pipeline.
Build an Experiment from a config name, a :class:DataConfig,
or a :class:GenAIRR._engine.RefDataConfig via :meth:Experiment.on,
chain configuration steps (currently just :meth:recombine), then
call :meth:run (or :meth:compile for explicit two-stage flow).
The builder is stateful but not destructive: each fluent call
returns self after appending a step. The same Experiment
can be compile()-d and run()-d multiple times with
different seeds.
on(source)
classmethod
¶
Start an experiment against the given reference data.
source is one of:
- a config-name string (e.g. "human_igh"),
- a :class:GenAIRR.DataConfig instance,
- a :class:GenAIRR._engine.RefDataConfig.
When source is a config name or a DataConfig, the
underlying empirical distributions (NP lengths, per-gene
trims) are kept on the experiment so :meth:recombine can
use them as the default sampling distributions. A bare
RefDataConfig has no such backing — :meth:recombine
falls through to its uniform [0..6] placeholder unless
the caller passes np1_lengths / np2_lengths
explicitly.
recombine(*, np1_lengths=None, np2_lengths=None, v_allele_weights=None, d_allele_weights=None, j_allele_weights=None)
¶
Append a standard V(D)J recombination step.
Compiles to: - VJ: sample V → sample J → (trim V_3, J_5) → assemble V → generate NP1 → assemble J. - VDJ: sample V → sample D → sample J → (trim V_3, D_5, D_3, J_5) → assemble V → generate NP1 → assemble D → generate NP2 → assemble J.
np1_lengths / np2_lengths default to the species'
empirical NP-length distributions (from
DataConfig.NP_lengths) when the experiment is bound to
a DataConfig. For raw-RefDataConfig experiments where no
empirical data is available, both fall back to the uniform
[(0, 1.0), ..., (6, 1.0)] distribution and emit a
:class:UserWarning so the caller knows the synthetic
default is being used. Pass an explicit iterable of
(length, weight) tuples to override the default.
Passing np2_lengths on a VJ chain raises ValueError
(VJ chains have no NP2 region — there's no D segment to
bracket).
Exonuclease trim is enabled by default and uses the
empirical per-segment trim distributions from the bound
DataConfig when available. To disable trim, or supply
custom trim distributions, call :meth:trim after
:meth:recombine in the chain (before any mutation /
corruption step). On a raw RefDataConfig without trim
data, recombine emits a :class:UserWarning and falls back
to a no-op trim.
v_allele_weights / d_allele_weights /
j_allele_weights — optional {allele_name:
weight} dicts that bias allele sampling. Listed alleles
get the supplied positive weight; unlisted alleles default
to 1.0, so e.g. v_allele_weights={"IGHV3-23*01": 100}
boosts that allele while keeping every other V allele
possible at 1/100th the rate. Mutually exclusive with the
per-segment :meth:restrict_alleles restriction. Raises
ValueError for unknown allele names or non-positive
weights.
productive_only()
¶
Require every emitted record to be a productive sequence.
Attaches the canonical productive-sequence contract bundle to
the experiment: junction frame in-register, no stop codons in
the junction, and V/J anchor amino acids preserved. The bundle
is enforced during recombination, mutation, and corruption
passes by narrowing each pass's action support before
sampling. If a constrained support is empty, permissive
execution uses the pass's explicit no-op / sentinel behavior;
use strict=True at run time to raise instead.
Failure surfaces (see
docs/productive_failure_mode_audit.md for the full matrix):
- Compile-time precondition — when a sampling distribution
is statically impossible under the bundle (e.g. every NP1
length violates frame), :meth:
compileraisesValueErrorregardless of thestrictflag. - Runtime fresh strict (
run(..., strict=True)) — when dynamic state makes a sampler's admissible support empty, raises :class:GenAIRR._engine.StrictSamplingErrorwith structured(pass_name, address, reason)args. - Runtime fresh permissive (
strict=False, default) — the pass records its declared sentinel (indelsite=-1, NP length0, NP baseN, trim0) or skips the slot; the record continues. - Trace replay (
replay_from_trace_file) — consumes recorded values verbatim, does not re-evaluate admissibility. A permissive-sentinel trace replays cleanly even withstrict=True.
Order-independent. This method is a constraint declaration,
not a pipeline step — it can be called anywhere in the chain
and the result is identical. Convention is to place it last
(right before run_records()) so the constraint reads as a
post-hoc requirement on the emitted records.
TCR refdata accepts the call but raises ValueError at
:meth:compile time because TCRs don't have somatic
hypermutation and the productive bundle's anchor checks
assume BCR semantics. Catch this early at the builder if it
matters to you.
Example::
result = (
Experiment.on("human_igh")
.recombine()
.mutate(count=(5, 15))
.productive_only()
.run_records(seed=42)
)
mutate(*, model='s5f', count=None, rate=None, s5f_model='hh_s5f', segment_rates=None, v_subregion_rates=None)
¶
Append a somatic-hypermutation step.
model selects the mutation kernel:
- "s5f" (default) — context-dependent SHM via the bundled
S5F kernel named in s5f_model. Available kernels:
"hh_s5f", "hh_s5f_60", "hh_s5f_opposite",
"hkl_s5f".
- "uniform" — position-independent SHM. Each mutated
position gets a uniformly drawn A/C/G/T replacement.
Specify intensity with exactly one of rate or count.
rate is the per-base mutation rate (e.g. 0.03 for 3 %
SHM, which is roughly memory B-cell SHM). At execute time the
engine draws count ~ Poisson(rate × pool_len) against each
record's current sequence length — so the realized count
scales with each record's actual length, matching how
immunologists report SHM in the literature. This is the
canonical, biology-default form.
count is the legacy explicit count distribution, useful
for benchmark scripts that want a deterministic count
independent of record length:
- count=15 — fixed: every simulation gets exactly 15
mutations.
- count=(5, 25) — uniform integer in [5, 25] (both
endpoints inclusive).
- count=[(5, 1.0), (10, 2.0), ...] — explicit empirical
(count, weight) distribution.
Passing both count and rate raises ValueError.
Passing neither raises ValueError.
TCR guard: somatic hypermutation is a B-cell
phenomenon — T-cells do not undergo SHM in the periphery.
Calling .mutate() on a TCR-configured experiment raises
ValueError to prevent silent biological misuse. Use
pcr_amplify / sequencing_errors for sequencing-error
realism on TCR data instead.
expand_clones(*, n_clones, per_clone)
¶
Expand the pipeline into clonal lineages.
Marks the per-clone / per-descendant boundary in the chain:
steps appended before this call run once per clone —
typically just :meth:recombine, which establishes the
parent V/D/J + trim + NP + assembled IR for the clonal
family. Steps appended after this call run once per
read inside the family — typically :meth:mutate and the
library-prep / sequencing-stage steps, which introduce
per-read divergence within the clone.
Concrete shape::
exp = (Experiment.on("human_igh")
.recombine()
.expand_clones(n_clones=10, per_clone=20)
.mutate(rate=0.05)
.pcr_amplify(count=2))
result = exp.run_records(seed=0)
# 10 clones × 20 descendants = 200 records.
# Each record carries a ``clone_id`` integer in [0, 10).
n can be omitted from :meth:run_records for a clonal
experiment — the runtime expands n_clones * per_clone
records automatically. Passing n is allowed only when
n == n_clones * per_clone.
Constraints:
- Both n_clones and per_clone must be positive ints.
- At most one expansion per pipeline; calling this method
twice raises ValueError.
Implementation note: the runtime forks the parent's IR
(final Simulation after the pre-fork plan) into
descendants by running the post-fork plan from that IR
with distinct seeds. Within a clone, every descendant
shares the same recombination provenance (V allele, trim,
NP bases) and only diverges through the post-fork passes.
paired_end(*, r1_length, r2_length=None, insert_size)
¶
Append a paired-end / read-layout step.
Models the Illumina paired-end read layout: each fragment produces R1 (forward from the 5' adapter) and R2 (reverse-complemented from the 3' adapter) windows over the final projected molecule, plus an insert size that locates R2's 3' end. The DSL exposes three integer distributions:
r1_length— required.r2_length— defaults tor1_lengthwhenNone. Many Illumina libraries do run asymmetric (R2 quality drops faster); the explicit shape lets callers opt in.insert_size— required.
Each accepts the same three shapes the rest of the DSL already uses for length-like distributions:
int— fixed value.(low, high)— uniform integer in the closed interval[low, high].[(value, weight), …]— explicit empirical distribution.
Engine path: a trace-only
:class:~GenAIRR._engine.PairedEndSamplingPass records
three Ints at paired_end.r1_length /
paired_end.r2_length / paired_end.insert_size;
the AIRR builder reads them back at projection time and
populates the eight read_layout / r1_sequence /
r2_sequence / r1_start / r1_end / r2_start /
r2_end / insert_size fields via the Slice B
projection kernel. rec.sequence is the only
coordinate space — end-loss and rev-comp projections have
already finalised the molecule by the time paired-end
windows are drawn (design doc §6 / §7).
Both VDJ and VJ chains supported. Paired-end is a sequencing-stage observable, not a biology mechanism; it makes sense on every chain.
At most once per experiment. Calling
:meth:paired_end twice raises ValueError —
last-one-wins semantics would silently override an
over-eager builder.
Position in the chain. The compile pre-pass extracts the step and pushes the engine pass at the end of the plan, after every IR-mutating / corruption / orientation step. Even though the pass is trace-only, recording the choices last keeps the trace order aligned with the biological/readout order (recombine → mutation → corruption → end-loss → paired-end).
Returns self so the call chains fluently.
run_records(*, n=None, seed=0, strict=False, expose_provenance=False, allow_curatable_refdata=None, validate_records=False)
¶
Compile and run, then return the batch as a
:class:SimulationResult ready for .to_csv / .to_fasta
/ .to_dataframe export.
For non-clonal experiments n defaults to 1. For clonal
experiments (when the pipeline contains :meth:with_clonal
_structure) n defaults to n_clones * size and may
be omitted; passing n explicitly is allowed only if it
matches that product.
expose_provenance=True appends truth_v_call,
truth_d_call, truth_j_call columns containing the
originally-sampled allele names — distinct from the
evidence-driven v_call / d_call / j_call fields
an aligner would produce. Useful for benchmarking aligners
against ground truth without keeping a side truth file.
strict semantics match :meth:run — strict-mode applies
only to fresh sampling. Trace replay
(:meth:CompiledExperiment.replay_from_trace_file) consumes
recorded sentinel values verbatim, so a permissive trace
replays cleanly even with strict=True. See
docs/productive_failure_mode_audit.md §5.
validate_records=True runs
:meth:SimulationResult.validate_records on the freshly
built batch before returning. If any record fails the
postcondition validator the call raises
:class:GenAIRR._validation.RecordValidationFailedError
(a :class:RuntimeError subclass) carrying a
machine-greppable summary of the failures. The check costs
roughly one outcome-side re-derivation per record, so it
defaults to False; flip it on in CI or when chasing a
suspected projection bug. The validator runs before
any with_metadata stamps are applied, matching the
order :meth:SimulationResult.validate_records would see
on a separate post-hoc call (metadata columns are
per-batch annotations, not engine-derived fields).
Returns a :class:SimulationResult; clonal records carry
an integer clone_id field per row.
run(*, n=None, seed=0, strict=False, allow_curatable_refdata=None)
¶
Compile and run this experiment n times.
Equivalent to
self.compile().run(n=n, seed=seed, strict=strict).
Returns a list of :class:GenAIRR._engine.Outcome objects in
clone-major order for clonal experiments.
Attach :meth:productive_only (or any future constraint
method) to the chain to require admissible records; the
runtime filters NP base draws, length samples, and mutation
/ contamination substitutions in real time so the resulting
sequences satisfy the bundle by construction.
Statically impossible contract configurations fail during
compile() with ValueError. For runtime residue
— i.e., empty admissible support emerging dynamically at
sample time — strict=False (default) lets a pass consume
the slot as its explicit no-op / sentinel; strict=True
raises :class:GenAIRR._engine.StrictSamplingError instead.
Note the two error paths use different exception classes:
ValueError for compile-time preconditions,
StrictSamplingError (subclass of Exception, NOT of
ValueError) for runtime empty-support. A bare
except ValueError: will not catch the runtime case. See
docs/productive_failure_mode_audit.md §6.1.
strict only governs fresh sampling. Trace replay
(:meth:CompiledExperiment.replay_from_trace_file) consumes
recorded values verbatim; a permissive-recorded sentinel
trace replays cleanly even with strict=True. To re-execute
a trace under strict-fresh semantics, call
simulator.run(seed=<original_seed>, strict=True) instead.
Output-correctness validation is on :meth:run_records
only. This method returns raw Outcome objects, which
have no projected AIRR record to validate; pass
validate_records=True to :meth:run_records to opt into
the post-build check (which raises
:class:GenAIRR._validation.RecordValidationFailedError on
any failure). For an outcome-by-outcome post-hoc check
without re-running, build a :class:SimulationResult via
:meth:SimulationResult.from_outcomes and call
:meth:SimulationResult.validate_records on it.
compile(*, allow_curatable_refdata=None)
¶
Compile the recorded steps into a reusable
:class:CompiledExperiment (or :class:CompiledClonalExperiment
when the pipeline contains a :meth:expand_clones
fork).
Idempotent: calling compile() twice produces two distinct
compiled instances with structurally-equal simulators.
Constraints declared via :meth:productive_only (or future
bundle methods) are baked into the compiled simulator at this
step; they're not runtime knobs. To run without constraints,
omit the constraint methods from the chain.
allow_curatable_refdata selects the refdata validation
mode. None (default) inherits the instance flag set by
:meth:allow_curatable_refdata; an explicit True /
False overrides per-call. False runs the gate in
strict mode — every issue rejects compile with a
:class:ValueError. True runs the lenient mode — Fatal
issues (empty pool, duplicates, invalid byte, anchor out of
bounds) still reject, but Curatable issues (pseudogene-shape
anchor anomalies) pass.
describe()
¶
Render a biology-style narrative of this experiment.
The output is one line per step, prefixed with its position in the chain. Locks, allele weights, NP-length distributions, SHM kernels, and per-corruption rates are all surfaced. Use this to sanity-check what a fluent chain actually encodes — if it doesn't read like an immunology protocol, the chain is too murky.
Returns a multi-line string ending without a trailing newline.
Safe to print(exp.describe()).
Example::
>>> print(Experiment.on("human_igh").recombine().mutate(count=(5, 15)).describe())
Experiment on human_igh (vdj, DataConfig)
1. V(D)J recombination: sample V/D/J alleles; empirical exonuclease trim (V3', D5', D3', J5'); insert NP1 (0–11 weighted bases) and NP2 (0–11 weighted bases)
2. Somatic hypermutation (S5F context model, human heavy-chain (HH_S5F)): 5–15 mutations/record