A contract is a predicate over the simulation state. Pass respect=… to
run() and the engine prunes the candidate distribution at every sampling step so
only choices that satisfy the contract are eligible. No reject-and-retry. No post-hoc
filtering. The constraint is enforced at the moment a decision is made — biology comes
out of the engine valid by construction.
A fourth predicate — NoStopCodonInJunction — runs at the codon walk and is included
in the bundle below. Together these four compose productive().
Most simulators build a sequence first and check it afterwards. If it's broken, they discard and resample. That's slow, statistically biased toward weakly-constrained edges of the distribution, and offers no way to reason about which draws are admissible. GenAIRR inverts the order: the contract is consulted before each random draw, and only candidates that keep the contract intact stay in the support.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine()
.run(n=1000, seed=42,
respect=ga.productive())
)
# every record is functional by construction
# productive = True · in-frame · no stops · anchors intact
productive().
ga.productive() is not a black box. It's an explicit composition of four small
predicates, each enforcing one biological invariant. You can use the bundle as-is, drop a
predicate to relax the rules, or stack additional contracts on top.
The V segment's conserved Cysteine codon — the 5′ junction landmark — must survive whatever 3′ trim the engine draws. If a trim length would remove the Cys, that trim is not in the admissible set.
The J segment's conserved Tryptophan (heavy / IGK / TR) or Phenylalanine (IGL) codon — the 3′ junction landmark — must survive the J's 5′ trim. Same admissibility logic, opposite end.
Junction length — Cys-codon start through anchor-codon end — must be divisible by 3. Enforced at NP-region length sampling, where the candidate lengths that produce an in-frame junction are the only ones drawn from.
After segments and NP regions are assembled, the junction is read codon-by-codon. If a candidate NP base would form a stop codon (TAA · TAG · TGA), it's pruned from the base distribution before the draw.
Some configurations are over-constrained — a particular allele pair simply has no admissible NP length that keeps the junction in-frame. You get to choose what GenAIRR does in that case: relax and continue, or stop and complain.
import GenAIRR as ga
try:
ga.Experiment.on("human_igh").recombine().run(
n=100, seed=42,
respect=ga.productive(),
strict=True,
)
except ga.StrictSamplingError as e:
pass_name, address, reason = e.args
# pass_name → "generate_np.np1"
# address → "np.np1.length"
# reason → "empty_admissible_support"
Contracts shape what you simulate. The right setting depends on what you're doing with the records downstream.
Use productive(). You want clean, biology-shaped data so the model
learns the signal, not the artifact distribution of broken receptors.
Drop the contracts. You want both productive and non-productive sequences so the classifier sees the full distribution it has to discriminate.
Drop the contracts. The natural productive : non-productive ratio is itself a biological signal you may want to measure or perturb.