Chapter 03 · The discipline

Contracts — sample inside the rules.

A contract is a predicate over the simulation state. Pass respect=… to run() and the engine prunes the candidate distribution at every sampling step so only choices that satisfy the contract are eligible. No reject-and-retry. No post-hoc filtering. The constraint is enforced at the moment a decision is made — biology comes out of the engine valid by construction.

01 AnchorPreserved (V) V Cys anchor present after trim
02 AnchorPreserved (J) J Trp/Phe anchor present after trim
03 ProductiveJunctionFrame junction length ≡ 0 (mod 3)

A fourth predicate — NoStopCodonInJunction — runs at the codon walk and is included in the bundle below. Together these four compose productive().

PART 01

Constraint-aware sampling, not reject-and-retry.

Most simulators build a sequence first and check it afterwards. If it's broken, they discard and resample. That's slow, statistically biased toward weakly-constrained edges of the distribution, and offers no way to reason about which draws are admissible. GenAIRR inverts the order: the contract is consulted before each random draw, and only candidates that keep the contract intact stay in the support.

A contract, mechanically
  • admits(state, choice)would this choice keep the contract satisfied?
  • verify(state)does the current state satisfy the contract?
  • namestable string for trace + error messages
The bumpers, in practice
import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .run(n=1000, seed=42,
            respect=ga.productive())
)

# every record is functional by construction
# productive = True · in-frame · no stops · anchors intact
PART 02

The four predicates of productive().

ga.productive() is not a black box. It's an explicit composition of four small predicates, each enforcing one biological invariant. You can use the bundle as-is, drop a predicate to relax the rules, or stack additional contracts on top.

AnchorPreserved (V)

The V segment's conserved Cysteine codon — the 5′ junction landmark — must survive whatever 3′ trim the engine draws. If a trim length would remove the Cys, that trim is not in the admissible set.

AnchorPreserved (J)

The J segment's conserved Tryptophan (heavy / IGK / TR) or Phenylalanine (IGL) codon — the 3′ junction landmark — must survive the J's 5′ trim. Same admissibility logic, opposite end.

ProductiveJunctionFrame

Junction length — Cys-codon start through anchor-codon end — must be divisible by 3. Enforced at NP-region length sampling, where the candidate lengths that produce an in-frame junction are the only ones drawn from.

NoStopCodonInJunction

After segments and NP regions are assembled, the junction is read codon-by-codon. If a candidate NP base would form a stop codon (TAA · TAG · TGA), it's pruned from the base distribution before the draw.

PART 03

What happens when a contract can't be satisfied?

Some configurations are over-constrained — a particular allele pair simply has no admissible NP length that keeps the junction in-frame. You get to choose what GenAIRR does in that case: relax and continue, or stop and complain.

Permissive (default)
  • strict=Falseif the admissible support is empty, fall back to the unconstrained distribution and continue
  • use whenyou want a steady stream of records and can tolerate the occasional non-productive sequence
  • trace showsa fallback marker on the affected draw — fully observable
Strict
  • strict=Trueraise StrictSamplingError with the pass name, trace address, and reason
  • use when100% contract compliance is required — training data, validation, formal benchmarks
  • recoverycatch the error, log it, and surface the failing constraint to the caller
Catching a strict failure
import GenAIRR as ga

try:
    ga.Experiment.on("human_igh").recombine().run(
        n=100, seed=42,
        respect=ga.productive(),
        strict=True,
    )
except ga.StrictSamplingError as e:
    pass_name, address, reason = e.args
    # pass_name → "generate_np.np1"
    # address   → "np.np1.length"
    # reason    → "empty_admissible_support"
PART 04

When to switch contracts on or off.

Contracts shape what you simulate. The right setting depends on what you're doing with the records downstream.

Training an aligner / annotation model

Use productive(). You want clean, biology-shaped data so the model learns the signal, not the artifact distribution of broken receptors.

Benchmarking a productivity classifier

Drop the contracts. You want both productive and non-productive sequences so the classifier sees the full distribution it has to discriminate.

Studying natural repertoire composition

Drop the contracts. The natural productive : non-productive ratio is itself a biological signal you may want to measure or perturb.

Continue

Next chapter.

← Chapter 02 · Persistent IR

How GenAIRR stores the sequence as an immutable snapshot per phase, with per-base provenance.

Chapter 04 · AIRR Record →

How GenAIRR derives ~70 ground-truth fields from a single pool snapshot.