Add one keyword argument and every record GenAIRR emits will be a functional receptor — in frame, no stop codons, anchors intact. No reject-and-retry loop; the engine prunes the candidate distribution at sample time so productive sequences fall out by construction.
The contract bundle ga.productive() combines four predicates: V anchor
preserved, J anchor preserved, junction length divisible by three, no stop codons in the
junction. Pass it as respect= and the engine refuses to pick any draw that
would break those invariants.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine()
.run(n=1000, seed=42,
respect=ga.productive())
)
# every record is productive by construction
# productive = True · vj_in_frame = True · stop_codon = False
For most allele pairs the admissible support is plenty wide. But some configurations are
over-constrained: a particular V-J combination may simply have no NP length that keeps the
junction in frame. Default is to relax and continue (you get a record, possibly
non-productive). Pass strict=True if you'd rather catch the failure loudly.
try:
ga.Experiment.on("human_igh").recombine().run(
n=100, seed=42,
respect=ga.productive(),
strict=True,
)
except ga.StrictSamplingError as e:
pass_name, address, reason = e.args
# pass_name = "generate_np.np1"
# address = "np.np1.length"
# reason = "empty_admissible_support"
Sanity-check a small batch before committing to a large run. Without contracts, natural
productive rate ranges from ~30% (no constraint) to ~50% (with implicit anchor checks).
With ga.productive() in permissive mode you should see 99%+; in strict mode
it's 100% by definition (or an exception).
result = ga.Experiment.on("human_igh").recombine().run(
n=1000, seed=42, respect=ga.productive(),
)
rate = sum(1 for o in result
if o.final_simulation().productive) / len(result)
assert rate >= 0.99, f"productive rate too low: {rate:.3f}"
The productive bundle is a tool for sampling, not a default to leave on. Some studies need the full distribution of natural outcomes — productive and not — and turning the bundle on silently distorts your input.
Use it. You want clean, biology-shaped data so the model learns the signal, not the artifact distribution of broken receptors.
Drop it. You want both productive and non-productive sequences so the classifier sees the full discrimination problem.
Drop it. The natural productive : non-productive ratio is itself a biological signal you may want to measure or perturb.