.with_clonal_structure(n_clones, size) forks the pipeline. Passes before
the fork are shared by every descendant of a clone (the V/D/J/NP backbone). Passes
after the fork run independently per descendant (the SHM history that makes each
member of the lineage unique). Every output record carries a clone_id so you can
group descendants back together for lineage analysis.
Output: n_clones × size records total, grouped by clone_id ∈ [0,
n_clones). Each clone's members share V/D/J/NP exactly; downstream passes diverge.
Where you place .with_clonal_structure(...) in the chain decides what each
clone's descendants share and what they each draw fresh. Get this wrong and every "lineage"
in your output is actually identical, or every "lineage" has unrelated V/D/J.
Place .with_clonal_structure(...) between .recombine() and the
first divergent pass. Omit n from run_records() — for clonal
experiments the total record count is auto-derived as n_clones × size.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine() # shared by descendants
.with_clonal_structure(n_clones=10, size=8) # fork here
.mutate(model="s5f", count=(5, 25)) # divergent
.run_records(seed=42, expose_provenance=True)
)
# 80 records total (10 clones × 8 descendants), each carrying clone_id
print(len(result.records)) # 80
# group descendants back together by clone_id
from collections import defaultdict
by_clone = defaultdict(list)
for r in result.records:
by_clone[r["clone_id"]].append(r)
# within any clone: V/D/J identical, mutation count varies
clone0 = by_clone[0]
assert len({r["truth_v_call"] for r in clone0}) == 1 # one V allele
assert len({r["n_mutations"] for r in clone0}) > 1 # varied SHM
Two ways to get the chain wrong, both of which produce technically-valid records that look wrong on inspection.
# DON'T: mutate before with_clonal_structure
result = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(model="s5f", count=(5, 25)) # shared by all descendants
.with_clonal_structure(n_clones=10, size=8) # fork AFTER mutate
.run_records(seed=42, expose_provenance=True)
)
# every descendant in a clone has IDENTICAL sequence — the SHM was applied once,
# before the fork, so all 8 members of clone 0 inherit the same mutated parent.
# the records carry n_mutations=0 because each descendant added zero of its own.
# NOT SUPPORTED: recombine() can't run after with_clonal_structure —
# the engine raises at compile time. The fork is supposed to inherit a parent
# recombination, so recombine() always comes first. Mentioned for completeness.
The .with_clonal_structure(...) fork composes with
respect=ga.productive(). The contract enforces at the pre-fork recombination
phase — every descendant inherits a parent V/D/J/NP that's in-frame, has both anchors
intact, and starts with a clean junction. Downstream per-descendant passes (SHM,
corruption) can still break productivity — an SHM substitution that hits the junction can
introduce a stop codon, an indel can shift the frame. Filter post-hoc on r["productive"]
if you need 100%.
result = (
ga.Experiment.on("human_igh")
.recombine()
.with_clonal_structure(n_clones=25, size=20)
.mutate(model="s5f", count=(5, 25))
.run_records(seed=42, respect=ga.productive(),
expose_provenance=True)
)
# pre-fork enforcement: every parent is productive by construction
# post-fork: SHM in the junction can introduce stop codons → some descendants drop out
prod_rate = sum(1 for r in result.records if r["productive"]) / len(result.records)
print(f"productive rate: {prod_rate:.2%}") # ~85-95% typical
productive_only = [r for r in result.records if r["productive"]]