Recipe A · 03 · ~12 min · intermediate

Generate clonal families.

.with_clonal_structure(n_clones, size) forks the pipeline. Passes before the fork are shared by every descendant of a clone (the V/D/J/NP backbone). Passes after the fork run independently per descendant (the SHM history that makes each member of the lineage unique). Every output record carries a clone_id so you can group descendants back together for lineage analysis.

01 Pre-fork passes recombine — runs once per clone
02 .with_clonal_structure(n, k) fork into k descendants
03 Post-fork passes mutate · corrupt — runs per descendant

Output: n_clones × size records total, grouped by clone_id ∈ [0, n_clones). Each clone's members share V/D/J/NP exactly; downstream passes diverge.

PART 01

Pipeline order is the whole story.

Where you place .with_clonal_structure(...) in the chain decides what each clone's descendants share and what they each draw fresh. Get this wrong and every "lineage" in your output is actually identical, or every "lineage" has unrelated V/D/J.

Shared across descendants (pre-fork)
  • .recombine()always pre-fork — the shared V/D/J ancestor is the point of a clone
  • .with_metadata()put pre-fork if metadata applies to the whole clone
Divergent across descendants (post-fork)
  • .mutate(...)post-fork: each descendant gets its own SHM history (the canonical pattern)
  • .corrupt_*(...)post-fork: every descendant goes through the sequencer separately
PART 02

The canonical clonal-family chain.

Place .with_clonal_structure(...) between .recombine() and the first divergent pass. Omit n from run_records() — for clonal experiments the total record count is auto-derived as n_clones × size.

10 clones × 8 descendants — shared V/D/J, divergent SHM
import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
      .recombine()                                  # shared by descendants
      .with_clonal_structure(n_clones=10, size=8)  # fork here
      .mutate(model="s5f", count=(5, 25))        # divergent
      .run_records(seed=42, expose_provenance=True)
)

# 80 records total (10 clones × 8 descendants), each carrying clone_id
print(len(result.records))   # 80

# group descendants back together by clone_id
from collections import defaultdict
by_clone = defaultdict(list)
for r in result.records:
    by_clone[r["clone_id"]].append(r)

# within any clone: V/D/J identical, mutation count varies
clone0 = by_clone[0]
assert len({r["truth_v_call"] for r in clone0}) == 1     # one V allele
assert len({r["n_mutations"] for r in clone0}) > 1      # varied SHM
PART 03

Common pitfalls.

Two ways to get the chain wrong, both of which produce technically-valid records that look wrong on inspection.

Wrong: mutation before the fork — every clone member is identical
# DON'T: mutate before with_clonal_structure
result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .mutate(model="s5f", count=(5, 25))         # shared by all descendants
      .with_clonal_structure(n_clones=10, size=8)   # fork AFTER mutate
      .run_records(seed=42, expose_provenance=True)
)

# every descendant in a clone has IDENTICAL sequence — the SHM was applied once,
# before the fork, so all 8 members of clone 0 inherit the same mutated parent.
# the records carry n_mutations=0 because each descendant added zero of its own.
Wrong: recombination after the fork — every member is a fresh receptor
# NOT SUPPORTED: recombine() can't run after with_clonal_structure —
# the engine raises at compile time. The fork is supposed to inherit a parent
# recombination, so recombine() always comes first. Mentioned for completeness.
PART 04

Clonal + productive: works as of v1.1.

The .with_clonal_structure(...) fork composes with respect=ga.productive(). The contract enforces at the pre-fork recombination phase — every descendant inherits a parent V/D/J/NP that's in-frame, has both anchors intact, and starts with a clean junction. Downstream per-descendant passes (SHM, corruption) can still break productivity — an SHM substitution that hits the junction can introduce a stop codon, an indel can shift the frame. Filter post-hoc on r["productive"] if you need 100%.

Productive clonal lineages — measure-then-filter
result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .with_clonal_structure(n_clones=25, size=20)
      .mutate(model="s5f", count=(5, 25))
      .run_records(seed=42, respect=ga.productive(),
                   expose_provenance=True)
)

# pre-fork enforcement: every parent is productive by construction
# post-fork: SHM in the junction can introduce stop codons → some descendants drop out
prod_rate = sum(1 for r in result.records if r["productive"]) / len(result.records)
print(f"productive rate: {prod_rate:.2%}")   # ~85-95% typical

productive_only = [r for r in result.records if r["productive"]]
Related recipes

Where to next.

C · 03 · Replay with one knob changed →

The lock-and-replay pattern — pin V/D/J via .using(...) to get the clonal-family spirit without the fork.

A · 02 · Sample only productive sequences →

Background on the respect=ga.productive() bundle that composes with clonal forks.