Skip to content

Clonal repertoires (TCR & abundance)

Where clonal_lineage grows BCR affinity-maturation trees, clonal_repertoire builds a non-tree clonal repertoire: each clone is one rearrangement proliferated to a clone size drawn from a heavy-tailed distribution, and those copies are emitted as reads through the library-prep / sequencing passes. Identical reads collapse into AIRR records carrying the AIRR-standard duplicate_count. It is the model for TCR repertoires (T cells don't somatically hypermutate) and for flat BCR clonal abundance — the modern replacement for the deprecated expand_clones, with realistic clone sizes instead of a fixed per-clone count.

What it is & when to use it

A real repertoire is a population of clones with wildly uneven sizes: a few huge expanded clones and a long tail of singletons. clonal_repertoire reproduces that structure. For each of n_clones clones it:

  1. runs the pre-fork plan (recombine()) once to fix the clone's V/D/J + trim + NP backbone — the single rearrangement that defines the clone;
  2. draws a size from a heavy-tailed clone-size distribution (rounded power-law / Zipf-like by default, log-normal optional), with a controllable unexpanded-singleton fraction;
  3. emits that many reads through the post-fork library-prep / sequencing passes, so reads diverge only by technical noise;
  4. genotype-collapses identical reads into AIRR records, each carrying a clone_id (ground-truth clone label) and a duplicate_count (abundance).

Reach for it when you want a repertoire whose ground truth is clone membership + abundance — the input clone-callers and abundance-aware tools expect — rather than a per-clone mutation genealogy.

How it compares

What it models Ground truth Loci
clonal_repertoire Non-tree clonal abundance; one rearrangement × N copies + technical noise clone_id + duplicate_count TCR and flat BCR
clonal_lineage BCR affinity-maturation trees (per-division SHM, selection) Lineage tree + per-cell records BCR only
expand_clones (deprecated) Star: fixed per_clone count, no size distribution clone_id + parent_id BCR / TCR

clonal_repertoire is the modern replacement for flat clonal expansion: instead of expand_clones' fixed per_clone count, every clone draws a realistic heavy-tailed size. For BCR lineage trees (genealogy, ancestral sequences, selection), use clonal_lineage instead.

The biology

A T-cell clone is the progeny of one rearranged T cell proliferated to many identical copies. T cells do not somatically hypermutate, so all the sequence diversity you observe within a TCR clone is technical — PCR and sequencing error introduced during library prep — not biological. That is exactly the shape clonal_repertoire produces: one rearrangement per clone, size copies, divergence only through the post-fork sequencing passes.

Clone sizes are empirically heavy-tailed — TCR clone-size distributions are approximately power-law (a handful of enormous clones, a long tail of singletons). The default size_distribution="power_law" (with exponent≈2) captures that, and unexpanded_fraction sets the share of clones forced to be never-expanded singletons (size 1) — the resting naive cells that were never clonally expanded.

For BCR the same flat model is useful when you want clonal abundance without a genealogy. SHM is optional and applied flat across the clone's copies via a post-fork .mutate() (see below) — there is no mutation tree.

Quick start (TCR)

import GenAIRR as ga

result = (ga.Experiment.on("human_tcrb").allow_curatable_refdata().recombine()
          .clonal_repertoire(n_clones=200, size_distribution="power_law",
                             exponent=2.0, max_size=500, unexpanded_fraction=0.5)
          .sequencing_errors(rate=0.005)         # per-read technical noise
          .run_records(seed=0))

# Each record carries a ground-truth clone label + an abundance count:
for rec in result.records[:5]:
    print(rec["clone_id"], rec["duplicate_count"], rec["v_call"], rec["j_call"])

# T cells don't hypermutate — every record has zero SHM:
assert all(rec.get("n_mutations", 0) == 0 for rec in result.records)

allow_curatable_refdata() is the usual opt-in for sampling from the bundled TCR catalogue (which includes pseudogene/ORF alleles). sequencing_errors(rate=...) is the per-read technical-noise pass — rate is a per-base error probability in [0, 1] (drawn as count ~ Poisson(rate × read_len) per read). pcr_amplify(rate=...) has the same shape; any of the post-fork library-prep passes (sequencing_errors, pcr_amplify, polymerase_indels, end_loss_*, ambiguous_base_calls, random_strand_orientation) can follow the fork.

Quick start (flat BCR with SHM)

The same model works for BCR. Optionally add a post-fork .mutate() to apply flat SHM independently to each copy (this is not a lineage tree — it's per-read mutation off the shared founder):

import GenAIRR as ga

result = (ga.Experiment.on("human_igh").recombine()
          .clonal_repertoire(n_clones=100, max_size=300, unexpanded_fraction=0.3)
          .mutate(model="s5f", rate=0.01)        # flat SHM on each copy
          .sequencing_errors(rate=0.005)
          .run_records(seed=0))

With no post-fork passes at all, every copy of a clone is identical, so the clone collapses to a single record whose duplicate_count equals the drawn size:

r = (ga.Experiment.on("human_igh").recombine()
     .clonal_repertoire(n_clones=10, max_size=100)
     .run_records(seed=1))

from collections import Counter
per_clone = Counter(rec["clone_id"] for rec in r.records)
assert all(c == 1 for c in per_clone.values())     # one record per clone
assert all(rec["duplicate_count"] >= 1 for rec in r.records)

mutate is BCR-only. A post-fork .mutate() on a TCR experiment is rejected by mutate's own TCR guard — T cells don't hypermutate. On TCR, leave SHM out; the post-fork sequencing passes provide all within-clone variation.

How it works

flowchart TB
    A["For each clone c = 0..n_clones:<br/>recombine() once → founder rearrangement"] --> B["Draw size_c from the clone-size distribution<br/>(deterministic for seed)"]
    B --> C{"Post-fork passes?"}
    C -->|"no"| D["1 record, duplicate_count = size_c"]
    C -->|"yes"| E["Emit size_c reads through the<br/>library-prep / sequencing passes<br/>(each read its own seed)"]
    E --> F["Collapse identical sequences →<br/>records with duplicate_count"]
    D --> G["Stamp clone_id = c"]
    F --> G

Per clone, the engine draws the size, runs the founder recombination once, then plays size reads through the post-fork plan (each with its own derived seed) and collapses by emitted sequence. clone_id is the ground-truth clone index; duplicate_count is the post-collapse abundance.

Determinism. Everything is keyed on seed: clone sizes come from a seeded draw, clone c recombines from seed + c × 1_000_000, and each read within a clone draws from a derived sub-seed. Re-running with the same seed reproduces the records and duplicate_counts byte-for-byte.

Read-count cost. When post-fork passes are present, the total reads simulated is roughly the sum of the drawn clone sizes (before collapse). A heavy tail with a large max_size can therefore blow up runtime — keep max_size modest (the default is 1000) and remember a few clones near max_size dominate the cost.

Parameters

Parameter Default Meaning
n_clones Number of clones to simulate (positive int). Sets the number of distinct clone_ids.
size_distribution "power_law" Clone-size law: "power_law" (Zipf-like, heavy-tailed) or "lognormal".
exponent 2.0 Power-law exponent (> 0), used when size_distribution="power_law". Higher ⇒ steeper tail / more singletons; ~2–3 is typical for TCR.
mu 1.0 Log-normal location parameter, used when size_distribution="lognormal".
sigma 1.0 Log-normal scale (>= 0), used when size_distribution="lognormal". Larger ⇒ heavier tail.
max_size 1000 Upper clamp on any clone size. Bounds runtime (total reads ≈ Σ sizes when post-fork passes are present) — keep it modest.
unexpanded_fraction 0.0 Fraction of clones forced to size 1 (never-expanded singletons), in [0, 1]. The forced count is round(n_clones × fraction).

Singletons come from two places. The power-law size is drawn by a continuous inverse-CDF and then rounded to an integer in [1, max_size], so even with unexpanded_fraction=0 a large share of clones round to size 1 (for exponent=2 the singleton share is ~1/3). unexpanded_fraction adds a forced floor of size-1 clones on top of that natural tail, so raise it when you want an even larger never-expanded population. Because the power-law is a continuous draw that is rounded — not an exact discrete Zipf PMF — the realized distribution is an approximation of true discrete Zipf.

Ground truth & tooling

Each record carries the two ground-truth fields the downstream ecosystem consumes:

  • clone_id — the planted clone membership label (which rearrangement this read descends from).
  • duplicate_count — the abundance of the collapsed record (AIRR-standard field name).

Write the records to an AIRR TSV (keeping the ground-truth label under a non-AIRR column name so it doesn't collide with a tool's inferred clone_id):

import pandas as pd
df = pd.DataFrame(result.records).rename(columns={"clone_id": "true_clone_id"})
df.to_csv("repertoire.tsv", sep="\t", index=False)

The output format is what clone-callers and abundance-aware tools consume — Change-O DefineClones and SCOPer read AIRR TSV with duplicate_count, and TCR tools like tcrdist read V/J + junction. You can score a tool's inferred clusters against the planted true_clone_id (e.g. with sklearn.metrics.adjusted_rand_score). We frame the records as format-compatible with these tools; this page does not claim they were validated against clonal_repertoire output here.

Limitations

  • No mutation tree. clonal_repertoire is a flat, non-tree model — there are no ancestral nodes, generations, or selection. For a BCR genealogy (Newick / FASTA / node table, affinity maturation), use clonal_lineage. A post-fork .mutate() on BCR applies flat SHM per copy, not a lineage.
  • Power-law is continuous-rounded, not exact discrete Zipf. Sizes come from a continuous inverse-CDF rounded to integers; this approximates true discrete Zipf rather than matching its PMF exactly.
  • Cost scales with Σ sizes. With post-fork passes present, the simulator emits roughly the sum of the drawn sizes in reads before collapse. Large clones combined with heavy post-fork passes mean many reads — keep max_size modest.
  • One fork per pipeline. clonal_repertoire, expand_clones, and clonal_lineage are mutually exclusive in a single chain; descendant-phase steps (e.g. .mutate(), the corruption passes) must come after the fork.