Recipe B · 03 · ~20 min · intermediate

Audit a simulation for biological realism.

A simulation can be technically correct (truth columns consistent, contracts respected) and biologically unrealistic at the same time — V usage flat, CDR3 lengths the wrong shape, mutation hotspots in the wrong places. This recipe runs four standard marginal checks against an empirical reference panel so you find that gap before you publish from it.

01 Reference panel an OAS subset · your own data

02 Sim panel same locus · same size

03 Diff marginals V-usage · CDR3 · codon · productive

PART 01

What "realistic" means here.

A simulator's job is not to be indistinguishable from the wet lab — it can't be. But the marginal distributions of fields that downstream analyses depend on should align with empirical observations of the same locus. Four marginals catch ~90% of the issues you'd actually want to know about.

V-gene usage distribution

Empirical V usage is famously skewed — IGHV3-23, IGHV4-39, IGHV4-59 dominate. A uniform distribution from your sim is a red flag; reweight with the empirical frequencies (see C · 02).

CDR3 length histogram

Empirical CDR3 lengths are roughly log-normal, centred ~15 amino acids. NP-length priors that produce a flat or bimodal distribution distort downstream diversity metrics.

Junction codon usage

Codon usage in the NP regions reflects TdT activity. Deviations (over-representation of G/C, or hard symmetry between NP1 and NP2) indicate a mis-calibrated NP sampler.

Productive rate

Without contracts, natural productive rate is ~30–50%. If your sim sits outside that band, either contracts are silently on or your junction-frame sampling is biased.

PART 02

The audit script.

Generate a simulation panel of the same shape as your reference (same locus, same size). Pull both into pandas, compute the four marginals, and compare.

Sim panel matching your reference

import GenAIRR as ga
import pandas as pd

empirical = pd.read_csv("oas_subset.tsv", sep="\t")

sim = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(model="s5f", count=(0, 30))
       .run_records(
           n=len(empirical),
           seed=42,
           expose_provenance=True,
       )
).to_dataframe()

PART 03

The four checks.

Each check is a one-liner. Compute it on both panels, then compare with KL divergence, KS statistic, or just an eyeball overlay.

V-usage diff

def v_family(s): return s.str.split("*").str[0]

emp_usage = v_family(empirical["v_call"]).value_counts(normalize=True)
sim_usage = v_family(sim["truth_v_call"]).value_counts(normalize=True)

diff = (emp_usage - sim_usage).abs().nlargest(10)
print("largest V-family mismatches:")
print(diff)

CDR3 length, codon usage, productive rate

from scipy.stats import ks_2samp

# CDR3 length
ks = ks_2samp(empirical["cdr3_aa"].str.len(),
              sim["cdr3_aa"].str.len())
print("CDR3 length KS:", ks.statistic, ks.pvalue)

# productive rate
print("empirical productive:", empirical["productive"].mean())
print("sim productive:", sim["productive"].mean())

# junction codon usage (top 10 trinucleotides)
def trinuc_freq(s):
    from collections import Counter
    counts = Counter()
    for j in s.dropna():
        for i in range(0, len(j) - 2, 3):
            counts[j[i:i+3]] += 1
    return pd.Series(counts).sort_values(ascending=False)

PART 04

If you fail a check.

Each marginal points at a specific knob in the simulation. Mismatch ≠ broken sim — it means your sim doesn't yet model whatever shaped the empirical distribution.

V usage off → reweight allele sampling

Recipe C · 02 shows how to feed a target frequency table to the V-allele sampler.

CDR3 length off → tune NP priors

Pass np1_lengths= / np2_lengths= to recombine() with a list of (length, weight) tuples matching your empirical histogram.

Productive rate too high → contracts left on

Check that you're not passing respect=ga.productive() unintentionally. The natural rate without contracts is well below 99%.

Related recipes

Where to next.

C · 02 · Match empirical V-gene usage →

Reweight allele sampling so V usage matches your reference panel.

B · 04 · Smoke-test a config change →

Drop this audit into CI so a marginal regression fails loudly.