A simulation can be technically correct (truth columns consistent, contracts respected) and biologically unrealistic at the same time — V usage flat, CDR3 lengths the wrong shape, mutation hotspots in the wrong places. This recipe runs four standard marginal checks against an empirical reference panel so you find that gap before you publish from it.
A simulator's job is not to be indistinguishable from the wet lab — it can't be. But the marginal distributions of fields that downstream analyses depend on should align with empirical observations of the same locus. Four marginals catch ~90% of the issues you'd actually want to know about.
Empirical V usage is famously skewed — IGHV3-23, IGHV4-39, IGHV4-59 dominate. A uniform distribution from your sim is a red flag; reweight with the empirical frequencies (see C · 02).
Empirical CDR3 lengths are roughly log-normal, centred ~15 amino acids. NP-length priors that produce a flat or bimodal distribution distort downstream diversity metrics.
Codon usage in the NP regions reflects TdT activity. Deviations (over-representation of G/C, or hard symmetry between NP1 and NP2) indicate a mis-calibrated NP sampler.
Without contracts, natural productive rate is ~30–50%. If your sim sits outside that band, either contracts are silently on or your junction-frame sampling is biased.
Generate a simulation panel of the same shape as your reference (same locus, same size). Pull both into pandas, compute the four marginals, and compare.
import GenAIRR as ga
import pandas as pd
empirical = pd.read_csv("oas_subset.tsv", sep="\t")
sim = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(model="s5f", count=(0, 30))
.run_records(
n=len(empirical),
seed=42,
expose_provenance=True,
)
).to_dataframe()
Each check is a one-liner. Compute it on both panels, then compare with KL divergence, KS statistic, or just an eyeball overlay.
def v_family(s): return s.str.split("*").str[0]
emp_usage = v_family(empirical["v_call"]).value_counts(normalize=True)
sim_usage = v_family(sim["truth_v_call"]).value_counts(normalize=True)
diff = (emp_usage - sim_usage).abs().nlargest(10)
print("largest V-family mismatches:")
print(diff)
from scipy.stats import ks_2samp
# CDR3 length
ks = ks_2samp(empirical["cdr3_aa"].str.len(),
sim["cdr3_aa"].str.len())
print("CDR3 length KS:", ks.statistic, ks.pvalue)
# productive rate
print("empirical productive:", empirical["productive"].mean())
print("sim productive:", sim["productive"].mean())
# junction codon usage (top 10 trinucleotides)
def trinuc_freq(s):
from collections import Counter
counts = Counter()
for j in s.dropna():
for i in range(0, len(j) - 2, 3):
counts[j[i:i+3]] += 1
return pd.Series(counts).sort_values(ascending=False)
Each marginal points at a specific knob in the simulation. Mismatch ≠ broken sim — it means your sim doesn't yet model whatever shaped the empirical distribution.
Recipe C · 02 shows how to feed a target frequency table to the V-allele sampler.
Pass np1_lengths= / np2_lengths= to recombine()
with a list of (length, weight) tuples matching your empirical histogram.
Check that you're not passing respect=ga.productive() unintentionally. The
natural rate without contracts is well below 99%.