Recipe C · 02 · ~15 min · intermediate

Match an empirical V-gene usage distribution.

By default the V allele sampler draws uniformly from the reference pool. Real repertoires are not uniform — IGHV3-23, IGHV4-39, and IGHV4-59 routinely make up a quarter of the sample. Pass a frequency table to recombine() and the sampler reweights to match your panel, an OAS subset, or any AIRR TSV you can read into pandas.

01 Frequency table allele name → relative weight

02 Pass to recombine() v_allele_weights= · d_ · j_

03 Verify empirical KL on truth_v_call

PART 01

What the weight table looks like.

A dictionary from allele name (or family name) to a non-negative relative weight. The sampler normalises the weights — you don't need probabilities that sum to one. Alleles not in the dict default to weight 1.0; set them to 0 to exclude.

A panel-specific weight dict

import GenAIRR as ga

# weights are relative; you can use raw counts
v_weights = {
    "IGHV3-23*01":  450,
    "IGHV4-39*01":  280,
    "IGHV4-59*02":  210,
    "IGHV1-69*06":  150,
    # everything else uses default weight 1.0
}

result = (
    ga.Experiment.on("human_igh")
       .recombine(v_allele_weights=v_weights)
       .run(n=10_000, seed=42)
)

PART 02

Building the table from an AIRR TSV.

If you have a reference panel (your own data, an OAS pull, the example from any aligner benchmark), you can derive the weights directly. Two grain options: by allele (IGHV3-23*01) or by family (IGHV3-23). Family-level is more forgiving of small reference panels.

From counts in an AIRR TSV

import pandas as pd

panel = pd.read_csv("my_repertoire.tsv", sep="\t")

# allele-grain (use this if your reference panel is large)
v_weights = panel["v_call"].value_counts().to_dict()

# family-grain (smoother — drop the *NN suffix)
fam_counts = (
    panel["v_call"].str.split("*").str[0]
         .value_counts().to_dict()
)
# the sampler will match any allele whose family is in the dict

PART 03

D and J too.

The same mechanism applies to D and J pools. D is small enough that an even distribution is usually fine; J is small enough that you should consider weighting it explicitly if your reference repertoire has any asymmetry (often IGHJ4 and IGHJ6 dominate).

All three weight knobs

result = (
    ga.Experiment.on("human_igh")
       .recombine(
            v_allele_weights=v_weights,
            d_allele_weights=d_weights,
            j_allele_weights=j_weights,
       )
       .run(n=10_000, seed=42)
)

PART 04

Verify the match.

After a sample, compare your truth_v_call distribution to your reference panel. Compute a KL divergence or eyeball a bar chart — both should be close.

KL divergence vs the reference

import numpy as np

df = result.to_dataframe()

emp = panel["v_call"].value_counts(normalize=True)
sim = df["truth_v_call"].value_counts(normalize=True)

shared = emp.index.intersection(sim.index)
e, s = emp[shared].values, sim[shared].values

kl = (e * np.log(e / s)).sum()
print(f"KL(empirical || simulated): {kl:.4f}")

Related recipes

Where to next.

B · 03 · Audit for biological realism →

Confirm V-usage matching closes the gap on the full marginal panel.

A · 01 · Build a config for your locus →

If your panel uses alleles outside the built-in pool, build a custom RefDataConfig first.