By default the V allele sampler draws uniformly from the reference pool. Real repertoires
are not uniform — IGHV3-23, IGHV4-39, and IGHV4-59 routinely make up a quarter of the
sample. Pass a frequency table to recombine() and the sampler reweights to
match your panel, an OAS subset, or any AIRR TSV you can read into pandas.
A dictionary from allele name (or family name) to a non-negative relative weight. The sampler normalises the weights — you don't need probabilities that sum to one. Alleles not in the dict default to weight 1.0; set them to 0 to exclude.
import GenAIRR as ga
# weights are relative; you can use raw counts
v_weights = {
"IGHV3-23*01": 450,
"IGHV4-39*01": 280,
"IGHV4-59*02": 210,
"IGHV1-69*06": 150,
# everything else uses default weight 1.0
}
result = (
ga.Experiment.on("human_igh")
.recombine(v_allele_weights=v_weights)
.run(n=10_000, seed=42)
)
If you have a reference panel (your own data, an OAS pull, the example from any aligner
benchmark), you can derive the weights directly. Two grain options: by allele
(IGHV3-23*01) or by family (IGHV3-23). Family-level is more
forgiving of small reference panels.
import pandas as pd
panel = pd.read_csv("my_repertoire.tsv", sep="\t")
# allele-grain (use this if your reference panel is large)
v_weights = panel["v_call"].value_counts().to_dict()
# family-grain (smoother — drop the *NN suffix)
fam_counts = (
panel["v_call"].str.split("*").str[0]
.value_counts().to_dict()
)
# the sampler will match any allele whose family is in the dict
The same mechanism applies to D and J pools. D is small enough that an even distribution is usually fine; J is small enough that you should consider weighting it explicitly if your reference repertoire has any asymmetry (often IGHJ4 and IGHJ6 dominate).
result = (
ga.Experiment.on("human_igh")
.recombine(
v_allele_weights=v_weights,
d_allele_weights=d_weights,
j_allele_weights=j_weights,
)
.run(n=10_000, seed=42)
)
After a sample, compare your truth_v_call distribution to your reference
panel. Compute a KL divergence or eyeball a bar chart — both should be close.
import numpy as np
df = result.to_dataframe()
emp = panel["v_call"].value_counts(normalize=True)
sim = df["truth_v_call"].value_counts(normalize=True)
shared = emp.index.intersection(sim.index)
e, s = emp[shared].values, sim[shared].values
kl = (e * np.log(e / s)).sum()
print(f"KL(empirical || simulated): {kl:.4f}")