GenAIRR ships with 106 built-in configs covering 23 species. But if you're working with a
panel-specific reference, an in-house annotation, or a non-stock locus, you can hand the
engine a RefDataConfig directly. This recipe walks you through the smallest
valid VJ config, scales up to a heavy-chain VDJ config, and points at the things you'll get
wrong the first time.
Before you build one, check whether the engine already has what you need.
GenAIRR.list_configs() returns all 106 built-ins; if your species or locus is
in there, point Experiment.on(name) at it and skip the rest of this recipe.
Build a custom config when the answer to any of the following is yes:
A VJ config has just two pools (V and J). Build it with
RefDataConfig.vj(), populate the pools with add_v_allele and
add_j_allele, and you're ready to simulate. Anchors are 0-based positions of
the codon's first base — that's the field GenAIRR uses to locate the junction on the final
sequence, regardless of trim or mutation.
import GenAIRR as ga
cfg = ga.RefDataConfig.vj()
# V alleles — family, sequence, Cys-codon position
cfg.add_v_allele(
"IGKV1-39*01", "IGKV1-39",
b"GACATCCAGATGACCCAGTCTCCATCCT...",
anchor=288,
)
cfg.add_v_allele(
"IGKV3-20*01", "IGKV3-20",
b"GAAATTGTGTTGACGCAGTCTCCAGGCA...",
anchor=285,
)
# J alleles — Trp/Phe-codon position
cfg.add_j_allele(
"IGKJ1*01", "IGKJ1",
b"GTGGACGTTCGGCCAAGGGACCAAGGT...",
anchor=3,
)
cfg.add_j_allele(
"IGKJ2*01", "IGKJ2",
b"TGTACACTTTTGGCCAGGGGACCAAGCT...",
anchor=4,
)
# simulate from the custom config
result = ga.Experiment.on(cfg).recombine().run(n=100, seed=42)
VJ has no D segment and no NP2. Light chains (IGK, IGL) and the α / γ T-cell chains are
all VJ. If you need three segments (heavy chain, β, δ), use .vdj() below.
For VDJ loci, swap .vj() for .vdj() and populate the D pool with
add_d_allele. D alleles have no anchor — they're short and entirely contained
between NP1 and NP2 — so the third argument is just the sequence.
cfg = ga.RefDataConfig.vdj()
cfg.add_v_allele("IGHV3-23*01", "IGHV3-23",
b"GAGGTGCAGCTGGTGGAGTCT...", anchor=288)
cfg.add_v_allele("IGHV1-69*01", "IGHV1-69",
b"CAGGTGCAGCTGGTGCAGTCT...", anchor=288)
cfg.add_d_allele("IGHD3-10*01", "IGHD3-10",
b"GTATTACTATGGTTCGGGGAGTTATTATAAC")
cfg.add_j_allele("IGHJ4*02", "IGHJ4",
b"ACTACTTTGACTACTGGGGCCAGGGAACCCTG...",
anchor=2)
# productive() works against custom configs the same way
result = ga.Experiment.on(cfg).recombine().run(
n=100, seed=42, respect=ga.productive(),
)
A custom config can be plausible-looking and still subtly wrong — an anchor off by one base breaks every junction call. Run a small batch and check three things before you trust the output.
# did every add_* land?
cfg.v_pool_size() # expected: number you added
cfg.d_pool_size() # 0 for vj()
cfg.j_pool_size()
sim = result[0].final_simulation()
junc = sim.bases()[
sim.junction_span()[0] : sim.junction_span()[1]
]
# junc should start with Cys-codon (TGT/TGC)
# and end with Trp (TGG) or Phe (TTT/TTC)
productive_rate = sum(
1 for o in result
if o.final_simulation().productive
) / len(result)
# without contracts: ~30-50% is typical
# with productive(): >99%