Recipe A · 01 · ~15 min · intermediate

Build a config for your locus or species.

GenAIRR ships with 106 built-in configs covering 23 species. But if you're working with a panel-specific reference, an in-house annotation, or a non-stock locus, you can hand the engine a RefDataConfig directly. This recipe walks you through the smallest valid VJ config, scales up to a heavy-chain VDJ config, and points at the things you'll get wrong the first time.

01 Construct RefDataConfig.vj() · .vdj()
02 Add alleles name · family · sequence · anchor
03 Verify run, inspect, validate
PART 01

When you actually need a custom config.

Before you build one, check whether the engine already has what you need. GenAIRR.list_configs() returns all 106 built-ins; if your species or locus is in there, point Experiment.on(name) at it and skip the rest of this recipe. Build a custom config when the answer to any of the following is yes:

Reasons to build your own
  • Custom germline pulla newer IMGT release, an OGRDB-curated set, or an in-house annotation
  • Panel-specific referencea subset matched to the alleles your wet-lab assay actually amplifies
  • Non-stock species or locussomething not in the 23-species shipped pool
  • Designed allelesynthetic or modified V/J alleles for engineering experiments
What you'll need
  • Allele sequencesFASTA-style nucleotide strings for every V, D (if VDJ), and J allele
  • Anchor positions0-based offset of the V Cys codon · J Trp/Phe codon (first base of the codon)
  • Family labelsa coarse grouping per allele — IGHV3-23*01 and *02 share family "IGHV3-23"
PART 02

The smallest valid config — VJ.

A VJ config has just two pools (V and J). Build it with RefDataConfig.vj(), populate the pools with add_v_allele and add_j_allele, and you're ready to simulate. Anchors are 0-based positions of the codon's first base — that's the field GenAIRR uses to locate the junction on the final sequence, regardless of trim or mutation.

A 2-allele light-chain config
import GenAIRR as ga

cfg = ga.RefDataConfig.vj()

# V alleles — family, sequence, Cys-codon position
cfg.add_v_allele(
    "IGKV1-39*01", "IGKV1-39",
    b"GACATCCAGATGACCCAGTCTCCATCCT...",
    anchor=288,
)
cfg.add_v_allele(
    "IGKV3-20*01", "IGKV3-20",
    b"GAAATTGTGTTGACGCAGTCTCCAGGCA...",
    anchor=285,
)

# J alleles — Trp/Phe-codon position
cfg.add_j_allele(
    "IGKJ1*01", "IGKJ1",
    b"GTGGACGTTCGGCCAAGGGACCAAGGT...",
    anchor=3,
)
cfg.add_j_allele(
    "IGKJ2*01", "IGKJ2",
    b"TGTACACTTTTGGCCAGGGGACCAAGCT...",
    anchor=4,
)

# simulate from the custom config
result = ga.Experiment.on(cfg).recombine().run(n=100, seed=42)

VJ has no D segment and no NP2. Light chains (IGK, IGL) and the α / γ T-cell chains are all VJ. If you need three segments (heavy chain, β, δ), use .vdj() below.

PART 03

Adding a D pool for heavy chains.

For VDJ loci, swap .vj() for .vdj() and populate the D pool with add_d_allele. D alleles have no anchor — they're short and entirely contained between NP1 and NP2 — so the third argument is just the sequence.

Heavy-chain config sketch
cfg = ga.RefDataConfig.vdj()

cfg.add_v_allele("IGHV3-23*01", "IGHV3-23",
                  b"GAGGTGCAGCTGGTGGAGTCT...", anchor=288)
cfg.add_v_allele("IGHV1-69*01", "IGHV1-69",
                  b"CAGGTGCAGCTGGTGCAGTCT...", anchor=288)

cfg.add_d_allele("IGHD3-10*01", "IGHD3-10",
                  b"GTATTACTATGGTTCGGGGAGTTATTATAAC")

cfg.add_j_allele("IGHJ4*02", "IGHJ4",
                  b"ACTACTTTGACTACTGGGGCCAGGGAACCCTG...",
                  anchor=2)

# productive() works against custom configs the same way
result = ga.Experiment.on(cfg).recombine().run(
    n=100, seed=42, respect=ga.productive(),
)
PART 04

Verify it works.

A custom config can be plausible-looking and still subtly wrong — an anchor off by one base breaks every junction call. Run a small batch and check three things before you trust the output.

1 · Pool sizes match expectation
# did every add_* land?
cfg.v_pool_size()  # expected: number you added
cfg.d_pool_size()  # 0 for vj()
cfg.j_pool_size()
2 · Junction sits where the anchor says
sim = result[0].final_simulation()
junc = sim.bases()[
    sim.junction_span()[0] : sim.junction_span()[1]
]
# junc should start with Cys-codon (TGT/TGC)
# and end with Trp (TGG) or Phe (TTT/TTC)
3 · Productivity rate is sane
productive_rate = sum(
    1 for o in result
    if o.final_simulation().productive
) / len(result)

# without contracts: ~30-50% is typical
# with productive(): >99%
Related recipes

Where to next.

A · 03 · Sample only productive sequences →

Pair your new config with the productive() bundle for clean training data.

C · 02 · Match an empirical V-gene usage distribution →

Reweight allele sampling so your config matches a target frequency table.