Recipe A · 04 · ~25 min · advanced

Wire a custom SHM model.

Out of the box GenAIRR ships two mutation models: S5F (motif-aware, Yaari et al. 2013) and Uniform (rate-only, useful as a null). Both are good. Neither is your model — if you have a per-base substitution distribution from your own data, you can drop it in as a new mutation pass while leaving the rest of the pipeline intact.

01 Pick the surface per-base · per-motif · position-aware

02 Implement probability table + sampler

03 Replace swap into the mutate pass

PART 01

Pick the right level of granularity.

A mutation model has to answer two questions per substitution event: where in the sequence does the mutation land, and which base does it become. The difference between S5F, uniform, and your-own-model is purely how those probabilities are conditioned.

Per-base (uniform-ish)

Each position is independently a target with probability proportional to a fixed rate. Substitution is uniform over the 3 alternative bases. Useful as a null model.

Per-motif (S5F-style)

Target probability conditioned on a small window — typically the 5-mer centred on the position. Substitution table is also conditioned on the motif. Captures AID hotspots.

Position-aware

Probability conditioned on the position relative to the gene start (e.g. CDR/FWR aware). Most flexible; needs more parameters and a calibration dataset.

PART 02

A custom per-motif model — sketch.

A motif-aware model is two tables and a sampler. The targeting table maps each motif to a probability of being targeted; the substitution table maps each motif (and target base) to a distribution over the 4 nucleotides.

Custom 5-mer model — Python pass

from GenAIRR._engine import Simulation, Rng
import json

class MotifMutationPass:
    def __init__(self, model_path: str, count: tuple):
        with open(model_path) as f:
            m = json.load(f)
        self.target_w = m["targeting"]    # 5-mer → weight
        self.sub_dist = m["substitution"]  # 5-mer → {A,C,G,T weights}
        self.count_lo, self.count_hi = count

    def name(self): return "mutate.custom_motif"
    def effects(self): return frozenset({"point_mutation"})

    def apply(self, sim, rng, contracts):
        n_subs = rng.randint(self.count_lo, self.count_hi)
        builder = sim.to_builder()
        for _ in range(n_subs):
            pos    = self._draw_target(sim, rng)
            new_b  = self._draw_sub(sim, pos, rng)
            builder.set_base(pos, new_b)
            builder.record_choice(
                f"mutate.custom_motif.pos[{pos}]",
                {"pos": pos, "alt": new_b},
            )
        return builder.build()

The _draw_target and _draw_sub helpers are where your model's physics lives. Implement them however you like — weighted reservoir sampling, alias method, or a precomputed inverse-CDF lookup. The engine only cares that they return valid positions and bases.

PART 03

Plug it in.

Drop the custom pass into a manually-assembled PassPlan in the same slot the built-in mutate pass would occupy. The DSL stays available for the surrounding phases.

Wiring into a plan

from GenAIRR._engine import PassPlan, CompiledSimulator
import GenAIRR as ga

refdata = ga.dataconfig_to_refdata(ga.HUMAN_IGH_OGRDB)
plan = PassPlan.recombine(refdata)

# swap in your motif model for the canonical S5F pass
plan.push_python_pass(
    MotifMutationPass("./my_5mer_model.json", count=(5, 25))
)

# layer the standard corruption phase after
plan.push_corrupt_5prime_loss([(5, 30, 1.0)])

sim = CompiledSimulator(plan)
out = sim.run(n=1000, seed=42)

PART 04

Verify the rates land where you expect.

A custom model is only useful if it produces the distributions it advertises. Two orthogonal checks: per-base mutation rate, and motif-bias on the targeting distribution.

Mutation rate per base

import numpy as np
rates = np.array([
    sim.truth_n_mutations / sim.sequence_length
    for o in out
    for sim in [o.final_simulation()]
])
print(rates.mean(), rates.std())

Motif-bias check

from collections import Counter
counts = Counter()
for o in out:
    sim = o.final_simulation()
    for pos in sim.mutation_positions():
        motif = sim.germline_window(pos, 5)
        counts[motif] += 1
# compare with your model's targeting table

Related recipes

Where to next.

B · 02 · Compare two SHM models →

Once you have your model wired, compare its output against S5F or Uniform on the same recombination seed.

A · 02 · Compose a custom Pass →

The general pattern for any custom pass, not just SHM.