Skip to content

Reference models and rules

The typed dataclasses that describe a cartridge's empirical-model plane (allele usage, NP-base model, trim and length distributions) and its rules plane (anchor expectations, alphabet, severity policy). These specs are the schemas the ReferenceCartridgeBuilder estimators write into; you can also hand-author them when estimating from records isn't possible.

For the conceptual model — four typed planes (identity, catalogue, rules, empirical models) — see the reference-cartridge concept page.

Empirical models

ReferenceEmpiricalModels

The bundle that carries the four empirical planes. One of these attaches to DataConfig.reference_models after .build().

GenAIRR.reference_models.ReferenceEmpiricalModels dataclass

The empirical models plane of a reference cartridge — default distributions the engine samples from when the user doesn't override them at recombine time.

A cartridge has four planes:

  • identity (species, locus, reference set, name, source)
  • catalogue (V/D/J alleles)
  • rules (see :class:ReferenceRulesSpec) — how to interpret alleles
  • empirical models — this object — defaults for NP lengths
  • exonuclease trim distributions

Two dicts, both keyed by stable string identifiers:

  • np_lengths: {"NP1", "NP2"} → length distribution.
  • trims: {"V_3", "D_5", "D_3", "J_5"} → trim distribution.

Each value is an :class:EmpiricalDistributionSpec — a flat list of (value, weight) pairs the engine consumes directly (no marginalisation magic in between, unlike the legacy nested-dict extraction path).

Empty dicts are valid (model absent → fall back to the legacy DataConfig.NP_lengths/trim_dicts extraction, then to the uniform placeholder). Partially-populated dicts are valid (NP1 set, NP2 absent → NP1 used; NP2 falls through to legacy).

validate(chain_type=...) enforces shape: known keys only, non-empty distributions, positive finite weights. The chain_type argument ("vj" or "vdj") rejects D-trim models on VJ cartridges — D trims are nonsense without a D segment, and silently ignoring them would mask the user's authoring mistake.

See docs/reference_cartridge.md for end-to-end examples.

EmpiricalDistributionSpec

The [(value, weight), ...] shape used for trim and NP-length distributions. Weights renormalise; zero-weight entries drop out of the proposal support.

GenAIRR.reference_models.EmpiricalDistributionSpec dataclass

One empirical distribution as a flat (value, weight) list.

values is a list of (int, float) pairs. Values are non-negative ints (lengths, trim amounts). Weights are positive finite floats; the engine normalises them at sample time, so callers don't need to normalise themselves — they just need to be positive. Duplicate values aren't enforced at this layer (the engine accumulates) but discourage them stylistically.

NpBaseModelSpec

The NP-base sampling model. Two kinds: empirical-first-base (per-base marginal) and Markov (first-base marginal + 4×4 transition matrix). The Markov form is the default produced by estimate_np_base_model(records, kind="markov").

GenAIRR.reference_models.NpBaseModelSpec dataclass

One V(D)J N-addition base sampling model.

Three kinds are recognised by the Python validator:

  • "uniform" (default-equivalent) — every NP position samples uniformly from A/C/G/T. Byte-identical to the pre-slice engine when no model is configured. first_base and transitions must be None.
  • "empirical_first_base" — every NP position samples independently from the supplied first_base categorical distribution. first_base is required; transitions must be None. This is the v1 cartridge-owned biology surface for the typed NP base model.
  • "markov" — true Markov chain where each base is sampled conditional on the previously emitted base via the transitions matrix. Both first_base and transitions are required.

Wired end-to-end as of the NP-Markov slice. The engine ships an NpBaseGenerator trait whose MarkovBaseGenerator implementation threads the previous base through sample_base(rng, previous_base, …); the typed Python spec lowers directly into it via push_generate_np(..., markov_transitions=...). Plan-signature replay folds the full Markov payload (first-base row + 4 transition rows in canonical A/C/G/T order). See docs/junction_n_addition_audit.md (Markov shipped) and the validation matrix's "NP base models / Markov N-addition" row.

Validation rules:

  • kind must be one of {"uniform", "empirical_first_base", "markov"}.
  • first_base (when present) is dict[str, float] with keys from {"A", "C", "G", "T"}. Values are finite, non-negative, with at least one strictly positive. Unknown bases are rejected.
  • transitions (when present) is dict[str, dict[str, float]] — outer key is the previously-emitted base, inner is the next-base categorical. Same per-base validation rules; each row must be non-empty and have at least one positive weight. Outer-key coverage of A/C/G/T is enforced — if any from-base is missing, the validator raises (a partial Markov matrix is almost certainly an authoring bug).

AlleleUsageSpec

Per-segment allele weights for V, D, and J. Weights renormalise within each segment.

GenAIRR.reference_models.AlleleUsageSpec dataclass

Per-segment allele-usage weights authored on a cartridge.

Each segment field is a {allele_name: weight} dict — names are AIRR-convention allele names ("IGHV1-2*02" etc.); weights are positive finite floats. Weights need NOT sum to 1.0 at authoring time; the bridge resolver normalises per segment when it lowers into the dense pool-aligned Tuple[float, ...] the engine consumes. An empty segment dict means "fall back to uniform for that segment" — useful when the author only has data for one or two segments and wants the others uniform.

Validation:

  • Each name is a non-empty string.
  • Each weight is a finite, strictly positive float.
  • When chain_type="vj" is supplied to :meth:validate, non-empty D-segment entries are rejected (D on a VJ cartridge is an authoring bug, mirrored from the trims / NP-base / P-nucleotide planes).
  • Allele-name existence in the cartridge's V/D/J pools is NOT checked here — the pool lookup lives in the bridge resolver (_dataconfig_extract._allele_usage_from_models) so the spec layer stays decoupled from a specific catalogue.

Slice — Allele Usage Estimation v1. Produced either by hand or by :meth:GenAIRR.ReferenceCartridgeBuilder.estimate_allele_usage.

Rules

ReferenceRulesSpec

Anchor expectations + alphabet + severity policy. Hand-authored or attached via .with_rules(...) on the builder.

GenAIRR.reference_rules.ReferenceRulesSpec dataclass

The rules plane of a reference cartridge — how the engine should interpret the catalogue's alleles.

A cartridge has four planes:

  • identity (species, locus, reference set, name, source)
  • catalogue (V/D/J alleles)
  • rules — this object
  • empirical models (NP-length + trim distributions, see :class:ReferenceEmpiricalModels)

Plus an orthogonal concept:

  • curation — which subset of the catalogue participates in simulation (Experiment.curate_refdata(...)).

This spec mirrors the Rust ReferenceRules struct one-to-one and is shape-validated here (single-char bases / amino acids, severity strings, A/C/G/T present in the alphabet). The Rust RefDataConfig.validate() remains the authority on whether the catalogue is usable under these rules.

Default constructor reproduces the Rust ReferenceRules::default(): A/C/G/T/N alphabet, V expects C, J accepts W or F, both severities curatable. When the spec is attached to a :class:DataConfig, dataconfig_to_refdata ships it verbatim into the cartridge — overriding the loader's bundled-locus inference (which narrows J to ["W"] for IGH, ["F"] for IGK/IGL/TR*).

See docs/reference_cartridge.md for the cartridge model and end-to-end examples (custom J anchor Y, extended alphabet, non-standard species).

AnchorRuleSpec

Per-anchor expected amino-acid plus a required flag. Members of ReferenceRulesSpec.anchors are AnchorRuleSpec instances.

GenAIRR.reference_rules.AnchorRuleSpec dataclass

Anchor rule for one segment (V or J) — one of the rules in a reference cartridge.

Rules define how the engine interprets the catalogue's alleles. The V/J anchor rule says which amino acids the anchor codon may translate to, whether an anchor is required at all, and how anchor-related issues are classified (Fatal vs Curatable) for the strict-vs-AllowCuratable validation gate.

Fields mirror the Rust AnchorRule struct one-to-one. Severity strings are "fatal" or "curatable" (lower-case) — the same vocabulary the validator's issue dicts use.

The default constructor is intentionally restrictive (anchor required, both severities curatable). Pass an empty expected_aa list with required=False to model an anchorless rule (legacy/exploratory catalogues).

See docs/reference_cartridge.md for the cartridge model and the full validation/curation/compile flow.