Reference models and rules¶
The typed dataclasses that describe a cartridge's
empirical-model plane (allele usage, NP-base model, trim and
length distributions) and its rules plane (anchor expectations,
alphabet, severity policy). These specs are the schemas the
ReferenceCartridgeBuilder
estimators write into; you can also hand-author them when
estimating from records isn't possible.
For the conceptual model — four typed planes (identity, catalogue, rules, empirical models) — see the reference-cartridge concept page.
Empirical models¶
ReferenceEmpiricalModels¶
The bundle that carries the four empirical planes. One of these
attaches to DataConfig.reference_models after .build().
GenAIRR.reference_models.ReferenceEmpiricalModels
dataclass
¶
The empirical models plane of a reference cartridge — default distributions the engine samples from when the user doesn't override them at recombine time.
A cartridge has four planes:
- identity (species, locus, reference set, name, source)
- catalogue (V/D/J alleles)
- rules (see :class:
ReferenceRulesSpec) — how to interpret alleles - empirical models — this object — defaults for NP lengths
- exonuclease trim distributions
Two dicts, both keyed by stable string identifiers:
np_lengths:{"NP1", "NP2"}→ length distribution.trims:{"V_3", "D_5", "D_3", "J_5"}→ trim distribution.
Each value is an :class:EmpiricalDistributionSpec — a flat list
of (value, weight) pairs the engine consumes directly (no
marginalisation magic in between, unlike the legacy nested-dict
extraction path).
Empty dicts are valid (model absent → fall back to the legacy
DataConfig.NP_lengths/trim_dicts extraction, then to the
uniform placeholder). Partially-populated dicts are valid (NP1
set, NP2 absent → NP1 used; NP2 falls through to legacy).
validate(chain_type=...) enforces shape: known keys only,
non-empty distributions, positive finite weights. The
chain_type argument ("vj" or "vdj") rejects D-trim
models on VJ cartridges — D trims are nonsense without a D
segment, and silently ignoring them would mask the user's
authoring mistake.
See docs/reference_cartridge.md for end-to-end examples.
EmpiricalDistributionSpec¶
The [(value, weight), ...] shape used for trim and NP-length
distributions. Weights renormalise; zero-weight entries drop out
of the proposal support.
GenAIRR.reference_models.EmpiricalDistributionSpec
dataclass
¶
One empirical distribution as a flat (value, weight) list.
values is a list of (int, float) pairs. Values are
non-negative ints (lengths, trim amounts). Weights are positive
finite floats; the engine normalises them at sample time, so
callers don't need to normalise themselves — they just need to
be positive. Duplicate values aren't enforced at this layer (the
engine accumulates) but discourage them stylistically.
NpBaseModelSpec¶
The NP-base sampling model. Two kinds: empirical-first-base
(per-base marginal) and Markov (first-base marginal + 4×4
transition matrix). The Markov form is the default produced by
estimate_np_base_model(records, kind="markov").
GenAIRR.reference_models.NpBaseModelSpec
dataclass
¶
One V(D)J N-addition base sampling model.
Three kinds are recognised by the Python validator:
"uniform"(default-equivalent) — every NP position samples uniformly from A/C/G/T. Byte-identical to the pre-slice engine when no model is configured.first_baseandtransitionsmust beNone."empirical_first_base"— every NP position samples independently from the suppliedfirst_basecategorical distribution.first_baseis required;transitionsmust beNone. This is the v1 cartridge-owned biology surface for the typed NP base model."markov"— true Markov chain where each base is sampled conditional on the previously emitted base via thetransitionsmatrix. Bothfirst_baseandtransitionsare required.
Wired end-to-end as of the NP-Markov slice. The
engine ships an NpBaseGenerator trait whose
MarkovBaseGenerator implementation threads the
previous base through sample_base(rng,
previous_base, …); the typed Python spec lowers
directly into it via push_generate_np(...,
markov_transitions=...). Plan-signature replay
folds the full Markov payload (first-base row + 4
transition rows in canonical A/C/G/T order). See
docs/junction_n_addition_audit.md (Markov
shipped) and the validation matrix's "NP base
models / Markov N-addition" row.
Validation rules:
kindmust be one of{"uniform", "empirical_first_base", "markov"}.first_base(when present) isdict[str, float]with keys from{"A", "C", "G", "T"}. Values are finite, non-negative, with at least one strictly positive. Unknown bases are rejected.transitions(when present) isdict[str, dict[str, float]]— outer key is the previously-emitted base, inner is the next-base categorical. Same per-base validation rules; each row must be non-empty and have at least one positive weight. Outer-key coverage of A/C/G/T is enforced — if any from-base is missing, the validator raises (a partial Markov matrix is almost certainly an authoring bug).
AlleleUsageSpec¶
Per-segment allele weights for V, D, and J. Weights renormalise within each segment.
GenAIRR.reference_models.AlleleUsageSpec
dataclass
¶
Per-segment allele-usage weights authored on a cartridge.
Each segment field is a {allele_name: weight} dict — names
are AIRR-convention allele names ("IGHV1-2*02" etc.); weights
are positive finite floats. Weights need NOT sum to 1.0 at
authoring time; the bridge resolver normalises per segment when
it lowers into the dense pool-aligned Tuple[float, ...] the
engine consumes. An empty segment dict means "fall back to
uniform for that segment" — useful when the author only has
data for one or two segments and wants the others uniform.
Validation:
- Each name is a non-empty string.
- Each weight is a finite, strictly positive float.
- When
chain_type="vj"is supplied to :meth:validate, non-empty D-segment entries are rejected (D on a VJ cartridge is an authoring bug, mirrored from the trims / NP-base / P-nucleotide planes). - Allele-name existence in the cartridge's V/D/J pools is NOT
checked here — the pool lookup lives in the bridge resolver
(
_dataconfig_extract._allele_usage_from_models) so the spec layer stays decoupled from a specific catalogue.
Slice — Allele Usage Estimation v1. Produced either by hand or
by :meth:GenAIRR.ReferenceCartridgeBuilder.estimate_allele_usage.
Rules¶
ReferenceRulesSpec¶
Anchor expectations + alphabet + severity policy. Hand-authored
or attached via .with_rules(...) on the builder.
GenAIRR.reference_rules.ReferenceRulesSpec
dataclass
¶
The rules plane of a reference cartridge — how the engine should interpret the catalogue's alleles.
A cartridge has four planes:
- identity (species, locus, reference set, name, source)
- catalogue (V/D/J alleles)
- rules — this object
- empirical models (NP-length + trim distributions, see
:class:
ReferenceEmpiricalModels)
Plus an orthogonal concept:
- curation — which subset of the catalogue participates
in simulation (
Experiment.curate_refdata(...)).
This spec mirrors the Rust ReferenceRules struct one-to-one
and is shape-validated here (single-char bases / amino acids,
severity strings, A/C/G/T present in the alphabet). The Rust
RefDataConfig.validate() remains the authority on whether the
catalogue is usable under these rules.
Default constructor reproduces the Rust ReferenceRules::default():
A/C/G/T/N alphabet, V expects C, J accepts W or F,
both severities curatable. When the spec is attached to a
:class:DataConfig, dataconfig_to_refdata ships it verbatim
into the cartridge — overriding the loader's bundled-locus
inference (which narrows J to ["W"] for IGH, ["F"] for
IGK/IGL/TR*).
See docs/reference_cartridge.md for the cartridge model and
end-to-end examples (custom J anchor Y, extended alphabet,
non-standard species).
AnchorRuleSpec¶
Per-anchor expected amino-acid plus a required flag. Members of
ReferenceRulesSpec.anchors are AnchorRuleSpec instances.
GenAIRR.reference_rules.AnchorRuleSpec
dataclass
¶
Anchor rule for one segment (V or J) — one of the rules in a reference cartridge.
Rules define how the engine interprets the catalogue's alleles. The V/J anchor rule says which amino acids the anchor codon may translate to, whether an anchor is required at all, and how anchor-related issues are classified (Fatal vs Curatable) for the strict-vs-AllowCuratable validation gate.
Fields mirror the Rust AnchorRule struct one-to-one. Severity
strings are "fatal" or "curatable" (lower-case) — the same
vocabulary the validator's issue dicts use.
The default constructor is intentionally restrictive (anchor
required, both severities curatable). Pass an empty
expected_aa list with required=False to model an
anchorless rule (legacy/exploratory catalogues).
See docs/reference_cartridge.md for the cartridge model and
the full validation/curation/compile flow.