ReferenceCartridgeBuilder¶
A fluent builder for custom reference cartridges
from raw FASTA. Every call returns the same builder extended by
one stage; at .build() the engine validates the
typed planes and stamps a schema_sha256 on the
resulting DataConfig. Every stage's inputs,
warnings, and rejection details are captured in
builder.report() — the cartridge ships with an
auditable trail.
For the broader workflow, see Build a reference cartridge; for the estimator deep dive, see Estimate cartridge models from real data.
Common methods¶
| Method | Purpose |
|---|---|
.from_fasta(...) |
Class-level constructor; parses V/D/J FASTA into the catalogue |
.infer_identity(...) |
Resolve species / chain / locus / reference-set / name |
.infer_v_subregions() |
Annotate V alleles with FWR/CDR intervals (required for v_subregion_rates) |
.with_rules(...) |
Attach an explicit ReferenceRulesSpec |
.with_models(...) |
Attach an explicit ReferenceEmpiricalModels bundle |
.estimate_allele_usage(records) |
Estimate per-segment allele weights |
.estimate_trim_distributions(records) |
Estimate per-key trim distributions |
.estimate_np_length_distributions(records) |
Estimate NP1 / NP2 length distributions |
.estimate_np_base_model(records, kind=...) |
Estimate NP-base model (Markov or empirical first-base) |
.estimate_p_nucleotide_lengths(records) |
Estimate palindromic-insertion length distributions |
.build() |
Finalise, stamp checksum, return the DataConfig |
.report() |
Get the CartridgeBuildReport with stages / warnings / rejections |
Class reference¶
GenAIRR.cartridge_builder.ReferenceCartridgeBuilder
¶
Stage-based builder for a :class:~GenAIRR.DataConfig.
See :mod:GenAIRR.cartridge_builder module docs and
docs/reference_cartridge_authoring_audit.md for the design.
Usage::
builder = ReferenceCartridgeBuilder.from_fasta(
v_fasta="igh_v.fasta",
j_fasta="igh_j.fasta",
d_fasta="igh_d.fasta",
chain_type="BCR_HEAVY",
)
builder.infer_identity(species="HUMAN", reference_set="MY_REF")
builder.infer_v_subregions()
builder.with_rules(my_rules_spec)
builder.with_models(my_empirical_models)
cfg = builder.build()
report = builder.report()
Each stage method mutates the builder in place and returns
self so calls can be chained.
from_fasta(*, v_fasta, j_fasta, d_fasta=None, c_fasta=None, chain_type)
classmethod
¶
Build a builder from FASTA inputs.
v_fasta and j_fasta are required; d_fasta is
required only when the chain type has a D segment
(BCR_HEAVY / TCR_BETA / TCR_DELTA). c_fasta
is accepted but the parsed C alleles are not yet wired into
the resulting cartridge — v1 limits to V/D/J authoring.
Records dropped during parsing land in
:attr:CartridgeBuildReport.rejected.
Each *_fasta argument can be a path-like, an open text
file, or a raw FASTA string (a string containing a newline
and starting with > is parsed directly).
infer_identity(*, species=None, locus=None, reference_set=None, name=None, source='ReferenceCartridgeBuilder')
¶
Populate the cartridge's identity / metadata plane.
species / locus / reference_set are user-supplied
— none of these can be reliably inferred from FASTA alone, so
the audit (§7) requires explicit input. When locus is
omitted the conventional label is derived from the chain
type (BCR_HEAVY → "IGH", etc.).
source is recorded on every parsed allele's .source
field so downstream consumers can identify cartridges
produced by this builder vs the bundled OGRDB / IMGT
loaders.
infer_v_subregions()
¶
Derive per-V-allele IMGT subregion intervals.
Walks every V allele with a non-empty gapped_seq and
runs :func:compute_v_region_boundaries, attaching the
resulting {label: (start, end)} dict to
allele.subregions. Alleles without gapped_seq (or
with a derivation that raises) are left unset and counted
in the per-stage report.
Idempotent — calling twice re-derives and overwrites the
previous output, with a replaced=True flag on the
report entry.
with_rules(reference_rules)
¶
Attach an explicit :class:ReferenceRulesSpec.
The user-supplied spec is validated immediately so an author's bug surfaces at build-stage time, not at compile time. The report records the rule fields the spec defines.
with_models(reference_models)
¶
Attach an explicit :class:ReferenceEmpiricalModels.
The supplied bundle is validated against the chain type (D-end keys rejected on VJ chains, etc.). The report records which typed planes are populated.
estimate_allele_usage(rearrangements, *, min_count=1.0, ambiguous='fractional', replace=True)
¶
Estimate per-segment allele-usage weights from observed AIRR rearrangement records.
rearrangements accepts:
- a list of dicts (each row is one AIRR record),
- a path-like (filesystem path to an AIRR TSV — parsed via
csv.DictReaderwith tab delimiter), or - an open text file handle pointing at AIRR TSV.
ambiguous selects the tie-set policy (column values
like "IGHV1*01,IGHV2*01"):
"fractional"(default): split one row's credit1.0evenly across all known alleles in the tie set. Unknown allele names in the tie set are excluded; if NO names in the tie set are known to the cartridge, the row is recorded asunknown_alleleand skipped."truth_first": credit only the first comma-separated allele in the tie set. Matches the existing :func:GenAIRR._mcp_summaryconvention."reject": drop ambiguous (multi-call) rows entirely and record them inreport.rejected.
min_count drops alleles whose final per-segment count
is strictly below the threshold. Pre-normalisation; the
per-segment weights remaining after the drop are then
renormalised to sum to 1.0.
replace (default True) controls idempotency. When
True, calling :meth:estimate_allele_usage twice
overwrites the previous spec and writes replaced=True
on the new stage entry. When False, the second call
raises :class:ValueError.
Updates self._reference_models.allele_usage so a
downstream :meth:build carries the estimated spec into
the cartridge.
estimate_trim_distributions(rearrangements, *, min_count=1, pseudocount=0.0, replace=True)
¶
Estimate per-key trim distributions from observed AIRR rearrangement records.
Writes EmpiricalDistributionSpec instances into
self._reference_models.trims keyed by "V_3" /
"D_5" / "D_3" / "J_5" (see
:data:GenAIRR.reference_models.TRIM_KEYS). VJ cartridges
only produce the V_3 and J_5 keys (see
:data:GenAIRR.reference_models.TRIM_KEYS_VJ); the D-end
keys are skipped silently because the cartridge has no D
segment to trim.
rearrangements accepts the same shapes as
:meth:estimate_allele_usage:
- a list of dicts (each row is one AIRR record),
- a path-like to an AIRR TSV (parsed via
csv.DictReaderwith tab delimiter), or - an open text file handle pointing at AIRR TSV.
AIRR columns consumed (per audit §1.3):
- VJ:
v_trim_3,j_trim_5. - VDJ:
v_trim_3,d_trim_5,d_trim_3,j_trim_5.
Columns deliberately ignored: v_trim_5 / j_trim_3
(no engine pass — hard-zero in projection), and the
observation-stage end_loss_5_length / end_loss_3_length
(separate corruption surface — see
docs/primer_trim_end_loss_audit.md).
Per-row validation is field-local: a malformed or
missing value in one trim column drops that field's
contribution only — the row still feeds its other
well-formed fields. Per-row structured entries land in
report.rejected with reasons
"missing_required_column" / "malformed_trim_value"
/ "negative_trim_value", each carrying the AIRR
column name.
min_count (int, default 1) drops trim values whose
observed integer count is strictly below the threshold
before normalisation.
pseudocount (float, default 0.0) adds a uniform prior
to every observed trim value before normalisation
(no support expansion; values never observed stay
unobserved). Applied AFTER the min_count filter so
the filter looks at raw observations.
replace (default True) controls idempotency. When
True, calling this method twice overwrites the
previous specs and writes replaced=True on the new
stage entry. When False AND a prior typed-plane
trims is already attached to self._reference_models,
the call raises :class:ValueError before consuming any
records.
estimate_np_length_distributions(rearrangements, *, min_count=1, pseudocount=0.0, replace=True)
¶
Estimate per-key NP length distributions from observed AIRR rearrangement records.
Writes EmpiricalDistributionSpec instances into
self._reference_models.np_lengths keyed by
"NP1" and (on VDJ cartridges) "NP2" (see
:data:GenAIRR.reference_models.NP_KEYS). VJ
cartridges produce only the NP1 key; the
cartridge has no NP2 region. The boundary is
enforced at estimation time because the typed-plane
validator does NOT chain-type-reject NP2 on
VJ at attach time (see audit §2.2 of
docs/np_length_estimation_design.md).
AIRR columns consumed (per audit §1.2):
np1_length(always).np2_length(VDJ only — VJ rows with a non-zeronp2_lengthraise a one-time warning and the contribution is skipped).
Columns deliberately ignored: np1 / np2
(sequence-derived length is sensitive to
post-claim reabsorption — see audit §5.3),
p_v_3_length / p_d_5_length /
p_d_3_length / p_j_5_length (separate
P-nucleotide biology — see
docs/p_nucleotide_design.md),
and junction_length (aggregate arithmetic too
fragile across simulators — audit §5.2).
Per-row validation is field-local: a malformed
or missing value in one NP column drops only that
key's contribution; the row's other column still
feeds. Per-row structured entries land in
report.rejected with reasons
"missing_required_column" /
"malformed_length_value" /
"negative_length_value", each carrying the
AIRR column name.
min_count (int, default 1) drops length values
whose observed integer count is strictly below
the threshold before normalisation.
pseudocount (float, default 0.0) adds a uniform
prior to every observed length value before
normalisation (no support expansion). Applied
AFTER the min_count filter.
replace (default True) controls idempotency.
When False AND a prior typed-plane
np_lengths is attached to
self._reference_models, the call raises
:class:ValueError before consuming any records.
estimate_np_base_model(rearrangements, *, kind='markov', min_count=1, pseudocount=0.0, replace=True)
¶
Estimate per-key NP base sampling model from observed AIRR rearrangement records.
Writes NpBaseModelSpec instances into
self._reference_models.np_bases keyed by
"NP1" (always) and (on VDJ cartridges)
"NP2". VJ cartridges produce only the NP1
key.
kind selects the model family:
"empirical_first_base": estimates a single categorical over A/C/G/T from the full base composition of every observed NP string (every position, not just position 0). The engine samples every NP position independently from this distribution — the name is preserved for API stability with existing cartridges; the biologically correct estimate is the full-base composition."markov"(default): estimates a first-base row from position 0 of each NP string plus a 4×4 transition matrix from every observed (prev, next) pair.
AIRR columns consumed (per audit §1.2):
np1(always).np2(VDJ only — VJ rows with a non-emptynp2raise a one-time warning and the contribution is skipped).
Columns deliberately ignored: junction (audit
§1.2), p_v_3_length / p_d_5_length /
p_d_3_length / p_j_5_length (separate
P-nucleotide biology), np1_length /
np2_length (length-only — owned by
:meth:estimate_np_length_distributions).
Per-row validation is field-local: a malformed
or missing value in one NP column drops only that
key's contribution. Per-row structured entries
land in report.rejected with reasons
"missing_required_column" (empty / missing
string) or "noncanonical_base" (any character
outside {A,C,G,T} after uppercasing).
min_count (int, default 1) drops first-base
categories whose observed count is strictly below
the threshold before normalisation. For markov,
min_count applies to the first-base row
only: dropping transition cells could leave a
from-base row with no positive weights, which the
:class:NpBaseModelSpec validator rejects. v1
keeps transition rows intact and surfaces the
first-base drops in below_min_count.first_base.
pseudocount (float, default 0.0) adds a uniform
prior:
empirical_first_base: added to every A/C/G/T base category before normalisation.markov: added to every A/C/G/T first-base category AND to every cell of the 4×4 transition matrix.
replace (default True) controls idempotency.
When False AND a prior typed-plane
np_bases is already attached, the call raises
:class:ValueError before consuming any records.
estimate_p_nucleotide_lengths(rearrangements, *, min_count=1, pseudocount=0.0, replace=True)
¶
Estimate per-end P-nucleotide length distributions from observed AIRR rearrangement records.
Writes EmpiricalDistributionSpec instances into
self._reference_models.p_nucleotide_lengths
keyed by "V_3" and "J_5" (always) plus
"D_5" and "D_3" (VDJ only). VJ cartridges
produce only the V_3 and J_5 keys; the typed-plane
validator chain-type-rejects D-end keys on VJ at
attach time.
Provenance warning. This estimator requires
AIRR-like records that ALREADY carry GenAIRR's
P-length fields (p_v_3_length /
p_d_5_length / p_d_3_length /
p_j_5_length). It does NOT infer P-lengths
from generic AIRR junction sequences, NP strings,
or trim arithmetic — that inference problem is
out of scope for v1. External AIRR tools (IgBLAST,
MiXCR, …) do not model P-nucleotide additions, so
their output will either omit these columns
entirely (rejection storm) or populate them as
zero (degenerate [(0, 1.0)] distribution).
The estimator emits a stage-level warning per key
when ≥ 95% of contributing rows reported zero —
diagnostic of P-naïve input. See
docs/p_nucleotide_length_estimation_design.md
§5 for the realistic-input enumeration.
AIRR columns consumed:
p_v_3_length(always).p_j_5_length(always).p_d_5_length(VDJ only — VJ rows with a non-zero value raise a one-time warning per column and the contribution is skipped).p_d_3_length(same).
Columns deliberately ignored: every other AIRR column. v1 does NOT derive P-lengths from junction-arithmetic, NP strings, trim fields, or end-loss fields.
Per-row validation is field-local: a malformed
or missing value in one P-length column drops only
that key's contribution. Per-row structured
entries land in report.rejected with reasons
"missing_required_column" /
"malformed_length_value" /
"negative_length_value".
min_count (int, default 1) drops length values
whose observed count is strictly below the
threshold before normalisation.
pseudocount (float, default 0.0) adds a uniform
prior to every observed length value before
normalisation (no support expansion). Applied
AFTER the min_count filter.
replace (default True) controls idempotency.
When False AND a prior typed-plane
p_nucleotide_lengths is already attached,
the call raises :class:ValueError before
consuming any records.
build()
¶
Assemble and return a validated :class:DataConfig.
The cartridge is finalised with:
- allele dicts populated from FASTA.
metadatafrom :meth:infer_identity(or a built-in stub when :meth:infer_identitywas never called).reference_rules/reference_modelswhen those stages ran.build_reportpopulated with the audit trail + a manifest snapshot + the post-build checksum.
Final step calls :meth:DataConfig.verify_integrity so a
malformed cartridge surfaces at build time, not at
:func:Experiment.on time. The build report is attached
before integrity check so :attr:build_report is
consistent regardless of whether the check passes.
report()
¶
Return the accumulated :class:CartridgeBuildReport.