Skip to content

ReferenceCartridgeBuilder

A fluent builder for custom reference cartridges from raw FASTA. Every call returns the same builder extended by one stage; at .build() the engine validates the typed planes and stamps a schema_sha256 on the resulting DataConfig. Every stage's inputs, warnings, and rejection details are captured in builder.report() — the cartridge ships with an auditable trail.

For the broader workflow, see Build a reference cartridge; for the estimator deep dive, see Estimate cartridge models from real data.

Common methods

Method Purpose
.from_fasta(...) Class-level constructor; parses V/D/J FASTA into the catalogue
.infer_identity(...) Resolve species / chain / locus / reference-set / name
.infer_v_subregions() Annotate V alleles with FWR/CDR intervals (required for v_subregion_rates)
.with_rules(...) Attach an explicit ReferenceRulesSpec
.with_models(...) Attach an explicit ReferenceEmpiricalModels bundle
.estimate_allele_usage(records) Estimate per-segment allele weights
.estimate_trim_distributions(records) Estimate per-key trim distributions
.estimate_np_length_distributions(records) Estimate NP1 / NP2 length distributions
.estimate_np_base_model(records, kind=...) Estimate NP-base model (Markov or empirical first-base)
.estimate_p_nucleotide_lengths(records) Estimate palindromic-insertion length distributions
.build() Finalise, stamp checksum, return the DataConfig
.report() Get the CartridgeBuildReport with stages / warnings / rejections

Class reference

GenAIRR.cartridge_builder.ReferenceCartridgeBuilder

Stage-based builder for a :class:~GenAIRR.DataConfig.

See :mod:GenAIRR.cartridge_builder module docs and docs/reference_cartridge_authoring_audit.md for the design.

Usage::

builder = ReferenceCartridgeBuilder.from_fasta(
    v_fasta="igh_v.fasta",
    j_fasta="igh_j.fasta",
    d_fasta="igh_d.fasta",
    chain_type="BCR_HEAVY",
)
builder.infer_identity(species="HUMAN", reference_set="MY_REF")
builder.infer_v_subregions()
builder.with_rules(my_rules_spec)
builder.with_models(my_empirical_models)
cfg = builder.build()
report = builder.report()

Each stage method mutates the builder in place and returns self so calls can be chained.

from_fasta(*, v_fasta, j_fasta, d_fasta=None, c_fasta=None, chain_type) classmethod

Build a builder from FASTA inputs.

v_fasta and j_fasta are required; d_fasta is required only when the chain type has a D segment (BCR_HEAVY / TCR_BETA / TCR_DELTA). c_fasta is accepted but the parsed C alleles are not yet wired into the resulting cartridge — v1 limits to V/D/J authoring. Records dropped during parsing land in :attr:CartridgeBuildReport.rejected.

Each *_fasta argument can be a path-like, an open text file, or a raw FASTA string (a string containing a newline and starting with > is parsed directly).

infer_identity(*, species=None, locus=None, reference_set=None, name=None, source='ReferenceCartridgeBuilder')

Populate the cartridge's identity / metadata plane.

species / locus / reference_set are user-supplied — none of these can be reliably inferred from FASTA alone, so the audit (§7) requires explicit input. When locus is omitted the conventional label is derived from the chain type (BCR_HEAVY → "IGH", etc.).

source is recorded on every parsed allele's .source field so downstream consumers can identify cartridges produced by this builder vs the bundled OGRDB / IMGT loaders.

infer_v_subregions()

Derive per-V-allele IMGT subregion intervals.

Walks every V allele with a non-empty gapped_seq and runs :func:compute_v_region_boundaries, attaching the resulting {label: (start, end)} dict to allele.subregions. Alleles without gapped_seq (or with a derivation that raises) are left unset and counted in the per-stage report.

Idempotent — calling twice re-derives and overwrites the previous output, with a replaced=True flag on the report entry.

with_rules(reference_rules)

Attach an explicit :class:ReferenceRulesSpec.

The user-supplied spec is validated immediately so an author's bug surfaces at build-stage time, not at compile time. The report records the rule fields the spec defines.

with_models(reference_models)

Attach an explicit :class:ReferenceEmpiricalModels.

The supplied bundle is validated against the chain type (D-end keys rejected on VJ chains, etc.). The report records which typed planes are populated.

estimate_allele_usage(rearrangements, *, min_count=1.0, ambiguous='fractional', replace=True)

Estimate per-segment allele-usage weights from observed AIRR rearrangement records.

rearrangements accepts:

  • a list of dicts (each row is one AIRR record),
  • a path-like (filesystem path to an AIRR TSV — parsed via csv.DictReader with tab delimiter), or
  • an open text file handle pointing at AIRR TSV.

ambiguous selects the tie-set policy (column values like "IGHV1*01,IGHV2*01"):

  • "fractional" (default): split one row's credit 1.0 evenly across all known alleles in the tie set. Unknown allele names in the tie set are excluded; if NO names in the tie set are known to the cartridge, the row is recorded as unknown_allele and skipped.
  • "truth_first": credit only the first comma-separated allele in the tie set. Matches the existing :func:GenAIRR._mcp_summary convention.
  • "reject": drop ambiguous (multi-call) rows entirely and record them in report.rejected.

min_count drops alleles whose final per-segment count is strictly below the threshold. Pre-normalisation; the per-segment weights remaining after the drop are then renormalised to sum to 1.0.

replace (default True) controls idempotency. When True, calling :meth:estimate_allele_usage twice overwrites the previous spec and writes replaced=True on the new stage entry. When False, the second call raises :class:ValueError.

Updates self._reference_models.allele_usage so a downstream :meth:build carries the estimated spec into the cartridge.

estimate_trim_distributions(rearrangements, *, min_count=1, pseudocount=0.0, replace=True)

Estimate per-key trim distributions from observed AIRR rearrangement records.

Writes EmpiricalDistributionSpec instances into self._reference_models.trims keyed by "V_3" / "D_5" / "D_3" / "J_5" (see :data:GenAIRR.reference_models.TRIM_KEYS). VJ cartridges only produce the V_3 and J_5 keys (see :data:GenAIRR.reference_models.TRIM_KEYS_VJ); the D-end keys are skipped silently because the cartridge has no D segment to trim.

rearrangements accepts the same shapes as :meth:estimate_allele_usage:

  • a list of dicts (each row is one AIRR record),
  • a path-like to an AIRR TSV (parsed via csv.DictReader with tab delimiter), or
  • an open text file handle pointing at AIRR TSV.

AIRR columns consumed (per audit §1.3):

  • VJ: v_trim_3, j_trim_5.
  • VDJ: v_trim_3, d_trim_5, d_trim_3, j_trim_5.

Columns deliberately ignored: v_trim_5 / j_trim_3 (no engine pass — hard-zero in projection), and the observation-stage end_loss_5_length / end_loss_3_length (separate corruption surface — see docs/primer_trim_end_loss_audit.md).

Per-row validation is field-local: a malformed or missing value in one trim column drops that field's contribution only — the row still feeds its other well-formed fields. Per-row structured entries land in report.rejected with reasons "missing_required_column" / "malformed_trim_value" / "negative_trim_value", each carrying the AIRR column name.

min_count (int, default 1) drops trim values whose observed integer count is strictly below the threshold before normalisation.

pseudocount (float, default 0.0) adds a uniform prior to every observed trim value before normalisation (no support expansion; values never observed stay unobserved). Applied AFTER the min_count filter so the filter looks at raw observations.

replace (default True) controls idempotency. When True, calling this method twice overwrites the previous specs and writes replaced=True on the new stage entry. When False AND a prior typed-plane trims is already attached to self._reference_models, the call raises :class:ValueError before consuming any records.

estimate_np_length_distributions(rearrangements, *, min_count=1, pseudocount=0.0, replace=True)

Estimate per-key NP length distributions from observed AIRR rearrangement records.

Writes EmpiricalDistributionSpec instances into self._reference_models.np_lengths keyed by "NP1" and (on VDJ cartridges) "NP2" (see :data:GenAIRR.reference_models.NP_KEYS). VJ cartridges produce only the NP1 key; the cartridge has no NP2 region. The boundary is enforced at estimation time because the typed-plane validator does NOT chain-type-reject NP2 on VJ at attach time (see audit §2.2 of docs/np_length_estimation_design.md).

AIRR columns consumed (per audit §1.2):

  • np1_length (always).
  • np2_length (VDJ only — VJ rows with a non-zero np2_length raise a one-time warning and the contribution is skipped).

Columns deliberately ignored: np1 / np2 (sequence-derived length is sensitive to post-claim reabsorption — see audit §5.3), p_v_3_length / p_d_5_length / p_d_3_length / p_j_5_length (separate P-nucleotide biology — see docs/p_nucleotide_design.md), and junction_length (aggregate arithmetic too fragile across simulators — audit §5.2).

Per-row validation is field-local: a malformed or missing value in one NP column drops only that key's contribution; the row's other column still feeds. Per-row structured entries land in report.rejected with reasons "missing_required_column" / "malformed_length_value" / "negative_length_value", each carrying the AIRR column name.

min_count (int, default 1) drops length values whose observed integer count is strictly below the threshold before normalisation.

pseudocount (float, default 0.0) adds a uniform prior to every observed length value before normalisation (no support expansion). Applied AFTER the min_count filter.

replace (default True) controls idempotency. When False AND a prior typed-plane np_lengths is attached to self._reference_models, the call raises :class:ValueError before consuming any records.

estimate_np_base_model(rearrangements, *, kind='markov', min_count=1, pseudocount=0.0, replace=True)

Estimate per-key NP base sampling model from observed AIRR rearrangement records.

Writes NpBaseModelSpec instances into self._reference_models.np_bases keyed by "NP1" (always) and (on VDJ cartridges) "NP2". VJ cartridges produce only the NP1 key.

kind selects the model family:

  • "empirical_first_base": estimates a single categorical over A/C/G/T from the full base composition of every observed NP string (every position, not just position 0). The engine samples every NP position independently from this distribution — the name is preserved for API stability with existing cartridges; the biologically correct estimate is the full-base composition.
  • "markov" (default): estimates a first-base row from position 0 of each NP string plus a 4×4 transition matrix from every observed (prev, next) pair.

AIRR columns consumed (per audit §1.2):

  • np1 (always).
  • np2 (VDJ only — VJ rows with a non-empty np2 raise a one-time warning and the contribution is skipped).

Columns deliberately ignored: junction (audit §1.2), p_v_3_length / p_d_5_length / p_d_3_length / p_j_5_length (separate P-nucleotide biology), np1_length / np2_length (length-only — owned by :meth:estimate_np_length_distributions).

Per-row validation is field-local: a malformed or missing value in one NP column drops only that key's contribution. Per-row structured entries land in report.rejected with reasons "missing_required_column" (empty / missing string) or "noncanonical_base" (any character outside {A,C,G,T} after uppercasing).

min_count (int, default 1) drops first-base categories whose observed count is strictly below the threshold before normalisation. For markov, min_count applies to the first-base row only: dropping transition cells could leave a from-base row with no positive weights, which the :class:NpBaseModelSpec validator rejects. v1 keeps transition rows intact and surfaces the first-base drops in below_min_count.first_base.

pseudocount (float, default 0.0) adds a uniform prior:

  • empirical_first_base: added to every A/C/G/T base category before normalisation.
  • markov: added to every A/C/G/T first-base category AND to every cell of the 4×4 transition matrix.

replace (default True) controls idempotency. When False AND a prior typed-plane np_bases is already attached, the call raises :class:ValueError before consuming any records.

estimate_p_nucleotide_lengths(rearrangements, *, min_count=1, pseudocount=0.0, replace=True)

Estimate per-end P-nucleotide length distributions from observed AIRR rearrangement records.

Writes EmpiricalDistributionSpec instances into self._reference_models.p_nucleotide_lengths keyed by "V_3" and "J_5" (always) plus "D_5" and "D_3" (VDJ only). VJ cartridges produce only the V_3 and J_5 keys; the typed-plane validator chain-type-rejects D-end keys on VJ at attach time.

Provenance warning. This estimator requires AIRR-like records that ALREADY carry GenAIRR's P-length fields (p_v_3_length / p_d_5_length / p_d_3_length / p_j_5_length). It does NOT infer P-lengths from generic AIRR junction sequences, NP strings, or trim arithmetic — that inference problem is out of scope for v1. External AIRR tools (IgBLAST, MiXCR, …) do not model P-nucleotide additions, so their output will either omit these columns entirely (rejection storm) or populate them as zero (degenerate [(0, 1.0)] distribution). The estimator emits a stage-level warning per key when ≥ 95% of contributing rows reported zero — diagnostic of P-naïve input. See docs/p_nucleotide_length_estimation_design.md §5 for the realistic-input enumeration.

AIRR columns consumed:

  • p_v_3_length (always).
  • p_j_5_length (always).
  • p_d_5_length (VDJ only — VJ rows with a non-zero value raise a one-time warning per column and the contribution is skipped).
  • p_d_3_length (same).

Columns deliberately ignored: every other AIRR column. v1 does NOT derive P-lengths from junction-arithmetic, NP strings, trim fields, or end-loss fields.

Per-row validation is field-local: a malformed or missing value in one P-length column drops only that key's contribution. Per-row structured entries land in report.rejected with reasons "missing_required_column" / "malformed_length_value" / "negative_length_value".

min_count (int, default 1) drops length values whose observed count is strictly below the threshold before normalisation.

pseudocount (float, default 0.0) adds a uniform prior to every observed length value before normalisation (no support expansion). Applied AFTER the min_count filter.

replace (default True) controls idempotency. When False AND a prior typed-plane p_nucleotide_lengths is already attached, the call raises :class:ValueError before consuming any records.

build()

Assemble and return a validated :class:DataConfig.

The cartridge is finalised with:

  • allele dicts populated from FASTA.
  • metadata from :meth:infer_identity (or a built-in stub when :meth:infer_identity was never called).
  • reference_rules / reference_models when those stages ran.
  • build_report populated with the audit trail + a manifest snapshot + the post-build checksum.

Final step calls :meth:DataConfig.verify_integrity so a malformed cartridge surfaces at build time, not at :func:Experiment.on time. The build report is attached before integrity check so :attr:build_report is consistent regardless of whether the check passes.

report()

Return the accumulated :class:CartridgeBuildReport.