Estimate cartridge models from real data¶
A reference cartridge ships with biology baked in — but the empirical models inside (allele usage, trim distributions, NP lengths, NP base composition, P-nucleotide lengths) come from data. GenAIRR's estimator surface lets you turn a real AIRR rearrangement table into a cartridge whose simulator output matches the empirical distributions in that table. This guide is the focused workflow for that loop.
What estimators do¶
An estimator reads a stream of AIRR-shaped rearrangement records
and writes one typed empirical model onto the cartridge's
reference_models plane:
The output is still a normal DataConfig. The cartridge's
identity (cfg.identity), allele catalogue (cfg.alleles),
rules (cfg.rules), and any models you didn't estimate stay
exactly as they were. Estimators are typed-plane mutators:
they touch one plane each, leave the rest alone, and stamp an
auditable trail on the build report.
Which estimators ship today¶
Five estimators are available in v1:
| Estimator | Plane it populates | Required columns | Optional columns |
|---|---|---|---|
estimate_allele_usage |
reference_models.allele_usage |
v_call, j_call |
d_call (required on VDJ) |
estimate_trim_distributions |
reference_models.trims |
v_trim_3, j_trim_5 |
d_trim_5, d_trim_3 (required on VDJ) |
estimate_np_length_distributions |
reference_models.np_lengths |
np1_length |
np2_length (required on VDJ) |
estimate_np_base_model |
reference_models.np_bases |
np1 |
np2 (required on VDJ) |
estimate_p_nucleotide_lengths |
reference_models.p_nucleotide_lengths |
p_v_3_length, p_j_5_length |
p_d_5_length, p_d_3_length (required on VDJ) |
estimate_shm_rates is deferred
SHM-model estimation is not in v1. Calling
builder.estimate_shm_rates(...) raises AttributeError. SHM
kernels still ship from the cartridge author's chosen S5F
table; for now you can swap them at build time but not estimate
them from records.
Input record requirements¶
Estimators consume one of three input shapes:
- List of dicts:
[{"v_call": "IGHV3-23*01", ...}, ...] - AIRR TSV path:
"my_repertoire.tsv"(tab-delimited) - Open text file handle: an iterable yielding TSV lines
The expected column names by estimator:
| Estimator | Columns it reads |
|---|---|
estimate_allele_usage |
v_call, d_call, j_call |
estimate_trim_distributions |
v_trim_3, d_trim_5, d_trim_3, j_trim_5 |
estimate_np_length_distributions |
np1_length, np2_length |
estimate_np_base_model |
np1, np2 |
estimate_p_nucleotide_lengths |
p_v_3_length, p_d_5_length, p_d_3_length, p_j_5_length |
Columns are read per-field, not per-row: a row missing one column for one estimator is rejected only for that estimator's stage, not for the whole pipeline. Run all five estimators in sequence against the same record set — each takes what it needs and reports independently on the rest.
VJ-chain cartridges (kappa, lambda) silently ignore the D-segment columns; VDJ-chain cartridges require them.
One complete estimator chain¶
The full builder workflow, with every estimator wired up:
import GenAIRR as ga
records = [...] # list[dict] of AIRR rearrangements
builder = (
ga.ReferenceCartridgeBuilder
.from_fasta(
v_fasta="v.fasta",
d_fasta="d.fasta",
j_fasta="j.fasta",
chain_type="BCR_HEAVY",
)
.infer_identity(species="HUMAN", locus="IGH")
.estimate_allele_usage(records)
.estimate_trim_distributions(records)
.estimate_np_length_distributions(records)
.estimate_np_base_model(records, kind="markov", pseudocount=1.0)
.estimate_p_nucleotide_lengths(records)
)
cfg = builder.build()
report = builder.report()
Every call returns the same ReferenceCartridgeBuilder so the
chain stays fluent. Order doesn't matter for correctness — each
estimator writes its own plane — but the build report records the
calls in the order they ran, so put them in the order a future
reader will find easiest to follow.
build() runs cartridge finalisation (alphabet, identity, rules
defaulting), stamps a schema_sha256, and returns the
DataConfig. After build(), the builder is frozen; further
estimator calls raise.
Ambiguity and skipped rows¶
Estimators don't crash on imperfect data. Every malformed,
ambiguous, or unknown-allele row is captured in
report.rejected with the reason it was dropped, so you can
audit data quality without losing the rest of the corpus.
The recognised rejection reasons:
| Reason | What triggers it |
|---|---|
missing_required_column |
The field was empty / missing on this row |
unknown_allele |
The allele name isn't in the cartridge's catalogue |
missing_d_call_on_vdj |
VDJ chain row with no d_call |
malformed_trim_value |
Trim column couldn't parse as integer |
malformed_length_value |
NP / P length column couldn't parse as integer |
negative_trim_value |
Trim value parsed but was negative |
negative_length_value |
NP / P length parsed but was negative |
noncanonical_base |
NP sequence contained characters outside {A,C,G,T} |
Each rejection entry is a dict carrying at minimum stage,
row_index, and reason. Stage-specific entries add fields
like column, segment, allele_name, value, or
unknown_chars. Inspect them post-build:
report = builder.report()
for entry in report.rejected:
if entry["reason"] == "unknown_allele":
print(entry["stage"], entry["segment"], entry["allele_name"])
estimate_allele_usage ambiguity policy¶
The ambiguous keyword chooses how comma-separated tie sets
("IGHV3-23*01,IGHV3-23*04") are handled:
| Policy | Behaviour |
|---|---|
"fractional" (default) |
Splits credit evenly across the known alleles in the tie set. Unknown alleles inside a tie set surface as unknown_allele rejection entries, one per segment per row, without dropping the row if at least one allele in the set is known. |
"truth_first" |
Collapses to the first allele in the tie set before resolution — closest to "treat the call as a single allele". |
"reject" |
Drops the entire row if any segment has multiple alleles. Useful when you want to estimate from confident calls only. |
P-nucleotide naive input detector¶
estimate_p_nucleotide_lengths emits a stage-level warning when
≥95 % of contributing rows reported zero for a given key. That
threshold is the P-naive detection signal: most external AIRR
tools don't report palindromic insertions, so a tsv from MiXCR /
IMGT / AIRR-tools that doesn't have those columns will trip the
warning and you'll know the estimated distribution is essentially
"always zero" rather than meaningful biology.
How estimated models affect simulation¶
The empirical models you write to the cartridge feed the engine's recombination passes in the obvious way:
| Plane | What it drives |
|---|---|
allele_usage |
Recombination's V / D / J segment-pick weights |
trims |
Per-segment trim distributions at recombine time |
np_lengths |
NP1 / NP2 insertion length draws |
np_bases |
NP1 / NP2 base composition (first-base + Markov transitions on kind="markov") |
p_nucleotide_lengths |
Palindromic-insertion length draws |
Two important composition rules:
- Explicit
Experimentkwargs always override the cartridge. When you pass.trim(v_3=..., d_5=..., d_3=..., j_5=...)to an experiment, those overrides win — the estimated cartridge distribution is the default, not a floor. - The plan signature folds these distributions where applicable,
so replaying a trace against a cartridge whose estimated
distributions have drifted will fail cleanly with a plan-signature
mismatch. The one documented soft gap is in allele-usage
weights — see
docs/replay_protocol.mdfor the precise envelope.
The cartridge itself doesn't expose run-time sampling code; the
engine consumes the typed planes through dataconfig_to_refdata
and translates them into compiled recombination draws at
Experiment.compile() time.
Inspecting the result¶
Two readable handles after build():
The build report¶
report = builder.report()
report.stages # list[dict] — one entry per builder call
report.warnings # list[str] — build-finalisation warnings
report.rejected # list[dict] — per-row + per-allele drops
report.manifest_snapshot # dict | None — cfg.cartridge_manifest() at build time
report.checksum_at_build_time # str | None — schema_sha256 stamped on cfg
report.to_dict() # JSON-clean dict for CI artifacts
stages carries an inputs and inferred block per stage. For
estimator stages you'll find counts of records_seen,
records_used, replaced (idempotency marker), and per-key
diagnostic dicts.
The cartridge manifest¶
manifest = cfg.cartridge_manifest()
manifest["models"]
# {
# "has_reference_models": True,
# "allele_usage": {"v": 320, "d": 30, "j": 6, ...},
# "trim_models": {"v_trim_3": {"support_size": 18, ...}, ...},
# "np_length_models": {"np1": {...}, "np2": {...}},
# "np_base_models": {"first_base": {...}, "transition": {...}},
# "p_nucleotide_models": {"p_v_3_length": {...}, ...},
# "shm": {"v_subregion_support": {"available": True, ...}, ...},
# "trim_keys": ["v_trim_3", "d_trim_5", ...],
# "np_length_keys": ["np1", "np2"],
# "legacy_trim_dicts_present": False,
# "legacy_np_lengths_present": False,
# }
manifest["models"] is the canonical place to verify what got
estimated. Each top-level key is None when its plane is
empty, so manifest["models"]["allele_usage"] is None after a
build that skipped estimate_allele_usage is normal.
Common mistakes¶
A handful of issues that show up repeatedly with the estimator surface.
Expecting legacy single-dict fields to auto-lift. The old
cfg.trim_dicts-style fields don't drive the engine anymore.
The estimators write to cfg.reference_models.<plane>. If
you carry forward a hand-authored cartridge with legacy fields,
manifest["models"]["legacy_trim_dicts_present"] flags it.
Calling estimate_p_nucleotide_lengths on a generic AIRR
table. External tools usually don't populate p_*_length
columns at all. The estimator's 95 % zero-fraction warning is
the signal that you're estimating off P-naive input — at that
point the model is "always-zero", not biology. Either source the
records from a P-aware tool or skip this estimator.
Expecting the SHM-rate estimator today. It's deferred to a
follow-up slice. Calling builder.estimate_shm_rates(...) raises
AttributeError. Use the cartridge's bundled S5F kernel and tune
the rate via the Experiment.mutate(rate=..., model=...) knob —
see SHM and mutation targeting.
Forgetting replace=False when you want single-call
discipline. All estimators default to replace=True: calling
the same estimator twice silently overwrites the first result
and stamps replaced=True on the stage entry. That's the
ergonomic default for iterative work. When you want a build
script that asserts you only ran each estimator once — for CI
hygiene, or to catch a duplicated method call in a longer chain —
pass replace=False on every estimator call. The second call
then raises ValueError instead of overwriting.
Markov NP-base estimation with pseudocount=0 on tiny
inputs. If a from-base never appears in the observed sequences,
the engine can't form a transition row for it, and ValueError
fires before spec construction. Either use the default
kind="markov", pseudocount=1.0, or drop to
kind="empirical_first_base" if the from-base coverage isn't
representative.
Where to go next¶
- Build a reference cartridge — the broader builder workflow these estimators plug into.
- Reference cartridge — the four-plane conceptual model and why the empirical-models plane is typed.
- Junction N/P additions — how the NP-base and P-length models affect what the engine samples.
- Validation hub — replay and plan-signature guarantees that ride on top of an estimated cartridge.