Validation & reproducibility¶

GenAIRR's outputs are rich - 50+ AIRR fields per record, full event ledgers per outcome, recoverable traces per seed. Validation answers one question across all of it: does what the record reports agree with what the engine actually did? This page is the hub for the three user-facing validation layers, the replay surface that makes runs durable, and the recommended workflows for development and CI.

Your learning path

You're at the hub of the "I want reproducible / validated output" path. Two focused deep dives plug in here: validate_records is the per-record output-correctness gate, and Trace, replay, reproducibility is the durable-replay surface. Start with this page for orientation, then pick the deep dive that matches your task. See all paths →

Why validation exists¶

The engine drives every AIRR field from internal state - the persistent IR, the event ledger, and the per-draw trace. The projection that maps that state into the record runs outside the engine, so a bug in the projection (or in any code path that feeds the projection - live-call caching, V-subregion attribution, counter aggregation) could silently produce a record whose fields don't match the outcome that produced them. Validation closes that loop: independently re-derive every reported field from the upstream source, and surface any divergence as a structured report.

If you ever ask "does this record's v_call match the allele the engine actually sampled?", "do n_v + n_d + n_j + n_np add up to n_mutations?", "is productive consistent with the actual junction translation?" - validation is the call that answers, without you writing the cross-check yourself.

The three validation layers¶

GenAIRR exposes three user-facing validation surfaces. All of them are opt-in - production loops that have qualified their pipeline don't pay the per-record overhead by default.

Record validation¶

report = result.validate_records(refdata)
assert report, report.summary()

The point-wise validator. Re-derives every field on every record from the upstream source and reports per-record divergences in a ValidationReport. This is the gate every release-tier CI run should carry.

→ See the dedicated guide for the issue catalogue, the ValidationReport API, and the canonical CI one-liner.

Family validation¶

For workloads that stamp clone_id (clonal_lineage(...), clonal_repertoire(...), or legacy expand_clones(...)), family validation checks clone-level record consistency:

family_report = result.validate_families()
assert family_report, family_report.summary()

# Stronger gate - runs per-record validation on every parent first.
full_report = result.validate_families_with_parents(refdata)
assert full_report, full_report.summary()

validate_families is records-only. It checks that a clonal batch is not mixed with non-clonal records and, when truth_v_call, truth_d_call, and truth_j_call are present, that those recombination-time truth calls are invariant within each clone_id group. No refdata required.

validate_families_with_parents(refdata) is the stronger legacy-star diagnostic for expand_clones(...) outputs, where result.parents exists. It compares descendant records against the actual parent Outcome. Modern clonal_repertoire and clonal_lineage do not expose result.parents; validate their records with validate_records, use validate_families for clone_id grouping, and validate lineage trees with tree.validate().

Runtime opt-in¶

For workloads where you want the validator to run on every batch without an explicit second call:

result = exp.run_records(n=100, seed=1, validate_records=True)

The validator runs inline after each batch; the result is the same as calling result.validate_records(refdata) explicitly. Off by default - opt in during development; leave off in production hot loops.

What each layer catches¶

Issue category	Caught by
Sequence + coordinate consistency (`sequence` matches pool; `*_start/end` well-ordered)	`validate_records`
CIGAR ops (canonical M/I/D/S/N/P/X/=) + query span	`validate_records`
Counter provenance (`n_mutations`, per-segment partition, `n_pcr_errors`, indel counts)	`validate_records`
Allele calls (`v_call` / `d_call` / `j_call` matches an independent walker)	`validate_records`
Junction + productive triad (junction content, `vj_in_frame`, `stop_codon`, `productive` predicate identification)	`validate_records`
Paired-end geometry (R1/R2 windows, R2 reverse-complement, `insert_size`)	`validate_records` (when `paired_end()` is in the pipeline)
Clonal batch is consistently stamped with `clone_id`; truth V/D/J calls are invariant within clone when truth columns are present	`validate_families`
Legacy `expand_clones` descendants agree with their parent `Outcome`	`validate_families_with_parents`

The three layers are independent - validate_records doesn't require a clonal structure; validate_families doesn't require a refdata. They compose for the strongest possible gate:

report = result.validate_records(refdata)
assert report, report.summary()
families = result.validate_families_with_parents(refdata)
assert families, families.summary()

Trace and replay¶

Full guide

See Trace, replay, and reproducibility for the deep dive: seed vs trace, replay vs rerun, every failure mode, strict-mode interactions, and recommended workflows for debugging, regression tests, and reproducible examples. The summary below covers the essentials.

GenAIRR's reproducibility model rests on two facts:

Same seed + same plan + same cartridge → byte-identical output. Across runs, machines, and platforms. The seed= argument on run_records(...) is the canonical surface; n records use seeds [seed, seed+1, …, seed+n-1], so batches stitch together if you offset the starting seed.
Every random draw is recorded on the outcome's trace. Each draw lives at a stable hierarchical address (e.g. "sample_allele.v", "np.np1.length", "np.np1.bases[3]"). You can inspect the trace, dump it to disk, and replay it later.

For most users, the seed argument is the whole reproducibility story. The trace+replay surface matters when:

You want a durable replay artifact that captures the cartridge identity, engine version, and DSL signature - not just the seed.
You want to verify, weeks later, that a recorded run still reproduces against the current code + cartridge.
You're filing a bug and want a self-contained artifact a maintainer can reproduce against.

Saving a trace file¶

A TraceFile bundles a single outcome's recorded trace together with the plan signature, refdata signature, refdata content hash, engine version, and producing seed:

compiled = exp.compile()
outcome = compiled.simulator.run(seed=42)

trace_file = compiled.simulator.trace_file_from(outcome, seed=42)
trace_file.write_to("run-42.trace.json")

Read it back later with:

from GenAIRR._engine import TraceFile

trace_file = TraceFile.read_from("run-42.trace.json")
trace_file.to_json()           # round-trip the JSON
trace_file.seed                # 42
trace_file.engine_version      # "X.Y.Z"
trace_file.schema_version      # int

Replaying a trace file¶

Two complementary replay paths on the simulator:

replay_from_trace_file(trace_file) - consumes the recorded values verbatim at every sampling slot. The trace becomes the source of randomness, not the RNG. This is the strongest reproducibility gate: byte-identical output even if the seed's RNG sequence drifts in a future version.
rerun_from_trace_file(trace_file) - re-runs the sampler from the trace's recorded seed. The trace acts as a signature bundle (plan + refdata gates) rather than as a value source. Useful when you want a fresh draw against the same configured pipeline.

compiled = exp.compile()
outcome = compiled.simulator.replay_from_trace_file(trace_file)

Mismatch errors¶

Replay is gated on three signatures, each of which fires a ValueError if it disagrees with the trace:

Gate	When it fires	Message prefix
Plan signature	DSL chain changed (different passes / different rates / different kwargs that fold into the signature)	`"pass plan signature mismatch"`
Refdata signature	Cartridge structure changed (different catalogue / rules)	`"refdata signature mismatch"`
Refdata content hash	Cartridge bytes changed (curation, V-subregion annotation, allele content)	`"refdata content hash mismatch"` (replay only)

These fail loudly, before any choices are consumed. If your replay errors with "refdata content hash mismatch", the trace was produced against a different cartridge (rules / identity / curation may differ); load the original cartridge to replay.

Strict vs permissive¶

Every run_records (and run, stream, stream_records) takes a strict=False keyword that controls what happens when a sampler runs out of admissible candidates at sample time. This is rare, but it's the canonical failure mode of an unsatisfiable plan (e.g. a productive constraint that admits no junction under your NP-length distribution).

# Default - permissive
result = exp.run_records(n=100, seed=0)

# Strict - fail loud on empty admissible support
result = exp.run_records(n=100, seed=0, strict=True)

The two modes diverge only when admissible support is empty:

Mode	When admissible support is empty	What you see
`strict=False` (default)	Falls back to a documented sentinel value - indel site `-1`, NP length `0`, NP base `N`, trim `0`; SHM substitution skips the slot.	Execution continues; the record may end up non-productive at that site.
`strict=True`	Raises `StrictSamplingError` immediately.	The call fails with `(pass_name, address, reason)` args naming the failing site.

StrictSamplingError is not a ValueError subclass - except ValueError will not catch it. Catch it by name from the top-level import:

import GenAIRR as ga

try:
    result = exp.run_records(n=10, seed=42, strict=True)
except ga.StrictSamplingError as e:
    pass_name, address, reason = e.args
    print(f"{pass_name} couldn't satisfy the contract at {address}: {reason}")

The reason field is a stable lowercase code - common values: "empty_admissible_support", "support_unavailable", "missing_allele.V.42", "contract_violation.NoStopCodonInJunction".

The recommended posture: leave strict off in production, where the sentinel fallback keeps your batch flowing; turn strict on during cartridge / DSL development, where you want the unsatisfiable plan to surface immediately rather than silently.

Note that strict mode applies only to fresh sampling - replay consumes recorded values verbatim, so it never re-evaluates contract admissibility. To force strict-fresh semantics on a recorded trace, call simulator.run(seed=<original_seed>, strict=True) instead of replay_from_trace_file.

Recommended workflows¶

The three layers + the two reproducibility surfaces compose into three workflow patterns covering the common cases.

Development¶

Catch problems immediately. Runtime validation surfaces drift between your DSL changes and the engine's expectations; strict mode surfaces unsatisfiable plans:

result = exp.run_records(
    n=100,
    seed=1,
    validate_records=True,
    strict=True,
)

If the run succeeds and the inline validation passes, the pipeline is qualified.

Continuous integration¶

Run the validator as an explicit gate, surfacing a structured report you can keep as a build artifact:

result = exp.run_records(n=1000, seed=0)

report = result.validate_records(refdata)
assert report, report.summary()

A failing report dumps the per-record issues to the test runner; the failures list is JSON-serialisable for downstream tooling.

Clonal output¶

Stack record + family validation when the pipeline ships clonal records:

result = (
    exp
    .recombine()
    .clonal_repertoire(n_clones=50, max_size=100)
    .sequencing_errors(rate=0.001)
    .run_records(seed=42, expose_provenance=True)
)

assert result.validate_records(refdata), "AIRR record divergence"
assert result.validate_families(), "Family invariant divergence"

For legacy expand_clones(...), add result.validate_families_with_parents(refdata). For clonal_lineage(...), also call tree.validate() on each result.lineage_trees entry when you need topology checks.

When validation is not enough¶

GenAIRR's validation answers is this record internally consistent with how the engine claims it was produced - not is this output biologically realistic. A pipeline can validate clean and still produce simulations that don't match the biology you're targeting. Three places the validators don't help:

Are my cartridge's empirical distributions right? The cartridge's cartridge_manifest() block is the canonical provenance source for cartridge content. The build report attached to a ReferenceCartridgeBuilder-produced cartridge captures every estimator's inputs and inferred distributions.
Are my output distributions calibrated to a reference dataset? Use the distribution-invariant test suite or compare your simulated marginals against the dataset you're benchmarking against. The audit-realism workflow pattern is the natural starting point.
Does my simulation match a specific aligner's expectations? GenAIRR's calls come from an independent walker; an aligner with different scoring rules will produce different calls on the same sequence. That's not a validator problem; it's an aligner-comparison problem.

For input-model provenance specifically:

manifest = cfg.cartridge_manifest()
print(manifest["models"]["allele_usage"])     # what's authored on the cartridge
print(manifest["hashes"]["data_config_checksum"])  # canonical content hash

if cfg.build_report is not None:
    for stage in cfg.build_report.stages:
        print(stage["stage"], stage["inputs"])  # what every estimator saw

Where to go next¶

validate_records - the full guide to the per-record validator: API, the five issue categories, reading a ValidationReport.
Your first AIRR record the field catalogue the validator checks against.
Reference cartridge - the cartridge model the validator gates against and the manifest that documents input-model provenance.
The Experiment builder - how the pipeline composes and where validate_records=True and strict=True sit in the call surface.