Recipe D · 03 · ~25 min · intermediate

Reproduce a published dataset.

GenAIRR results are bit-identical across Linux, macOS, and Windows for the same seed and config. That guarantee turns "the figure-2 dataset from the paper" into a reproducible artifact: pull the seeds and configs, run the recipe, get the same bytes. This recipe walks through what you need to capture, what to publish alongside your figures, and how a future collaborator regenerates the data.

01 Capture inputs seed · config name · package version
02 Publish recipe repro.py · pinned dependencies
03 Verify hash byte-identical · plot identical
PART 01

What you need to capture for full reproduction.

Five pieces of information uniquely determine a GenAIRR dataset. Capture all five and a collaborator on any OS can produce byte-identical sequences. Miss one and the reproduction drifts — sometimes silently.

The five inputs
  • seedthe integer passed to run() / stream_records()
  • config"human_igh" or your custom RefDataConfig serialized to JSON
  • pipelinethe exact chain of .recombine(), .mutate(...), .corrupt_*(...) calls
  • contractsthe respect= bundle and strict= flag (or absence)
  • GenAIRR versionthe package version on PyPI — pin it
What you don't need

Hardware. OS. Python version (within a minor range). The Rust kernel is deterministic across all platforms GenAIRR ships wheels for. numpy seed state, system time, RNG entropy — none of those play a role.

PART 02

The repro script — what to publish alongside the paper.

A single Python file your readers run. genairr==X.Y.Z in a requirements.txt next to it. The script generates the exact dataset that produced your figure — no other machine state matters.

repro.py — alongside requirements.txt with pinned version
"""
Figure 2 dataset for "Title of My Paper" (Author et al., 2026).

Run: python repro.py
Pin: pip install -r requirements.txt  (GenAIRR == 1.1.0)
"""
import GenAIRR as ga

# five inputs that uniquely determine the dataset
SEED       = 42
CONFIG     = "human_igh"
N_RECORDS  = 50_000
STRICT     = False

result = (
    ga.Experiment.on(CONFIG)
       .recombine()
       .mutate(model="s5f", count=(0, 40))
       .corrupt_5prime_loss(length=(5, 60))
       .corrupt_3prime_loss(length=(0, 30))
       .run_records(
           n=N_RECORDS,
           seed=SEED,
           respect=ga.productive(),
           strict=STRICT,
           expose_provenance=True,
       )
)

result.to_csv("figure2.tsv", sep="\t")
print(f"wrote {len(result)} records")
PART 03

Verify the reproduction.

A bit-identical guarantee is only useful if you check it. Hash the TSV; commit the hash to your repo alongside the recipe. Any future run that produces a different hash signals a version mismatch, an environment problem, or — rare but important — a real upstream change in the engine.

Hash + assert
import hashlib, pathlib

EXPECTED = "a7c3f9e2..."  # commit alongside repro.py

actual = hashlib.sha256(
    pathlib.Path("figure2.tsv").read_bytes()
).hexdigest()

assert actual == EXPECTED, (
    f"hash mismatch — expected {EXPECTED}, got {actual}"
)
print("reproduction verified")
PART 04

Common reproduction failures.

Two patterns account for ~90% of reproduction failures. Catch them early.

1 · Unpinned GenAIRR version

A user runs pip install GenAIRR without a version pin and gets a newer engine. Sampling order may have changed for legitimate reasons; output drifts. Always publish a requirements.txt with genairr==X.Y.Z.

2 · Custom RefDataConfig not serialized

If you built a custom config in Python (RefDataConfig.vdj() + add_v_allele calls), the allele set is in the script — but if you loaded one from a private file, ship the file. Otherwise the reader can't reach the same V/D/J pool.

3 · Different number of records

Records at n=50_000 are not the first 50k of n=100_000 unless the per-record seed math lines up. seed=42, n=N draws the same sequence at the same index for any N — but only as long as N is fixed at simulate time.

PART 05

What an audit trail buys you.

Beyond reproduction, every record carries the addressed trace — the full history of random draws that produced it. If a future reviewer asks "why does record 4732 have a 24-mer NP1?", you can answer with the exact draws, not a hand-wave.

Inspecting a single record's trace
# trace lookup for the record at position 4732
out = (
    ga.Experiment.on("human_igh")
       .recombine()
       .run(n=5000, seed=42)
)
trace = out[4732].trace()

print(trace.find("sample_allele.v").value)
print(trace.find("np.np1.length").value)
print(trace.prefix_count("mutate.s5f"))
Related recipes

Where to next.

D · 02 · Stream millions without OOM →

For paper-sized datasets, switch the repro script to the streaming API.

Concept · The AIRR Record →

All ~70 fields the repro will reproduce, including the truth columns that anchor the audit.