Skip to content

GenAIRR Demo

Generate annotated AIRR records with ground truth, replayable traces, and clonal families from one fluent Python DSL.

QR code for the GenAIRR Demo page Scan demo
From DSL to AIRR record the part other simulators usually hide
1. DSL builder
Experiment.on(...) .recombine() .productive_only() .mutate()
Declare the pipeline; no record sampled yet.
2. compile() typed, ordered, signed pass plan
contract set plan signature live-call hooks
3. executor passes walk the persistent IR
trace address + value for each draw events committed IR changes hooks committed V/D/J live calls
4. projection AIRR dict / DataFrame row
observedv_call junction_aa
truthtruth_v_call productive
countersn_mutations n_v_mutations

1. Annotated AIRR records

One fluent chain creates records, truth fields, mutation counters, and a pandas-ready table.

Code
import GenAIRR as ga

result = (
    ga.Experiment.on("HUMAN_IGH_OGRDB")
      .recombine()
      .productive_only()
      .mutate(model="s5f", rate=0.03)
      .run_records(n=1000, seed=42, expose_provenance=True)
)
df = result.to_dataframe()
Output
len(result) = 1000
df.shape    = (1000, 102)
elapsed     = 2.0s
What one generated AIRR row contains observed call, committed truth, and counters travel together
row 0 sequence CVKDDGNRGYCSGDSCYGHCCALDYWYFDLW
Observed fields
v_callIGHVF10-G38*04
d_callIGHD2-15*01
j_callIGHJ2*01
Ground truth
truth_v_callIGHVF10-G38*04
truth_d_callIGHD2-15*01
truth_j_callIGHJ2*01
Record state
productiveTrue
n_mutations10
columns102

2. Productive by construction

productive_only() changes the sampling support before each draw.

Code
unconstrained = (
    ga.Experiment.on("HUMAN_IGH_OGRDB")
      .recombine()
      .run_records(n=1000, seed=42)
)
constrained = (
    ga.Experiment.on("HUMAN_IGH_OGRDB")
      .recombine()
      .productive_only()
      .run_records(n=1000, seed=42)
)

unconstrained_productive = sum(
    1 for r in unconstrained if r["productive"]
) / len(unconstrained)
constrained_productive = sum(
    1 for r in constrained if r["productive"]
) / len(constrained)

print(f"Without productive_only(): {unconstrained_productive:.1%}")
print(f"With    productive_only(): {constrained_productive:.1%}")
Output
Without productive_only(): 18.5%
With    productive_only(): 100.0%
Without productive_only()
18.5%
With productive_only()
100.0%

3. Auditable replay

One trace file replays the same record, but refuses a changed pipeline.

Replay is gated, not just seeded same trace, two outcomes depending on signatures
Run seed=42
Trace file 12,996 bytes
Replay replay_from_trace_file()
same pipeline + cartridge byte-identical Outcome
SHM rate changed ValueError before output
Code
exp = (
    ga.Experiment.on("HUMAN_IGH_OGRDB")
      .recombine().productive_only()
      .mutate(model="s5f", rate=0.03)
)
compiled = exp.compile()

outcome = compiled.simulator.run(seed=42)
trace_file = compiled.simulator.trace_file_from(outcome, seed=42)
trace_file.write_to("demo.trace.json")

from GenAIRR._engine import TraceFile
tf = TraceFile.read_from("demo.trace.json")
replayed = compiled.simulator.replay_from_trace_file(tf, strict=False)
Output
trace size = 12,996 bytes
byte-identical fields = sequence, V/D/J calls, junction_aa, n_mutations
Changed code
exp_modified = (
    ga.Experiment.on("HUMAN_IGH_OGRDB")
      .recombine().productive_only()
      .mutate(model="s5f", rate=0.05)
)
compiled_modified = exp_modified.compile()
compiled_modified.simulator.replay_from_trace_file(tf, strict=False)
Output
ValueError: replay_from_trace_file: pass plan signature mismatch.

4. Legacy fixed-size clonal families

One parent recombination can fork into many independently mutated descendants. This demo uses legacy expand_clones for the simple fixed-size star shape; for new clone benchmarks, see clonal_lineage for BCR trees and clonal_repertoire for TCR / abundance repertoires.

Code
result = (
    ga.Experiment.on("HUMAN_IGH_OGRDB")
      .recombine().productive_only()
      .expand_clones(n_clones=10, per_clone=50)
      .mutate(model="s5f", rate=0.02)
      .run_records(seed=42, expose_provenance=True)
)
Output
records = 500
parents = 10
per clone = 50
clone 0 SHM counts = 4, 14, 8, 7, 5, ...
validate_families() = ok
What the clonal result looks like 10 parent outcomes, 500 descendant AIRR records
clone 0 parent IGHVF10-G38*04 / IGHD2-15*01 / IGHJ2*01 result.parents[0] holds the pre-fork IR + trace.
clone0_desc04 SHM
clone0_desc114 SHM
clone0_desc28 SHM
clone0_desc37 SHM
clone0_desc45 SHM
+45 more descendants
Every row: clone_id + parent_id Within clone: shared V/D/J truth After fork: independent SHM

Next steps