The ground-truth payoff

Same molecule. Two annotation paths. One guesses. One knows. The whole reason GenAIRR exists is on the right.

05 / 05

The intuition

When you benchmark an aligner against real sequencing data, you're benchmarking it against another aligner — the one that produced your "ground truth." That's a circular reference. GenAIRR breaks the loop: simulate, corrupt, then ask any aligner what it thinks. The truth is right there, by construction.

Side by side

Below: the same simulated heavy-chain sequence (post-corruption: 5' loss = 12, indels, N-bases). On the left, what an off-the-shelf aligner reports. On the right, what GenAIRR knows for certain.

aligner output inferred

v_call          IGHV3-23*01 (or *04?)
d_call          IGHD3-10 / IGHD2-21 — ambiguous
j_call          IGHJ4*02
v_sequence_start  ~0  (lost leader, can't tell)
v_sequence_end    38
np1_length        unknown — N's hide it
d_sequence_start  ~43
junction_length   39
n_mutations       ~5  (vs N's, vs indels — flagged)
productive        unsure (frame uncertain)

GenAIRR record truth

truth_v_call    IGHV3-23*04
truth_d_call    IGHD3-10*01
truth_j_call    IGHJ4*02
v_sequence_start  0
v_sequence_end    38
np1_length        5
d_sequence_start  43
junction_length   39
n_mutations       7  (exact — N's distinguished)
productive        True

Two call fields, by design

Look closely at the record. There are actually two V-call columns: v_call is what GenAIRR's own evidence-based caller would report from the corrupted sequence (and may carry a comma-separated tie-set when the surviving bases don't distinguish close paralogs). truth_v_call is the allele that was sampled at recombination — exact, unambiguous, captured before any corruption could touch it. The same split exists for D and J. When you benchmark an external aligner, you compare it against truth_v_call: that's the only field that's guaranteed correct regardless of how heavy the corruption is.

Why this matters for benchmarking

Run any aligner on the GenAIRR sequence. Compare its output to the truth columns. The diff is the error rate — there's no other reference frame to argue about. This makes GenAIRR the right substrate for:

Aligner accuracy benchmarks — V/D/J call concordance, junction localization, mutation recall.
Training data for ML — every record carries 47 labels with no annotation noise.
Method development — design new corruption scenarios and stress-test how each tool degrades.

What you've built

You now have intuition for the five phases, the S5F model, the corruption pipeline, and what makes GenAIRR's metadata trustworthy. The rest of the docs are recipes — go pick one.

benchmark_pattern.py

import GenAIRR as ga

experiment = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(count=(5, 25))
       .corrupt_5prime_loss(length=(0, 20))
       .corrupt_indels(count=(0, 2))
)

# expose_provenance=True adds truth_v/d/j_call columns to every record
result = experiment.run_records(n=10_000, seed=42, expose_provenance=True)
result.to_fasta("corrupted.fasta")        # feed to aligner X
truth = result.to_dataframe()              # keep truth

# run aligner X → predictions.tsv → join on sequence_id
v_recall = (truth.truth_v_call == predictions.v_call).mean()