Sequencing artifacts, dialed in
Real reads are noisy. 5'/3' loss eats segments. Indels shift frames. Base-callers fail. GenAIRR can dial each one independently — and the truth fields stay correct.
The intuition
A simulator that only models clean sequences is benchmarking your aligner's best case. The interesting case — the one your benchmark actually needs to cover — is when the molecule is half-eaten and the base-caller is guessing. Dial in each artifact and see how each one bites.
Live corruption
The top line is the clean post-mutation sequence. The bottom line is what your aligner would actually receive. Move the sliders.
Even with all artifacts cranked, the V/D/J coloring tells you exactly which base belongs to which segment. That mapping is the truth GenAIRR preserves through every transformation.
The corresponding code
Each artifact is its own pass on the pipeline. length= ranges feed 5'/3' loss; count= ranges feed indels and N-bases. Drop a pass to skip that artifact entirely.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(model="s5f", count=(5, 25))
.corrupt_5prime_loss(length=(0, 20))
.corrupt_3prime_loss(length=(0, 20))
.corrupt_indels(count=(0, 2), insertion_prob=0.5)
.corrupt_ns(count=(0, 5))
.run_records(n=1000, seed=42, expose_provenance=True)
)
Run the snippet above, then run it again without the two corrupt_*_loss calls. Compare the mean sequence_length across the two runs — the loss-on version is measurably shorter. The corruption changed the sequence, but every truth_v_call, truth_d_call, and truth_j_call stays correct in both runs.
Take the snippet above and crank corruption up: corrupt_5prime_loss(length=(20, 60)), corrupt_3prime_loss(length=(5, 25)), corrupt_ns(count=(0, 8)). Generate 1,000 records and confirm truth_v_call, truth_d_call, truth_j_call are non-empty in 100% of them. That's the ground-truth guarantee: the alleles were captured at construction time, and no amount of downstream chewing can erase them.