Learn Lesson 04
Lesson 04 · ~3 min

Sequencing artifacts, dialed in

Real reads are noisy. 5'/3' loss eats segments. Indels shift frames. Base-callers fail. GenAIRR can dial each one independently — and the truth fields stay correct.

04 / 05

The intuition

A simulator that only models clean sequences is benchmarking your aligner's best case. The interesting case — the one your benchmark actually needs to cover — is when the molecule is half-eaten and the base-caller is guessing. Dial in each artifact and see how each one bites.

Live corruption

The top line is the clean post-mutation sequence. The bottom line is what your aligner would actually receive. Move the sliders.

CLEAN (truth)
CORRUPTED
5' LOSS
3' LOSS
INDEL %
N-BASE %

Even with all artifacts cranked, the V/D/J coloring tells you exactly which base belongs to which segment. That mapping is the truth GenAIRR preserves through every transformation.

The corresponding code

Each artifact is its own pass on the pipeline. length= ranges feed 5'/3' loss; count= ranges feed indels and N-bases. Drop a pass to skip that artifact entirely.

lesson_4.py
import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(model="s5f", count=(5, 25))
       .corrupt_5prime_loss(length=(0, 20))
       .corrupt_3prime_loss(length=(0, 20))
       .corrupt_indels(count=(0, 2), insertion_prob=0.5)
       .corrupt_ns(count=(0, 5))
       .run_records(n=1000, seed=42, expose_provenance=True)
)
Try it now

Run the snippet above, then run it again without the two corrupt_*_loss calls. Compare the mean sequence_length across the two runs — the loss-on version is measurably shorter. The corruption changed the sequence, but every truth_v_call, truth_d_call, and truth_j_call stays correct in both runs.

Exercise

Take the snippet above and crank corruption up: corrupt_5prime_loss(length=(20, 60)), corrupt_3prime_loss(length=(5, 25)), corrupt_ns(count=(0, 8)). Generate 1,000 records and confirm truth_v_call, truth_d_call, truth_j_call are non-empty in 100% of them. That's the ground-truth guarantee: the alleles were captured at construction time, and no amount of downstream chewing can erase them.