Corruption and sequencing artefacts¶
Real sequencing data doesn't ship clean. PCR amplifies errors, sequencers miscall bases, polymerases insert and delete, adapter loss clips the ends, and library prep flips the strand half the time. GenAIRR models each of these as a distinct observation-stage pass — separate from biological SHM, separately counted, and individually tunable to match the artefact profile of the sequencer you're benchmarking against.
What counts as observation-stage corruption¶
Seven passes belong in the observation-stage block. Each models a distinct artefact mechanism, each has its own AIRR provenance, and each is independently tunable:
| Pass | What it does | Counter / field |
|---|---|---|
.pcr_amplify(...) |
PCR-cycle substitution errors | n_pcr_errors |
.sequencing_errors(...) |
Sequencer base-call errors (lowercases bytes) | n_quality_errors |
.ambiguous_base_calls(...) |
N-base injection (replaces bytes with N) |
(no dedicated counter — see below) |
.polymerase_indels(...) |
Polymerase-stage insertions and deletions | n_indels, n_v_indels, n_d_indels, n_j_indels |
.end_loss_5prime(...) |
5' adapter / primer loss | end_loss_5_length |
.end_loss_3prime(...) |
3' adapter / primer loss | end_loss_3_length |
.random_strand_orientation(...) |
Strand flip during library prep | rev_comp |
Plus .paired_end(...), which isn't corruption — it's a read-layout
projection — but it sits in the same descendant-phase block of the
pipeline and is usually configured alongside the corruption passes.
See Paired-end reads and FASTQ.
Biological mutation vs artefact¶
The single most important distinction: .mutate(...) is the
only pass that increments n_mutations. Every corruption pass
above is observation-stage; their counts surface on dedicated
fields. A downstream pipeline that conflates them will misattribute
real biology to library prep and vice versa.
| Mechanism | Pipeline stage | Counter |
|---|---|---|
| Somatic hypermutation | Biology — .mutate(...) |
n_mutations (plus per-segment n_v_mutations etc.) |
| PCR substitution | Library — .pcr_amplify(...) |
n_pcr_errors |
| Sequencer base-call error | Sequencer — .sequencing_errors(...) |
n_quality_errors |
| N-base injection | Sequencer / quality flagging — .ambiguous_base_calls(...) |
(none on the record) |
| Polymerase indel | Library — .polymerase_indels(...) |
n_indels + per-segment partition |
| Adapter / primer loss | Library — .end_loss_*prime(...) |
end_loss_*_length |
| Strand flip | Library — .random_strand_orientation(...) |
rev_comp |
n_mutations is biology, every other field is artefact. If you
want a "total perturbation budget" you sum them downstream — they
don't aggregate automatically.
See SHM and mutation targeting for the biology side of the boundary in detail.
PCR and sequencing errors¶
Two substitution-style corruption passes — same shape, different biology.
PCR errors¶
result = (
ga.Experiment.on("human_igh")
.recombine()
.pcr_amplify(count=(0, 3))
.run_records(n=100, seed=1)
)
rec = result[0]
rec["n_pcr_errors"] # int — number of PCR substitutions on this record
Each PCR error is a single-base substitution drawn uniformly across
the molecule. The count argument follows the standard
int / (low, high) / [(value, weight), ...] pattern.
Sequencing errors¶
result = (
ga.Experiment.on("human_igh")
.recombine()
.sequencing_errors(count=(0, 5))
.run_records(n=100, seed=1)
)
rec["n_quality_errors"] # int — number of sequencer base-call errors
sequencing_errors models sequencer-quality miscalls. The pass
lowercases the affected bytes in the assembled sequence as a
visible quality marker. So sequence may end up with mixed-case
bytes: uppercase = clean, lowercase = sequencer-flagged
low-quality position. Downstream pipelines that want to mask
quality-flagged bases can do
sequence.upper() to normalise.
N corruption / ambiguous bases¶
result = (
ga.Experiment.on("human_igh")
.recombine()
.ambiguous_base_calls(count=(0, 2))
.run_records(n=100, seed=1)
)
rec["sequence"] # may contain 'N' characters at corrupted positions
N-base injection replaces individual bytes with N. Useful when
the downstream tool you're benchmarking handles ambiguity codes
specifically (allele-call walkers, junction translators).
There is no n_ns counter on the AIRR record. The pass leaves
no provenance field; the only signal is the N characters in
sequence. If you need a count, derive it: rec["sequence"].count("N").
A few facts about how N corruption composes with downstream
analysis:
- Allele-call walkers handle
Nas a wildcard. The GenAIRR walker classifiesNas matching any canonical reference base (per the standard convention), so an N-injected record can still produce a meaningfulv_call. Tie sets may widen. - Junction translation produces
Xcodons. AnN-containing codon translates toX;junction_aaandsequence_aamay carryXcharacters. Theproductivepredicate still evaluates correctly —Xcodons don't satisfy the stop-codon check but also don't satisfy the V-Cys or J-W/F anchor preservation checks if they land at the anchor positions.
Polymerase indels¶
result = (
ga.Experiment.on("human_igh")
.recombine()
.polymerase_indels(count=(0, 2), insertion_prob=0.5)
.run_records(n=100, seed=1)
)
rec["n_indels"] # total polymerase indel events
rec["n_v_indels"] # partitioned by carried segment
rec["n_d_indels"]
rec["n_j_indels"]
Each event is either an insertion (random base inserted at a
random position) or a deletion (single base removed), drawn per
insertion_prob. The n_indels total partitions exactly into the
per-segment counters: n_v + n_d + n_j == n_indels for events
that land in V / D / J. Events in NP1 / NP2 roll into n_indels
only (they're not allocated to V/D/J).
Indels vs end-loss¶
A common confusion. Both shorten / extend the sequence in some way; the biology is completely different:
| Mechanism | What it does | AIRR fields |
|---|---|---|
| Polymerase indel | Random single-base insertions and deletions inside the molecule | n_indels, per-segment partition |
| End-loss | Block deletion from the molecule's 5' or 3' end | end_loss_5_length, end_loss_3_length |
If your downstream pipeline conflates them, the per-segment
identity and coordinate calculations will be wrong. Use
n_indels for "did the polymerase mis-step in the middle of the
molecule"; use end_loss_*_length for "how much got chewed off
the ends".
End-loss¶
result = (
ga.Experiment.on("human_igh")
.recombine()
.end_loss_5prime(length=(0, 8))
.end_loss_3prime(length=(0, 4))
.run_records(n=100, seed=1)
)
rec["end_loss_5_length"] # bases removed from 5' end
rec["end_loss_3_length"] # bases removed from 3' end
End-loss models adapter / primer loss during library prep —
the read starts a few bases in from the molecule's actual 5' end
and / or terminates a few bases short of the 3' end. The
length argument follows the standard
int / (low, high) / [(value, weight), ...] shape.
primer_trim_*prime is a backwards-compatible alias¶
# These two are exactly equivalent:
.end_loss_5prime(length=(0, 8))
.primer_trim_5prime(length=(0, 8)) # legacy alias — same pass, same fields
The legacy primer_trim_*prime names predate the engine's
distinction between recombination-stage exonuclease trim
(TrimPass, the v_trim_3 family) and observation-stage end
loss (EndLossPass, the end_loss_*_length family). New code
should use end_loss_*prime for clarity.
End-loss vs recombination trim¶
Another common confusion. Recombination trims and end-loss are two separate passes with two separate sets of fields:
| Mechanism | Engine pass | AIRR fields | Biology stage |
|---|---|---|---|
| Recombination trim | TrimPass |
v_trim_3, d_trim_5, d_trim_3, j_trim_5 |
Recombination (germline exonuclease) |
| End loss | EndLossPass |
end_loss_5_length, end_loss_3_length |
Observation (library prep / sequencing) |
Don't conflate them — they capture different biology and a downstream pipeline that adds them together gets the wrong total. See the Recombination + junction biology guide for the detailed comparison.
Random strand orientation¶
result = (
ga.Experiment.on("human_igh")
.recombine()
.random_strand_orientation(prob=0.5)
.run_records(n=100, seed=1)
)
rec["rev_comp"] # bool — True if this record was flipped
At prob=0.5, half the records get reverse-complemented in place.
The rev_comp flag tells you which way the strand went;
sequence and sequence_aa reflect the post-flip orientation;
junction, np1, np2 similarly reflect the post-flip bytes.
A few facts to know:
- The flip is projection-level, not biology. Real biology isn't being inverted; the read just happens to come off the sequencer in the other orientation. Allele truth and all the per-segment provenance fields stay correct; coordinates remap to the flipped sequence frame.
- Paired-end windows are sliced AFTER the strand flip. If
.paired_end(...)follows.random_strand_orientation(...), R1 and R2 are sliced from the post-flip molecule, so they reflect the strand decision automatically. Don't apply a second flip downstream.
Ordering with clonal workflows¶
All seven corruption passes are descendant-phase. They model per-read artefacts — every emitted clone member or observed lineage cell gets independent PCR errors, independent indel events, independent end-loss draws, etc.
result = (
ga.Experiment.on("human_igh")
.recombine()
.clonal_repertoire(n_clones=10, max_size=50)
.mutate(model="s5f", rate=0.03) # biology — descendant-phase
.pcr_amplify(count=(0, 3)) # corruption — descendant-phase
.polymerase_indels(count=(0, 2))
.ambiguous_base_calls(count=(0, 2))
.sequencing_errors(count=(0, 5))
.end_loss_5prime(length=(0, 8))
.end_loss_3prime(length=(0, 4))
.random_strand_orientation(prob=0.5)
.paired_end(r1_length=150, insert_size=300)
.run_records(seed=1)
)
Calling corruption before a flat fork (clonal_repertoire or legacy
expand_clones) raises ValueError at chain time because the artefact would be
shared by the whole clone. Corruption may also follow clonal_lineage; in that
case it is applied independently to each observed sampled cell. See
Clonal simulation overview for the phase discipline in
full.
Per-platform calibrated profiles¶
Different sequencing platforms break sequences in characteristic ways: MiSeq adds quality noise at the 3′ end, NovaSeq drops bases uniformly with a slight 5′ bias, PacBio HiFi inserts homopolymer indels. Three starting profiles for the platforms researchers ask about most often:
# Counts are per-record ranges. For a ~350bp read, count=(0, 2) Ns
# is roughly 0.1% N-density.
PROFILES = {
"miseq_300pe": dict(
end_loss_5=(5, 25),
end_loss_3=(8, 45), # 3′ decay heavier than 5′
ambig_ns=(0, 2),
indels=(0, 1),
),
"novaseq_150pe": dict(
end_loss_5=(3, 15),
end_loss_3=(3, 15),
ambig_ns=(0, 1),
indels=(0, 1),
),
"pacbio_hifi": dict(
end_loss_5=(0, 5),
end_loss_3=(0, 5),
ambig_ns=(0, 1),
indels=(0, 3), # HiFi: low subs, higher indels
),
}
def apply_profile(exp, name):
p = PROFILES[name]
return (
exp.end_loss_5prime(length=p["end_loss_5"])
.end_loss_3prime(length=p["end_loss_3"])
.ambiguous_base_calls(count=p["ambig_ns"])
.polymerase_indels(count=p["indels"])
)
result = apply_profile(
ga.Experiment.on("human_igh").recombine().mutate(count=(5, 20)),
"miseq_300pe",
).run_records(n=5000, seed=42)
The set of passes you call is the profile — these are
starting points, not exhaustive: add pcr_amplify or
sequencing_errors rates when the platform calls for them.
Calibrating from your own reads¶
Don't trust a preset blindly. Run a small panel under the profile and match two distributions to a sample of real reads from your sequencer:
Match the N-density.
n_rate_real = (real_reads.str.count("N") / real_reads.str.len()).mean()
n_rate_sim = (sim["sequence"].str.count("N") / sim["sequence_length"]).mean()
# Adjust the ambiguous_base_calls count range until these match
Match the length distribution.
# End-loss is the dominant driver of read-length variance
sim_lens = sim["sequence_length"]
real_lens = real_reads.str.len()
# Match means + variances by widening or tightening the (min, max)
# ranges on end_loss_5prime / end_loss_3prime
Validation and replay¶
validate_records re-derives every corruption counter from the
event ledger and trace addresses, then compares against the
record's reported values. Issue kinds the catalogue covers:
| Counter | Issue kind on mismatch |
|---|---|
n_pcr_errors |
NPcrErrorsMismatch |
n_quality_errors |
NQualityErrorsMismatch |
n_indels |
NIndelsMismatch |
n_v_indels / n_d_indels / n_j_indels |
per-segment indel-count mismatches |
end_loss_5_length / end_loss_3_length |
EndLossLengthMismatch |
The rev_comp flag is derived from the IR; mismatch surfaces
similarly. Every corruption mechanism records its choices on
stable trace addresses (corrupt.pcr.count, corrupt.quality.count,
corrupt.indel.count, corrupt.end_loss.5, corrupt.end_loss.3,
etc.), so trace replay reproduces the same artefacts deterministically.
Paired-end FASTQ export writes the projected r1_sequence /
r2_sequence fields verbatim — by AIRR-projection time the
corruption + strand-flip + paired-end pipeline has already
finished, so the exported reads carry all the artefacts the
record's counters report. See Validation hub
for the validator's full scope.
Common mistakes¶
A handful of issues that show up repeatedly with corruption.
Expecting PCR or sequencing errors to increment n_mutations.
They never do. n_mutations is biology only. PCR substitution
surfaces on n_pcr_errors; sequencer miscalls surface on
n_quality_errors. If your downstream pipeline reads
n_mutations as a "total perturbation budget", it's leaving the
artefact counters off.
Confusing end-loss with recombination trim. Recombination
trim is biology (germline exonuclease, fields v_trim_3 etc.);
end-loss is library-prep artefact (end_loss_*_length). They're
separate passes, separate fields, separate biology stages. The
Recombination + junction biology
guide has the side-by-side comparison.
Putting corruption before a clonal fork. All seven corruption
passes are descendant-phase — they're per-read artefacts that need to
vary within a clonal family. The DSL rejects this before
clonal_repertoire() or expand_clones() with a message naming the
offending method and telling you to move it after the fork.
Reverse-complementing R2 again after random strand orientation.
When .random_strand_orientation(...) is in the pipeline,
sequence (and r1_sequence / r2_sequence if paired-end is
on) already reflect the post-flip orientation. R2 is also
already reverse-complemented at projection time (that's the
paired-end-output convention). Applying any further flips
downstream produces wrong reads. The validator's
PairedEndWindowMismatch check enforces the per-record R2 RC
state.
Looking for an n_ns counter. ambiguous_base_calls doesn't
ship one. The only signal is the N characters in sequence —
count them yourself with rec["sequence"].count("N").
Where to go next¶
- SHM and mutation targeting — the biology side of the boundary; the per-segment / per-V-subregion partition.
- Paired-end reads and FASTQ — the read-layout projection that usually sits alongside corruption.
- Clonal simulation overview — the clonal model chooser and phase discipline that gates every corruption pass.
- Recombination + junction biology — the trim vs end-loss boundary in detail.
- Validation & reproducibility — the validator's issue catalogue covering every corruption counter.
- Your first AIRR record — the field-by-field catalogue including every counter named on this page.