Corruption and sequencing artefacts¶

Real sequencing data doesn't ship clean. PCR amplifies errors, sequencers miscall bases, polymerases insert and delete, adapter loss clips the ends, and library prep flips the strand half the time. GenAIRR models each of these as a distinct observation-stage pass - separate from biological SHM, separately counted, and individually tunable to match the artefact profile of the sequencer you're benchmarking against.

What counts as observation-stage corruption¶

Seven passes belong in the observation-stage block. Each models a distinct artefact mechanism, each has its own AIRR provenance, and each is independently tunable:

Pass	What it does	Counter / field
`.pcr_amplify(...)`	PCR-cycle substitution errors	`n_pcr_errors`
`.sequencing_errors(...)`	Sequencer base-call errors (lowercases bytes)	`n_quality_errors`
`.ambiguous_base_calls(...)`	N-base injection (replaces bytes with `N`)	(no dedicated counter - see below)
`.polymerase_indels(...)`	Polymerase-stage insertions and deletions	`n_indels`, `n_v_indels`, `n_d_indels`, `n_j_indels`
`.end_loss_5prime(...)`	5' adapter / primer loss	`end_loss_5_length`
`.end_loss_3prime(...)`	3' adapter / primer loss	`end_loss_3_length`
`.random_strand_orientation(...)`	Strand flip during library prep	`rev_comp`

Plus .paired_end(...), which isn't corruption - it's a read-layout projection - but it sits in the same descendant-phase block of the pipeline and is usually configured alongside the corruption passes. See Paired-end reads and FASTQ.

Biological mutation vs artefact¶

The single most important distinction: .mutate(...) is the only pass that increments n_mutations. Every corruption pass above is observation-stage; their counts surface on dedicated fields. A downstream pipeline that conflates them will misattribute real biology to library prep and vice versa.

Mechanism	Pipeline stage	Counter
Somatic hypermutation	Biology - `.mutate(...)`	`n_mutations` (plus per-segment `n_v_mutations` etc.)
PCR substitution	Library - `.pcr_amplify(...)`	`n_pcr_errors`
Sequencer base-call error	Sequencer - `.sequencing_errors(...)`	`n_quality_errors`
N-base injection	Sequencer / quality flagging - `.ambiguous_base_calls(...)`	(none on the record)
Polymerase indel	Library - `.polymerase_indels(...)`	`n_indels` + per-segment partition
Adapter / primer loss	Library - `.end_loss_*prime(...)`	`end_loss_*_length`
Strand flip	Library - `.random_strand_orientation(...)`	`rev_comp`

n_mutations is biology, every other field is artefact. If you want a "total perturbation budget" you sum them downstream - they don't aggregate automatically.

See SHM and mutation targeting for the biology side of the boundary in detail.

PCR and sequencing errors¶

Two substitution-style corruption passes - same shape, different biology.

PCR errors¶

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .pcr_amplify(count=(0, 3))
      .run_records(n=100, seed=1)
)

rec = result[0]
rec["n_pcr_errors"]    # int - number of PCR substitutions on this record

Each PCR error is a single-base substitution drawn uniformly across the molecule. The count argument follows the standard int / (low, high) / [(value, weight), ...] pattern.

Sequencing errors¶

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .sequencing_errors(count=(0, 5))
      .run_records(n=100, seed=1)
)

rec["n_quality_errors"]    # int - number of sequencer base-call errors

sequencing_errors models sequencer-quality miscalls. The pass lowercases the affected bytes in the assembled sequence as a visible quality marker. So sequence may end up with mixed-case bytes: uppercase = clean, lowercase = sequencer-flagged low-quality position. Downstream pipelines that want to mask quality-flagged bases can do sequence.upper() to normalise.

N corruption / ambiguous bases¶

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .ambiguous_base_calls(count=(0, 2))
      .run_records(n=100, seed=1)
)

rec["sequence"]    # may contain 'N' characters at corrupted positions

N-base injection replaces individual bytes with N. Useful when the downstream tool you're benchmarking handles ambiguity codes specifically (allele-call walkers, junction translators).

There is no n_ns counter on the AIRR record. The pass leaves no provenance field; the only signal is the N characters in sequence. If you need a count, derive it: rec["sequence"].count("N").

A few facts about how N corruption composes with downstream analysis:

Allele-call walkers handle N as a wildcard. The GenAIRR walker classifies N as matching any canonical reference base (per the standard convention), so an N-injected record can still produce a meaningful v_call. Tie sets may widen.
Junction translation produces X codons. An N-containing codon translates to X; junction_aa and sequence_aa may carry X characters. The productive predicate still evaluates correctly - X codons don't satisfy the stop-codon check but also don't satisfy the V-Cys or J-W/F anchor preservation checks if they land at the anchor positions.

Polymerase indels¶

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .polymerase_indels(count=(0, 2), insertion_prob=0.5)
      .run_records(n=100, seed=1)
)

rec["n_indels"]      # total polymerase indel events
rec["n_v_indels"]    # partitioned by carried segment
rec["n_d_indels"]
rec["n_j_indels"]

Each event is either an insertion (random base inserted at a random position) or a deletion (single base removed), drawn per insertion_prob. The n_indels total partitions exactly into the per-segment counters: n_v + n_d + n_j == n_indels for events that land in V / D / J. Events in NP1 / NP2 roll into n_indels only (they're not allocated to V/D/J).

Indels vs end-loss¶

A common confusion. Both shorten / extend the sequence in some way; the biology is completely different:

Mechanism	What it does	AIRR fields
Polymerase indel	Random single-base insertions and deletions inside the molecule	`n_indels`, per-segment partition
End-loss	Block deletion from the molecule's 5' or 3' end	`end_loss_5_length`, `end_loss_3_length`

If your downstream pipeline conflates them, the per-segment identity and coordinate calculations will be wrong. Use n_indels for "did the polymerase mis-step in the middle of the molecule"; use end_loss_*_length for "how much got chewed off the ends".

End-loss¶

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .end_loss_5prime(length=(0, 8))
      .end_loss_3prime(length=(0, 4))
      .run_records(n=100, seed=1)
)

rec["end_loss_5_length"]    # bases removed from 5' end
rec["end_loss_3_length"]    # bases removed from 3' end

End-loss models adapter / primer loss during library prep - the read starts a few bases in from the molecule's actual 5' end and / or terminates a few bases short of the 3' end. The length argument follows the standard int / (low, high) / [(value, weight), ...] shape.

`primer_trim_*prime` is a backwards-compatible alias¶

# These two are exactly equivalent:
.end_loss_5prime(length=(0, 8))
.primer_trim_5prime(length=(0, 8))    # legacy alias - same pass, same fields

The legacy primer_trim_*prime names predate the engine's distinction between recombination-stage exonuclease trim (TrimPass, the v_trim_3 family) and observation-stage end loss (EndLossPass, the end_loss_*_length family). New code should use end_loss_*prime for clarity.

End-loss vs recombination trim¶

Another common confusion. Recombination trims and end-loss are two separate passes with two separate sets of fields:

Mechanism	Engine pass	AIRR fields	Biology stage
Recombination trim	`TrimPass`	`v_trim_3`, `d_trim_5`, `d_trim_3`, `j_trim_5`	Recombination (germline exonuclease)
End loss	`EndLossPass`	`end_loss_5_length`, `end_loss_3_length`	Observation (library prep / sequencing)

Don't conflate them - they capture different biology and a downstream pipeline that adds them together gets the wrong total. See the Recombination + junction biology guide for the detailed comparison.

Random strand orientation¶

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .random_strand_orientation(prob=0.5)
      .run_records(n=100, seed=1)
)

rec["rev_comp"]    # bool - True if this record was flipped

At prob=0.5, half the records get reverse-complemented in place. The rev_comp flag tells you which way the strand went; sequence and sequence_aa reflect the post-flip orientation; junction, np1, np2 similarly reflect the post-flip bytes.

A few facts to know:

The flip is projection-level, not biology. Real biology isn't being inverted; the read just happens to come off the sequencer in the other orientation. Allele truth and all the per-segment provenance fields stay correct; coordinates remap to the flipped sequence frame.
Paired-end windows are sliced AFTER the strand flip. If .paired_end(...) follows .random_strand_orientation(...), R1 and R2 are sliced from the post-flip molecule, so they reflect the strand decision automatically. Don't apply a second flip downstream.

Ordering with clonal workflows¶

All seven corruption passes are descendant-phase. They model per-read artefacts - every emitted clone member or observed lineage cell gets independent PCR errors, independent indel events, independent end-loss draws, etc.

result = (
    ga.Experiment.on("human_igh")
      .recombine()
      .clonal_repertoire(n_clones=10, max_size=50)
      .mutate(model="s5f", rate=0.03)        # biology - descendant-phase
      .pcr_amplify(count=(0, 3))             # corruption - descendant-phase
      .polymerase_indels(count=(0, 2))
      .ambiguous_base_calls(count=(0, 2))
      .sequencing_errors(count=(0, 5))
      .end_loss_5prime(length=(0, 8))
      .end_loss_3prime(length=(0, 4))
      .random_strand_orientation(prob=0.5)
      .paired_end(r1_length=150, insert_size=300)
      .run_records(seed=1)
)

Calling corruption before a flat fork (clonal_repertoire or legacy expand_clones) raises ValueError at chain time because the artefact would be shared by the whole clone. Corruption may also follow clonal_lineage; in that case it is applied independently to each observed sampled cell. See Clonal simulation overview for the phase discipline in full.

Per-platform calibrated profiles¶

Different sequencing platforms break sequences in characteristic ways: MiSeq adds quality noise at the 3′ end, NovaSeq drops bases uniformly with a slight 5′ bias, PacBio HiFi inserts homopolymer indels. Three starting profiles for the platforms researchers ask about most often:

# Counts are per-record ranges. For a ~350bp read, count=(0, 2) Ns
# is roughly 0.1% N-density.
PROFILES = {
    "miseq_300pe": dict(
        end_loss_5=(5, 25),
        end_loss_3=(8, 45),       # 3′ decay heavier than 5′
        ambig_ns=(0, 2),
        indels=(0, 1),
    ),
    "novaseq_150pe": dict(
        end_loss_5=(3, 15),
        end_loss_3=(3, 15),
        ambig_ns=(0, 1),
        indels=(0, 1),
    ),
    "pacbio_hifi": dict(
        end_loss_5=(0, 5),
        end_loss_3=(0, 5),
        ambig_ns=(0, 1),
        indels=(0, 3),            # HiFi: low subs, higher indels
    ),
}

def apply_profile(exp, name):
    p = PROFILES[name]
    return (
        exp.end_loss_5prime(length=p["end_loss_5"])
           .end_loss_3prime(length=p["end_loss_3"])
           .ambiguous_base_calls(count=p["ambig_ns"])
           .polymerase_indels(count=p["indels"])
    )

result = apply_profile(
    ga.Experiment.on("human_igh").recombine().mutate(count=(5, 20)),
    "miseq_300pe",
).run_records(n=5000, seed=42)

The set of passes you call is the profile - these are starting points, not exhaustive: add pcr_amplify or sequencing_errors rates when the platform calls for them.

Calibrating from your own reads¶

Don't trust a preset blindly. Run a small panel under the profile and match two distributions to a sample of real reads from your sequencer:

Match the N-density.

n_rate_real = (real_reads.str.count("N") / real_reads.str.len()).mean()
n_rate_sim  = (sim["sequence"].str.count("N") / sim["sequence_length"]).mean()
# Adjust the ambiguous_base_calls count range until these match

Match the length distribution.

# End-loss is the dominant driver of read-length variance
sim_lens  = sim["sequence_length"]
real_lens = real_reads.str.len()
# Match means + variances by widening or tightening the (min, max)
# ranges on end_loss_5prime / end_loss_3prime

Validation and replay¶

validate_records re-derives every corruption counter from the event ledger and trace addresses, then compares against the record's reported values. Issue kinds the catalogue covers:

Counter	Issue kind on mismatch
`n_pcr_errors`	`NPcrErrorsMismatch`
`n_quality_errors`	`NQualityErrorsMismatch`
`n_indels`	`NIndelsMismatch`
`n_v_indels` / `n_d_indels` / `n_j_indels`	per-segment indel-count mismatches
`end_loss_5_length` / `end_loss_3_length`	`EndLossLengthMismatch`

The rev_comp flag is derived from the IR; mismatch surfaces similarly. Every corruption mechanism records its choices on stable trace addresses (corrupt.pcr.count, corrupt.quality.count, corrupt.indel.count, corrupt.end_loss.5, corrupt.end_loss.3, etc.), so trace replay reproduces the same artefacts deterministically.

Paired-end FASTQ export writes the projected r1_sequence / r2_sequence fields verbatim - by AIRR-projection time the corruption + strand-flip + paired-end pipeline has already finished, so the exported reads carry all the artefacts the record's counters report. See Validation hub for the validator's full scope.

Common mistakes¶

A handful of issues that show up repeatedly with corruption.

Expecting PCR or sequencing errors to increment n_mutations. They never do. n_mutations is biology only. PCR substitution surfaces on n_pcr_errors; sequencer miscalls surface on n_quality_errors. If your downstream pipeline reads n_mutations as a "total perturbation budget", it's leaving the artefact counters off.

Confusing end-loss with recombination trim. Recombination trim is biology (germline exonuclease, fields v_trim_3 etc.); end-loss is library-prep artefact (end_loss_*_length). They're separate passes, separate fields, separate biology stages. The Recombination + junction biology guide has the side-by-side comparison.

Putting corruption before a clonal fork. All seven corruption passes are descendant-phase - they're per-read artefacts that need to vary within a clonal family. The DSL rejects this before clonal_repertoire() or expand_clones() with a message naming the offending method and telling you to move it after the fork.

Reverse-complementing R2 again after random strand orientation. When .random_strand_orientation(...) is in the pipeline, sequence (and r1_sequence / r2_sequence if paired-end is on) already reflect the post-flip orientation. R2 is also already reverse-complemented at projection time (that's the paired-end-output convention). Applying any further flips downstream produces wrong reads. The validator's PairedEndWindowMismatch check enforces the per-record R2 RC state.

Looking for an n_ns counter. ambiguous_base_calls doesn't ship one. The only signal is the N characters in sequence - count them yourself with rec["sequence"].count("N").

Where to go next¶

SHM and mutation targeting - the biology side of the boundary; the per-segment / per-V-subregion partition.
Paired-end reads and FASTQ - the read-layout projection that usually sits alongside corruption.
Clonal simulation overview - the clonal model chooser and phase discipline that gates every corruption pass.
Recombination + junction biology the trim vs end-loss boundary in detail.
Validation & reproducibility - the validator's issue catalogue covering every corruption counter.
Your first AIRR record the field-by-field catalogue including every counter named on this page.