Recipe C · 01 · ~20 min · intermediate

Tune corruption rates to match your sequencer.

Different sequencing platforms break sequences in different ways: MiSeq adds quality noise at the 3′ end, NovaSeq drops bases uniformly with a 5′ bias, PacBio HiFi inserts homopolymer indels. Tune the corrupt_* passes to match a Q-score profile from your platform and your sim will look like real wet-lab data — same artefact distribution your aligner sees in production.

01 Pick a profile MiSeq · NovaSeq · HiFi · custom
02 Map to rates Q-score → corruption rate
03 Calibrate verify n_count + indel distributions match
PART 01

What each corruption pass actually models.

The corruption phase is composed of independent passes. You pick the ones whose mechanisms match your platform and skip the rest. The set you choose is the profile.

5′ / 3′ end loss

corrupt_5prime_loss(length=(min, max)) · corrupt_3prime_loss(...) — models library-prep truncation. Independent draws per record, uniform within range. Set both for amplicon assays.

N-corruption

corrupt_ns(prob=...) — models low-quality base calls, written as N. Per-base Bernoulli. Use a rate that matches the per-base N-density in your empirical reads.

Indels

corrupt_indels(prob=...) — models homopolymer drift and library prep artefacts. Per-record Bernoulli with random length and position.

PCR + quality

corrupt_pcr(...) for substitution drift from amplification cycles; corrupt_quality(...) for the platform-shaped Q-score decay. Lowercase bases in the output mark quality-degraded positions.

PART 02

Three calibrated profiles.

Starting points for the three platforms researchers most often ask about. Each profile is a small dictionary you spread into Experiment via the corrupt_* methods.

MiSeq · NovaSeq · HiFi profiles
PROFILES = {
    "miseq_300pe": {
        "loss_5": (5, 25),
        "loss_3": (8, 45),    # 3' decay heavier than 5'
        "ns": 0.001,
        "indels": 0.0005,
    },
    "novaseq_150pe": {
        "loss_5": (3, 15),
        "loss_3": (3, 15),
        "ns": 0.0003,
        "indels": 0.0001,
    },
    "pacbio_hifi": {
        "loss_5": (0, 5),
        "loss_3": (0, 5),
        "ns": 0.00005,
        "indels": 0.002,    # HiFi: low subs, higher indels
    },
}

def profile(name):
    p = PROFILES[name]
    return lambda exp: (
        exp.corrupt_5prime_loss(length=p["loss_5"])
           .corrupt_3prime_loss(length=p["loss_3"])
           .corrupt_ns(prob=p["ns"])
           .corrupt_indels(prob=p["indels"])
    )

result = profile("miseq_300pe")(
    ga.Experiment.on("human_igh").recombine().mutate(count=(5, 20))
).run(n=5000, seed=42)
PART 03

Calibrate from your own data.

Don't trust a preset blindly. Run a small panel with the profile and compare two distributions against a sample of real reads from your sequencer.

Match the N-density
n_rate_real = (
    real_reads.str.count("N") /
    real_reads.str.len()
).mean()

n_rate_sim = (
    sim["n_count"] /
    sim["sequence_length"]
).mean()
# adjust corrupt_ns prob until these match
Match the length distribution
# end-loss is the dominant driver of read-length variance
sim_lens  = sim["sequence_length"]
real_lens = real_reads.str.len()

# match means and variances by widening
# or tightening the (min, max) ranges
Related recipes

Where to next.

B · 03 · Audit for biological realism →

Run the marginal audit against your real reads to confirm the corruption profile holds up.

B · 01 · Benchmark an aligner →

Use the calibrated profile to benchmark aligners under realistic platform noise.