Different sequencing platforms break sequences in different ways: MiSeq adds quality noise
at the 3′ end, NovaSeq drops bases uniformly with a 5′ bias, PacBio HiFi inserts homopolymer
indels. Tune the corrupt_* passes to match a Q-score profile from your platform
and your sim will look like real wet-lab data — same artefact distribution your aligner sees
in production.
The corruption phase is composed of independent passes. You pick the ones whose mechanisms match your platform and skip the rest. The set you choose is the profile.
corrupt_5prime_loss(length=(min, max)) · corrupt_3prime_loss(...)
— models library-prep truncation. Independent draws per record, uniform within range.
Set both for amplicon assays.
corrupt_ns(prob=...) — models low-quality base calls, written as
N. Per-base Bernoulli. Use a rate that matches the per-base
N-density in your empirical reads.
corrupt_indels(prob=...) — models homopolymer drift and library prep
artefacts. Per-record Bernoulli with random length and position.
corrupt_pcr(...) for substitution drift from amplification cycles;
corrupt_quality(...) for the platform-shaped Q-score decay. Lowercase bases
in the output mark quality-degraded positions.
Starting points for the three platforms researchers most often ask about. Each profile
is a small dictionary you spread into Experiment via the
corrupt_* methods.
PROFILES = {
"miseq_300pe": {
"loss_5": (5, 25),
"loss_3": (8, 45), # 3' decay heavier than 5'
"ns": 0.001,
"indels": 0.0005,
},
"novaseq_150pe": {
"loss_5": (3, 15),
"loss_3": (3, 15),
"ns": 0.0003,
"indels": 0.0001,
},
"pacbio_hifi": {
"loss_5": (0, 5),
"loss_3": (0, 5),
"ns": 0.00005,
"indels": 0.002, # HiFi: low subs, higher indels
},
}
def profile(name):
p = PROFILES[name]
return lambda exp: (
exp.corrupt_5prime_loss(length=p["loss_5"])
.corrupt_3prime_loss(length=p["loss_3"])
.corrupt_ns(prob=p["ns"])
.corrupt_indels(prob=p["indels"])
)
result = profile("miseq_300pe")(
ga.Experiment.on("human_igh").recombine().mutate(count=(5, 20))
).run(n=5000, seed=42)
Don't trust a preset blindly. Run a small panel with the profile and compare two distributions against a sample of real reads from your sequencer.
n_rate_real = (
real_reads.str.count("N") /
real_reads.str.len()
).mean()
n_rate_sim = (
sim["n_count"] /
sim["sequence_length"]
).mean()
# adjust corrupt_ns prob until these match
# end-loss is the dominant driver of read-length variance
sim_lens = sim["sequence_length"]
real_lens = real_reads.str.len()
# match means and variances by widening
# or tightening the (min, max) ranges