Recipe D · 02 · ~15 min · intermediate

Stream millions of sequences without OOM.

At 10M records, holding the whole result in memory is no longer free. The streaming API yields records one at a time — you process or write each as it arrives, and only the current record (plus your batch buffer) ever sits in RAM. Combine with chunked TSV writes and you can pin memory at any size you like, on any machine.

01 Use stream_records() iterator API · one at a time

02 Buffer and flush chunk_size = N · write incremental

03 Measure memory profile · throughput

PART 01

When to switch from `run()` to `stream_records()`.

Rule of thumb: under 100k records, run() is fine. Above that, switch to stream_records() and stop holding the whole result in memory. The CPU cost is the same — you're paying the simulation time either way — but the memory profile flips from O(n) to O(1).

run() — eager

memoryO(n) — all records live in a single list
ergonomicindexing, slicing, len(), to_dataframe()
use whenn ≤ 100k, doing interactive exploration

stream_records() — lazy

memoryO(1) per record — generator yields one at a time
ergonomicfor-loop only; no indexing or random access
use whenn ≥ 100k, or writing to disk as you go

PART 02

The streaming loop.

Build your Experiment as you normally would, then call stream_records() instead of run(). The return value is a generator — iterate it with a for-loop and process each record before discarding it.

10M records, constant memory

import GenAIRR as ga

exp = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(model="s5f", count=(5, 25))
       .corrupt_5prime_loss(length=(5, 30))
)

n_productive = 0
for rec in exp.stream_records(n=10_000_000, seed=42):
    if rec["productive"]:
        n_productive += 1
print(n_productive)

PART 03

Chunked TSV writes.

Process records one at a time, write to disk in chunks. A 100k-record buffer keeps memory tiny while amortizing the TSV-formatting cost across many records.

Write 10M records to a single TSV — bounded memory

import csv

CHUNK = 100_000

with open("panel_10m.tsv", "w", newline="") as fh:
    writer = None
    buf = []
    for rec in exp.stream_records(n=10_000_000, seed=42):
        if writer is None:
            writer = csv.DictWriter(fh, fieldnames=rec.keys(),
                                    delimiter="\t")
            writer.writeheader()
        buf.append(rec)
        if len(buf) >= CHUNK:
            writer.writerows(buf); buf.clear()
    if buf:
        writer.writerows(buf)

PART 04

Memory profile — what to expect.

With a 100k-record buffer, total resident-set is roughly your buffer plus the engine's working state (a few MB). The number doesn't grow with n.

100k records

stream + chunk=100k~110 MB peak — dominated by the buffer
run()~95 MB — comparable; streaming overhead is small

1M records

stream + chunk=100k~110 MB peak — same as 100k case
run()~950 MB — linear in n

10M records

stream + chunk=100k~110 MB peak — flat
run()won't fit on most machines

PART 05

Throughput.

Single-threaded throughput is between 10k and 30k records per second depending on the complexity of the pipeline (more passes = more time per record). Streaming adds negligible overhead — the simulation itself is the bottleneck, not the iteration.

Quick benchmark

import time

start = time.monotonic()
n = 0
for _ in exp.stream_records(n=100_000, seed=42):
    n += 1
dt = time.monotonic() - start

print(f"{n / dt:.0f} records/sec")

Related recipes

Where to next.

D · 01 · Export to AIRR TSV, Parquet, FASTA →

The three writers when your dataset does fit in memory.

D · 03 · Reproduce a published dataset →

Streaming is the workflow that scales reproduction to paper-sized datasets.