Recipe D · 02 · ~15 min · intermediate

Stream millions of sequences without OOM.

At 10M records, holding the whole result in memory is no longer free. The streaming API yields records one at a time — you process or write each as it arrives, and only the current record (plus your batch buffer) ever sits in RAM. Combine with chunked TSV writes and you can pin memory at any size you like, on any machine.

01 Use stream_records() iterator API · one at a time
02 Buffer and flush chunk_size = N · write incremental
03 Measure memory profile · throughput
PART 01

When to switch from run() to stream_records().

Rule of thumb: under 100k records, run() is fine. Above that, switch to stream_records() and stop holding the whole result in memory. The CPU cost is the same — you're paying the simulation time either way — but the memory profile flips from O(n) to O(1).

run() — eager
  • memoryO(n) — all records live in a single list
  • ergonomicindexing, slicing, len(), to_dataframe()
  • use whenn ≤ 100k, doing interactive exploration
stream_records() — lazy
  • memoryO(1) per record — generator yields one at a time
  • ergonomicfor-loop only; no indexing or random access
  • use whenn ≥ 100k, or writing to disk as you go
PART 02

The streaming loop.

Build your Experiment as you normally would, then call stream_records() instead of run(). The return value is a generator — iterate it with a for-loop and process each record before discarding it.

10M records, constant memory
import GenAIRR as ga

exp = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(model="s5f", count=(5, 25))
       .corrupt_5prime_loss(length=(5, 30))
)

n_productive = 0
for rec in exp.stream_records(n=10_000_000, seed=42):
    if rec["productive"]:
        n_productive += 1
print(n_productive)
PART 03

Chunked TSV writes.

Process records one at a time, write to disk in chunks. A 100k-record buffer keeps memory tiny while amortizing the TSV-formatting cost across many records.

Write 10M records to a single TSV — bounded memory
import csv

CHUNK = 100_000

with open("panel_10m.tsv", "w", newline="") as fh:
    writer = None
    buf = []
    for rec in exp.stream_records(n=10_000_000, seed=42):
        if writer is None:
            writer = csv.DictWriter(fh, fieldnames=rec.keys(),
                                    delimiter="\t")
            writer.writeheader()
        buf.append(rec)
        if len(buf) >= CHUNK:
            writer.writerows(buf); buf.clear()
    if buf:
        writer.writerows(buf)
PART 04

Memory profile — what to expect.

With a 100k-record buffer, total resident-set is roughly your buffer plus the engine's working state (a few MB). The number doesn't grow with n.

100k records
  • stream + chunk=100k~110 MB peak — dominated by the buffer
  • run()~95 MB — comparable; streaming overhead is small
1M records
  • stream + chunk=100k~110 MB peak — same as 100k case
  • run()~950 MB — linear in n
10M records
  • stream + chunk=100k~110 MB peak — flat
  • run()won't fit on most machines
PART 05

Throughput.

Single-threaded throughput is between 10k and 30k records per second depending on the complexity of the pipeline (more passes = more time per record). Streaming adds negligible overhead — the simulation itself is the bottleneck, not the iteration.

Quick benchmark
import time

start = time.monotonic()
n = 0
for _ in exp.stream_records(n=100_000, seed=42):
    n += 1
dt = time.monotonic() - start

print(f"{n / dt:.0f} records/sec")
Related recipes

Where to next.

D · 01 · Export to AIRR TSV, Parquet, FASTA →

The three writers when your dataset does fit in memory.

D · 03 · Reproduce a published dataset →

Streaming is the workflow that scales reproduction to paper-sized datasets.