At 10M records, holding the whole result in memory is no longer free. The streaming API yields records one at a time — you process or write each as it arrives, and only the current record (plus your batch buffer) ever sits in RAM. Combine with chunked TSV writes and you can pin memory at any size you like, on any machine.
run() to stream_records().
Rule of thumb: under 100k records, run() is fine. Above that, switch to
stream_records() and stop holding the whole result in memory. The CPU cost
is the same — you're paying the simulation time either way — but the memory profile flips
from O(n) to O(1).
Build your Experiment as you normally would, then call
stream_records() instead of run(). The return value is a
generator — iterate it with a for-loop and process each record before discarding it.
import GenAIRR as ga
exp = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(model="s5f", count=(5, 25))
.corrupt_5prime_loss(length=(5, 30))
)
n_productive = 0
for rec in exp.stream_records(n=10_000_000, seed=42):
if rec["productive"]:
n_productive += 1
print(n_productive)
Process records one at a time, write to disk in chunks. A 100k-record buffer keeps memory tiny while amortizing the TSV-formatting cost across many records.
import csv
CHUNK = 100_000
with open("panel_10m.tsv", "w", newline="") as fh:
writer = None
buf = []
for rec in exp.stream_records(n=10_000_000, seed=42):
if writer is None:
writer = csv.DictWriter(fh, fieldnames=rec.keys(),
delimiter="\t")
writer.writeheader()
buf.append(rec)
if len(buf) >= CHUNK:
writer.writerows(buf); buf.clear()
if buf:
writer.writerows(buf)
With a 100k-record buffer, total resident-set is roughly your buffer plus the engine's
working state (a few MB). The number doesn't grow with n.
Single-threaded throughput is between 10k and 30k records per second depending on the complexity of the pipeline (more passes = more time per record). Streaming adds negligible overhead — the simulation itself is the bottleneck, not the iteration.
import time
start = time.monotonic()
n = 0
for _ in exp.stream_records(n=100_000, seed=42):
n += 1
dt = time.monotonic() - start
print(f"{n / dt:.0f} records/sec")