Once run() returns a result, getting it out of GenAIRR is a single method call.
Three formats cover almost every downstream use case: AIRR TSV for tool
compatibility, FASTA for blinded handoff to aligners or BLAST, and pandas for
interactive analysis. Each comes with one important detail — whether the
truth_* columns survive the trip.
Each format trades off something. TSV is verbose but tool-native. FASTA is minimal but strips everything except the sequence. DataFrame keeps everything in memory but lets you filter, group, and join in seconds.
result.to_csv("out.tsv", sep="\t"). AIRR-schema column names, all ~70 fields
(including truth_* if you ran with expose_provenance=True).
Set airr_strict=True to convert coordinates to 1-based inclusive for
IMGT-style downstream tools.
result.to_fasta("out.fasta"). Just the headers and sequence. Use this when
the next stage is an aligner that should not have access to the truth fields — e.g. a
benchmark, see B · 01.
df = result.to_dataframe(). Full pandas DataFrame; one row per record, one
column per AIRR field. Ideal for in-notebook exploration or feeding into the audit
workflows in B · 03.
The same result can be dumped to all three. Pick what each downstream step
actually needs.
import GenAIRR as ga
result = (
ga.Experiment.on("human_igh")
.recombine()
.mutate(model="s5f", count=(5, 25))
.run_records(n=10_000, seed=42,
expose_provenance=True)
)
# 1. AIRR TSV — coordinates already 0-based
result.to_csv("panel.tsv", sep="\t")
# 2. AIRR TSV — 1-based coords for IMGT-shaped tools
result.to_csv("panel_imgt.tsv", sep="\t", airr_strict=True)
# 3. FASTA — blinded; only sequence_id + sequence
result.to_fasta("panel.fasta")
# 4. DataFrame — for inline analysis
df = result.to_dataframe()
df["v_call"].value_counts().head()
truth_* from a TSV.For a truly blinded benchmark you want the AIRR TSV columns (sequence, sequence_id, maybe a few metadata columns) without the provenance. Two options.
import pandas as pd
df = pd.read_csv("panel.tsv", sep="\t")
df.drop(columns=[c for c in df.columns if c.startswith("truth_")],
inplace=True)
df.to_csv("panel_blinded.tsv", sep="\t", index=False)
# run without expose_provenance=True
result = (
ga.Experiment.on("human_igh")
.recombine()
.run_records(n=10_000, seed=42)
)
result.to_csv("panel.tsv", sep="\t")
# no truth_* columns at all — clean blinded panel
By default FASTA headers are the sequence_id — typically just seq0,
seq1, etc. If you need richer headers (e.g. embedding metadata for a tool
that doesn't read the TSV), generate them yourself from the DataFrame.
with open("panel_rich.fasta", "w") as fh:
for _, row in df.iterrows():
meta = f"locus={row['locus']}|prod={row['productive']}"
fh.write(f">{row['sequence_id']}|{meta}\n")
fh.write(row["sequence"] + "\n")