Recipe D · 01 · ~5 min · beginner

Export to AIRR TSV, Parquet, or FASTA.

Once run() returns a result, getting it out of GenAIRR is a single method call. Three formats cover almost every downstream use case: AIRR TSV for tool compatibility, FASTA for blinded handoff to aligners or BLAST, and pandas for interactive analysis. Each comes with one important detail — whether the truth_* columns survive the trip.

01 Pick the format TSV · FASTA · DataFrame
02 Truth or blinded expose_provenance · airr_strict
03 Write one method call
PART 01

Which writer for which job.

Each format trades off something. TSV is verbose but tool-native. FASTA is minimal but strips everything except the sequence. DataFrame keeps everything in memory but lets you filter, group, and join in seconds.

TSV — for tools

result.to_csv("out.tsv", sep="\t"). AIRR-schema column names, all ~70 fields (including truth_* if you ran with expose_provenance=True). Set airr_strict=True to convert coordinates to 1-based inclusive for IMGT-style downstream tools.

FASTA — for blinded handoff

result.to_fasta("out.fasta"). Just the headers and sequence. Use this when the next stage is an aligner that should not have access to the truth fields — e.g. a benchmark, see B · 01.

DataFrame — for analysis

df = result.to_dataframe(). Full pandas DataFrame; one row per record, one column per AIRR field. Ideal for in-notebook exploration or feeding into the audit workflows in B · 03.

PART 02

One call, three writers.

The same result can be dumped to all three. Pick what each downstream step actually needs.

All three writers from one run
import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(model="s5f", count=(5, 25))
       .run_records(n=10_000, seed=42,
                    expose_provenance=True)
)

# 1. AIRR TSV — coordinates already 0-based
result.to_csv("panel.tsv", sep="\t")

# 2. AIRR TSV — 1-based coords for IMGT-shaped tools
result.to_csv("panel_imgt.tsv", sep="\t", airr_strict=True)

# 3. FASTA — blinded; only sequence_id + sequence
result.to_fasta("panel.fasta")

# 4. DataFrame — for inline analysis
df = result.to_dataframe()
df["v_call"].value_counts().head()
PART 03

Strip truth_* from a TSV.

For a truly blinded benchmark you want the AIRR TSV columns (sequence, sequence_id, maybe a few metadata columns) without the provenance. Two options.

Drop the columns post-hoc
import pandas as pd

df = pd.read_csv("panel.tsv", sep="\t")
df.drop(columns=[c for c in df.columns if c.startswith("truth_")],
        inplace=True)
df.to_csv("panel_blinded.tsv", sep="\t", index=False)
Or just don't expose them
# run without expose_provenance=True
result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .run_records(n=10_000, seed=42)
)
result.to_csv("panel.tsv", sep="\t")
# no truth_* columns at all — clean blinded panel
PART 04

FASTA header shape.

By default FASTA headers are the sequence_id — typically just seq0, seq1, etc. If you need richer headers (e.g. embedding metadata for a tool that doesn't read the TSV), generate them yourself from the DataFrame.

Custom FASTA with metadata-embedded headers
with open("panel_rich.fasta", "w") as fh:
    for _, row in df.iterrows():
        meta = f"locus={row['locus']}|prod={row['productive']}"
        fh.write(f">{row['sequence_id']}|{meta}\n")
        fh.write(row["sequence"] + "\n")
Related recipes

Where to next.

D · 02 · Stream millions without OOM →

For datasets that don't fit in memory, write to disk as the engine produces records.

B · 01 · Benchmark an aligner →

The classic export pattern: TSV for truth, FASTA for the aligner, join later.