Recipe D · 01 · ~5 min · beginner

Export to AIRR TSV, Parquet, or FASTA.

Once run() returns a result, getting it out of GenAIRR is a single method call. Three formats cover almost every downstream use case: AIRR TSV for tool compatibility, FASTA for blinded handoff to aligners or BLAST, and pandas for interactive analysis. Each comes with one important detail — whether the truth_* columns survive the trip.

01 Pick the format TSV · FASTA · DataFrame

02 Truth or blinded expose_provenance · airr_strict

03 Write one method call

PART 01

Which writer for which job.

Each format trades off something. TSV is verbose but tool-native. FASTA is minimal but strips everything except the sequence. DataFrame keeps everything in memory but lets you filter, group, and join in seconds.

TSV — for tools

result.to_csv("out.tsv", sep="\t"). AIRR-schema column names, all ~70 fields (including truth_* if you ran with expose_provenance=True). Set airr_strict=True to convert coordinates to 1-based inclusive for IMGT-style downstream tools.

FASTA — for blinded handoff

result.to_fasta("out.fasta"). Just the headers and sequence. Use this when the next stage is an aligner that should not have access to the truth fields — e.g. a benchmark, see B · 01.

DataFrame — for analysis

df = result.to_dataframe(). Full pandas DataFrame; one row per record, one column per AIRR field. Ideal for in-notebook exploration or feeding into the audit workflows in B · 03.

PART 02

One call, three writers.

The same result can be dumped to all three. Pick what each downstream step actually needs.

All three writers from one run

import GenAIRR as ga

result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .mutate(model="s5f", count=(5, 25))
       .run_records(n=10_000, seed=42,
                    expose_provenance=True)
)

# 1. AIRR TSV — coordinates already 0-based
result.to_csv("panel.tsv", sep="\t")

# 2. AIRR TSV — 1-based coords for IMGT-shaped tools
result.to_csv("panel_imgt.tsv", sep="\t", airr_strict=True)

# 3. FASTA — blinded; only sequence_id + sequence
result.to_fasta("panel.fasta")

# 4. DataFrame — for inline analysis
df = result.to_dataframe()
df["v_call"].value_counts().head()

PART 03

Strip `truth_*` from a TSV.

For a truly blinded benchmark you want the AIRR TSV columns (sequence, sequence_id, maybe a few metadata columns) without the provenance. Two options.

Drop the columns post-hoc

import pandas as pd

df = pd.read_csv("panel.tsv", sep="\t")
df.drop(columns=[c for c in df.columns if c.startswith("truth_")],
        inplace=True)
df.to_csv("panel_blinded.tsv", sep="\t", index=False)

Or just don't expose them

# run without expose_provenance=True
result = (
    ga.Experiment.on("human_igh")
       .recombine()
       .run_records(n=10_000, seed=42)
)
result.to_csv("panel.tsv", sep="\t")
# no truth_* columns at all — clean blinded panel

PART 04

FASTA header shape.

By default FASTA headers are the sequence_id — typically just seq0, seq1, etc. If you need richer headers (e.g. embedding metadata for a tool that doesn't read the TSV), generate them yourself from the DataFrame.

Custom FASTA with metadata-embedded headers

with open("panel_rich.fasta", "w") as fh:
    for _, row in df.iterrows():
        meta = f"locus={row['locus']}|prod={row['productive']}"
        fh.write(f">{row['sequence_id']}|{meta}\n")
        fh.write(row["sequence"] + "\n")

Related recipes

Where to next.

D · 02 · Stream millions without OOM →

For datasets that don't fit in memory, write to disk as the engine produces records.

B · 01 · Benchmark an aligner →

The classic export pattern: TSV for truth, FASTA for the aligner, join later.