Prepare Your Data¶
Learn how to format and clean your TCR/BCR repertoire data for use with LZGraphs.
Quick Reference¶
from LZGraphs import LZGraph
# 1. Plain list of sequences (simplest)
sequences = ["CASSLEPSGGTDTQYF", "CASSDTSGGTDTQYF", ...]
graph = LZGraph(sequences, variant='aap')
# 2. With abundance weighting
counts = [150, 42, 10, ...]
graph = LZGraph(sequences, abundances=counts, variant='aap')
# 3. With V/J gene annotations
v_genes = ["TRBV16-1*01", "TRBV1-1*01", ...]
j_genes = ["TRBJ1-2*01", "TRBJ1-5*01", ...]
graph = LZGraph(sequences, v_genes=v_genes, j_genes=j_genes, variant='aap')
Accepted Input Format¶
The LZGraph constructor expects parallel lists for all input data.
| Argument | Type | Description |
|---|---|---|
sequences |
list[str] |
Required: Amino acid or nucleotide strings |
abundances |
list[int] |
Optional: Frequency counts per sequence |
v_genes |
list[str] |
Optional: V gene names |
j_genes |
list[str] |
Optional: J gene names |
No pandas required
LZGraphs has no pandas dependency. All inputs are plain Python lists. Use Python's built-in csv module to load data from files.
Loading from Common Formats¶
From CSV¶
import csv
from LZGraphs import LZGraph
seqs, v_genes, j_genes = [], [], []
with open("repertoire.csv") as f:
for row in csv.DictReader(f):
seqs.append(row['cdr3_amino_acid'])
v_genes.append(row['v_call'])
j_genes.append(row['j_call'])
graph = LZGraph(seqs, v_genes=v_genes, j_genes=j_genes, variant='aap')
From TSV (AIRR-Standard Files)¶
If your data is in the AIRR TSV format:
import csv
from LZGraphs import LZGraph
seqs, v_genes, j_genes = [], [], []
with open("repertoire.tsv") as f:
for row in csv.DictReader(f, delimiter='\t'):
seqs.append(row['junction_aa'])
v_genes.append(row['v_call'])
j_genes.append(row['j_call'])
graph = LZGraph(seqs, v_genes=v_genes, j_genes=j_genes, variant='aap')
Cleaning and Filtering¶
Invalid Characters¶
LZGraphs expects valid amino acid (standard 20 residues) or nucleotide (A, C, G, T) strings. Filter out entries with missing data or non-standard characters like * or X:
import re
# Standard 20 amino acids
valid_aa = re.compile(r'^[ACDEFGHIKLMNPQRSTVWY]+$')
# Keep only valid sequences
clean_seqs = [s for s in sequences if valid_aa.match(s)]
Length Filtering¶
Extremely short or long sequences are often artifacts of sequencing or pipeline errors:
# Keep typical CDR3 lengths (8 to 25 residues)
filtered = [s for s in sequences if 8 <= len(s) <= 25]
Saturation Analysis¶
How do you know if you have enough data to build a representative graph? Use the saturation_curve function to see how the number of nodes and edges grows as you add more sequences.
from LZGraphs import saturation_curve
# Compute saturation stats
# Returns a list of dicts: [{'n_sequences': 100, 'n_nodes': 450, 'n_edges': 600}, ...]
stats = saturation_curve(sequences, variant='aap', log_every=500)
for s in stats[:5]:
print(f"Seqs: {s['n_sequences']:>6} | Nodes: {s['n_nodes']:>6}")
If the curve of n_nodes starts to flatten, it means you have enough data to capture the structural diversity of that repertoire.
Next Steps¶
- Graph Construction tutorial — Detailed construction options
- Choosing Graph Variants — Choose between AAP, NDP, and Naive
- Quick Start — Build your first graph