Quick Start¶
Build your first LZGraph in 5 minutes. This guide shows the essential workflow for TCR repertoire analysis.
Step 1: Import and Load Data¶
from LZGraphs import LZGraph
# Option 1: Plain list of sequences
sequences = ["CASSLEPSGGTDTQYF", "CASSDTSGGTDTQYF", "CASSLEPQTFTDTFFF"]
# Option 2: Load from CSV
import csv
with open("your_repertoire.csv") as f:
sequences = [row['cdr3_amino_acid'] for row in csv.DictReader(f)]
Input Format
LZGraphs expects a list of strings for sequences. If you have gene annotations or abundances, they should also be provided as parallel lists.
Step 2: Build a Graph¶
from LZGraphs import LZGraph
sequences = ["CASSLEPSGGTDTQYF", "CASSDTSGGTDTQYF", "CASSLEPQTFTDTFFF"]
graph = LZGraph(sequences, variant='aap') # (1)
variant='aap'uses amino acid positional encoding, the best choice for CDR3 amino acid sequences. See Graph Variants for alternatives.
Step 3: Explore the Graph¶
# Basic graph statistics
print(f"Nodes: {graph.n_nodes}")
print(f"Edges: {graph.n_edges}")
print(f"Total sequences: {graph.n_sequences}")
# View length distribution {length: count}
print(graph.length_distribution)
Step 4: Sequence Generation Probability (pgen)¶
Every sequence has a generation probability under the LZ-constrained model:
# Calculate log probability
log_p = graph.pgen("CASSLEPSGGTDTQYF")
print(f"log P(gen) = {log_p:.2f}")
# Multiple sequences at once (returns numpy array)
log_ps = graph.pgen(["CASSLEPSGGTDTQYF", "CASSDTSGGTDTQYF"])
Step 5: Simulate New Sequences¶
Generate sequences that follow the statistical patterns of your repertoire:
# Generate 1000 sequences
results = graph.simulate(1000, seed=42)
# results is iterable and contains the generated strings
for seq in results[:5]:
print(seq)
# If the graph has gene data, you can constrain the simulation
custom_results = graph.simulate(10, v_gene="TRBV16-1*01", j_gene="TRBJ1-2*01")
Step 6: Diversity & Richness¶
# Hill diversity number D(1) (Effective Diversity)
# Note: effective_diversity() is a shortcut for hill_number(1)
print(f"Effective Diversity: {graph.effective_diversity():.1f}")
# Hill order 2 (Inverse Simpson)
print(f"Hill D(2): {graph.hill_number(2):.1f}")
# Predicted richness at depth 100,000
print(f"Predicted richness: {graph.predicted_richness(100000):.1f}")
Complete Example¶
from LZGraphs import LZGraph
# Sample data
sequences = ["CASSLEPSGGTDTQYF", "CASSDTSGGTDTQYF", "CASSLEPQTFTDTFFF"]
v_genes = ["TRBV16-1*01", "TRBV1-1*01", "TRBV16-1*01"]
j_genes = ["TRBJ1-2*01", "TRBJ1-5*01", "TRBJ2-7*01"]
# 1. Build graph
graph = LZGraph(sequences, v_genes=v_genes, j_genes=j_genes, variant='aap')
# 2. Probability
seq = "CASSLEPSGGTDTQYF"
print(f"{seq} log P: {graph.pgen(seq):.2f}")
# 3. Simulate
simulated = graph.simulate(5, seed=42)
print(f"Simulated: {list(simulated)}")
# 4. Diversity
print(f"D(1): {graph.effective_diversity():.1f}")
print(f"D(2): {graph.hill_number(2.0):.1f}")
FlashBackGraph
If you need exact, closed-form diversity metrics without stochastic estimation, or if you are working with very large repertoires in limited memory, consider using FlashBackGraph. It follows the same basic API for pgen() and simulate().
Which LZGraph variant?¶
LZGraph has three variants. Pick by your input:
- Amino acid CDR3s ->
variant='aap'(the default, most common) - Nucleotide CDR3s ->
variant='ndp' - Motif or position-free analysis ->
variant='naive'
For the full comparison, see Graph Variants.
What's Next?¶
- Tutorial: Graph Construction - Go deeper with the LZGraph track
- Core Ideas - The concepts behind the graph
- Which Graph Should I Use? - Pick a family
- How-To Guides - Task-specific recipes
- Concepts - Theory behind LZ76 graphs