Graph Types¶
LZGraphs provides three graph types, each optimized for different analysis scenarios. This guide helps you choose the right one.
Overview Comparison¶
| Feature | AAPLZGraph | NDPLZGraph | NaiveLZGraph |
|---|---|---|---|
| Input | Amino acids | Nucleotides | Any strings |
| Column | cdr3_amino_acid |
cdr3_rearrangement |
List of strings |
| Position encoding | Single (end) | Double (start, end) | None |
| V/J gene support | Yes | Yes | No |
| Alphabet size | 20 AA | 4 NT | Configurable |
| Graph density | Medium | Low | Configurable |
| Memory usage | Medium | High | Configurable |
| Best for | Most TCR analysis | Nucleotide-level | ML features |
AAPLZGraph¶
Amino Acid Positional LZGraph
When to Use¶
- Analyzing amino acid CDR3 sequences
- Need V/J gene annotations
- Standard repertoire analysis
- Moderate-sized repertoires
Node Format¶
Example: SL_5 means pattern "SL" ending at position 5.
Example¶
from LZGraphs import AAPLZGraph
import pandas as pd
data = pd.DataFrame({
'cdr3_amino_acid': ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF'],
'V': ['TRBV16-1*01', 'TRBV1-1*01'],
'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})
graph = AAPLZGraph(data, verbose=True)
Key Features¶
- Position-aware: Same pattern at different positions = different nodes
- Gene-aware: Edges carry V/J gene annotations
- Compact: 20-letter alphabet keeps graph manageable
NDPLZGraph¶
Nucleotide Double Positional LZGraph
When to Use¶
- Analyzing nucleotide sequences
- Need fine-grained positional information
- Studying codon usage or frame
- Have memory for larger graphs
Node Format¶
Example: TG_3_4 means pattern "TG" from position 3 to 4.
Example¶
from LZGraphs import NDPLZGraph
import pandas as pd
data = pd.DataFrame({
'cdr3_rearrangement': [
'TGTGCCAGCAGTTTAGAGCCCAGCGGGGGG...',
'TGTGCCAGCAGTGACACTTCAGGGGGGACT...'
],
'V': ['TRBV16-1*01', 'TRBV1-1*01'],
'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})
graph = NDPLZGraph(data, verbose=True)
Key Features¶
- Double position: Captures exact pattern boundaries
- Higher resolution: Better for sequence-level analysis
- Larger graphs: 4-letter alphabet but more positions
NaiveLZGraph¶
Naive LZGraph (No positional encoding)
When to Use¶
- Cross-repertoire comparisons
- Feature extraction for machine learning
- Fixed-dimension representations
- Memory-constrained environments
Node Format¶
Just the raw pattern, no position information.
Example¶
from LZGraphs import NaiveLZGraph
from LZGraphs.utilities import generate_kmer_dictionary
# Create shared dictionary
dictionary = generate_kmer_dictionary(6) # All patterns up to length 6
# Build graphs with consistent node sets
sequences1 = data1['cdr3_rearrangement'].tolist()
sequences2 = data2['cdr3_rearrangement'].tolist()
graph1 = NaiveLZGraph(sequences1, dictionary)
graph2 = NaiveLZGraph(sequences2, dictionary)
# Feature vectors have same dimensions
features1 = graph1.eigenvector_centrality()
features2 = graph2.eigenvector_centrality()
Key Features¶
- Fixed dictionary: Same nodes across all repertoires
- Consistent dimensions: Ideal for ML pipelines
- No position info: Simpler but less detailed
Choosing the Right Type¶
flowchart TD
A[Start] --> B{Sequence type?}
B -->|Amino acids| C{Need ML features?}
B -->|Nucleotides| D{Need ML features?}
C -->|Yes| E[NaiveLZGraph]
C -->|No| F[AAPLZGraph]
D -->|Yes| E
D -->|No| G{Need high resolution?}
G -->|Yes| H[NDPLZGraph]
G -->|No| I[NaiveLZGraph or NDPLZGraph]
Quick Decision Guide¶
- "I want standard TCR analysis" → AAPLZGraph
- "I want nucleotide-level detail" → NDPLZGraph
- "I want to compare multiple repertoires with ML" → NaiveLZGraph
- "I want consistent feature vectors" → NaiveLZGraph with shared dictionary
Memory and Performance¶
Graph Size Estimates¶
For a repertoire of N sequences with average length L:
| Graph Type | Approximate Nodes | Approximate Edges |
|---|---|---|
| AAPLZGraph | O(20 × L) | O(N × L) |
| NDPLZGraph | O(4 × L²) | O(N × L) |
| NaiveLZGraph | O(dictionary size) | O(unique patterns) |
Practical Recommendations¶
# Check your graph size
print(f"Nodes: {graph.graph.number_of_nodes()}")
print(f"Edges: {graph.graph.number_of_edges()}")
# Estimate memory (rough)
import sys
size_mb = sys.getsizeof(graph) / (1024 * 1024)
print(f"Approximate size: {size_mb:.1f} MB")
Converting Between Types¶
You cannot directly convert between graph types, but you can:
Re-encode Sequences¶
# Get sequences from one format
aa_sequences = data['cdr3_amino_acid'].tolist()
# Build different graph types
aa_graph = AAPLZGraph(data)
# For NaiveLZGraph, use the sequences directly
naive_graph = NaiveLZGraph(
aa_sequences,
generate_kmer_dictionary(6)
)
Compare Analyses¶
# Same repertoire, different representations
pgen_aa = aa_graph.walk_probability(
AAPLZGraph.encode_sequence(sequence)
)
pgen_naive = naive_graph.walk_probability(
lempel_ziv_decomposition(sequence)
)
print(f"AAPLZGraph Pgen: {pgen_aa:.2e}")
print(f"NaiveLZGraph Pgen: {pgen_naive:.2e}")
Next Steps¶
- Probability Model - How graphs calculate probabilities
- LZ76 Algorithm - Understand the encoding
- API Reference - Detailed class documentation