Graph Types¶

LZGraphs provides three graph types, each optimized for different analysis scenarios. This guide helps you choose the right one.

Overview Comparison¶

Feature	AAPLZGraph	NDPLZGraph	NaiveLZGraph
Input	Amino acids	Nucleotides	Any strings
Column	`cdr3_amino_acid`	`cdr3_rearrangement`	List of strings
Position encoding	Single (end)	Double (start, end)	None
V/J gene support	Yes	Yes	No
Alphabet size	20 AA	4 NT	Configurable
Graph density	Medium	Low	Configurable
Memory usage	Medium	High	Configurable
Best for	Most TCR analysis	Nucleotide-level	ML features

AAPLZGraph¶

Amino Acid Positional LZGraph

When to Use¶

Analyzing amino acid CDR3 sequences
Need V/J gene annotations
Standard repertoire analysis
Moderate-sized repertoires

Node Format¶

<pattern>_<end_position>

Example: SL_5 means pattern "SL" ending at position 5.

Example¶

from LZGraphs import AAPLZGraph
import pandas as pd

data = pd.DataFrame({
    'cdr3_amino_acid': ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF'],
    'V': ['TRBV16-1*01', 'TRBV1-1*01'],
    'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})

graph = AAPLZGraph(data, verbose=True)

Key Features¶

Position-aware: Same pattern at different positions = different nodes
Gene-aware: Edges carry V/J gene annotations
Compact: 20-letter alphabet keeps graph manageable

NDPLZGraph¶

Nucleotide Double Positional LZGraph

When to Use¶

Analyzing nucleotide sequences
Need fine-grained positional information
Studying codon usage or frame
Have memory for larger graphs

Node Format¶

<pattern>_<start_position>_<end_position>

Example: TG_3_4 means pattern "TG" from position 3 to 4.

Example¶

from LZGraphs import NDPLZGraph
import pandas as pd

data = pd.DataFrame({
    'cdr3_rearrangement': [
        'TGTGCCAGCAGTTTAGAGCCCAGCGGGGGG...',
        'TGTGCCAGCAGTGACACTTCAGGGGGGACT...'
    ],
    'V': ['TRBV16-1*01', 'TRBV1-1*01'],
    'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})

graph = NDPLZGraph(data, verbose=True)

Key Features¶

Double position: Captures exact pattern boundaries
Higher resolution: Better for sequence-level analysis
Larger graphs: 4-letter alphabet but more positions

NaiveLZGraph¶

Naive LZGraph (No positional encoding)

When to Use¶

Cross-repertoire comparisons
Feature extraction for machine learning
Fixed-dimension representations
Memory-constrained environments

Node Format¶

<pattern>

Just the raw pattern, no position information.

Example¶

from LZGraphs import NaiveLZGraph
from LZGraphs.utilities import generate_kmer_dictionary

# Create shared dictionary
dictionary = generate_kmer_dictionary(6)  # All patterns up to length 6

# Build graphs with consistent node sets
sequences1 = data1['cdr3_rearrangement'].tolist()
sequences2 = data2['cdr3_rearrangement'].tolist()

graph1 = NaiveLZGraph(sequences1, dictionary)
graph2 = NaiveLZGraph(sequences2, dictionary)

# Feature vectors have same dimensions
features1 = graph1.eigenvector_centrality()
features2 = graph2.eigenvector_centrality()

Key Features¶

Fixed dictionary: Same nodes across all repertoires
Consistent dimensions: Ideal for ML pipelines
No position info: Simpler but less detailed

Choosing the Right Type¶

flowchart TD
    A[Start] --> B{Sequence type?}
    B -->|Amino acids| C{Need ML features?}
    B -->|Nucleotides| D{Need ML features?}

    C -->|Yes| E[NaiveLZGraph]
    C -->|No| F[AAPLZGraph]

    D -->|Yes| E
    D -->|No| G{Need high resolution?}

    G -->|Yes| H[NDPLZGraph]
    G -->|No| I[NaiveLZGraph or NDPLZGraph]

Quick Decision Guide¶

"I want standard TCR analysis" → AAPLZGraph
"I want nucleotide-level detail" → NDPLZGraph
"I want to compare multiple repertoires with ML" → NaiveLZGraph
"I want consistent feature vectors" → NaiveLZGraph with shared dictionary

Memory and Performance¶

Graph Size Estimates¶

For a repertoire of N sequences with average length L:

Graph Type	Approximate Nodes	Approximate Edges
AAPLZGraph	O(20 × L)	O(N × L)
NDPLZGraph	O(4 × L²)	O(N × L)
NaiveLZGraph	O(dictionary size)	O(unique patterns)

Practical Recommendations¶

# Check your graph size
print(f"Nodes: {graph.graph.number_of_nodes()}")
print(f"Edges: {graph.graph.number_of_edges()}")

# Estimate memory (rough)
import sys
size_mb = sys.getsizeof(graph) / (1024 * 1024)
print(f"Approximate size: {size_mb:.1f} MB")

Converting Between Types¶

You cannot directly convert between graph types, but you can:

Re-encode Sequences¶

# Get sequences from one format
aa_sequences = data['cdr3_amino_acid'].tolist()

# Build different graph types
aa_graph = AAPLZGraph(data)

# For NaiveLZGraph, use the sequences directly
naive_graph = NaiveLZGraph(
    aa_sequences,
    generate_kmer_dictionary(6)
)

Compare Analyses¶

# Same repertoire, different representations
pgen_aa = aa_graph.walk_probability(
    AAPLZGraph.encode_sequence(sequence)
)
pgen_naive = naive_graph.walk_probability(
    lempel_ziv_decomposition(sequence)
)

print(f"AAPLZGraph Pgen:   {pgen_aa:.2e}")
print(f"NaiveLZGraph Pgen: {pgen_naive:.2e}")

Next Steps¶

Probability Model - How graphs calculate probabilities
LZ76 Algorithm - Understand the encoding
API Reference - Detailed class documentation