LZ76 Algorithm¶
The Lempel-Ziv 1976 (LZ76) algorithm is the foundation of LZGraphs' sequence encoding. Understanding how it works helps you interpret graph structure and encoding results.
What is LZ76?¶
LZ76 is a lossless compression algorithm that decomposes a sequence into a series of unique subpatterns. Each new pattern extends a previously seen pattern by one character.
How It Works¶
Step-by-Step Decomposition¶
Let's decompose the sequence CASSLE:
Input: C A S S L E
│ │ │ │ │ │
Step 1: C → Dictionary: {C}
Step 2: A → Dictionary: {C, A}
Step 3: S → Dictionary: {C, A, S}
Step 4: S L → "SL" (S + new char L) → Dictionary: {C, A, S, SL}
Step 5: E → Dictionary: {C, A, S, SL, E}
Result: [C, A, S, SL, E]
Key Properties¶
- Unique patterns: Each pattern in the decomposition is unique
- Incremental building: New patterns extend previous ones
- Lossless: Original sequence can be reconstructed
- Deterministic: Same input always produces same output
Python Implementation¶
LZGraphs uses this algorithm internally:
from LZGraphs.utilities import lempel_ziv_decomposition
# Basic decomposition
sequence = "CASSLEPSGGTDTQYF"
patterns = lempel_ziv_decomposition(sequence)
print(patterns)
# ['C', 'A', 'S', 'SL', 'E', 'P', 'SG', 'G', 'T', 'D', 'TQ', 'Y', 'F']
Verify Reconstruction¶
# Patterns concatenate to original
reconstructed = ''.join(patterns)
print(f"Original: {sequence}")
print(f"Reconstructed: {reconstructed}")
print(f"Match: {sequence == reconstructed}")
Why LZ76 for Sequences?¶
1. Captures Structure¶
Unlike k-mer approaches, LZ76 adapts to the sequence:
# Repetitive sequence
rep_seq = "AAAAAA"
print(lempel_ziv_decomposition(rep_seq))
# ['A', 'AA', 'AAA'] - Captures repetition
# Diverse sequence
div_seq = "ABCDEF"
print(lempel_ziv_decomposition(div_seq))
# ['A', 'B', 'C', 'D', 'E', 'F'] - All unique
2. Variable-Length Patterns¶
LZ76 naturally finds patterns of different lengths:
sequence = "CASSLEPSGGTDTQYF"
patterns = lempel_ziv_decomposition(sequence)
# Pattern lengths vary based on repetition
lengths = [len(p) for p in patterns]
print(f"Patterns: {patterns}")
print(f"Lengths: {lengths}")
3. Compression Reflects Complexity¶
# More repetitive = fewer patterns
simple = "AAAAAAAAAAAA"
complex = "ABCDEFGHIJKL"
simple_patterns = lempel_ziv_decomposition(simple)
complex_patterns = lempel_ziv_decomposition(complex)
print(f"Simple ({len(simple)} chars): {len(simple_patterns)} patterns")
print(f"Complex ({len(complex)} chars): {len(complex_patterns)} patterns")
Positional Encoding¶
LZGraphs extends LZ76 with position information:
AAPLZGraph Encoding¶
from LZGraphs import AAPLZGraph
sequence = "CASSLE"
encoded = AAPLZGraph.encode_sequence(sequence)
print(encoded)
# ['C_1', 'A_2', 'S_3', 'SL_5', 'E_6']
The _N suffix indicates the ending position of each pattern:
| Pattern | Start | End | Encoded |
|---|---|---|---|
| C | 1 | 1 | C_1 |
| A | 2 | 2 | A_2 |
| S | 3 | 3 | S_3 |
| SL | 4 | 5 | SL_5 |
| E | 6 | 6 | E_6 |
NDPLZGraph Encoding¶
NDPLZGraph uses double positions (start and end):
from LZGraphs import NDPLZGraph
sequence = "TGTGCC"
encoded = NDPLZGraph.encode_sequence(sequence)
print(encoded)
# ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']
Why Position Matters¶
In CDR3 sequences, position carries biological meaning:
- Early positions: Often V-gene derived
- Middle positions: Junction diversity
- Late positions: Often J-gene derived
Positional encoding allows the graph to distinguish:
# Same pattern, different positions
seq1 = "CASSG" # G at position 5
seq2 = "GCASS" # G at position 1
enc1 = AAPLZGraph.encode_sequence(seq1)
enc2 = AAPLZGraph.encode_sequence(seq2)
print(f"seq1: {enc1}") # [..., 'G_5']
print(f"seq2: {enc2}") # ['G_1', ...]
Mathematical Properties¶
Compression Ratio¶
The number of LZ76 patterns relative to sequence length indicates complexity:
def complexity_ratio(sequence):
patterns = lempel_ziv_decomposition(sequence)
return len(patterns) / len(sequence)
# Compare sequences
sequences = [
"AAAAAAAAAA", # Very simple
"CASSLEPSGGTDTQYF", # Typical CDR3
"ABCDEFGHIJ" # Maximum complexity
]
for seq in sequences:
ratio = complexity_ratio(seq)
print(f"{seq[:15]:15s} ratio: {ratio:.2f}")
Pattern Growth¶
For a sequence of length n:
- Maximum patterns: n (every character is new)
- Minimum patterns: O(log n) (highly repetitive)
- Typical CDR3: Between these extremes
Connection to Diversity¶
LZ76 complexity connects to repertoire diversity:
- K-diversity: Counts unique patterns across samples
- Graph nodes: Each pattern becomes a potential node
- Graph edges: Pattern transitions become edges
This creates a natural bridge from sequences to graph metrics.
Next Steps¶
- Graph Types - How encoding affects graph structure
- Probability Model - Using patterns for probability
- Tutorials: Diversity Metrics - Apply to real data