Lesson 1: Getting Started with LZGraph¶
Welcome to the LZGraph track. This is the first of a short series of notebooks
that takes you from zero to a working understanding of the LZGraph class.
By the end of this lesson you will be able to:
- Explain what an
LZGraphis and what problem it solves. - Tokenize a CDR3 sequence with LZ76 and read the resulting tokens.
- Build an
LZGraphfrom a list of sequences (with or without V/J gene calls). - Score the probability of a sequence under the graph using
pgen(). - Generate new sequences by sampling from the graph.
- Sanity-check a graph (does it define a proper probability distribution?).
- Save and reload a graph in the native
.lzgbinary format.
This notebook does not cover diversity metrics, set algebra, posteriors, or
the FlashBackGraph variant. Diversity and analytics are Lesson 2; set
operations, posteriors and ML features are Lesson 3. For the FlashBackGraph
class, see the parallel notebooks under examples/flashback/.
About the data. The cells below load
data/ExampleData3.csv, which ships with the package and contains 5,000 amino-acid CDR3 sequences with V/J gene calls. The same file is used across all three LZGraph notebooks.
from LZGraphs import LZGraph, lz76_decompose
import numpy as np
import csv
1. The LZ76 tokenizer¶
LZGraph is built on the Lempel-Ziv 1976 parsing of a string. LZ76
walks left-to-right and emits a new token every time it encounters a
prefix that has not been seen before. Each new token is exactly one
character longer than the longest previously-seen prefix that matches
the current position. The result is a sequence of tokens that grows in
length as repeated structure appears.
This is the same algorithm used by gzip (with refinements), but here
we use it as a structural fingerprint of a CDR3 sequence: tokens carry
both the literal characters and the position at which they first
appeared in the parse.
seq = 'CASSLEPSGGTDTQYF'
tokens = lz76_decompose(seq)
print(f'Sequence: {seq}')
print(f'Tokens: {tokens}')
print(f'Count: {len(tokens)}')
# Sanity check: concatenating the tokens reproduces the input
assert ''.join(tokens) == seq
Sequence: CASSLEPSGGTDTQYF Tokens: ['C', 'A', 'S', 'SL', 'E', 'P', 'SG', 'G', 'T', 'D', 'TQ', 'Y', 'F'] Count: 13
Read the tokens above left-to-right and notice that each one extends the longest previously-seen prefix by exactly one character. Short, repetitive sequences produce few tokens; long, novel sequences produce many. The token count is closely related to the information content of the sequence.
2. From a sequence to a graph¶
Imagine you ran LZ76 on every CDR3 in a repertoire and recorded every token transition you saw. The result is a directed graph where:
- Nodes are tokens (with a positional suffix in the
'aap'variant, e.g.'C_2'means "C as the second token"). - Edges record how often one token follows another.
- Two special sentinel nodes wrap every walk:
@at the start and$at the end.
A repertoire becomes a single graph, and the LZ76 grammar guarantees the graph is a directed acyclic graph (DAG) for the AAP variant. We'll build one on a tiny toy repertoire first so you can see the structure end-to-end.
tiny = ['CASSLGIRRT', 'CASSLGYEQYF', 'CASSQETQYF']
tiny_graph = LZGraph(tiny, variant='aap')
print(tiny_graph)
print(f'Nodes: {tiny_graph.n_nodes} Edges: {tiny_graph.n_edges}')
print(f'Initial states: {tiny_graph.n_initial}')
print(f'Terminal states: {tiny_graph.n_terminal}')
LZGraph(variant='aap', nodes=21, edges=21) Nodes: 21 Edges: 21 Initial states: 1 Terminal states: 2
Here is what that tiny graph looks like rendered. The green @ node
is where every walk begins; the red $ nodes are where walks end. The
blue nodes are the LZ76 tokens (with their position suffixes).
The graph encodes the structure of the input repertoire: any of the
three input sequences corresponds to a walk from @ to a $. The graph
also encodes structure the inputs did not exhibit but is consistent
with the LZ76 grammar (e.g. cross-paths between sequences), which is
exactly what lets us generate and score new sequences.
3. Building a real graph¶
Time for a real repertoire. Load ExampleData3.csv (5,000 AA CDR3s
with V/J gene calls) and build the graph with full gene annotation.
sequences, v_genes, j_genes = [], [], []
with open('../data/ExampleData3.csv') as f:
for row in csv.DictReader(f):
sequences.append(row['cdr3_amino_acid'])
v_genes.append(row['V'])
j_genes.append(row['J'])
print(f'{len(sequences):,d} sequences loaded')
print(f'First three: {sequences[:3]}')
print(f'Genes attached: V={v_genes[0]}, J={j_genes[0]}')
5,000 sequences loaded First three: ['CASSGLAGSRSYNEQFF', 'CASSPTGGVYEQYF', 'CASSQTGESNQPQHF'] Genes attached: V=TRBV2-1*01, J=TRBJ2-1*01
graph = LZGraph(
sequences,
variant='aap',
v_genes=v_genes,
j_genes=j_genes,
)
print(graph)
LZGraph(variant='aap', nodes=1721, edges=9644)
4. Graph anatomy¶
Once a graph is built, it exposes a small set of properties for the
basic structural inspection you'll do all the time. Use these to
sanity-check size, confirm gene data is present, and get a feel for
how "rich" the graph is (path count is a useful measure: it is the
total number of distinct walks from @ to any $).
print(f'n_nodes: {graph.n_nodes:,d}')
print(f'n_edges: {graph.n_edges:,d}')
print(f'variant: {graph.variant}')
print(f'has_gene_data: {graph.has_gene_data}')
print(f'is_dag: {graph.is_dag}')
print(f'path_count: {graph.path_count:,.0f}')
n_nodes: 1,721 n_edges: 9,644 variant: aap has_gene_data: True is_dag: True path_count: 855,650,120,261
# A structural summary in dict form. Useful for logging or
# comparing two graphs side-by-side.
graph.summary()
{'n_nodes': 1721,
'n_edges': 9644,
'n_initial': 1,
'n_terminal': 28,
'max_out_degree': 109,
'max_in_degree': 81,
'n_isolates': 0,
'is_dag': True}
5. Scoring a sequence with pgen()¶
The most important method on an LZGraph is pgen(). Given a sequence,
it returns the probability that you would observe that sequence under
the graph's distribution.
What does that mean concretely? Internally pgen():
- Re-tokenizes the input sequence with LZ76.
- Traces the unique walk through the graph that those tokens describe.
- Multiplies the renormalized edge weights along that walk.
- Returns the result (as a log probability by default; pass
log=Falsefor the raw probability).
Probabilities are very small numbers, so working in log space is the default and is what you want most of the time.
# Score a single training sequence
seq = sequences[0]
lp = graph.pgen(seq)
p = graph.pgen(seq, log=False)
print(f'{seq}')
print(f' log P(gen) = {lp:.4f}')
print(f' P(gen) = {p:.3e}')
CASSGLAGSRSYNEQFF log P(gen) = -22.3202 P(gen) = 2.025e-10
# Batch scoring: pass a list, get a numpy array
batch = sequences[:8]
log_probs = graph.pgen(batch)
for s, lp in zip(batch, log_probs):
print(f'{s:<25} log P = {lp:.4f}')
CASSGLAGSRSYNEQFF log P = -22.3202 CASSPTGGVYEQYF log P = -14.8578 CASSQTGESNQPQHF log P = -15.6642 CASSKTDISSPLHF log P = -25.0422 CASSLAGHSGGAQRGNEQFF log P = -22.3090 CASSPQDRPNYGYTF log P = -19.0571 CASSSQDRVTQYF log P = -17.5585 CASSSSGGATEQYF log P = -15.9175
# Membership: does this sequence have a valid walk in the graph?
print(f'In-vocabulary: "CASSLGYEQYF" in graph = {"CASSLGYEQYF" in graph}')
print(f'In-graph: "{sequences[0]}" in graph = {sequences[0] in graph}')
print(f'Out-of-vocab: "ZZZZZZZZZZ" in graph = {"ZZZZZZZZZZ" in graph}')
In-vocabulary: "CASSLGYEQYF" in graph = True In-graph: "CASSGLAGSRSYNEQFF" in graph = True Out-of-vocab: "ZZZZZZZZZZ" in graph = False
What if a sequence is impossible under the graph? The graph still
returns a (very small) log probability, floored at LOG_EPS ≈ -690
(the natural log of 1e-300). This avoids -inf propagating through
downstream calculations. The matching __contains__ check (seq in graph) returns True only for sequences that are strictly on a
valid walk, so use that if you need a clean yes/no.
6. Sampling new sequences¶
Now the reverse direction: instead of asking "how likely is this
sequence?" we ask "give me some sequences from the same distribution".
simulate(n) runs n random walks from @ to a terminal $, using
the graph's edge weights as transition probabilities.
sim = graph.simulate(10, seed=42)
for seq in sim:
print(seq)
CAISLSGGGTEKGQYFF CASSPPSGGADGYTQF CASSPPPGTTRAGNEQYF CASSLQGGSYNEAFF CASSFGLAGAATQYF CSVGLPRTEQYGYTF CASSEASQYNEKLF CASQFMKTAANEAFF CASSLGGRGNYTF CASSLQRWYNEQYF
# Simulation returns rich metadata, not just strings
print(f'n simulated: {len(sim)}')
print(f'log_probs: {sim.log_probs[:5]} ...')
print(f'tokens per walk: {sim.n_tokens[:5]} ...')
print(f'mean log P: {sim.log_probs.mean():.4f}')
print(f'mean #tokens: {sim.n_tokens.mean():.2f}')
n simulated: 10 log_probs: [-22.92335593 -17.83176041 -17.85461193 -14.61490197 -17.01575667] ... tokens per walk: [14 13 15 13 13] ... mean log P: -18.5834 mean #tokens: 13.20
Gene-conditioned sampling¶
Because we built the graph with V/J gene calls, we can also ask the
simulator to condition on a particular V, J, or both. The returned
SimulationResult includes the V and J gene that was used for each
walk.
# Condition on a specific V and J gene
result = graph.simulate(5, seed=42,
v_gene='TRBV5-1*01',
j_gene='TRBJ2-7*01')
for s, v, j in zip(result.sequences, result.v_genes, result.j_genes):
print(f'{s:<25} V={v} J={j}')
CASSVGRDTYEQYF V=TRBV5-1*01 J=TRBJ2-7*01 CASSPRQGLNEQYF V=TRBV5-1*01 J=TRBJ2-7*01 CASSPRTGLNEQYF V=TRBV5-1*01 J=TRBJ2-7*01 CASSLLGGGEQYF V=TRBV5-1*01 J=TRBJ2-7*01 CASSQGLAATYF V=TRBV5-1*01 J=TRBJ2-7*01
# Sample genes too (joint VJ distribution learned from the data)
result = graph.simulate(5, seed=42, sample_genes=True)
for s, v, j in zip(result.sequences, result.v_genes, result.j_genes):
print(f'{s:<25} V={v} J={j}')
CASRTGPQEYNEQF V=TRBV5-1*01 J=TRBJ2-1*01 CASSENGVSGTF V=TRBV6-5*01 J=TRBJ1-2*01 CASTSGTGQYYGYTF V=TRBV7-8*01 J=TRBJ1-2*01 CASSLGRNTEAFFF V=TRBV5-6*01 J=TRBJ1-1*01 CASSTSGGGGNSPLHF V=TRBV5-1*01 J=TRBJ1-6*01
7. Is the model proper?¶
A valid probability distribution sums to 1.0. For an LZGraph this
means: if you integrate the probability of every possible walk from
@ to any $, you should get 1.0 (up to floating-point noise).
pgen_diagnostics() checks this directly by running the forward DP
once and reporting the total absorbed mass at terminal states and any
leaked mass at non-terminal dead ends.
Two flavors of result you'll see in practice:
- Strictly proper (
total_leaked ≈ 0,is_proper = True): every walk consistent with the LZ76 grammar terminates at a$. You see this on small or carefully curated repertoires. - Slightly leaky (
total_leakeda few percent,is_proper = False): perfectly normal on real repertoires, including the 5,000-sequence example here. Some token paths that the LZ76 grammar could take don't appear at any terminal in the training data, so the forward DP loses mass at those dead ends. This is a property of the data, not a bug. As long astotal_absorbedis close to 1.0 (say > 0.95),pgen()and the downstream diversity/occupancy metrics behave well.
What you should never see: initial_prob_sum != 1.0, or absorbed
mass below ~0.5. Either of those means something is wrong with the
build or the graph was loaded from a corrupted file.
d = graph.pgen_diagnostics()
print(f'total_absorbed: {d["total_absorbed"]:.10f}')
print(f'total_leaked: {d["total_leaked"]:.3e}')
print(f'initial prob sum: {d["initial_prob_sum"]:.10f}')
print(f'is_proper: {d["is_proper"]}')
total_absorbed: 0.9743652344 total_leaked: 2.563e-02 initial prob sum: 1.0000000000 is_proper: False
8. Save and reload¶
Graphs are saved in a custom .lzg binary format with CRC-32C
integrity checks. Loading is roughly 100x faster than rebuilding from
sequences (no LZ76 re-parsing, no edge counting, no normalization).
The format is portable across platforms and preserves everything:
edge weights, gene data, sentinel structure, and learned distributions.
A save -> load -> save round-trip is byte-identical.
graph.save('my_graph.lzg')
loaded = LZGraph.load('my_graph.lzg')
print(f'Loaded: {loaded}')
print(f'Gene data preserved: {loaded.has_gene_data}')
print(f'Match: nodes={loaded.n_nodes == graph.n_nodes}, '
f'edges={loaded.n_edges == graph.n_edges}')
import os
os.remove('my_graph.lzg')
Loaded: LZGraph(variant='aap', nodes=1721, edges=9644) Gene data preserved: True Match: nodes=True, edges=True
What you learned¶
You now know how to:
- Tokenize a CDR3 sequence with LZ76 (
lz76_decompose) - Build an
LZGraphfrom raw sequences, optionally with V/J gene calls - Inspect the resulting graph's anatomy
- Score sequences with
pgen()(single and batch, log and linear) - Test membership with
seq in graph - Simulate new sequences and condition on V/J genes
- Confirm the graph defines a proper probability distribution
- Persist and reload graphs via
.lzg
Next: Lesson 2 (02_Analytics_and_Diversity.ipynb) will introduce
diversity metrics (Hill numbers, effective diversity), the analytical
PGEN distribution, occupancy-based predictions for unseen sequencing
depths, and repertoire comparison via Jensen-Shannon divergence.
