Lesson 1: Getting Started with `LZGraph`¶

Welcome to the LZGraph track. This is the first of a short series of notebooks that takes you from zero to a working understanding of the LZGraph class.

By the end of this lesson you will be able to:

Explain what an LZGraph is and what problem it solves.
Tokenize a CDR3 sequence with LZ76 and read the resulting tokens.
Build an LZGraph from a list of sequences (with or without V/J gene calls).
Score the probability of a sequence under the graph using pgen().
Generate new sequences by sampling from the graph.
Sanity-check a graph (does it define a proper probability distribution?).
Save and reload a graph in the native .lzg binary format.

This notebook does not cover diversity metrics, set algebra, posteriors, or the FlashBackGraph variant. Diversity and analytics are Lesson 2; set operations, posteriors and ML features are Lesson 3. For the FlashBackGraph class, see the parallel notebooks under examples/flashback/.

About the data. The cells below load data/ExampleData3.csv, which ships with the package and contains 5,000 amino-acid CDR3 sequences with V/J gene calls. The same file is used across all three LZGraph notebooks.

In [1]:

Copied!

from LZGraphs import LZGraph, lz76_decompose
import numpy as np
import csv
from LZGraphs import LZGraph, lz76_decompose
import numpy as np
import csv

1. The LZ76 tokenizer¶

LZGraph is built on the Lempel-Ziv 1976 parsing of a string. LZ76 walks left-to-right and emits a new token every time it encounters a prefix that has not been seen before. Each new token is exactly one character longer than the longest previously-seen prefix that matches the current position. The result is a sequence of tokens that grows in length as repeated structure appears.

This is the same algorithm used by gzip (with refinements), but here we use it as a structural fingerprint of a CDR3 sequence: tokens carry both the literal characters and the position at which they first appeared in the parse.

In [2]:

Copied!





seq = 'CASSLEPSGGTDTQYF'
tokens = lz76_decompose(seq)

print(f'Sequence: {seq}')
print(f'Tokens:   {tokens}')
print(f'Count:    {len(tokens)}')

# Sanity check: concatenating the tokens reproduces the input
assert ''.join(tokens) == seq
seq = 'CASSLEPSGGTDTQYF'
tokens = lz76_decompose(seq)

print(f'Sequence: {seq}')
print(f'Tokens:   {tokens}')
print(f'Count:    {len(tokens)}')

# Sanity check: concatenating the tokens reproduces the input
assert ''.join(tokens) == seq

Sequence: CASSLEPSGGTDTQYF
Tokens:   ['C', 'A', 'S', 'SL', 'E', 'P', 'SG', 'G', 'T', 'D', 'TQ', 'Y', 'F']
Count:    13

Read the tokens above left-to-right and notice that each one extends the longest previously-seen prefix by exactly one character. Short, repetitive sequences produce few tokens; long, novel sequences produce many. The token count is closely related to the information content of the sequence.

2. From a sequence to a graph¶

Imagine you ran LZ76 on every CDR3 in a repertoire and recorded every token transition you saw. The result is a directed graph where:

Nodes are tokens (with a positional suffix in the 'aap' variant, e.g. 'C_2' means "C as the second token").
Edges record how often one token follows another.
Two special sentinel nodes wrap every walk: @ at the start and $ at the end.

A repertoire becomes a single graph, and the LZ76 grammar guarantees the graph is a directed acyclic graph (DAG) for the AAP variant. We'll build one on a tiny toy repertoire first so you can see the structure end-to-end.

In [3]:

Copied!





tiny = ['CASSLGIRRT', 'CASSLGYEQYF', 'CASSQETQYF']
tiny_graph = LZGraph(tiny, variant='aap')
print(tiny_graph)
print(f'Nodes: {tiny_graph.n_nodes}  Edges: {tiny_graph.n_edges}')
print(f'Initial states: {tiny_graph.n_initial}')
print(f'Terminal states: {tiny_graph.n_terminal}')
tiny = ['CASSLGIRRT', 'CASSLGYEQYF', 'CASSQETQYF']
tiny_graph = LZGraph(tiny, variant='aap')
print(tiny_graph)
print(f'Nodes: {tiny_graph.n_nodes}  Edges: {tiny_graph.n_edges}')
print(f'Initial states: {tiny_graph.n_initial}')
print(f'Terminal states: {tiny_graph.n_terminal}')

LZGraph(variant='aap', nodes=21, edges=21)
Nodes: 21  Edges: 21
Initial states: 1
Terminal states: 2

Here is what that tiny graph looks like rendered. The green @ node is where every walk begins; the red $ nodes are where walks end. The blue nodes are the LZ76 tokens (with their position suffixes).

The graph encodes the structure of the input repertoire: any of the three input sequences corresponds to a walk from @ to a $. The graph also encodes structure the inputs did not exhibit but is consistent with the LZ76 grammar (e.g. cross-paths between sequences), which is exactly what lets us generate and score new sequences.

3. Building a real graph¶

Time for a real repertoire. Load ExampleData3.csv (5,000 AA CDR3s with V/J gene calls) and build the graph with full gene annotation.

In [4]:

Copied!





sequences, v_genes, j_genes = [], [], []
with open('../data/ExampleData3.csv') as f:
    for row in csv.DictReader(f):
        sequences.append(row['cdr3_amino_acid'])
        v_genes.append(row['V'])
        j_genes.append(row['J'])

print(f'{len(sequences):,d} sequences loaded')
print(f'First three: {sequences[:3]}')
print(f'Genes attached: V={v_genes[0]}, J={j_genes[0]}')
sequences, v_genes, j_genes = [], [], []
with open('../data/ExampleData3.csv') as f:
    for row in csv.DictReader(f):
        sequences.append(row['cdr3_amino_acid'])
        v_genes.append(row['V'])
        j_genes.append(row['J'])

print(f'{len(sequences):,d} sequences loaded')
print(f'First three: {sequences[:3]}')
print(f'Genes attached: V={v_genes[0]}, J={j_genes[0]}')

5,000 sequences loaded
First three: ['CASSGLAGSRSYNEQFF', 'CASSPTGGVYEQYF', 'CASSQTGESNQPQHF']
Genes attached: V=TRBV2-1*01, J=TRBJ2-1*01

In [5]:

Copied!





graph = LZGraph(
    sequences,
    variant='aap',
    v_genes=v_genes,
    j_genes=j_genes,
)
print(graph)
graph = LZGraph(
    sequences,
    variant='aap',
    v_genes=v_genes,
    j_genes=j_genes,
)
print(graph)

LZGraph(variant='aap', nodes=1721, edges=9644)

4. Graph anatomy¶

Once a graph is built, it exposes a small set of properties for the basic structural inspection you'll do all the time. Use these to sanity-check size, confirm gene data is present, and get a feel for how "rich" the graph is (path count is a useful measure: it is the total number of distinct walks from @ to any $).

In [6]:

Copied!





print(f'n_nodes:       {graph.n_nodes:,d}')
print(f'n_edges:       {graph.n_edges:,d}')
print(f'variant:       {graph.variant}')
print(f'has_gene_data: {graph.has_gene_data}')
print(f'is_dag:        {graph.is_dag}')
print(f'path_count:    {graph.path_count:,.0f}')
print(f'n_nodes:       {graph.n_nodes:,d}')
print(f'n_edges:       {graph.n_edges:,d}')
print(f'variant:       {graph.variant}')
print(f'has_gene_data: {graph.has_gene_data}')
print(f'is_dag:        {graph.is_dag}')
print(f'path_count:    {graph.path_count:,.0f}')

n_nodes:       1,721
n_edges:       9,644
variant:       aap
has_gene_data: True
is_dag:        True
path_count:    855,650,120,261

In [7]:

Copied!

# A structural summary in dict form. Useful for logging or
# comparing two graphs side-by-side.
graph.summary()
# A structural summary in dict form. Useful for logging or
# comparing two graphs side-by-side.
graph.summary()

Out[7]:

{'n_nodes': 1721,
 'n_edges': 9644,
 'n_initial': 1,
 'n_terminal': 28,
 'max_out_degree': 109,
 'max_in_degree': 81,
 'n_isolates': 0,
 'is_dag': True}

5. Scoring a sequence with `pgen()`¶

The most important method on an LZGraph is pgen(). Given a sequence, it returns the probability that you would observe that sequence under the graph's distribution.

What does that mean concretely? Internally pgen():

Re-tokenizes the input sequence with LZ76.
Traces the unique walk through the graph that those tokens describe.
Multiplies the renormalized edge weights along that walk.
Returns the result (as a log probability by default; pass log=False for the raw probability).

Probabilities are very small numbers, so working in log space is the default and is what you want most of the time.

In [8]:

Copied!





# Score a single training sequence
seq = sequences[0]
lp = graph.pgen(seq)
p  = graph.pgen(seq, log=False)
print(f'{seq}')
print(f'  log P(gen) = {lp:.4f}')
print(f'  P(gen)     = {p:.3e}')
# Score a single training sequence
seq = sequences[0]
lp = graph.pgen(seq)
p  = graph.pgen(seq, log=False)
print(f'{seq}')
print(f'  log P(gen) = {lp:.4f}')
print(f'  P(gen)     = {p:.3e}')

CASSGLAGSRSYNEQFF
  log P(gen) = -22.3202
  P(gen)     = 2.025e-10

In [9]:

Copied!





# Batch scoring: pass a list, get a numpy array
batch = sequences[:8]
log_probs = graph.pgen(batch)
for s, lp in zip(batch, log_probs):
    print(f'{s:<25}  log P = {lp:.4f}')
# Batch scoring: pass a list, get a numpy array
batch = sequences[:8]
log_probs = graph.pgen(batch)
for s, lp in zip(batch, log_probs):
    print(f'{s:<25}  log P = {lp:.4f}')

CASSGLAGSRSYNEQFF          log P = -22.3202
CASSPTGGVYEQYF             log P = -14.8578
CASSQTGESNQPQHF            log P = -15.6642
CASSKTDISSPLHF             log P = -25.0422
CASSLAGHSGGAQRGNEQFF       log P = -22.3090
CASSPQDRPNYGYTF            log P = -19.0571
CASSSQDRVTQYF              log P = -17.5585
CASSSSGGATEQYF             log P = -15.9175

In [10]:

Copied!





# Membership: does this sequence have a valid walk in the graph?
print(f'In-vocabulary:  "CASSLGYEQYF"     in graph = {"CASSLGYEQYF" in graph}')
print(f'In-graph:       "{sequences[0]}" in graph = {sequences[0] in graph}')
print(f'Out-of-vocab:   "ZZZZZZZZZZ"      in graph = {"ZZZZZZZZZZ" in graph}')
# Membership: does this sequence have a valid walk in the graph?
print(f'In-vocabulary:  "CASSLGYEQYF"     in graph = {"CASSLGYEQYF" in graph}')
print(f'In-graph:       "{sequences[0]}" in graph = {sequences[0] in graph}')
print(f'Out-of-vocab:   "ZZZZZZZZZZ"      in graph = {"ZZZZZZZZZZ" in graph}')

In-vocabulary:  "CASSLGYEQYF"     in graph = True
In-graph:       "CASSGLAGSRSYNEQFF" in graph = True
Out-of-vocab:   "ZZZZZZZZZZ"      in graph = False

What if a sequence is impossible under the graph? The graph still returns a (very small) log probability, floored at LOG_EPS ≈ -690 (the natural log of 1e-300). This avoids -inf propagating through downstream calculations. The matching __contains__ check (seq in graph) returns True only for sequences that are strictly on a valid walk, so use that if you need a clean yes/no.

6. Sampling new sequences¶

Now the reverse direction: instead of asking "how likely is this sequence?" we ask "give me some sequences from the same distribution". simulate(n) runs n random walks from @ to a terminal $, using the graph's edge weights as transition probabilities.

In [11]:

Copied!

sim = graph.simulate(10, seed=42)
for seq in sim:
    print(seq)
sim = graph.simulate(10, seed=42)
for seq in sim:
    print(seq)

CAISLSGGGTEKGQYFF
CASSPPSGGADGYTQF
CASSPPPGTTRAGNEQYF
CASSLQGGSYNEAFF
CASSFGLAGAATQYF
CSVGLPRTEQYGYTF
CASSEASQYNEKLF
CASQFMKTAANEAFF
CASSLGGRGNYTF
CASSLQRWYNEQYF

In [12]:

Copied!





# Simulation returns rich metadata, not just strings
print(f'n simulated:      {len(sim)}')
print(f'log_probs:        {sim.log_probs[:5]} ...')
print(f'tokens per walk:  {sim.n_tokens[:5]} ...')
print(f'mean log P:       {sim.log_probs.mean():.4f}')
print(f'mean #tokens:     {sim.n_tokens.mean():.2f}')
# Simulation returns rich metadata, not just strings
print(f'n simulated:      {len(sim)}')
print(f'log_probs:        {sim.log_probs[:5]} ...')
print(f'tokens per walk:  {sim.n_tokens[:5]} ...')
print(f'mean log P:       {sim.log_probs.mean():.4f}')
print(f'mean #tokens:     {sim.n_tokens.mean():.2f}')

n simulated:      10
log_probs:        [-22.92335593 -17.83176041 -17.85461193 -14.61490197 -17.01575667] ...
tokens per walk:  [14 13 15 13 13] ...
mean log P:       -18.5834
mean #tokens:     13.20

Gene-conditioned sampling¶

Because we built the graph with V/J gene calls, we can also ask the simulator to condition on a particular V, J, or both. The returned SimulationResult includes the V and J gene that was used for each walk.

In [13]:

Copied!





# Condition on a specific V and J gene
result = graph.simulate(5, seed=42,
                        v_gene='TRBV5-1*01',
                        j_gene='TRBJ2-7*01')
for s, v, j in zip(result.sequences, result.v_genes, result.j_genes):
    print(f'{s:<25}  V={v}  J={j}')
# Condition on a specific V and J gene
result = graph.simulate(5, seed=42,
                        v_gene='TRBV5-1*01',
                        j_gene='TRBJ2-7*01')
for s, v, j in zip(result.sequences, result.v_genes, result.j_genes):
    print(f'{s:<25}  V={v}  J={j}')

CASSVGRDTYEQYF             V=TRBV5-1*01  J=TRBJ2-7*01
CASSPRQGLNEQYF             V=TRBV5-1*01  J=TRBJ2-7*01
CASSPRTGLNEQYF             V=TRBV5-1*01  J=TRBJ2-7*01
CASSLLGGGEQYF              V=TRBV5-1*01  J=TRBJ2-7*01
CASSQGLAATYF               V=TRBV5-1*01  J=TRBJ2-7*01

In [14]:

Copied!





# Sample genes too (joint VJ distribution learned from the data)
result = graph.simulate(5, seed=42, sample_genes=True)
for s, v, j in zip(result.sequences, result.v_genes, result.j_genes):
    print(f'{s:<25}  V={v}  J={j}')
# Sample genes too (joint VJ distribution learned from the data)
result = graph.simulate(5, seed=42, sample_genes=True)
for s, v, j in zip(result.sequences, result.v_genes, result.j_genes):
    print(f'{s:<25}  V={v}  J={j}')

CASRTGPQEYNEQF             V=TRBV5-1*01  J=TRBJ2-1*01
CASSENGVSGTF               V=TRBV6-5*01  J=TRBJ1-2*01
CASTSGTGQYYGYTF            V=TRBV7-8*01  J=TRBJ1-2*01
CASSLGRNTEAFFF             V=TRBV5-6*01  J=TRBJ1-1*01
CASSTSGGGGNSPLHF           V=TRBV5-1*01  J=TRBJ1-6*01

7. Is the model proper?¶

A valid probability distribution sums to 1.0. For an LZGraph this means: if you integrate the probability of every possible walk from @ to any $, you should get 1.0 (up to floating-point noise). pgen_diagnostics() checks this directly by running the forward DP once and reporting the total absorbed mass at terminal states and any leaked mass at non-terminal dead ends.

Two flavors of result you'll see in practice:

Strictly proper (total_leaked ≈ 0, is_proper = True): every walk consistent with the LZ76 grammar terminates at a $. You see this on small or carefully curated repertoires.
Slightly leaky (total_leaked a few percent, is_proper = False): perfectly normal on real repertoires, including the 5,000-sequence example here. Some token paths that the LZ76 grammar could take don't appear at any terminal in the training data, so the forward DP loses mass at those dead ends. This is a property of the data, not a bug. As long as total_absorbed is close to 1.0 (say > 0.95), pgen() and the downstream diversity/occupancy metrics behave well.

What you should never see: initial_prob_sum != 1.0, or absorbed mass below ~0.5. Either of those means something is wrong with the build or the graph was loaded from a corrupted file.

In [15]:

Copied!





d = graph.pgen_diagnostics()
print(f'total_absorbed:    {d["total_absorbed"]:.10f}')
print(f'total_leaked:      {d["total_leaked"]:.3e}')
print(f'initial prob sum:  {d["initial_prob_sum"]:.10f}')
print(f'is_proper:         {d["is_proper"]}')
d = graph.pgen_diagnostics()
print(f'total_absorbed:    {d["total_absorbed"]:.10f}')
print(f'total_leaked:      {d["total_leaked"]:.3e}')
print(f'initial prob sum:  {d["initial_prob_sum"]:.10f}')
print(f'is_proper:         {d["is_proper"]}')

total_absorbed:    0.9743652344
total_leaked:      2.563e-02
initial prob sum:  1.0000000000
is_proper:         False

8. Save and reload¶

Graphs are saved in a custom .lzg binary format with CRC-32C integrity checks. Loading is roughly 100x faster than rebuilding from sequences (no LZ76 re-parsing, no edge counting, no normalization).

The format is portable across platforms and preserves everything: edge weights, gene data, sentinel structure, and learned distributions. A save -> load -> save round-trip is byte-identical.

In [16]:

Copied!





graph.save('my_graph.lzg')

loaded = LZGraph.load('my_graph.lzg')
print(f'Loaded: {loaded}')
print(f'Gene data preserved: {loaded.has_gene_data}')
print(f'Match: nodes={loaded.n_nodes == graph.n_nodes}, '
      f'edges={loaded.n_edges == graph.n_edges}')

import os
os.remove('my_graph.lzg')
graph.save('my_graph.lzg')

loaded = LZGraph.load('my_graph.lzg')
print(f'Loaded: {loaded}')
print(f'Gene data preserved: {loaded.has_gene_data}')
print(f'Match: nodes={loaded.n_nodes == graph.n_nodes}, '
      f'edges={loaded.n_edges == graph.n_edges}')

import os
os.remove('my_graph.lzg')

Loaded: LZGraph(variant='aap', nodes=1721, edges=9644)
Gene data preserved: True
Match: nodes=True, edges=True

What you learned¶

You now know how to:

Tokenize a CDR3 sequence with LZ76 (lz76_decompose)
Build an LZGraph from raw sequences, optionally with V/J gene calls
Inspect the resulting graph's anatomy
Score sequences with pgen() (single and batch, log and linear)
Test membership with seq in graph
Simulate new sequences and condition on V/J genes
Confirm the graph defines a proper probability distribution
Persist and reload graphs via .lzg

Next: Lesson 2 (02_Analytics_and_Diversity.ipynb) will introduce diversity metrics (Hill numbers, effective diversity), the analytical PGEN distribution, occupancy-based predictions for unseen sequencing depths, and repertoire comparison via Jensen-Shannon divergence.

Lesson 1: Getting Started with LZGraph¶