Skip to content

Sequence Analysis

In this tutorial you'll learn how to use an LZGraph as a probabilistic model — scoring sequences, generating new ones, and measuring how well the model fits your data. We'll build intuition for what the numbers mean and when to use each technique.

Prerequisites

This tutorial assumes you've completed Graph Construction and have a basic understanding of what an LZGraph represents.


Setup: building a working graph

We'll use a moderately sized graph throughout this tutorial so the results are realistic:

import csv
from LZGraphs import LZGraph

# Load 5,000 CDR3 sequences from the example dataset
seqs = []
with open("examples/ExampleData3.csv") as f:
    for row in csv.DictReader(f):
        seqs.append(row['cdr3_amino_acid'])

graph = LZGraph(seqs, variant='aap')
print(graph)
# LZGraph(variant='aap', nodes=1721, edges=9644)

Scoring sequences: what is LZPGEN?

Every LZGraph defines a probability distribution over sequences. For any string you give it, the graph can tell you: how likely is this sequence to be generated by the repertoire?

This probability is called LZPGEN (LZ-generation probability). It's computed by decomposing the sequence into LZ76 subpatterns, tracing the corresponding path through the graph, and multiplying the edge transition probabilities along the way.

A concrete example

seq = "CASSLEPSGGTDTQYF"  # a sequence from the training data

log_p = graph.lzpgen(seq) # (1)
print(f"log P(gen) = {log_p:.2f}")
  1. Returns the natural log of the probability by default. The raw probability would be \(e^{\text{log\_p}}\), but log-space avoids floating-point underflow for long sequences. Pass log=False to get the raw probability.

Output:

log P(gen) = -17.42

What does -17.42 mean? This is the natural log of the probability. The actual probability is \(e^{-17.42} \approx 2.7 \times 10^{-8}\) — a very small number, but that's normal! With thousands of possible sequences, no single sequence has a high probability. What matters is relative comparison: which sequences are more or less likely than others?

Comparing sequences

common_seq = "CASSLEPSGGTDTQYF"  # common in training data
rare_seq   = "CASSYTGQENVLHF"    # less common
random_seq = "CASSXYZXYZXYZ"     # probably not in graph

scores = graph.lzpgen([common_seq, rare_seq, random_seq])

for seq, score in zip([common_seq, rare_seq, random_seq], scores):
    print(f"  {seq:25s}  log P = {score:.2f}")

Output:

  CASSLEPSGGTDTQYF           log P = -17.42
  CASSYTGQENVLHF             log P = -21.87
  CASSXYZXYZXYZ               log P = -690.78

Three important observations:

  1. Higher (less negative) = more typical. The first sequence is the most likely under this repertoire's model.
  2. The rare sequence scores lower but is still far from zero — it's unusual but plausible.
  3. The random sequence scores approximately \(-\infty\) (represented as -690.78). This means the graph has zero probability for this sequence — some transition in its LZ76 decomposition doesn't exist as an edge. The graph has never seen XYZ patterns.

Why log-probabilities?

Raw probabilities of CDR3 sequences are incredibly small — often \(10^{-8}\) to \(10^{-35}\). Working directly with such tiny numbers causes floating-point underflow (the computer rounds them to exactly 0.0). Log-probabilities avoid this:

Representation Typical range When to use
Log-probability (default) -5 to -80 Ranking, comparison, model evaluation — almost always
Raw probability \(10^{-3}\) to \(10^{-35}\) Only for very short sequences; underflows for long ones
# Default: log-probability
log_p = graph.lzpgen("CASSLGIRRT")

# Raw probability (use with caution)
p = graph.lzpgen("CASSLGIRRT", log=False)
print(f"log P = {log_p:.4f}  →  P = {p:.2e}")

Batch scoring

When you need to score hundreds or thousands of sequences, pass them as a list. This is much faster than calling lzpgen in a loop because the C backend processes them in batch:

Batch scoring 5,000 sequences
import numpy as np

# Score all training sequences
all_scores = graph.lzpgen(seqs) # (1)

print(f"Scored {len(all_scores)} sequences")
print(f"Mean log P:   {np.mean(all_scores):.2f}")
print(f"Std log P:    {np.std(all_scores):.2f}")
print(f"Min log P:    {np.min(all_scores):.2f}")
print(f"Max log P:    {np.max(all_scores):.2f}")
  1. Pass a list[str] to score all sequences in a single call. Returns a np.ndarray of the same length. ~5,000 sequences/sec on typical hardware.

Checking membership

The in operator provides a quick way to check if a sequence is reachable in the graph (has non-zero probability):

print("CASSLGIRRT" in graph)       # True  — this sequence was in the training data
print("CASSXYZXYZ" not in graph)   # True  — the graph can't produce this

This is equivalent to checking graph.lzpgen(seq) > -690 but reads more naturally in conditionals.


Simulating new sequences

The graph isn't just a scoring model — it's also a generative model. You can draw new sequences from it by performing random walks through the graph, following the transition probabilities at each node.

Basic simulation

result = graph.simulate(10, seed=42)

for i, seq in enumerate(result):
    print(f"  {i+1}. {seq}")

Output:

  1. CASSLEGARVGELFF
  2. CASSPTSGNTIYF
  3. CASRGAINEQFF
  4. CASSLDSSYNEQFF
  ...

These are novel sequences that the graph generated — they weren't in the training data, but they follow the same statistical patterns. They have the right amino acid composition, length distribution, and positional structure because they were produced by the same transition probabilities.

What comes back: SimulationResult

simulate() returns a SimulationResult object that carries the generated sequences along with metadata:

result = graph.simulate(1000, seed=42)

# It's iterable like a list of strings
print(f"Count: {len(result)}")
print(f"First: {result[0]}")

# But also carries metadata
print(f"Log-probs shape: {result.log_probs.shape}")
print(f"Mean log P: {result.log_probs.mean():.2f}")
print(f"Mean tokens: {result.n_tokens.mean():.1f}")

Each simulated sequence comes with its exact log-probability — this isn't an approximation, it's the precise product of all transition probabilities along the walk. This is a rare advantage that most generative models don't have.

Reproducibility

Pass a seed to get identical results every time:

a = graph.simulate(100, seed=42)
b = graph.simulate(100, seed=42)
assert a.sequences == b.sequences  # identical

Omit the seed for different results each run.


Gene-constrained simulation

When your graph was built with V/J gene annotations, you can generate sequences that are consistent with specific gene usage.

Sample genes from the observed distribution

This is the most common use case — generate sequences whose V/J genes follow the repertoire's marginal distribution:

# Requires the graph to have gene data
result = graph.simulate(100, seed=42, sample_genes=True)

print(f"First sequence: {result[0]}")
print(f"  V gene: {result.v_genes[0]}")
print(f"  J gene: {result.j_genes[0]}")

Force specific genes

Generate sequences using a particular V-J combination:

result = graph.simulate(50, v_gene="TRBV5-1*01", j_gene="TRBJ2-7*01", seed=42)

This only generates sequences whose edge-level gene annotations are consistent with the specified V and J genes. Useful for studying the diversity within a specific V-J pairing.

Gene data required

Gene-constrained simulation raises NoGeneDataError if the graph was built without v_genes and j_genes. Check graph.has_gene_data before calling.


Perplexity: how well does the model fit?

Perplexity measures how "surprised" the model is by a sequence. Think of it as the average number of equally-likely choices the model faces at each step. Lower perplexity means the model predicts the data well; higher perplexity means the data is surprising.

Single sequence perplexity

common = "CASSLEPSGGTDTQYF"
unusual = "CASRGGTVYEQYF"

print(f"Common seq perplexity:  {graph.sequence_perplexity(common):.2f}")
print(f"Unusual seq perplexity: {graph.sequence_perplexity(unusual):.2f}")

A perplexity of 5 means "at each step, the model was as uncertain as if choosing uniformly among 5 options." The common sequence should have lower perplexity because its transitions are well-represented in the graph.

Repertoire-wide perplexity

Score an entire set of sequences at once:

held_out = seqs[4000:]  # last 1000 as a test set
ppl = graph.repertoire_perplexity(held_out)
print(f"Repertoire perplexity: {ppl:.2f}")

This is useful for model comparison — build graphs from different data or with different variants, score the same held-out set against each, and compare. Lower repertoire perplexity = better fit.

Entropy rate

The entropy rate gives the average information content per transition, in bits:

rate = graph.path_entropy_rate(held_out)
print(f"Entropy rate: {rate:.3f} bits/step")

If the entropy rate is 2.5 bits/step, that means each transition carries about 2.5 bits of uncertainty — equivalent to choosing among \(2^{2.5} \approx 5.7\) equally-likely options.


Exploring graph structure

Beyond scoring and simulation, you can inspect the graph's internal topology to understand repertoire structure.

Finding what follows a subpattern

# What can follow after the conserved "CAS" prefix?
for label, weight, count in graph.successors("S_4")[:5]:
    print(f"  S_4 → {label:8s}  P={weight:.3f}  ({count}x observed)")

This reveals the branching structure of the repertoire at a specific position. High-weight successors are the dominant continuation patterns; low-weight ones are rare alternatives.

Edge weights as biological signal

Edge weights encode transition probabilities that reflect the biological process of V(D)J recombination. For example:

  • High-weight edges near the start (positions 1-4) reflect germline V-gene contributions — these are highly constrained.
  • Variable-weight edges in the middle (positions 5-12) reflect the junctional diversity region — this is where N-insertions and D-gene contributions create the most uncertainty.
  • High-weight edges near the end reflect J-gene contributions — again more constrained.

You can inspect this pattern:

# Find the most "uncertain" nodes (highest out-degree)
import numpy as np

nodes = graph.all_nodes
out_deg = graph.out_degrees

# Top 5 branch points
top_idx = np.argsort(out_deg)[-5:][::-1]
for i in top_idx:
    print(f"  {nodes[i]:10s}  out-degree = {out_deg[i]}")

Practical example: finding public sequences

A common task is identifying "public" sequences — CDR3s that appear across multiple individuals. You can use LZPGEN to find sequences that are highly probable under a population-level model:

# Build a graph from a large population sample
population_graph = LZGraph(population_sequences, variant='aap')

# Score an individual's repertoire against the population model
individual_seqs = [...]  # one person's CDR3 sequences
scores = population_graph.lzpgen(individual_seqs)

# Sequences with high LZPGEN are "public-like" —
# they're expected under the population model
ranked = sorted(zip(individual_seqs, scores), key=lambda x: -x[1])

print("Top 5 most public-like sequences:")
for seq, log_p in ranked[:5]:
    print(f"  {seq:25s}  log P = {log_p:.2f}")

Sequences that score high under the population model are statistically expected to recur across individuals — making them candidates for public clonotypes.


What we learned

  • LZPGEN is the probability of a sequence under the graph's model — higher (less negative) means more typical
  • Always work with log-probabilities to avoid underflow
  • Simulation generates novel but statistically faithful sequences
  • Gene-constrained simulation respects V/J gene annotations
  • Perplexity measures model fit — lower is better
  • The graph's edge weights encode positionally-varying transition probabilities that reflect biological V(D)J recombination patterns

Next steps