Performance Benchmarks¶
All benchmarks were run on a single core (Intel/AMD x86_64 Linux, Python 3.12, LZGraphs 3.0.2) using a dataset of 5,000 CDR3 amino acid sequences (mean length 14.7 characters). The resulting graph has 1,721 nodes and 9,644 edges.
Times are wall-clock averages across multiple runs. Your results will vary depending on hardware, dataset size, and sequence diversity.
Graph Construction¶
Building a graph from raw sequences — includes LZ76 decomposition, CSR packing, edge weight normalization, and topological sort.
| Sequences | Nodes | Edges | Time |
|---|---|---|---|
| 100 | 342 | 654 | 0.5 ms |
| 500 | 753 | 2,269 | 6 ms |
| 1,000 | 997 | 3,585 | 10 ms |
| 2,000 | 1,262 | 5,529 | 24 ms |
| 5,000 | 1,721 | 9,644 | 82 ms |
Construction scales roughly linearly with the number of input sequences. A graph from 5,000 sequences builds in under 100 ms.
LZPGEN Scoring¶
Computing the exact log-generation-probability of sequences against a graph (5,000-sequence graph).
| Sequences scored | Time | Throughput |
|---|---|---|
| 1 | 0.2 ms | ~5,000 /sec |
| 10 | 2 ms | ~5,000 /sec |
| 100 | 22 ms | ~4,600 /sec |
| 1,000 | 205 ms | ~4,900 /sec |
| 5,000 | 1.0 s | ~4,800 /sec |
Throughput is constant at ~5,000 sequences/second regardless of batch size. Each score requires tracing the full LZ76 walk through the graph with per-step dictionary constraint checking.
Simulation¶
Generating new sequences via LZ-constrained random walks (5,000-sequence graph).
| Sequences | Time | Throughput |
|---|---|---|
| 100 | 21 ms | ~4,800 /sec |
| 1,000 | 208 ms | ~4,800 /sec |
| 10,000 | 2.1 s | ~4,800 /sec |
| 100,000 | 20.6 s | ~4,800 /sec |
Simulation throughput is constant at ~4,800 sequences/second. Each walk maintains a per-walk LZ dictionary and includes backtracking for dead-end recovery.
Analytics¶
Diversity and distribution metrics on the 5,000-sequence graph.
| Operation | Time | Notes |
|---|---|---|
effective_diversity() |
2.1 s | Monte Carlo with 10K walks |
hill_numbers([0,1,2,5]) |
2.1 s | Single MC run, all orders computed together |
pgen_moments() |
0.2 ms | Forward DP — no simulation needed |
pgen_distribution() |
2.1 s | Simulation-based Gaussian mixture fitting |
predicted_richness(1M) |
106 s | Occupancy model with series acceleration |
pgen_diagnostics() |
~200 ms | 1K validation walks |
diversity_profile() |
2.1 s | Same MC as effective_diversity |
Why are Hill numbers fast but richness is slow?
Hill numbers use Monte Carlo with 10K walks — each walk provides its exact probability, enabling unbiased importance-sampling estimation in about 2 seconds.
Predicted richness uses the Poisson occupancy formula \(F(d) = \sum_i (1 - (1-p_i)^d)\), which requires evaluating a series over the full PGEN distribution with Gauss-Hermite quadrature and Wynn epsilon acceleration. At very large depths (1M+), the series converges slowly. For moderate depths (up to ~10K), it's much faster.
Comparison¶
| Operation | Time |
|---|---|
jensen_shannon_divergence(g1, g2) |
0.1 ms |
JSD is computed directly from the edge weight vectors — no simulation needed.
Graph Operations¶
Set-algebraic operations on two graphs (~1,200 nodes each).
| Operation | Time |
|---|---|
union |
102 ms |
intersection |
91 ms |
difference |
69 ms |
These operations rebuild a new CSR graph from the merged edge data, including re-normalization and topological sort.
Serialization (IO)¶
Save/load the 5,000-sequence graph (285 KB on disk).
| Operation | Time |
|---|---|
save() |
1.2 ms |
load() |
0.7 ms |
The .lzg binary format is extremely fast — loading a graph is ~100x faster than rebuilding it from sequences.
Feature Extraction¶
| Operation | Dimension | Time |
|---|---|---|
feature_aligned(query) |
1,721 | 0.4 ms |
feature_stats() |
15 | 8.4 s |
feature_mass_profile() |
31 | 2.1 s |
feature_stats() is slow because it computes Hill numbers internally
The 15-element statistics vector includes D(0), D(0.5), D(1), D(2), D(5), entropy, and dynamic range — each requiring Monte Carlo simulation. If you only need the feature_aligned() vector for ML, that's sub-millisecond.
Scaling characteristics¶
| Operation | Complexity | Scales with |
|---|---|---|
| Graph construction | \(O(n \cdot L)\) | n = sequences, L = mean length |
| LZPGEN scoring | \(O(L)\) per sequence | L = sequence length |
| Simulation | \(O(L)\) per sequence | L = generated length |
| Hill numbers (MC) | \(O(N \cdot L)\) | N = MC samples (10K default) |
| JSD | \(O(\|E\|)\) | E = edges in larger graph |
| Save/Load | \(O(\|V\| + \|E\|)\) | V = nodes, E = edges |
feature_aligned() |
\(O(\|V\|)\) | V = reference nodes |
Tips for performance¶
-
Build once, reuse. Graph construction is the most expensive step per-sequence. Save to
.lzgand load for repeated analysis. -
Batch LZPGEN calls.
graph.lzpgen(list_of_seqs)is no faster per-sequence than a loop, but avoids Python overhead. -
Use
pgen_moments()for quick summaries. It runs in microseconds via forward DP, whilepgen_distribution()takes seconds. -
feature_aligned()is the fastest ML feature. Sub-millisecond per sample, versus seconds forfeature_stats(). -
JSD is nearly free. At 0.1 ms, you can compute thousands of pairwise comparisons in seconds.