Skip to content

LZGraphs

PyPI version Python 3.9+ License: MIT

LZGraphs is a high-performance Python library for analyzing immune receptor repertoires using Lempel-Ziv 76 compression graphs. Built on a C core, it transforms CDR3 sequences into probabilistic directed graphs that support exact probability computation, constrained sequence generation, and analytical diversity measurement — all without alignment or reference genotypes.

Example LZGraph
An LZGraph built from three CDR3 sequences. Shared prefixes form a single path; divergent suffixes branch. Edge weights encode transition probabilities.

Quick Start

from LZGraphs import LZGraph

graph = LZGraph(['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF'],
                variant='aap')

graph.lzpgen('CASSLEPSGGTDTQYF')          # log generation probability
graph.simulate(1000, seed=42)              # generate new sequences
graph.hill_number(2)                       # inverse Simpson diversity
graph.predicted_richness(100_000)          # richness at sequencing depth
lzg build repertoire.tsv -o rep.lzg       # build from the command line
lzg diversity rep.lzg                      # diversity report
lzg simulate rep.lzg -n 10000 > synth.txt # generate sequences

Install Quick Start


What LZGraphs does

Score sequences

Compute the exact generation probability of any CDR3 under the repertoire model with lzpgen().

Generate sequences

Simulate novel sequences via LZ-constrained random walks — with optional V/J gene constraints.

Measure diversity

Hill numbers, Shannon entropy, predicted richness, sample overlap, and sharing spectra — analytically from the graph.

Compare repertoires

Jensen-Shannon divergence, cross-scoring, and graph set operations (union, intersection, difference).

Extract ML features

Project repertoires into fixed-size feature vectors for classification, clustering, and regression.

Personalize models

Bayesian posterior updates to adapt a population graph to an individual patient.


Documentation


C Performance

Build graphs from 5,000 sequences in 80 ms. Simulate at ~5,000 seqs/sec. Save/load in < 1 ms.

LZ76 Constraints

Every simulated sequence is a valid LZ76 decomposition. No biologically impossible outputs.


If you use LZGraphs in your research, please cite our paper. GitHub · Issues · Contact