Comparison with IGoR, OLGA, and SONIA¶

LZGraphs takes a fundamentally different approach to repertoire analysis than the established V(D)J recombination tools. This page presents an empirical comparison — how do the models agree on real data? — along with a conceptual overview of when to use which tool.

Two paradigms¶

The immune repertoire modeling ecosystem has two distinct approaches:

Mechanistic (IGoR → OLGA → SONIA): Explicitly model the V(D)J recombination process — gene segment choices, deletions, N-insertions, selection pressures. These tools require germline reference databases and (for IGoR) sequence alignment. They answer why a sequence was generated: which gene segments, what insertion/deletion profiles.

Information-theoretic (LZGraphs): Compress sequences into a directed graph using LZ76, capturing statistical structure without assuming any generative mechanism. Requires only a list of CDR3 strings — no germline databases, no alignment, no pre-fitted models. Answers how likely, how diverse, how similar.

Neither approach is "better" — they answer different questions and are often complementary.

Empirical comparison: generation probabilities¶

The most direct comparison is on generation probability (Pgen): given a CDR3 sequence, how likely is it under each model?

We compared LZGraphs against OLGA (the gold-standard V(D)J Pgen tool) using two approaches:

OLGA-generated sequences — 5,000 sequences sampled from OLGA's default human TRB recombination model, scored by both OLGA and LZGraphs
Real CDR3 sequences — 300 sequences from a real TCR repertoire, scored by both models

Pgen rank correlation¶

Pgen Correlation — **Generation probability correlation between LZGraphs and OLGA.** (A) 300 sequences generated by OLGA's V(D)J model (Spearman ρ = 0.81). (B) 300 real CDR3 sequences from a human TRB repertoire (ρ = 0.71). Each dot is one sequence; the dashed line is the identity. Despite completely different mathematical foundations, the two models largely agree on which sequences are common and which are rare.

Key finding: The two models show strong rank correlation (ρ = 0.81 on synthetic data, ρ = 0.71 on real data), meaning they largely agree on the ordering of sequences from most to least probable. This is notable because:

OLGA computes Pgen by marginalizing over all V(D)J recombination scenarios — a mechanistic model with dozens of learned parameters
LZGraphs computes Pgen from LZ76-constrained transition probabilities — a data-driven compression model with no biological assumptions

The correlation is not perfect (and shouldn't be): OLGA's probabilities reflect the universal recombination process, while LZGraphs' probabilities reflect the specific repertoire's statistical patterns, including selection effects.

Pgen distribution shape¶

Both models produce a roughly Gaussian distribution of log-probabilities, spanning about 30 orders of magnitude. The rightward shift of LZGraphs is expected: it's trained directly on the data, so it assigns higher probability to patterns that are enriched in this particular repertoire (including selection effects), while OLGA reflects only the pre-selection recombination process.

Simulation quality¶

Beyond scoring, both models can generate new sequences. How realistic are the generated sequences?

Length distribution¶

Length Comparison — **CDR3 length distributions: real sequences (gray) vs OLGA-simulated (purple) vs LZGraphs-simulated (cyan).** LZGraphs closely matches the training data's length profile because it's built from that data. OLGA generates from a universal V(D)J model, producing a broader distribution with more short sequences (10-12 aa) that are underrepresented in this particular repertoire.

LZGraphs simulations match the training data's length distribution almost exactly — this is by design, since the graph encodes the observed length frequencies. OLGA's universal model produces a slightly different shape because it doesn't know about this specific repertoire's V-gene usage bias or selection pressures.

3-mer motif frequencies¶

3-mer Comparison — **3-mer amino acid frequencies in simulated vs real sequences.** (A) OLGA simulations vs real (cosine similarity 0.952). (B) LZGraphs simulations vs real (cosine similarity 0.985). Each dot is one of ~8,000 observed 3-mers; the dashed line is the identity. Both models capture the amino acid composition, with LZGraphs achieving higher fidelity because it's trained on the specific data.

Both models reproduce real 3-mer frequencies well, but LZGraphs achieves higher cosine similarity (0.985 vs 0.952). This advantage comes from learning data-specific motifs — for instance, if the repertoire is enriched for certain V genes that produce particular amino acid patterns, LZGraphs captures this while OLGA uses a universal model.

When to use which tool¶

Question	Best tool	Why
What is the V(D)J recombination probability of a CDR3?	OLGA	Marginalizes over all recombination scenarios
How is my repertoire shaped by selection?	SONIA	Learns selection factors relative to pre-selection baseline
What V/D/J genes, deletions, insertions produced this sequence?	IGoR	Full mechanistic model of recombination
How diverse is this repertoire?	LZGraphs	Hill numbers, entropy, occupancy predictions
How similar are two repertoires?	LZGraphs	JSD, graph algebra, cross-scoring
How many unique sequences at depth X?	LZGraphs	Poisson occupancy model
What fixed-size features for a classifier?	LZGraphs	Reference-aligned vectors, mass profiles, stats
Bayesian personalization of a population model?	LZGraphs	Dirichlet posterior updates
Generate sequences matching this specific repertoire?	LZGraphs	Trained directly on the data
Generate sequences from the universal recombination process?	OLGA	V(D)J mechanistic model

Using them together¶

These tools are complementary. A typical combined workflow:

IGoR — learn the recombination model from reference data
OLGA — compute mechanistic Pgen for your sequences
LZGraphs — build a graph for diversity, comparison, ML features, and occupancy predictions
SONIA — infer selection if comparing naive vs. experienced repertoires

LZGraphs excels at repertoire-level questions (how diverse? how similar? how many at depth X?) that the V(D)J tools don't address. The V(D)J tools excel at mechanism-level questions (what genes? what selection?) that LZGraphs doesn't address.

Technical comparison¶

	IGoR	OLGA	SONIA	LZGraphs
Approach	V(D)J Bayesian network	DP over V(D)J model	Max entropy selection	LZ76 compression graph
Learn from data	Yes (EM)	No (uses IGoR)	Partial (selection only)	Yes (direct from CDR3s)
Requires germline DB	Yes	Yes	Yes	No
Requires alignment	Yes	No	No	No
Compute Pgen	Yes	Yes	Yes (post-selection)	Yes
Simulate sequences	Yes	Yes	Yes	Yes
Diversity metrics	—	—	—	Hill numbers, entropy, perplexity
Richness prediction	—	—	—	Poisson occupancy
Repertoire comparison	—	—	—	JSD, graph algebra
ML features	—	—	—	3 strategies
Bayesian posteriors	—	—	—	Dirichlet updates
Selection inference	—	—	Yes	—
V(D)J decomposition	Yes	Partial	Partial	—
Language	C/C++	Python	Python + TF	C + Python
Dependencies	autotools, POSIX	numba (opt.)	TensorFlow	numpy only

Methodology¶

The comparisons above were produced using:

OLGA v1.3.0 with the default human_T_beta model (pre-fitted on IGoR default parameters)
LZGraphs v3.0.2 trained on 5,000 CDR3 amino acid sequences from the same repertoire
Real data: 5,000 TRB CDR3 amino acid sequences with V/J gene annotations
Pgen scored on 300 sequences; 3-mer analysis on the full 5,000
Spearman rank correlation for Pgen comparison; cosine similarity for 3-mer comparison

References¶

IGoR: Marcou, Q., Mora, T., & Walczak, A. M. (2018). High-throughput immune repertoire analysis with IGoR. Nature Communications, 9(1), 561. doi:10.1038/s41467-018-02832-w
OLGA: Sethna, Z., Elhanati, Y., Callan, C. G., Walczak, A. M., & Mora, T. (2019). OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs. Bioinformatics, 35(17), 2974–2981. doi:10.1093/bioinformatics/btz035
SONIA: Sethna, Z., Isacchini, G., Dupic, T., Mora, T., Walczak, A. M., & Elhanati, Y. (2020). Population variability in the generation and selection of T-cell repertoires. PLOS Computational Biology, 16(12), e1008394. doi:10.1371/journal.pcbi.1008394
soNNia: Isacchini, G., Walczak, A. M., Mora, T., & Nourmohammad, A. (2021). Deep generative selection models of T and B cell receptor repertoires with soNNia. PNAS, 118(14), e2023141118. doi:10.1073/pnas.2023141118
LZGraphs: Konstantinovsky, T. & Yaari, G. (2023). A novel approach to T-cell receptor beta chain (TCRB) repertoire encoding using lossless string compression. Bioinformatics, 39(7), btad426. doi:10.1093/bioinformatics/btad426