Skip to content

Glossary

Key terms used throughout LZGraphs documentation.


CDR3 (Complementary Determining Region 3)
The hypervariable region of a T-cell or B-cell receptor that primarily determines antigen specificity. LZGraphs models the statistical structure of CDR3 sequences.
DAG (Directed Acyclic Graph)
A graph with directed edges and no cycles. LZGraphs with positional encoding (AAP and NDP variants) are inherently DAGs because node positions increase monotonically along any path.
Dynamic Range
The ratio between the most and least probable sequences in the graph, typically expressed in orders of magnitude. Immune repertoire graphs commonly span 100–300 orders of magnitude.
Edge Weight
The conditional transition probability from one node to another. In LZGraphs 3.0+, edge weights are renormalized at each step to strictly enforce LZ76 dictionary constraints.
Effective Diversity (Neff)
The number of equiprobable sequences that would produce the same Shannon entropy as the graph's actual distribution. Computed as exp(H) where H is the entropy.
Entropy (Shannon entropy)
A measure of uncertainty in the generative distribution: H = −Σ P(x) log P(x). Higher entropy means a more uniform (less concentrated) distribution.
Hill Number (D(α))
A family of diversity measures parameterized by order α. D(0) = richness (total producible sequences), D(1) = effective diversity, D(2) = inverse Simpson index (collision diversity). Higher α gives more weight to common sequences.
LZ76 Decomposition (Lempel-Ziv 1976)
An algorithm that parses a string into the shortest sequence of subpatterns where each new subpattern extends a previously seen one by exactly one character. For example, CASSLE[C, A, S, SL, E].
LZPGEN (Probability of GENeration)
The probability that the graph's generative model produces a specific sequence. Computed using an LZ-constrained random walk model starting from the root node (@).
Node
A vertex in the LZGraph representing a subpattern at a specific position. Node format depends on the graph variant: - AAP: {subpattern}_{position} (e.g., SL_6) - NDP: {subpattern}{reading_frame}_{position} (e.g., TG2_4) - Naive: just the subpattern (e.g., SL)
Occupancy Model
A statistical model used to predict repertoire richness and overlap at arbitrary sequencing depths. LZGraphs uses a Poisson occupancy model with Taylor expansion and Wynn acceleration for high precision.
Path Count
The total number of unique sequences that can be generated by the graph. Because repertoire space is massive, this is usually an astronomically large number.
Posterior Graph
A graph created by the posterior() method that blends a population-level prior with an individual's observed sequences.
Reading Frame
In the NDP variant, the codon position (0, 1, or 2) of each nucleotide subpattern. This encodes the relationship between the nucleotide sequence and the underlying amino acid sequence.
Repertoire
The full set of T-cell or B-cell receptor sequences in a sample. LZGraphs models the statistical structure of repertoires.
Sharing Spectrum
The distribution describing how many distinct sequences are shared by exactly k donors in a cohort. Predicted by predict_sharing().
Simulation
The process of generating new sequences by performing random walks on the graph. In LZGraphs, all simulated sequences are guaranteed to be valid LZ76 decompositions.
Terminal Node
A node that contains the terminal sentinel $. These nodes represent valid endings for a sequence.
V gene / J gene
Variable and Joining gene segments. When provided during graph construction, these enable gene-aware analysis and constrained sequence generation.