Glossary¶
Key terms used throughout LZGraphs documentation.
- CDR3 (Complementary Determining Region 3)
- The hypervariable region of a T-cell or B-cell receptor that primarily determines antigen specificity. LZGraphs models the statistical structure of CDR3 sequences.
- DAG (Directed Acyclic Graph)
- A graph with directed edges and no cycles. LZGraphs with positional encoding (AAP and NDP variants) are inherently DAGs because node positions increase monotonically along any path.
- Dynamic Range
- The ratio between the most and least probable sequences in the graph, typically expressed in orders of magnitude. Immune repertoire graphs commonly span 100–300 orders of magnitude.
- Edge Weight
- The conditional transition probability from one node to another. In LZGraphs 3.0+, edge weights are renormalized at each step to strictly enforce LZ76 dictionary constraints.
- Effective Diversity (Neff)
- The number of equiprobable sequences that would produce the same Shannon entropy as the graph's actual distribution. Computed as
exp(H)where H is the entropy. - Entropy (Shannon entropy)
- A measure of uncertainty in the generative distribution:
H = −Σ P(x) log P(x). Higher entropy means a more uniform (less concentrated) distribution. - Hill Number (D(α))
- A family of diversity measures parameterized by order α. D(0) = richness (total producible sequences), D(1) = effective diversity, D(2) = inverse Simpson index (collision diversity). Higher α gives more weight to common sequences.
- LZ76 Decomposition (Lempel-Ziv 1976)
- An algorithm that parses a string into the shortest sequence of subpatterns where each new subpattern extends a previously seen one by exactly one character. For example,
CASSLE→[C, A, S, SL, E]. - LZPGEN (Probability of GENeration)
- The probability that the graph's generative model produces a specific sequence. Computed using an LZ-constrained random walk model starting from the root node (
@). - Node
- A vertex in the LZGraph representing a subpattern at a specific position. Node format depends on the graph variant:
- AAP:
{subpattern}_{position}(e.g.,SL_6) - NDP:{subpattern}{reading_frame}_{position}(e.g.,TG2_4) - Naive: just the subpattern (e.g.,SL) - Occupancy Model
- A statistical model used to predict repertoire richness and overlap at arbitrary sequencing depths. LZGraphs uses a Poisson occupancy model with Taylor expansion and Wynn acceleration for high precision.
- Path Count
- The total number of unique sequences that can be generated by the graph. Because repertoire space is massive, this is usually an astronomically large number.
- Posterior Graph
- A graph created by the
posterior()method that blends a population-level prior with an individual's observed sequences. - Reading Frame
- In the NDP variant, the codon position (0, 1, or 2) of each nucleotide subpattern. This encodes the relationship between the nucleotide sequence and the underlying amino acid sequence.
- Repertoire
- The full set of T-cell or B-cell receptor sequences in a sample. LZGraphs models the statistical structure of repertoires.
- Sharing Spectrum
- The distribution describing how many distinct sequences are shared by exactly
kdonors in a cohort. Predicted bypredict_sharing(). - Simulation
- The process of generating new sequences by performing random walks on the graph. In LZGraphs, all simulated sequences are guaranteed to be valid LZ76 decompositions.
- Terminal Node
- A node that contains the terminal sentinel
$. These nodes represent valid endings for a sequence. - V gene / J gene
- Variable and Joining gene segments. When provided during graph construction, these enable gene-aware analysis and constrained sequence generation.