NDPLZGraph¶
Nucleotide Double Positional LZGraph for analyzing nucleotide CDR3 sequences.
Quick Example¶
from LZGraphs import NDPLZGraph
import pandas as pd
# Build graph
data = pd.read_csv("repertoire.csv")
graph = NDPLZGraph(data, verbose=True)
# Calculate probability
sequence = "TGTGCCAGCAGT"
encoded = NDPLZGraph.encode_sequence(sequence)
pgen = graph.walk_probability(encoded)
Class Reference¶
NDPLZGraph
¶
NDPLZGraph(data: DataFrame, verbose: bool = True, calculate_trainset_pgen: bool = False, smoothing_alpha: float = 0.0, initial_state_threshold: int = 5)
Bases: LZGraphBase
This class implements the "Nucleotide Double Positional" version of the LZGraph, suitable for analyzing nucleotide sequences. Each node has the format: {LZ_subpattern}{reading_frame_start}_{start_position_in_sequence}, for example: "ATG0_3" might mean the subpattern "ATG", reading frame 0, starting at position 3 in the overall sequence.
The class inherits from LZGraphBase and thus uses the same random-walk logic, gene annotation logic (if present), and so on.
Constructor for NDPLZGraph.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Must include at least a column 'cdr3_rearrangement' with the nucleotide sequences. If 'V' and 'J' columns exist, we embed gene information (self.genetic=True).
TYPE:
|
verbose
|
Whether to log progress info.
TYPE:
|
calculate_trainset_pgen
|
If True, compute the walk_probability for each sequence in the dataset, storing results in self.train_pgen.
TYPE:
|
smoothing_alpha
|
Laplace smoothing parameter for edge weights. 0.0 means no smoothing (default).
TYPE:
|
initial_state_threshold
|
Minimum observation count for initial states. States observed fewer times than this are excluded. Default is 5.
TYPE:
|
encode_sequence
staticmethod
¶
Encode a nucleotide sequence (cdr3) into the NDPLZGraph format: {lz_subpattern}{reading_frame}_{start_position}
If cdr3="ATGCG", the function might yield subpatterns:
- "ATG" (frame=0, pos=0)
- "CG" (frame=0, pos=3) resulting in nodes "ATG0_3", "CG0_5", etc.
clean_node
staticmethod
¶
Given a sub-pattern that looks like "ATG0_3", extract only the nucleotides ("ATG").
Constructor¶
Parameters¶
| Parameter | Type | Description |
|---|---|---|
data |
pd.DataFrame |
DataFrame with cdr3_rearrangement column |
verbose |
bool |
Print progress messages (default: True) |
Required Columns¶
cdr3_rearrangement- Nucleotide CDR3 sequences
Optional Columns¶
V- V gene/allele annotationsJ- J gene/allele annotations
Node Format¶
NDPLZGraph uses double positional encoding:
Example:
encoded = NDPLZGraph.encode_sequence("TGTGCC")
# ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']
Key Methods¶
walk_probability¶
encoded = NDPLZGraph.encode_sequence("TGTGCCAGCAGT")
pgen = graph.walk_probability(encoded, use_log=True)
print(f"log P(gen) = {pgen:.2f}")
encode_sequence (static)¶
encoded = NDPLZGraph.encode_sequence("TGTGCC")
# Returns: ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']
clean_node (static)¶
Comparison with AAPLZGraph¶
| Feature | NDPLZGraph | AAPLZGraph |
|---|---|---|
| Sequence type | Nucleotides | Amino acids |
| Position encoding | Double (start, end) | Single (end) |
| Alphabet size | 4 | 20 |
| Graph size | Larger | Smaller |
| Resolution | Higher | Lower |
See Also¶
- AAPLZGraph - Amino acid version
- NaiveLZGraph - Non-positional version
- Concepts: Graph Types