NDPLZGraph¶

Nucleotide Double Positional LZGraph for analyzing nucleotide CDR3 sequences.

Quick Example¶

from LZGraphs import NDPLZGraph
import pandas as pd

# Build graph
data = pd.read_csv("repertoire.csv")
graph = NDPLZGraph(data, verbose=True)

# Calculate probability
sequence = "TGTGCCAGCAGT"
encoded = NDPLZGraph.encode_sequence(sequence)
pgen = graph.walk_probability(encoded)

Class Reference¶

NDPLZGraph ¶

NDPLZGraph(data: DataFrame, verbose: bool = True, calculate_trainset_pgen: bool = False, smoothing_alpha: float = 0.0, initial_state_threshold: int = 5)

Bases: LZGraphBase

This class implements the "Nucleotide Double Positional" version of the LZGraph, suitable for analyzing nucleotide sequences. Each node has the format: {LZ_subpattern}{reading_frame_start}_{start_position_in_sequence}, for example: "ATG0_3" might mean the subpattern "ATG", reading frame 0, starting at position 3 in the overall sequence.

The class inherits from LZGraphBase and thus uses the same random-walk logic, gene annotation logic (if present), and so on.

Constructor for NDPLZGraph.

PARAMETER	DESCRIPTION
`data`	Must include at least a column 'cdr3_rearrangement' with the nucleotide sequences. If 'V' and 'J' columns exist, we embed gene information (self.genetic=True). TYPE: `DataFrame`
`verbose`	Whether to log progress info. TYPE: `bool` DEFAULT: `True`
`calculate_trainset_pgen`	If True, compute the walk_probability for each sequence in the dataset, storing results in self.train_pgen. TYPE: `bool` DEFAULT: `False`
`smoothing_alpha`	Laplace smoothing parameter for edge weights. 0.0 means no smoothing (default). TYPE: `float` DEFAULT: `0.0`
`initial_state_threshold`	Minimum observation count for initial states. States observed fewer times than this are excluded. Default is 5. TYPE: `int` DEFAULT: `5`

encode_sequence `staticmethod` ¶

encode_sequence(cdr3: str) -> List[str]

Encode a nucleotide sequence (cdr3) into the NDPLZGraph format: {lz_subpattern}{reading_frame}_{start_position}

If cdr3="ATGCG", the function might yield subpatterns:

"ATG" (frame=0, pos=0)
"CG" (frame=0, pos=3) resulting in nodes "ATG0_3", "CG0_5", etc.

clean_node `staticmethod` ¶

clean_node(base: str) -> str

Given a sub-pattern that looks like "ATG0_3", extract only the nucleotides ("ATG").

Constructor¶

Parameters¶

Parameter	Type	Description
`data`	`pd.DataFrame`	DataFrame with `cdr3_rearrangement` column
`verbose`	`bool`	Print progress messages (default: `True`)

Required Columns¶

cdr3_rearrangement - Nucleotide CDR3 sequences

Optional Columns¶

V - V gene/allele annotations
J - J gene/allele annotations

Node Format¶

NDPLZGraph uses double positional encoding:

<pattern>_<start>_<end>

Example:

encoded = NDPLZGraph.encode_sequence("TGTGCC")
# ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']

Key Methods¶

walk_probability¶

encoded = NDPLZGraph.encode_sequence("TGTGCCAGCAGT")
pgen = graph.walk_probability(encoded, use_log=True)
print(f"log P(gen) = {pgen:.2f}")

encode_sequence (static)¶

encoded = NDPLZGraph.encode_sequence("TGTGCC")
# Returns: ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']

clean_node (static)¶

pattern = NDPLZGraph.clean_node("TG_3_4")
# Returns: "TG"

Comparison with AAPLZGraph¶

Feature	NDPLZGraph	AAPLZGraph
Sequence type	Nucleotides	Amino acids
Position encoding	Double (start, end)	Single (end)
Alphabet size	4	20
Graph size	Larger	Smaller
Resolution	Higher	Lower