Skip to content

NDPLZGraph

Nucleotide Double Positional LZGraph for analyzing nucleotide CDR3 sequences.

Quick Example

from LZGraphs import NDPLZGraph
import pandas as pd

# Build graph
data = pd.read_csv("repertoire.csv")
graph = NDPLZGraph(data, verbose=True)

# Calculate probability
sequence = "TGTGCCAGCAGT"
encoded = NDPLZGraph.encode_sequence(sequence)
pgen = graph.walk_probability(encoded)

Class Reference

NDPLZGraph

NDPLZGraph(data: DataFrame, verbose: bool = True, calculate_trainset_pgen: bool = False, smoothing_alpha: float = 0.0, initial_state_threshold: int = 5)

Bases: LZGraphBase

This class implements the "Nucleotide Double Positional" version of the LZGraph, suitable for analyzing nucleotide sequences. Each node has the format: {LZ_subpattern}{reading_frame_start}_{start_position_in_sequence}, for example: "ATG0_3" might mean the subpattern "ATG", reading frame 0, starting at position 3 in the overall sequence.

The class inherits from LZGraphBase and thus uses the same random-walk logic, gene annotation logic (if present), and so on.

Constructor for NDPLZGraph.

PARAMETER DESCRIPTION
data

Must include at least a column 'cdr3_rearrangement' with the nucleotide sequences. If 'V' and 'J' columns exist, we embed gene information (self.genetic=True).

TYPE: DataFrame

verbose

Whether to log progress info.

TYPE: bool DEFAULT: True

calculate_trainset_pgen

If True, compute the walk_probability for each sequence in the dataset, storing results in self.train_pgen.

TYPE: bool DEFAULT: False

smoothing_alpha

Laplace smoothing parameter for edge weights. 0.0 means no smoothing (default).

TYPE: float DEFAULT: 0.0

initial_state_threshold

Minimum observation count for initial states. States observed fewer times than this are excluded. Default is 5.

TYPE: int DEFAULT: 5

encode_sequence staticmethod

encode_sequence(cdr3: str) -> List[str]

Encode a nucleotide sequence (cdr3) into the NDPLZGraph format: {lz_subpattern}{reading_frame}_{start_position}

If cdr3="ATGCG", the function might yield subpatterns:
  • "ATG" (frame=0, pos=0)
  • "CG" (frame=0, pos=3) resulting in nodes "ATG0_3", "CG0_5", etc.

clean_node staticmethod

clean_node(base: str) -> str

Given a sub-pattern that looks like "ATG0_3", extract only the nucleotides ("ATG").

Constructor

Parameters

Parameter Type Description
data pd.DataFrame DataFrame with cdr3_rearrangement column
verbose bool Print progress messages (default: True)

Required Columns

  • cdr3_rearrangement - Nucleotide CDR3 sequences

Optional Columns

  • V - V gene/allele annotations
  • J - J gene/allele annotations

Node Format

NDPLZGraph uses double positional encoding:

<pattern>_<start>_<end>

Example:

encoded = NDPLZGraph.encode_sequence("TGTGCC")
# ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']

Key Methods

walk_probability

encoded = NDPLZGraph.encode_sequence("TGTGCCAGCAGT")
pgen = graph.walk_probability(encoded, use_log=True)
print(f"log P(gen) = {pgen:.2f}")

encode_sequence (static)

encoded = NDPLZGraph.encode_sequence("TGTGCC")
# Returns: ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']

clean_node (static)

pattern = NDPLZGraph.clean_node("TG_3_4")
# Returns: "TG"

Comparison with AAPLZGraph

Feature NDPLZGraph AAPLZGraph
Sequence type Nucleotides Amino acids
Position encoding Double (start, end) Single (end)
Alphabet size 4 20
Graph size Larger Smaller
Resolution Higher Lower

See Also