Skip to content

AAPLZGraph

Amino Acid Positional LZGraph for analyzing amino acid CDR3 sequences.

Quick Example

from LZGraphs import AAPLZGraph
import pandas as pd

# Build graph
data = pd.read_csv("repertoire.csv")
graph = AAPLZGraph(data, verbose=True)

# Calculate probability
sequence = "CASSLEPSGGTDTQYF"
encoded = AAPLZGraph.encode_sequence(sequence)
pgen = graph.walk_probability(encoded)

Class Reference

AAPLZGraph

AAPLZGraph(data: DataFrame, verbose: bool = True, calculate_trainset_pgen: bool = False, validate_sequences: bool = True, smoothing_alpha: float = 0.0, initial_state_threshold: int = 5)

Bases: LZGraphBase

Implements the "Amino Acid Positional" version of the LZGraph for analyzing amino-acid sequences, especially for immunological data.

Each node is labeled as

{LZ_subpattern}_{start_position_in_sequence}

Create an amino-acid-positional LZGraph from a DataFrame.

The DataFrame must contain at least a column "cdr3_amino_acid". Optionally, columns "V" and "J" may also be provided to embed gene information. If these columns are present, self.genetic is set to True.

PARAMETER DESCRIPTION
data

Input data for constructing the graph. Must contain a "cdr3_amino_acid" column; optionally "V" and "J" columns.

TYPE: DataFrame

verbose

Whether to log progress information.

TYPE: bool DEFAULT: True

calculate_trainset_pgen

If True, compute PGEN for each sequence in data.

TYPE: bool DEFAULT: False

validate_sequences

If True, validate that sequences contain only standard amino acids. Set to False to skip validation for performance.

TYPE: bool DEFAULT: True

smoothing_alpha

Laplace smoothing parameter for edge weights. 0.0 means no smoothing (default).

TYPE: float DEFAULT: 0.0

initial_state_threshold

Minimum observation count for initial states. States observed fewer times than this are excluded. Default is 5.

TYPE: int DEFAULT: 5

RAISES DESCRIPTION
TypeError

If data is not a pandas DataFrame.

ValueError

If required columns are missing or sequences are invalid.

encode_sequence staticmethod

encode_sequence(amino_acid: str) -> List[str]

Convert an amino acid string into LZ sub-patterns with positions. Each sub-pattern has the format: '{LZ_subpattern}_{position}'.

clean_node staticmethod

clean_node(base: str) -> str

Given a sub-pattern that might look like "ABC_10", extract only the amino acids ("ABC").

Constructor

Parameters

Parameter Type Description
data pd.DataFrame DataFrame with cdr3_amino_acid column
verbose bool Print progress messages (default: True)

Required Columns

  • cdr3_amino_acid - Amino acid CDR3 sequences

Optional Columns

  • V - V gene/allele annotations
  • J - J gene/allele annotations

Key Methods

walk_probability

Calculate the generation probability of a sequence.

encoded = AAPLZGraph.encode_sequence("CASSLEPSGGTDTQYF")
pgen = graph.walk_probability(encoded)
print(f"P(gen) = {pgen:.2e}")

# Use log probability for numerical stability
log_pgen = graph.walk_probability(encoded, use_log=True)
print(f"log P(gen) = {log_pgen:.2f}")

random_walk

Generate a random sequence following edge probabilities.

walk = graph.random_walk()
sequence = ''.join([AAPLZGraph.clean_node(n) for n in walk])
print(sequence)

genomic_random_walk

Generate a sequence consistent with V/J gene usage.

walk, v_gene, j_gene = graph.genomic_random_walk()
sequence = ''.join([AAPLZGraph.clean_node(n) for n in walk])
print(f"{sequence} ({v_gene}, {j_gene})")

encode_sequence (static)

Convert a sequence to graph walk format.

encoded = AAPLZGraph.encode_sequence("CASSLE")
# Returns: ['C_1', 'A_2', 'S_3', 'SL_5', 'E_6']

clean_node (static)

Extract the pattern from a node name.

pattern = AAPLZGraph.clean_node("SL_5")
# Returns: "SL"

Attributes

Attribute Type Description
graph nx.DiGraph NetworkX directed graph
nodes NodeView All nodes in the graph
edges EdgeView All edges in the graph
lengths dict Sequence length distribution
initial_states pd.Series Initial state counts
terminal_states pd.Series Terminal state counts
marginal_vgenes pd.Series V gene probabilities
marginal_jgenes pd.Series J gene probabilities
subpattern_individual_probability pd.DataFrame Pattern probabilities

Examples

Building with Gene Annotation

data = pd.DataFrame({
    'cdr3_amino_acid': ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF'],
    'V': ['TRBV16-1*01', 'TRBV1-1*01'],
    'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})

graph = AAPLZGraph(data, verbose=True)
print(graph.marginal_vgenes)

Batch Probability Calculation

sequences = ['CASSLEPSGGTDTQYF', 'CASSLGQGSTEAFF', 'CASSXYZRARESEQ']

for seq in sequences:
    try:
        encoded = AAPLZGraph.encode_sequence(seq)
        log_p = graph.walk_probability(encoded, use_log=True)
        print(f"{seq}: {log_p:.2f}")
    except:
        print(f"{seq}: Not in graph")

See Also