AAPLZGraph¶
Amino Acid Positional LZGraph for analyzing amino acid CDR3 sequences.
Quick Example¶
from LZGraphs import AAPLZGraph
import pandas as pd
# Build graph
data = pd.read_csv("repertoire.csv")
graph = AAPLZGraph(data, verbose=True)
# Calculate probability
sequence = "CASSLEPSGGTDTQYF"
encoded = AAPLZGraph.encode_sequence(sequence)
pgen = graph.walk_probability(encoded)
Class Reference¶
AAPLZGraph
¶
AAPLZGraph(data: DataFrame, verbose: bool = True, calculate_trainset_pgen: bool = False, validate_sequences: bool = True, smoothing_alpha: float = 0.0, initial_state_threshold: int = 5)
Bases: LZGraphBase
Implements the "Amino Acid Positional" version of the LZGraph for analyzing amino-acid sequences, especially for immunological data.
Each node is labeled as
{LZ_subpattern}_{start_position_in_sequence}
Create an amino-acid-positional LZGraph from a DataFrame.
The DataFrame must contain at least a column "cdr3_amino_acid". Optionally, columns "V" and "J" may also be provided to embed gene information. If these columns are present, self.genetic is set to True.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Input data for constructing the graph. Must contain a "cdr3_amino_acid" column; optionally "V" and "J" columns.
TYPE:
|
verbose
|
Whether to log progress information.
TYPE:
|
calculate_trainset_pgen
|
If True, compute PGEN for each sequence in
TYPE:
|
validate_sequences
|
If True, validate that sequences contain only standard amino acids. Set to False to skip validation for performance.
TYPE:
|
smoothing_alpha
|
Laplace smoothing parameter for edge weights. 0.0 means no smoothing (default).
TYPE:
|
initial_state_threshold
|
Minimum observation count for initial states. States observed fewer times than this are excluded. Default is 5.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If data is not a pandas DataFrame. |
ValueError
|
If required columns are missing or sequences are invalid. |
Constructor¶
Parameters¶
| Parameter | Type | Description |
|---|---|---|
data |
pd.DataFrame |
DataFrame with cdr3_amino_acid column |
verbose |
bool |
Print progress messages (default: True) |
Required Columns¶
cdr3_amino_acid- Amino acid CDR3 sequences
Optional Columns¶
V- V gene/allele annotationsJ- J gene/allele annotations
Key Methods¶
walk_probability¶
Calculate the generation probability of a sequence.
encoded = AAPLZGraph.encode_sequence("CASSLEPSGGTDTQYF")
pgen = graph.walk_probability(encoded)
print(f"P(gen) = {pgen:.2e}")
# Use log probability for numerical stability
log_pgen = graph.walk_probability(encoded, use_log=True)
print(f"log P(gen) = {log_pgen:.2f}")
random_walk¶
Generate a random sequence following edge probabilities.
walk = graph.random_walk()
sequence = ''.join([AAPLZGraph.clean_node(n) for n in walk])
print(sequence)
genomic_random_walk¶
Generate a sequence consistent with V/J gene usage.
walk, v_gene, j_gene = graph.genomic_random_walk()
sequence = ''.join([AAPLZGraph.clean_node(n) for n in walk])
print(f"{sequence} ({v_gene}, {j_gene})")
encode_sequence (static)¶
Convert a sequence to graph walk format.
clean_node (static)¶
Extract the pattern from a node name.
Attributes¶
| Attribute | Type | Description |
|---|---|---|
graph |
nx.DiGraph |
NetworkX directed graph |
nodes |
NodeView |
All nodes in the graph |
edges |
EdgeView |
All edges in the graph |
lengths |
dict |
Sequence length distribution |
initial_states |
pd.Series |
Initial state counts |
terminal_states |
pd.Series |
Terminal state counts |
marginal_vgenes |
pd.Series |
V gene probabilities |
marginal_jgenes |
pd.Series |
J gene probabilities |
subpattern_individual_probability |
pd.DataFrame |
Pattern probabilities |
Examples¶
Building with Gene Annotation¶
data = pd.DataFrame({
'cdr3_amino_acid': ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF'],
'V': ['TRBV16-1*01', 'TRBV1-1*01'],
'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})
graph = AAPLZGraph(data, verbose=True)
print(graph.marginal_vgenes)
Batch Probability Calculation¶
sequences = ['CASSLEPSGGTDTQYF', 'CASSLGQGSTEAFF', 'CASSXYZRARESEQ']
for seq in sequences:
try:
encoded = AAPLZGraph.encode_sequence(seq)
log_p = graph.walk_probability(encoded, use_log=True)
print(f"{seq}: {log_p:.2f}")
except:
print(f"{seq}: Not in graph")
See Also¶
- NDPLZGraph - Nucleotide version
- NaiveLZGraph - Non-positional version
- Tutorials: Graph Construction