Sequence Analysis¶
This tutorial covers analyzing sequences using LZGraphs, including probability calculation, encoding, and sequence generation.
Prerequisites¶
Build a graph first:
import pandas as pd
from LZGraphs import AAPLZGraph
data = pd.read_csv("Examples/ExampleData1.csv")
graph = AAPLZGraph(data, verbose=True)
Sequence Encoding¶
Before analyzing a sequence, you must encode it into the graph's format.
Encoding with AAPLZGraph¶
sequence = "CASRGERGDNEQFF"
# Encode the sequence
encoded = AAPLZGraph.encode_sequence(sequence)
print(encoded)
Output:
Each node has the format <subpattern>_<position>.
Decoding Back to Sequence¶
# Clean each node to get the original subpattern
clean_nodes = [AAPLZGraph.clean_node(node) for node in encoded]
print(clean_nodes)
# Reconstruct the sequence
reconstructed = ''.join(clean_nodes)
print(f"Original: {sequence}")
print(f"Reconstructed: {reconstructed}")
Calculating Sequence Probability (Pgen)¶
The generation probability quantifies how likely a sequence is given the repertoire:
Basic Probability¶
sequence = "CASRGERGDNEQFF"
encoded = AAPLZGraph.encode_sequence(sequence)
pgen = graph.walk_probability(encoded)
print(f"{sequence}: P(gen) = {pgen:.2e}")
Output:
Log Probability¶
For very small probabilities, use log-space to avoid numerical underflow:
When to use log probability
Use use_log=True when:
- Comparing many sequences
- Working with very rare sequences
- Performing numerical operations on probabilities
Handling Unknown Sequences¶
If a sequence contains patterns not in the graph:
unknown_seq = "CASSXYZABC" # XYZ unlikely in real repertoire
encoded = AAPLZGraph.encode_sequence(unknown_seq)
pgen = graph.walk_probability(encoded, verbose=True)
print(f"P(gen) = {pgen}") # Returns 0 if path doesn't exist
Generating New Sequences¶
Generate sequences that follow the statistical patterns of your repertoire.
Unsupervised Random Walk¶
Generate without gene constraints:
walk = graph.random_walk()
sequence = ''.join([AAPLZGraph.clean_node(node) for node in walk])
print(f"Generated: {sequence}")
Gene-Constrained Generation¶
Generate sequences consistent with specific V/J genes:
# Generate with random V/J selection
walk, v_gene, j_gene = graph.genomic_random_walk()
sequence = ''.join([AAPLZGraph.clean_node(node) for node in walk])
print(f"Sequence: {sequence}")
print(f"V gene: {v_gene}")
print(f"J gene: {j_gene}")
Output:
Generating Multiple Sequences¶
generated = []
for _ in range(100):
walk, v, j = graph.genomic_random_walk()
seq = ''.join([AAPLZGraph.clean_node(n) for n in walk])
generated.append({'sequence': seq, 'v_gene': v, 'j_gene': j})
df = pd.DataFrame(generated)
print(df.head())
Exploring Graph Structure¶
Node Properties¶
# Check if a node exists
node = "C_1"
exists = graph.graph.has_node(node)
print(f"Node {node} exists: {exists}")
# Get node successors (outgoing edges)
successors = list(graph.graph.successors("C_1"))
print(f"C_1 can transition to: {successors[:5]}...")
Edge Properties¶
# Check edge weight
if graph.graph.has_edge("C_1", "A_2"):
weight = graph.graph["C_1"]["A_2"]["weight"]
print(f"Edge C_1 -> A_2 weight: {weight:.4f}")
Subpattern Probabilities¶
# Get probability of a subpattern at a position
proba_df = graph.subpattern_individual_probability
print(proba_df.head())
Working with Terminal States¶
Terminal states are the final subpatterns of sequences:
# All terminal states with counts
print(graph.terminal_states.head(10))
# Check if a node is terminal
node = "F_15"
is_terminal = node in graph.terminal_states.index
print(f"{node} is terminal: {is_terminal}")
Terminal State Map¶
Get terminal states reachable from any position:
# Get terminal states for position 12
terminals_at_12 = graph.terminal_states_map.get(12, {})
print(f"Terminal states at position 12: {list(terminals_at_12.keys())[:5]}")
Practical Example: Sequence Comparison¶
Compare sequences by their generation probability:
sequences = [
"CASSLEPSGGTDTQYF", # Likely common
"CASSLGQGSTEAFF", # Also common
"CASSXYZABCDEFGH", # Likely rare/impossible
]
results = []
for seq in sequences:
try:
encoded = AAPLZGraph.encode_sequence(seq)
pgen = graph.walk_probability(encoded, use_log=True)
results.append({'sequence': seq, 'log_pgen': pgen})
except Exception as e:
results.append({'sequence': seq, 'log_pgen': float('-inf')})
df = pd.DataFrame(results)
df = df.sort_values('log_pgen', ascending=False)
print(df)
Next Steps¶
- Diversity Metrics Tutorial - Measure repertoire diversity
- Concepts: Probability Model - Understand how Pgen works
- How-To: Generate Sequences - Advanced generation techniques