Skip to content

First Steps

This guide helps you understand the fundamentals of LZGraphs and choose the right approach for your analysis.

Understanding Your Data

LZGraphs works with CDR3 sequences from T-cell receptor repertoires. Your data should be in a pandas DataFrame with at minimum a sequence column.

Required Columns

Graph Type Sequence Column Description
AAPLZGraph cdr3_amino_acid Amino acid sequences
NDPLZGraph cdr3_rearrangement Nucleotide sequences
NaiveLZGraph Any List of strings

Optional Columns

Column Purpose
V V gene/allele annotation (e.g., TRBV16-1*01)
J J gene/allele annotation (e.g., TRBJ1-2*01)

Choosing the Right Graph Type

LZGraphs provides three graph types, each optimized for different use cases:

flowchart TD
    A[What type of sequences?] --> B{Amino Acids?}
    B -->|Yes| C{Need gene info?}
    B -->|No| D{Nucleotides?}
    C -->|Yes| E[AAPLZGraph]
    C -->|No| F[AAPLZGraph or NaiveLZGraph]
    D -->|Yes| G{Need gene info?}
    D -->|No| H[NaiveLZGraph]
    G -->|Yes| I[NDPLZGraph]
    G -->|No| J[NDPLZGraph or NaiveLZGraph]

AAPLZGraph (Amino Acid Positional)

Best for: Amino acid CDR3 sequences with positional information

from LZGraphs import AAPLZGraph

graph = AAPLZGraph(data)  # data has 'cdr3_amino_acid' column

Features: - Position-aware encoding (e.g., C_1, A_2, S_3) - V/J gene annotation support - Compact graphs for amino acid alphabets - Ideal for most TCR analysis tasks

NDPLZGraph (Nucleotide Double Positional)

Best for: Nucleotide CDR3 sequences with fine-grained positional information

from LZGraphs import NDPLZGraph

graph = NDPLZGraph(data)  # data has 'cdr3_rearrangement' column

Features: - Double position encoding for higher resolution - V/J gene annotation support - Better for sequence-level analysis - Larger graphs than AAPLZGraph

NaiveLZGraph

Best for: Custom dictionaries, cross-repertoire comparisons, or machine learning features

from LZGraphs import NaiveLZGraph
from LZGraphs.utilities import generate_kmer_dictionary

# Create a fixed dictionary for consistent feature vectors
dictionary = generate_kmer_dictionary(6)

# Build graph with fixed dictionary
graph = NaiveLZGraph(sequences, dictionary)

Features: - Fixed dictionary across all repertoires - Consistent feature dimensions for ML - No positional encoding (simpler graphs) - Useful for eigenvector centrality features

Quick Comparison

Feature AAPLZGraph NDPLZGraph NaiveLZGraph
Input Amino acids Nucleotides Any strings
Position encoding Single Double None
V/J gene support Yes Yes No
Graph size Medium Large Configurable
Best for Most TCR analysis Nucleotide-level ML features

Input Data Format

Example: AAPLZGraph Data

import pandas as pd

data = pd.DataFrame({
    'cdr3_amino_acid': [
        'CASSLEPSGGTDTQYF',
        'CASSDTSGGTDTQYF',
        'CASSLEPQTFTDTFFF',
        'CASSLGQGSTEAFF'
    ],
    'V': [
        'TRBV16-1*01',
        'TRBV1-1*01',
        'TRBV16-1*01',
        'TRBV5-1*01'
    ],
    'J': [
        'TRBJ1-2*01',
        'TRBJ1-5*01',
        'TRBJ2-7*01',
        'TRBJ1-1*01'
    ]
})

Example: NDPLZGraph Data

data = pd.DataFrame({
    'cdr3_rearrangement': [
        'TGTGCCAGCAGTTTAGAGCCCAGCGGGGGG...',
        'TGTGCCAGCAGTGACACTTCAGGGGGGACT...',
    ],
    'V': ['TRBV16-1*01', 'TRBV1-1*01'],
    'J': ['TRBJ1-2*01', 'TRBJ1-5*01']
})

Understanding Graph Nodes

Each graph type encodes sequences differently:

AAPLZGraph Encoding

from LZGraphs import AAPLZGraph

sequence = "CASSLGQ"
encoded = AAPLZGraph.encode_sequence(sequence)
print(encoded)
# ['C_1', 'A_2', 'S_3', 'SL_5', 'G_6', 'Q_7']

The _N suffix indicates the position in the sequence.

NDPLZGraph Encoding

from LZGraphs import NDPLZGraph

sequence = "TGTGCC"
encoded = NDPLZGraph.encode_sequence(sequence)
print(encoded)
# ['T_1_1', 'G_2_2', 'T_3_3', 'G_4_4', 'C_5_5', 'C_6_6']

Double position encoding provides finer resolution.

NaiveLZGraph Encoding

from LZGraphs.utilities import lempel_ziv_decomposition

sequence = "TGTGCC"
encoded = lempel_ziv_decomposition(sequence)
print(encoded)
# ['T', 'G', 'TG', 'C', 'C']

No positional information, pure LZ76 decomposition.

Next Steps

Now that you understand the basics:

  1. Graph Construction Tutorial - Build graphs with different options
  2. Concepts: LZ76 Algorithm - Understand how encoding works
  3. Concepts: Graph Types - Deep dive into graph differences