Diversity Metrics¶

This tutorial covers measuring repertoire diversity using LZGraphs' novel metrics.

Overview¶

LZGraphs provides several approaches to quantify repertoire diversity:

Metric	What it Measures	Use Case
K1000	Unique patterns in 1000 sequences	Overall diversity
LZCentrality	Sequence position in repertoire	Sequence rarity
Entropy	Information content	Graph complexity
Perplexity	Prediction uncertainty	Model quality

K-Diversity Metrics¶

K-diversity measures how many unique LZ76 patterns appear when sampling K sequences.

K1000 Diversity¶

The most common diversity measure:

from LZGraphs import K1000_Diversity, AAPLZGraph
import pandas as pd

# Load data
data = pd.read_csv("Examples/ExampleData1.csv")
sequences = data['cdr3_amino_acid'].tolist()

# Calculate K1000
k1000 = K1000_Diversity(
    sequences,
    encoding_function=AAPLZGraph.encode_sequence,
    draws=30  # Number of resampling rounds
)
print(f"K1000 Diversity: {k1000:.1f}")

Interpretation

Higher K1000 values indicate more diverse repertoires. A repertoire with many unique patterns will have a higher K1000 than one dominated by repeated motifs.

Other K-Diversity Variants¶

from LZGraphs import K100_Diversity, K500_Diversity, K5000_Diversity

# For smaller repertoires
k100 = K100_Diversity(sequences, AAPLZGraph.encode_sequence)

# For larger repertoires
k5000 = K5000_Diversity(sequences, AAPLZGraph.encode_sequence)

print(f"K100:  {k100:.1f}")
print(f"K5000: {k5000:.1f}")

Adaptive K-Diversity¶

Automatically select K based on repertoire size:

from LZGraphs import adaptive_K_Diversity

k_adaptive = adaptive_K_Diversity(sequences, AAPLZGraph.encode_sequence)
print(f"Adaptive K-Diversity: {k_adaptive:.1f}")

K-Diversity with Statistics¶

Get confidence intervals:

from LZGraphs import K_Diversity

result = K_Diversity(
    sequences,
    encoding_function=AAPLZGraph.encode_sequence,
    sample_size=1000,
    draws=100,
    return_stats=True
)

print(f"Mean: {result['mean']:.1f}")
print(f"Std:  {result['std']:.2f}")
print(f"CI:   [{result['ci_low']:.1f}, {result['ci_high']:.1f}]")

LZCentrality¶

LZCentrality measures how central a sequence is within the repertoire structure.

from LZGraphs import LZCentrality, AAPLZGraph
import pandas as pd

# Build graph
data = pd.read_csv("Examples/ExampleData1.csv")
graph = AAPLZGraph(data, verbose=False)

# Calculate centrality for a sequence
sequence = "CASSLEPSGGTDTQYF"
centrality = LZCentrality(graph, sequence)
print(f"LZCentrality of {sequence}: {centrality:.4f}")

Comparing Sequence Centrality¶

sequences = [
    "CASSLEPSGGTDTQYF",
    "CASSLGQGSTEAFF",
    "CASSXYZRARESEQ"
]

for seq in sequences:
    try:
        cent = LZCentrality(graph, seq)
        print(f"{seq}: {cent:.4f}")
    except:
        print(f"{seq}: Not in graph")

LZCentrality interpretation

Higher values = more central (common patterns)
Lower values = more peripheral (rare patterns)
Zero = sequence not representable in graph

Entropy Metrics¶

Entropy quantifies the information content of your graph.

Node Entropy¶

Measures uncertainty in node (subpattern) distribution:

from LZGraphs import node_entropy

h_nodes = node_entropy(graph)
print(f"Node entropy: {h_nodes:.2f} bits")

Edge Entropy¶

Measures uncertainty in transitions:

from LZGraphs import edge_entropy

h_edges = edge_entropy(graph)
print(f"Edge entropy: {h_edges:.2f} bits")

Graph Entropy¶

Combined measure of graph complexity:

from LZGraphs import graph_entropy, normalized_graph_entropy

h_graph = graph_entropy(graph)
h_norm = normalized_graph_entropy(graph)

print(f"Graph entropy: {h_graph:.2f} bits")
print(f"Normalized:    {h_norm:.4f}")

Perplexity¶

Perplexity measures how "surprised" the model is by sequences.

Sequence Perplexity¶

from LZGraphs import sequence_perplexity

sequence = "CASSLEPSGGTDTQYF"
perp = sequence_perplexity(graph, sequence)
print(f"Perplexity of {sequence}: {perp:.2f}")

Perplexity interpretation

Lower perplexity = sequence fits model well
Higher perplexity = sequence is unexpected

Repertoire Perplexity¶

Average perplexity across all sequences:

from LZGraphs import repertoire_perplexity

sequences = data['cdr3_amino_acid'].tolist()
avg_perp = repertoire_perplexity(graph, sequences)
print(f"Average repertoire perplexity: {avg_perp:.2f}")

Comparing Repertoires¶

Jensen-Shannon Divergence¶

Measure similarity between two repertoires:

from LZGraphs import jensen_shannon_divergence

# Build two graphs from different repertoires
data1 = pd.read_csv("Examples/ExampleData1.csv")
data2 = pd.read_csv("Examples/ExampleData2.csv")

graph1 = AAPLZGraph(data1, verbose=False)
graph2 = AAPLZGraph(data2, verbose=False)

# Calculate JS divergence
jsd = jensen_shannon_divergence(graph1, graph2)
print(f"JS Divergence: {jsd:.4f}")

JS Divergence interpretation

0 = identical distributions
1 = completely different
Symmetric: JSD(A,B) = JSD(B,A)

Mutual Information for Genes¶

Measure association between genes and subpatterns:

from LZGraphs import mutual_information_genes

mi_v = mutual_information_genes(graph, gene_type='V')
mi_j = mutual_information_genes(graph, gene_type='J')

print(f"MI (V genes): {mi_v:.4f}")
print(f"MI (J genes): {mi_j:.4f}")

Saturation Analysis¶

Analyze how diversity grows with sample size.

from LZGraphs import NodeEdgeSaturationProbe

# Create saturation curve
probe = NodeEdgeSaturationProbe()
curve = probe.saturation_curve(
    sequences,
    encoding_function=AAPLZGraph.encode_sequence,
    steps=50
)

print(curve.head())

Key Saturation Metrics¶

# Half-saturation point (sequences needed for 50% of nodes)
half_sat = probe.half_saturation_point(curve)
print(f"Half-saturation: {half_sat} sequences")

# Area under saturation curve
ausc = probe.area_under_curve(curve)
print(f"AUSC: {ausc:.2f}")

Complete Example¶

Compare diversity between two repertoires:

from LZGraphs import (
    AAPLZGraph, K1000_Diversity,
    node_entropy, jensen_shannon_divergence
)
import pandas as pd

# Load two repertoires
rep1 = pd.read_csv("Examples/ExampleData1.csv")
rep2 = pd.read_csv("Examples/ExampleData2.csv")

# Build graphs
g1 = AAPLZGraph(rep1, verbose=False)
g2 = AAPLZGraph(rep2, verbose=False)

# Compare metrics
seq1 = rep1['cdr3_amino_acid'].tolist()
seq2 = rep2['cdr3_amino_acid'].tolist()

print("Repertoire 1 vs Repertoire 2")
print("-" * 40)
print(f"K1000:    {K1000_Diversity(seq1, AAPLZGraph.encode_sequence):.0f} vs "
      f"{K1000_Diversity(seq2, AAPLZGraph.encode_sequence):.0f}")
print(f"Entropy:  {node_entropy(g1):.2f} vs {node_entropy(g2):.2f}")
print(f"JS Div:   {jensen_shannon_divergence(g1, g2):.4f}")

Next Steps¶

Visualization Tutorial - Plot diversity metrics
Concepts: Diversity Indices - Theory behind metrics
How-To: Compare Repertoires - Detailed comparison workflows