Analyze the Generative Distribution¶
Learn how to validate, characterize, and use the probability distribution encoded in an LZGraph.
Quick Reference¶
from LZGraphs import LZGraph
# 1. Validate the distribution
diag = graph.pgen_diagnostics()
# 2. Count producible sequences (combinatorial size)
n = graph.path_count
# 3. Effective diversity and entropy
div = graph.effective_diversity()
# 4. Hill diversity numbers (D0, D1, D2, ...)
hills = graph.hill_numbers([0, 1, 2])
# 5. Predict richness at arbitrary depth
richness = graph.predicted_richness(1_000_000)
Validate the Distribution¶
Check whether the graph's generation probabilities form a proper probability distribution:
diag = graph.pgen_diagnostics()
print(f"Is proper distribution: {diag['is_proper']}")
print(f"Total absorbed mass: {diag['total_absorbed']:.4f}")
print(f"Total leaked mass: {diag['total_leaked']:.4f}")
A "proper" distribution means that every random walk eventually terminates at a valid sequence, and the sum of all sequence probabilities is 1.0.
Measure Diversity and Complexity¶
Combinatorial Size (Path Count)¶
The path_count property tells you the total number of unique sequences that could be generated by the graph:
Effective Diversity¶
While the combinatorial space is huge, most sequences have negligible probability. Effective Diversity (\(e^H\)) tells you how many sequences the distribution effectively uses:
# Single number: exp(Shannon entropy)
d_eff = graph.effective_diversity()
print(f"Effective Diversity: {d_eff:.2e}")
# Full profile with entropy and uniformity
profile = graph.diversity_profile()
print(f"Shannon Entropy: {profile['entropy_bits']:.1f} bits")
print(f"Effective Diversity: {profile['effective_diversity']:.2e}")
print(f"Uniformity: {profile['uniformity']:.4f}")
Hill Diversity Profile¶
Get a unified view of diversity at different scales of rarity:
orders = [0, 1, 2]
hills = graph.hill_numbers(orders)
print(f"D(0) Richness: {hills[0]:.2e}")
print(f"D(1) Shannon: {hills[1]:.2e}")
print(f"D(2) Simpson: {hills[2]:.2e}") # Key for collision prediction
Occupancy Predictions¶
LZGraphs can predict how many sequences you would observe at different sequencing depths without needing to perform a single simulation.
Predicted Richness¶
Predict the number of distinct sequences you would see if you sampled \(d\) reads from the distribution:
# How many unique sequences at 100,000 reads?
richness_100k = graph.predicted_richness(100_000)
print(f"Expected unique sequences: {richness_100k:,.0f}")
Predicted Overlap¶
Predict how many shared sequences would appear between two samples of different depths:
# Overlap between Sample A (10k reads) and Sample B (50k reads)
overlap = graph.predicted_overlap(10_000, 50_000)
print(f"Expected shared sequences: {overlap:,.1f}")
Analytical Distribution¶
For advanced statistical tasks, you can construct an analytical Gaussian mixture model of the PGEN distribution:
dist = graph.pgen_distribution()
# Use it to compute probabilities or plot the PDF
p_at_threshold = dist.pdf(-25.0)
Predict Cohort Sharing¶
Predict the "sharing spectrum" (how many sequences are shared by exactly \(k\) donors) for an entire cohort:
# List of read counts for 10 donors
draw_counts = [1000, 5000, 10000, 10000, 20000, 25000, 30000, 40000, 50000, 100000]
result = graph.predict_sharing(draw_counts)
spectrum = result['spectrum']
for k, count in enumerate(spectrum[:5]):
print(f"Shared by {k+1} donors: {count:,.0f} sequences")
Next Steps¶
- Concepts: Distribution Analytics — Mathematical foundations
- How-To: Repertoire Comparison — Use these metrics to compare samples
- API Reference — Detailed method documentation