Analyze the Generative Distribution¶

Learn how to validate, characterize, and use the probability distribution encoded in an LZGraph.

Quick Reference¶

from LZGraphs import LZGraph

# 1. Validate the distribution
diag = graph.pgen_diagnostics()

# 2. Count producible sequences (combinatorial size)
n = graph.path_count

# 3. Effective diversity and entropy
div = graph.effective_diversity()

# 4. Hill diversity numbers (D0, D1, D2, ...)
hills = graph.hill_numbers([0, 1, 2])

# 5. Predict richness at arbitrary depth
richness = graph.predicted_richness(1_000_000)

Validate the Distribution¶

Check whether the graph's generation probabilities form a proper probability distribution:

diag = graph.pgen_diagnostics()

print(f"Is proper distribution: {diag['is_proper']}")
print(f"Total absorbed mass:    {diag['total_absorbed']:.4f}")
print(f"Total leaked mass:      {diag['total_leaked']:.4f}")

A "proper" distribution means that every random walk eventually terminates at a valid sequence, and the sum of all sequence probabilities is 1.0.

Measure Diversity and Complexity¶

Combinatorial Size (Path Count)¶

The path_count property tells you the total number of unique sequences that could be generated by the graph:

print(f"Unique producible sequences: {graph.path_count:.2e}")

Effective Diversity¶

While the combinatorial space is huge, most sequences have negligible probability. Effective Diversity (\(e^H\)) tells you how many sequences the distribution effectively uses:

# Single number: exp(Shannon entropy)
d_eff = graph.effective_diversity()
print(f"Effective Diversity:  {d_eff:.2e}")

# Full profile with entropy and uniformity
profile = graph.diversity_profile()
print(f"Shannon Entropy:      {profile['entropy_bits']:.1f} bits")
print(f"Effective Diversity:  {profile['effective_diversity']:.2e}")
print(f"Uniformity:           {profile['uniformity']:.4f}")

Hill Diversity Profile¶

Get a unified view of diversity at different scales of rarity:

orders = [0, 1, 2]
hills = graph.hill_numbers(orders)

print(f"D(0) Richness:  {hills[0]:.2e}")
print(f"D(1) Shannon:   {hills[1]:.2e}")
print(f"D(2) Simpson:   {hills[2]:.2e}") # Key for collision prediction

Occupancy Predictions¶

LZGraphs can predict how many sequences you would observe at different sequencing depths without needing to perform a single simulation.

Predicted Richness¶

Predict the number of distinct sequences you would see if you sampled \(d\) reads from the distribution:

# How many unique sequences at 100,000 reads?
richness_100k = graph.predicted_richness(100_000)
print(f"Expected unique sequences: {richness_100k:,.0f}")

Predicted Overlap¶

Predict how many shared sequences would appear between two samples of different depths:

# Overlap between Sample A (10k reads) and Sample B (50k reads)
overlap = graph.predicted_overlap(10_000, 50_000)
print(f"Expected shared sequences: {overlap:,.1f}")

Analytical Distribution¶

For advanced statistical tasks, you can construct an analytical Gaussian mixture model of the PGEN distribution:

dist = graph.pgen_distribution()

# Use it to compute probabilities or plot the PDF
p_at_threshold = dist.pdf(-25.0)

Predict the "sharing spectrum" (how many sequences are shared by exactly \(k\) donors) for an entire cohort:

# List of read counts for 10 donors
draw_counts = [1000, 5000, 10000, 10000, 20000, 25000, 30000, 40000, 50000, 100000]

result = graph.predict_sharing(draw_counts)
spectrum = result['spectrum']

for k, count in enumerate(spectrum[:5]):
    print(f"Shared by {k+1} donors: {count:,.0f} sequences")

Next Steps¶

Concepts: Distribution Analytics — Mathematical foundations
How-To: Repertoire Comparison — Use these metrics to compare samples
API Reference — Detailed method documentation