FlashBackGraph¶

A strictly Markovian graph built from the FlashBack decomposition, where nodes are FlashBack tokens and edges are transitions between consecutive tokens. Because the graph is a Markovian DAG, every analytic (path count, diversity, entropy, Hill numbers, PGEN) is computed exactly via forward dynamic programming, with no Monte Carlo approximation.

For the conceptual distinction between FlashBackGraph and the LZGraph family (aap / ndp / naive), see Graph Variants.

Constructor¶

FlashBackGraph(
    sequences,
    *,
    abundances=None,
    smoothing=0.0
)

Parameters¶

Parameter	Type	Description
`sequences`	`list[str]`	Sequences to build from (must be non-empty)
`abundances`	`list[int]`	Optional per-sequence counts (default: all 1)
`smoothing`	`float`	Laplace smoothing alpha for edge weights (default: 0.0)

from_file (classmethod)¶

FlashBackGraph.from_file(path, *, smoothing=0.0)

Build directly from a plain-text file with constant memory (streaming). Accepts one sequence per line, or sequence<TAB>abundance.

Core Methods¶

pgen¶

pgen(sequence, *, log=True)

Exact generation probability of one or more sequences under the FlashBack model.

Parameters: sequence (str or list[str]), log (bool)
Returns: float (single) or np.ndarray (list)

simulate¶

simulate(n, *, seed=None)

Generate n sequences by Markov random walk.

Returns: SimulationResult

top_k_sequences¶

top_k_sequences(k=100, *, most_probable=True)

Find the K most (or least) probable sequences exactly, via forward DP over the DAG's topological order, with no simulation. Set most_probable=False for the lowest-probability tail.

Returns: SimulationResult

posterior¶

posterior(sequences, *, abundances=None, kappa=1.0)

Bayesian posterior graph using this graph as a Dirichlet-Multinomial prior. Keeps the prior topology and reweights edges toward the new individual. kappa=0 → pure individual, kappa→∞ → pure prior. Edges absent from the prior are ignored.

Returns: A new FlashBackGraph instance.

without¶

without(sequences, *, abundances=None)

Return a new graph with the contribution of sequences removed: each walk's edge counts are decremented (clamped at zero, zero-count edges pruned, weights renormalised). Enables leave-donor-out construction from an existing graph in seconds rather than rebuilding from source.

Returns: A new FlashBackGraph instance.

Exact Diversity & Analytics¶

All values below are computed exactly by forward DP over the DAG.

effective_diversity¶

effective_diversity()

Exact $e^H$ (Shannon entropy), equivalent to Hill number $D(1)$.

diversity_profile¶

diversity_profile()

Full Shannon diversity breakdown (cached per instance).

hill_number¶

hill_number(alpha)

Exact Hill diversity number $D(\alpha)$.

hill_numbers¶

hill_numbers(orders)

Exact Hill numbers for multiple orders. Returns an np.ndarray.

hill_curve¶

hill_curve(orders=None)

Returns a dict with orders and values for plotting a diversity profile (default orders span 0–10).

power_sum¶

power_sum(alpha)

Exact power sum $M(\alpha)$.

pgen_diagnostics¶

pgen_diagnostics(atol=1e-6)

Exact absorbed vs leaked probability mass.

pgen_dynamic_range¶

pgen_dynamic_range()

Exact dynamic range of generation probabilities, in orders of magnitude. Use pgen_dynamic_range_detail() for the full breakdown.

pgen_moments¶

pgen_moments()

Moments (mean, variance, std, skewness, kurtosis) of the forward-DP log-PGEN distribution.

pgen_distribution¶

pgen_distribution()

Analytical Gaussian-mixture approximation of the log-PGEN distribution. Returns a PgenDistribution.

Anomaly Scoring (SCALE)¶

SCALE is the recommended anomaly/error score: a self-calibrated, length-invariant transform of -log pgen. You calibrate once against the graph, then score sequences (higher = more anomalous). See the anomaly-detection tutorial.

calibrate_scale¶

calibrate_scale(*, n_sim=200_000, seed=None, min_count=50)

Self-calibrate SCALE: simulate n_sim sequences from the graph and record the per-length median and IQR of -log pgen. Returns a ScaleCalibration (the reusable cache, documented below).

Parameters: n_sim (int), seed (int or None), min_count (int, minimum simulated sequences at a length to get its own median/IQR; sparser lengths fall back to global).
Returns: ScaleCalibration.

scale_score¶

scale_score(sequence, calibration)

Length-calibrated anomaly score: (-log pgen(s) - median[len]) / IQR[len], using a ScaleCalibration. Higher means more anomalous.

Parameters: sequence (str or list[str]), calibration (ScaleCalibration).
Returns: float (single) or np.ndarray (list).

ScaleCalibration¶

The calibration cache returned by calibrate_scale. Holds median_by_length, iqr_by_length, global_median, global_iqr, n_sim, and seed. Persist and reuse it:

calibration.save('scale_calibration.json')
from LZGraphs import ScaleCalibration
calibration = ScaleCalibration.load('scale_calibration.json')

Graph Algebra¶

Operation	Method	Result
Union	`a.union(b)`	Sum edge counts
Intersection	`a.intersection(b)`	Shared structure, min counts
Difference	`a.difference(b)`	Subtract edge counts
Weighted Merge	`a.weighted_merge(b, α, β)`	Linear combination $ \alpha A + \beta B $

All operations return a new FlashBackGraph.

Features & Adjacency¶

feature_stats¶

feature_stats()

15-element statistical vector describing the graph, for ML pipelines.

feature_mass_profile¶

feature_mass_profile(max_pos=30)

Position-based mass distribution profile.

adjacency_csr¶

adjacency_csr()

CSR (Compressed Sparse Row) adjacency as a dict of numpy arrays (row_offsets, col_indices, weights, counts).

IO¶

save¶

save(path)

Save to .lzg binary format.

load (classmethod)¶

FlashBackGraph.load(path)

Load from a .lzg binary file.

Attributes¶

Basic¶

Attribute	Type	Description
`n_nodes`	`int`	Total number of nodes (including @ and $ sentinels)
`n_edges`	`int`	Total number of directed edges
`n_sequences`	`int`	Number of input sequences (abundance-weighted)
`variant`	`str`	Always `'flashback'`
`is_dag`	`bool`	Always `True`
`path_count`	`float`	Exact number of distinct walks (via DP)

Structure¶

Attribute	Type	Description
`nodes`	`list[str]`	Node labels (excluding sentinels)
`all_nodes`	`list[str]`	All node labels (including @ and $)
`edges`	`list[tuple]`	`(src, dst, weight, count)` tuples (no sentinels)
`all_edges`	`list[tuple]`	All `(src, dst, weight, count)` tuples
`n_initial`	`int`	Number of initial states
`n_terminal`	`int`	Number of terminal nodes
`max_out_degree`	`int`	Maximum out-degree
`max_in_degree`	`int`	Maximum in-degree
`density`	`float`	Graph density (0 to 1)
`out_degrees`	`np.ndarray`	Out-degree of each node
`in_degrees`	`np.ndarray`	In-degree of each node
`length_distribution`	`dict`	`{length: count}` mapping

Streaming Builder: FlashBackStream¶

For sequences that arrive incrementally (a generator, a network stream, an open-ended simulator), FlashBackStream accumulates them and lets you monitor running node/edge counts before deciding to stop. The finalized graph is byte-identical to FlashBackGraph(all_sequences).

from LZGraphs import FlashBackStream

with FlashBackStream(smoothing=0.0) as stream:
    for batch in source:
        stream.add_sequences(batch)            # appends; cheap
        print(stream.n_nodes, stream.n_edges)  # peek; instant
        if some_stop_condition:
            break
    graph = stream.finalize()                  # one-time CSR build
    graph.save('foundation.lzg')

Member	Description
`FlashBackStream(smoothing=0.0)`	Open a new streaming builder.
`add_sequences(sequences, abundances=None)`	Append sequences; raises `RuntimeError` after finalize/abort.
`n_nodes` / `n_edges`	Current running counts (instant peek).
`peek()`	`{'n_nodes': int, 'n_edges': int}` of the running accumulator.
`snapshot()`	Build a finalized graph from the current state without consuming the stream (for checkpointing).
`finalize()`	Convert the accumulator into a `FlashBackGraph`; consumes the stream.
`abort()`	Discard the partial build without paying the finalize cost (idempotent).

Used as a context manager, the stream auto-aborts on exit if it was never finalized.