Skip to content

FlashBackGrammar

A weighted probabilistic context-free grammar (PCFG) over FlashBack decomposition trees. Where FlashBackGraph models a Markov chain on linearised tokens, FlashBackGrammar models the decomposition tree directly, with non-terminals indexed by the boundary characters of the current middle. Rules are shared across recursion depths that share a boundary pair, giving cross-depth motif reuse while still guaranteeing every generated tree is a canonical FlashBack decomposition.

All analytics (path-count series, entropy, Hill numbers, dynamic range) are computed exactly via small linear systems over the non-terminal space.

Constructor

FlashBackGrammar(
    sequences,
    *,
    abundances=None,
    abundance_mode='linear',
    smoothing=0.0,
    backoff='none'
)

Parameters

Parameter Type Description
sequences list[str] Training CDR3 strings
abundances list[int] Per-sequence observed counts (default: all 1)
abundance_mode str How abundances become rule-count contributions: 'none', 'linear' (default), or 'log'
smoothing float Laplace alpha over observed rules at each non-terminal (default: 0.0)
backoff str Unseen-rule handling for pgen: 'none' (strict MLE) or 'gt' (Good-Turing). Analytics always use pure MLE; backoff only affects pgen.

from_file (classmethod)

FlashBackGrammar.from_file(path, ...)
Build directly from a plain-text file (one sequence per line, or sequence<TAB>abundance).

Core Methods

pgen

pgen(sequence, *, log=True)
Generation probability of one or more sequences under the grammar. Honours the backoff mode chosen at construction.

  • Parameters: sequence (str or list[str]), log (keyword-only bool)
  • Returns: float (single) or np.ndarray (list)

pgen_mle

pgen_mle(sequence, *, log=True)
Strict maximum-likelihood probability, ignoring any backoff. Useful for comparison against pgen.

simulate

simulate(n, *, seed=None)
Generate n sequences by sampling decomposition trees from the grammar.

top_k_sequences

top_k_sequences(k=100, *, most_probable=True)
Highest- (or lowest-) probability sequences under the grammar.

posterior

posterior(sequences, *, abundances=None, kappa=1.0)
Bayesian posterior grammar using this grammar as a prior. Returns a new FlashBackGrammar.

without

without(sequences, *, abundances=None)
Return a new grammar with the contribution of sequences removed.

Exact Analytics

All values are computed exactly via linear systems over the non-terminal space.

Method Description
path_count(max_length) Number of distinct trees up to max_length.
path_count_series(max_length) Path counts per length as an np.ndarray.
length_pmf(max_length) Probability mass over sequence lengths.
diversity_profile() Full Shannon diversity breakdown.
effective_diversity() Exact exp(H), equivalent to Hill D(1).
entropy() Exact Shannon entropy of the grammar.
hill_number(alpha) Exact Hill number D(alpha).
hill_numbers(orders) Exact Hill numbers for multiple orders.
hill_curve(orders=None) Dict of orders and values for plotting.
power_sum(alpha) Exact power sum M(alpha).
pgen_dynamic_range(max_length=30) Dynamic range of probabilities in orders of magnitude.
pgen_dynamic_range_detail(max_length=30) Full dynamic-range breakdown.

Grammar Inspection

Member Description
nonterminals List of (a, z, is_leaf) non-terminal descriptors.
rules_at(a, z) Rules at the non-terminal indexed by boundary characters a, z.
rules_at_start() Rules at the start non-terminal.
top_rules(k=20, by='weight') The k highest-weight (or highest-count) rules.
sentinel_rule_weights Weights of the sentinel rules.

IO

save

save(path)
Save to .lzg binary format.

load (classmethod)

FlashBackGrammar.load(path)
Load from a .lzg binary file.

Attributes

Attribute Type Description
variant str Always 'flashback_grammar'
n_nonterminals int Number of non-terminals
n_rules int Total number of rules
n_internal_rules int Internal (branching) rules
n_leaf_rules int Leaf (terminal) rules
alphabet list[str] Characters seen in training
spectral_radius float Spectral radius of the rule system (consistency indicator)
is_consistent bool Whether the grammar defines a proper distribution
smoothing float Laplace alpha used
backoff_mode str 'none' or 'gt'
abundance_mode str 'none', 'linear', or 'log'
max_length int Longest training sequence length
length_counts dict {length: count} mapping
n_sequences int Number of training sequences (abundance-weighted)

Decomposition utilities

The module-level functions flashback_decompose and flashback_reverse convert between a sequence and its FlashBack token decomposition. See Module Functions.