FlashBackGrammar¶
A weighted probabilistic context-free grammar (PCFG) over FlashBack decomposition trees. Where FlashBackGraph models a Markov chain on linearised tokens, FlashBackGrammar models the decomposition tree directly, with non-terminals indexed by the boundary characters of the current middle. Rules are shared across recursion depths that share a boundary pair, giving cross-depth motif reuse while still guaranteeing every generated tree is a canonical FlashBack decomposition.
All analytics (path-count series, entropy, Hill numbers, dynamic range) are computed exactly via small linear systems over the non-terminal space.
Constructor¶
FlashBackGrammar(
sequences,
*,
abundances=None,
abundance_mode='linear',
smoothing=0.0,
backoff='none'
)
Parameters¶
| Parameter | Type | Description |
|---|---|---|
sequences |
list[str] |
Training CDR3 strings |
abundances |
list[int] |
Per-sequence observed counts (default: all 1) |
abundance_mode |
str |
How abundances become rule-count contributions: 'none', 'linear' (default), or 'log' |
smoothing |
float |
Laplace alpha over observed rules at each non-terminal (default: 0.0) |
backoff |
str |
Unseen-rule handling for pgen: 'none' (strict MLE) or 'gt' (Good-Turing). Analytics always use pure MLE; backoff only affects pgen. |
from_file (classmethod)¶
Build directly from a plain-text file (one sequence per line, orsequence<TAB>abundance).
Core Methods¶
pgen¶
Generation probability of one or more sequences under the grammar. Honours thebackoff mode chosen at construction.
- Parameters:
sequence(str or list[str]),log(keyword-only bool) - Returns:
float(single) ornp.ndarray(list)
pgen_mle¶
Strict maximum-likelihood probability, ignoring any backoff. Useful for comparison againstpgen.
simulate¶
Generaten sequences by sampling decomposition trees from the grammar.
- Returns: SimulationResult
top_k_sequences¶
Highest- (or lowest-) probability sequences under the grammar.- Returns: SimulationResult
posterior¶
Bayesian posterior grammar using this grammar as a prior. Returns a newFlashBackGrammar.
without¶
Return a new grammar with the contribution ofsequences removed.
Exact Analytics¶
All values are computed exactly via linear systems over the non-terminal space.
| Method | Description |
|---|---|
path_count(max_length) |
Number of distinct trees up to max_length. |
path_count_series(max_length) |
Path counts per length as an np.ndarray. |
length_pmf(max_length) |
Probability mass over sequence lengths. |
diversity_profile() |
Full Shannon diversity breakdown. |
effective_diversity() |
Exact exp(H), equivalent to Hill D(1). |
entropy() |
Exact Shannon entropy of the grammar. |
hill_number(alpha) |
Exact Hill number D(alpha). |
hill_numbers(orders) |
Exact Hill numbers for multiple orders. |
hill_curve(orders=None) |
Dict of orders and values for plotting. |
power_sum(alpha) |
Exact power sum M(alpha). |
pgen_dynamic_range(max_length=30) |
Dynamic range of probabilities in orders of magnitude. |
pgen_dynamic_range_detail(max_length=30) |
Full dynamic-range breakdown. |
Grammar Inspection¶
| Member | Description |
|---|---|
nonterminals |
List of (a, z, is_leaf) non-terminal descriptors. |
rules_at(a, z) |
Rules at the non-terminal indexed by boundary characters a, z. |
rules_at_start() |
Rules at the start non-terminal. |
top_rules(k=20, by='weight') |
The k highest-weight (or highest-count) rules. |
sentinel_rule_weights |
Weights of the sentinel rules. |
IO¶
save¶
Save to.lzg binary format.
load (classmethod)¶
Load from a.lzg binary file.
Attributes¶
| Attribute | Type | Description |
|---|---|---|
variant |
str |
Always 'flashback_grammar' |
n_nonterminals |
int |
Number of non-terminals |
n_rules |
int |
Total number of rules |
n_internal_rules |
int |
Internal (branching) rules |
n_leaf_rules |
int |
Leaf (terminal) rules |
alphabet |
list[str] |
Characters seen in training |
spectral_radius |
float |
Spectral radius of the rule system (consistency indicator) |
is_consistent |
bool |
Whether the grammar defines a proper distribution |
smoothing |
float |
Laplace alpha used |
backoff_mode |
str |
'none' or 'gt' |
abundance_mode |
str |
'none', 'linear', or 'log' |
max_length |
int |
Longest training sequence length |
length_counts |
dict |
{length: count} mapping |
n_sequences |
int |
Number of training sequences (abundance-weighted) |
Decomposition utilities¶
The module-level functions flashback_decompose and flashback_reverse convert between a sequence and its FlashBack token decomposition. See Module Functions.