Lesson 1: Getting Started with FlashBackGraph¶
Welcome to the FlashBack track. This is the first of a short series of
notebooks that takes you from zero to a working understanding of the
FlashBackGraph class, the second graph family that ships with the
LZGraphs package.
By the end of this lesson you will be able to:
- Explain what a
FlashBackGraphis, and how it differs fromLZGraph. - Tokenize a CDR3 sequence with the FlashBack algorithm and read the tokens.
- Build a
FlashBackGraphfrom a list of sequences, or from a file. - Score the probability of a sequence under the graph using
pgen(). - Generate new sequences by sampling from the graph.
- Detect anomalous sequences using SCALE (
calibrate_scale+scale_score). - Save and reload a graph in the
.lzgbinary format.
If you have already worked through lzgraph/01_Getting_Started.ipynb,
the structure of this notebook will feel familiar. We deliberately
mirror the LZGraph lesson section-by-section so you can see exactly
where the two graph families behave the same and where they diverge.
About the data. The cells below load
data/ExampleData3.csv, the same 5,000 amino-acid CDR3 file used in the LZGraph lessons.FlashBackGraphdoes not carry V/J gene annotations, so we load the sequences only.
Why FlashBackGraph exists, in one paragraph¶
LZGraph encodes a repertoire using LZ76 tokens, which produces a DAG whose structure depends on each sequence's longest-prior-prefix relationships. That's powerful but the tokens are long, the graph has many nodes, and Hill numbers / entropy require Monte Carlo on arbitrarily large graphs.
FlashBack is a different tokenization, optimized for the regime where
you want exact, closed-form analytics: Hill numbers, entropy and
path counts that are computed directly via forward DP with no sampling.
The resulting graph is also Markovian (every transition depends only
on the current node), which keeps both the graph and the analytics
cheap. The trade-offs: FlashBack does not support V/J gene annotations,
and the per-sequence pgen() values have a different absolute scale
than LZGraph's. Use FlashBackGraph when you need fast and exact
distributional summaries over large repertoires.
from LZGraphs import FlashBackGraph, flashback_decompose, flashback_reverse
import numpy as np
import csv
1. The FlashBack tokenizer¶
LZ76 tokenizes a string from the left, one prefix at a time. FlashBack
takes a different path: it wraps the sequence with @ and $
sentinels (@CASSLEPSGGTDTQYF$) and then recursively peels matching
runs from both ends simultaneously. Each peel emits one token. The
algorithm stops when the middle has been fully consumed.
Each FlashBack token has the form <characters>_<flag>{<position>}:
<characters>is the run that was peeled<flag>is0or1depending on which side or pair the peel came from<position>is the recursion depth at which the peel happened
You usually don't need to read individual token internals to work with the package, but seeing the token stream once helps build intuition for why the resulting graph is Markovian: each token cleanly transitions into the next.
seq = 'CASSLEPSGGTDTQYF'
tokens = flashback_decompose(seq)
print(f'Sequence: {seq}')
print(f'Tokens ({len(tokens)}):')
for t in tokens:
print(f' {t}')
# Sanity check: FlashBack is invertible
roundtrip = flashback_reverse(tokens)
print(f'\nRound-trip: {roundtrip}')
assert roundtrip == seq, 'FlashBack tokenization must be invertible'
Sequence: CASSLEPSGGTDTQYF
Tokens (8):
@$_1{0}
CF_1{1}
AY_1{2}
SSQ_2{3}
LT_1{4}
ED_1{5}
PT_1{6}
SGG_0{7}
Round-trip: CASSLEPSGGTDTQYF
Now compare the same sequence under LZ76 (which produces 13 tokens
for 'CASSLEPSGGTDTQYF') with FlashBack (which produces 8). FlashBack
is more aggressive at folding repetitive end-structure into single
tokens. That's how it gets such a compact graph.
2. Building a tiny graph¶
Just like with LZGraph, the simplest way to understand the graph
structure is to build one from a handful of toy sequences. The graph
is Markovian: every node is a FlashBack token (plus the start/end
sentinels) and every directed edge carries the empirical transition
count seen in the training data.
tiny = ['CASSLGIRRT', 'CASSLGYEQYF', 'CASSQETQYF']
tiny_graph = FlashBackGraph(tiny)
print(tiny_graph)
print(f'Nodes: {tiny_graph.n_nodes} Edges: {tiny_graph.n_edges}')
print(f'Initial states: {tiny_graph.n_initial}')
print(f'Terminal states: {tiny_graph.n_terminal}')
print(f'Is DAG: {tiny_graph.is_dag}')
FlashBackGraph(nodes=12, edges=11) Nodes: 12 Edges: 11 Initial states: 1 Terminal states: 3 Is DAG: True
For these three sequences the FlashBack graph is slightly smaller than the equivalent LZGraph (12 nodes vs 21 on the same input). Once you build at scale, the relationship reverses (more on this in section 3): FlashBack tokens are richer per-node, so the node count grows faster but the number of distinct walks through the graph grows much more slowly. That second property is the real win.
3. Building a real graph¶
Now let's load the same 5,000-sequence repertoire we used in the
LZGraph lesson. FlashBackGraph takes a list of strings; abundances
are optional. There are no v_genes / j_genes parameters.
sequences = []
with open('../data/ExampleData3.csv') as f:
for row in csv.DictReader(f):
sequences.append(row['cdr3_amino_acid'])
print(f'{len(sequences):,d} sequences loaded')
print(f'First three: {sequences[:3]}')
5,000 sequences loaded First three: ['CASSGLAGSRSYNEQFF', 'CASSPTGGVYEQYF', 'CASSQTGESNQPQHF']
graph = FlashBackGraph(sequences)
print(graph)
FlashBackGraph(nodes=2826, edges=11899)
4. Graph anatomy¶
FlashBackGraph exposes the same basic shape properties as LZGraph,
plus a few that are specific to it.
The headline number to look at is path_count, which is the exact
number of distinct walks from @ to a terminal sentinel. The
equivalent number on an LZGraph built from this same repertoire is in
the hundreds of billions; on FlashBack it's a few tens of millions.
That gap matters for downstream analytics: closed-form Hill numbers,
entropy, and Shannon-style diversity reduce to forward DP over those
paths, and the forward DP cost scales with the graph's edge count
times the path enumeration. Smaller path_count means faster, more
numerically stable analytics.
Note that the graph itself has more nodes than the LZGraph at this scale (2,826 vs 1,721), because FlashBack tokens encode richer contextual information per token. The structural compression FlashBack buys you is in the topology, not the raw node count.
print(f'n_nodes: {graph.n_nodes:,d}')
print(f'n_edges: {graph.n_edges:,d}')
print(f'variant: {graph.variant}')
print(f'is_dag: {graph.is_dag}')
print(f'n_initial: {graph.n_initial}')
print(f'n_terminal: {graph.n_terminal}')
print(f'path_count: {graph.path_count:,.0f}')
n_nodes: 2,826 n_edges: 11,899 variant: flashback is_dag: True n_initial: 1 n_terminal: 965 path_count: 93,288,322
# A structural summary in dict form
graph.summary()
{'n_nodes': 2826,
'n_edges': 11899,
'n_initial': 1,
'n_terminal': 965,
'max_out_degree': 159,
'max_in_degree': 133,
'n_isolates': 0,
'is_dag': True}
5. Scoring a sequence with pgen()¶
pgen() on a FlashBackGraph is the analogue of pgen() on an
LZGraph: give it a sequence (or a list of sequences) and get back
the probability that the graph assigns to it, in log space by default.
Internally it works very differently. FlashBack tokenization is
deterministic and unique, so for a given sequence there is exactly one
walk through the graph. pgen() traces that walk and multiplies the
renormalized edge weights along it. No simulation, no approximation.
# Single sequence
seq = sequences[0]
lp = graph.pgen(seq)
p = graph.pgen(seq, log=False)
print(f'{seq}')
print(f' log P(gen) = {lp:.4f}')
print(f' P(gen) = {p:.3e}')
CASSGLAGSRSYNEQFF log P(gen) = -15.9716 P(gen) = 1.158e-07
# Batch scoring: pass a list, get a numpy array back
batch = sequences[:8]
log_probs = graph.pgen(batch)
for s, lp in zip(batch, log_probs):
print(f'{s:<25} log P = {lp:.4f}')
CASSGLAGSRSYNEQFF log P = -15.9716 CASSPTGGVYEQYF log P = -12.1356 CASSQTGESNQPQHF log P = -16.4738 CASSKTDISSPLHF log P = -10.8886 CASSLAGHSGGAQRGNEQFF log P = -15.7519 CASSPQDRPNYGYTF log P = -14.1788 CASSSQDRVTQYF log P = -13.8211 CASSSSGGATEQYF log P = -11.3128
Note: The absolute scale of
pgen()on a FlashBackGraph is different frompgen()on an LZGraph built from the same data. FlashBack produces fewer, larger tokens per sequence, so each walk has fewer multiplicative steps. Use these probabilities for ranking sequences against each other within the same graph, or for direct distributional analysis. Do not compare log-pgen values across different graph families.
6. Sampling new sequences¶
simulate(n) runs n independent random walks from @ to a terminal
$, sampling each transition by the renormalized edge weights. Because
the graph is Markovian, this is a single forward sweep per walk.
sim = graph.simulate(10, seed=42)
for s in sim:
print(s)
CAWSADRGNTIYF CAWSVQGESPQYF CASSPRDRPYEQHF CASSFREQGNQPLHF CASSFGVYGYTF CASRWGRAGYTF CASSRDKSTDTQYF CASSERDGVYGYTF CASRPHLGGSPQYF CARGGGGNYGYTF
# The returned object is a SimulationResult; same shape as for LZGraph
print(f'n simulated: {len(sim)}')
print(f'log_probs[:5]: {sim.log_probs[:5]}')
print(f'tokens per walk: {sim.n_tokens[:5]}')
print(f'mean log P: {sim.log_probs.mean():.4f}')
print(f'mean #tokens: {sim.n_tokens.mean():.2f}')
n simulated: 10 log_probs[:5]: [-14.49633766 -11.5858477 -13.7537104 -14.7077029 -12.10459348] tokens per walk: [8 8 8 8 7] mean log P: -12.5479 mean #tokens: 7.60
7. SCALE: the self-calibrated anomaly score¶
SCALE scores how unexpected a sequence is under the graph. Raw -log pgen
already flags improbable sequences, but it grows with sequence length, so a
fixed cutoff is biased toward long sequences. SCALE removes that dependence by
calibrating against the graph's own simulated output.
Two steps:
calibrate_scale()simulates sequences from the graph and records, per length, the median and IQR of-log pgen. This is the reusable calibration cache (aScaleCalibration, which you cansave()/load()).scale_score(seq, calibration)returns the length-invariant score(-log pgen(seq) - median[len]) / IQR[len]. Higher means more anomalous.
# Calibrate once against the graph (simulate, then per-length median/IQR of -log pgen)
calibration = graph.calibrate_scale(n_sim=50_000, seed=42)
# In-distribution sequence: low SCALE score
in_dist = sequences[0]
print(f'In-distribution: {in_dist}')
print(f'SCALE = {graph.scale_score(in_dist, calibration):.3f}')
In-distribution: CASSGLAGSRSYNEQFF SCALE = -0.031
# Out-of-distribution sequence: high SCALE score
out_dist = 'WWWWWWWWWWWWWW'
print(f'Out-of-distribution: {out_dist}')
print(f'SCALE = {graph.scale_score(out_dist, calibration):.3f}')
Out-of-distribution: WWWWWWWWWWWWWW SCALE = 238.365
# Batch: pass a list, get a numpy array of scores
scores = graph.scale_score(sequences[:5], calibration)
for s, sc in zip(sequences[:5], scores):
print(f'{s:<25} SCALE={sc:.4f}')
CASSGLAGSRSYNEQFF SCALE=-0.0309 CASSPTGGVYEQYF SCALE=-0.5103 CASSQTGESNQPQHF SCALE=0.7375 CASSKTDISSPLHF SCALE=-0.9492 CASSLAGHSGGAQRGNEQFF SCALE=-0.5452
8. Building from a file¶
For repertoires that don't fit comfortably in memory as a Python list,
FlashBackGraph.from_file reads sequences directly from disk in a
streaming fashion. The file format is one sequence per line, or
sequence<TAB>count for abundance-weighted builds (gzipped inputs are
detected by extension).
# Write a tiny example file so the cell is self-contained
tmp_path = 'fb_example.tsv'
with open(tmp_path, 'w') as f:
for s in sequences[:1000]:
f.write(s + '\n')
g_file = FlashBackGraph.from_file(tmp_path)
print(g_file)
print(f'nodes={g_file.n_nodes} edges={g_file.n_edges}')
import os; os.remove(tmp_path)
FlashBackGraph(nodes=1300, edges=3089) nodes=1300 edges=3089
For truly huge inputs, an even more flexible API is FlashBackStream,
which lets you push sequences in batches, take periodic snapshots, and
abort cleanly if you decide partway through that you don't need the full
graph. We cover that in Lesson 3 (Advanced Usage).
9. Save and reload¶
Both graph families share the same .lzg binary format on disk. A
FlashBackGraph saved with .save() can only be loaded back with
FlashBackGraph.load(); the format records the variant so that loading
under the wrong class would fail rather than silently produce nonsense.
graph.save('my_fbg.lzg')
loaded = FlashBackGraph.load('my_fbg.lzg')
print(f'Loaded: {loaded}')
print(f'Match: nodes={loaded.n_nodes == graph.n_nodes}, '
f'edges={loaded.n_edges == graph.n_edges}')
import os; os.remove('my_fbg.lzg')
Loaded: FlashBackGraph(nodes=2826, edges=11899) Match: nodes=True, edges=True
What you learned¶
You now know how to:
- Tokenize a sequence with FlashBack and reverse the tokenization
- Build a
FlashBackGraphfrom a list of sequences, or from a file - Inspect the graph's anatomy (
n_nodes,n_edges,path_count, etc.) - Score sequences with
pgen()(single and batch, log and linear) - Sample new sequences with
simulate() - Detect anomalous sequences with SCALE (
calibrate_scale+scale_score) - Persist and reload graphs via
.lzg
You also know the key trade-offs versus LZGraph: FlashBack gives you
exact, closed-form analytics and a more compact graph, but no V/J gene
support and a different absolute scale for pgen().
Next: Lesson 2 (02_Analytics_and_Diversity.ipynb) introduces
diversity metrics on FlashBackGraph, which are computed in closed
form and run noticeably faster than on LZGraph. We'll also cover
pgen_moments, the exact moments of the log-PGEN distribution under
the Markovian model.