Lesson 1: Getting Started with `FlashBackGraph`¶

Welcome to the FlashBack track. This is the first of a short series of notebooks that takes you from zero to a working understanding of the FlashBackGraph class, the second graph family that ships with the LZGraphs package.

By the end of this lesson you will be able to:

Explain what a FlashBackGraph is, and how it differs from LZGraph.
Tokenize a CDR3 sequence with the FlashBack algorithm and read the tokens.
Build a FlashBackGraph from a list of sequences, or from a file.
Score the probability of a sequence under the graph using pgen().
Generate new sequences by sampling from the graph.
Detect anomalous sequences using SCALE (calibrate_scale + scale_score).
Save and reload a graph in the .lzg binary format.

If you have already worked through lzgraph/01_Getting_Started.ipynb, the structure of this notebook will feel familiar. We deliberately mirror the LZGraph lesson section-by-section so you can see exactly where the two graph families behave the same and where they diverge.

About the data. The cells below load data/ExampleData3.csv, the same 5,000 amino-acid CDR3 file used in the LZGraph lessons. FlashBackGraph does not carry V/J gene annotations, so we load the sequences only.

Why `FlashBackGraph` exists, in one paragraph¶

LZGraph encodes a repertoire using LZ76 tokens, which produces a DAG whose structure depends on each sequence's longest-prior-prefix relationships. That's powerful but the tokens are long, the graph has many nodes, and Hill numbers / entropy require Monte Carlo on arbitrarily large graphs.

FlashBack is a different tokenization, optimized for the regime where you want exact, closed-form analytics: Hill numbers, entropy and path counts that are computed directly via forward DP with no sampling. The resulting graph is also Markovian (every transition depends only on the current node), which keeps both the graph and the analytics cheap. The trade-offs: FlashBack does not support V/J gene annotations, and the per-sequence pgen() values have a different absolute scale than LZGraph's. Use FlashBackGraph when you need fast and exact distributional summaries over large repertoires.

In [1]:

Copied!

from LZGraphs import FlashBackGraph, flashback_decompose, flashback_reverse
import numpy as np
import csv
from LZGraphs import FlashBackGraph, flashback_decompose, flashback_reverse
import numpy as np
import csv

1. The FlashBack tokenizer¶

LZ76 tokenizes a string from the left, one prefix at a time. FlashBack takes a different path: it wraps the sequence with @ and $ sentinels (@CASSLEPSGGTDTQYF$) and then recursively peels matching runs from both ends simultaneously. Each peel emits one token. The algorithm stops when the middle has been fully consumed.

Each FlashBack token has the form <characters>_<flag>{<position>}:

<characters> is the run that was peeled
<flag> is 0 or 1 depending on which side or pair the peel came from
<position> is the recursion depth at which the peel happened

You usually don't need to read individual token internals to work with the package, but seeing the token stream once helps build intuition for why the resulting graph is Markovian: each token cleanly transitions into the next.

In [2]:

Copied!





seq = 'CASSLEPSGGTDTQYF'
tokens = flashback_decompose(seq)

print(f'Sequence: {seq}')
print(f'Tokens ({len(tokens)}):')
for t in tokens:
    print(f'  {t}')

# Sanity check: FlashBack is invertible
roundtrip = flashback_reverse(tokens)
print(f'\nRound-trip: {roundtrip}')
assert roundtrip == seq, 'FlashBack tokenization must be invertible'
seq = 'CASSLEPSGGTDTQYF'
tokens = flashback_decompose(seq)

print(f'Sequence: {seq}')
print(f'Tokens ({len(tokens)}):')
for t in tokens:
    print(f'  {t}')

# Sanity check: FlashBack is invertible
roundtrip = flashback_reverse(tokens)
print(f'\nRound-trip: {roundtrip}')
assert roundtrip == seq, 'FlashBack tokenization must be invertible'

Sequence: CASSLEPSGGTDTQYF
Tokens (8):
  @$_1{0}
  CF_1{1}
  AY_1{2}
  SSQ_2{3}
  LT_1{4}
  ED_1{5}
  PT_1{6}
  SGG_0{7}

Round-trip: CASSLEPSGGTDTQYF

Now compare the same sequence under LZ76 (which produces 13 tokens for 'CASSLEPSGGTDTQYF') with FlashBack (which produces 8). FlashBack is more aggressive at folding repetitive end-structure into single tokens. That's how it gets such a compact graph.

2. Building a tiny graph¶

Just like with LZGraph, the simplest way to understand the graph structure is to build one from a handful of toy sequences. The graph is Markovian: every node is a FlashBack token (plus the start/end sentinels) and every directed edge carries the empirical transition count seen in the training data.

In [3]:

Copied!





tiny = ['CASSLGIRRT', 'CASSLGYEQYF', 'CASSQETQYF']
tiny_graph = FlashBackGraph(tiny)
print(tiny_graph)
print(f'Nodes: {tiny_graph.n_nodes}  Edges: {tiny_graph.n_edges}')
print(f'Initial states: {tiny_graph.n_initial}')
print(f'Terminal states: {tiny_graph.n_terminal}')
print(f'Is DAG: {tiny_graph.is_dag}')
tiny = ['CASSLGIRRT', 'CASSLGYEQYF', 'CASSQETQYF']
tiny_graph = FlashBackGraph(tiny)
print(tiny_graph)
print(f'Nodes: {tiny_graph.n_nodes}  Edges: {tiny_graph.n_edges}')
print(f'Initial states: {tiny_graph.n_initial}')
print(f'Terminal states: {tiny_graph.n_terminal}')
print(f'Is DAG: {tiny_graph.is_dag}')

FlashBackGraph(nodes=12, edges=11)
Nodes: 12  Edges: 11
Initial states: 1
Terminal states: 3
Is DAG: True

For these three sequences the FlashBack graph is slightly smaller than the equivalent LZGraph (12 nodes vs 21 on the same input). Once you build at scale, the relationship reverses (more on this in section 3): FlashBack tokens are richer per-node, so the node count grows faster but the number of distinct walks through the graph grows much more slowly. That second property is the real win.

3. Building a real graph¶

Now let's load the same 5,000-sequence repertoire we used in the LZGraph lesson. FlashBackGraph takes a list of strings; abundances are optional. There are no v_genes / j_genes parameters.

In [4]:

Copied!





sequences = []
with open('../data/ExampleData3.csv') as f:
    for row in csv.DictReader(f):
        sequences.append(row['cdr3_amino_acid'])

print(f'{len(sequences):,d} sequences loaded')
print(f'First three: {sequences[:3]}')
sequences = []
with open('../data/ExampleData3.csv') as f:
    for row in csv.DictReader(f):
        sequences.append(row['cdr3_amino_acid'])

print(f'{len(sequences):,d} sequences loaded')
print(f'First three: {sequences[:3]}')

5,000 sequences loaded
First three: ['CASSGLAGSRSYNEQFF', 'CASSPTGGVYEQYF', 'CASSQTGESNQPQHF']

In [5]:

Copied!

graph = FlashBackGraph(sequences)
print(graph)
graph = FlashBackGraph(sequences)
print(graph)

FlashBackGraph(nodes=2826, edges=11899)

4. Graph anatomy¶

FlashBackGraph exposes the same basic shape properties as LZGraph, plus a few that are specific to it.

The headline number to look at is path_count, which is the exact number of distinct walks from @ to a terminal sentinel. The equivalent number on an LZGraph built from this same repertoire is in the hundreds of billions; on FlashBack it's a few tens of millions. That gap matters for downstream analytics: closed-form Hill numbers, entropy, and Shannon-style diversity reduce to forward DP over those paths, and the forward DP cost scales with the graph's edge count times the path enumeration. Smaller path_count means faster, more numerically stable analytics.

Note that the graph itself has more nodes than the LZGraph at this scale (2,826 vs 1,721), because FlashBack tokens encode richer contextual information per token. The structural compression FlashBack buys you is in the topology, not the raw node count.

In [6]:

Copied!





print(f'n_nodes:        {graph.n_nodes:,d}')
print(f'n_edges:        {graph.n_edges:,d}')
print(f'variant:        {graph.variant}')
print(f'is_dag:         {graph.is_dag}')
print(f'n_initial:      {graph.n_initial}')
print(f'n_terminal:     {graph.n_terminal}')
print(f'path_count:     {graph.path_count:,.0f}')
print(f'n_nodes:        {graph.n_nodes:,d}')
print(f'n_edges:        {graph.n_edges:,d}')
print(f'variant:        {graph.variant}')
print(f'is_dag:         {graph.is_dag}')
print(f'n_initial:      {graph.n_initial}')
print(f'n_terminal:     {graph.n_terminal}')
print(f'path_count:     {graph.path_count:,.0f}')

n_nodes:        2,826
n_edges:        11,899
variant:        flashback
is_dag:         True
n_initial:      1
n_terminal:     965
path_count:     93,288,322

In [7]:

Copied!

# A structural summary in dict form
graph.summary()
# A structural summary in dict form
graph.summary()

Out[7]:

{'n_nodes': 2826,
 'n_edges': 11899,
 'n_initial': 1,
 'n_terminal': 965,
 'max_out_degree': 159,
 'max_in_degree': 133,
 'n_isolates': 0,
 'is_dag': True}

5. Scoring a sequence with `pgen()`¶

pgen() on a FlashBackGraph is the analogue of pgen() on an LZGraph: give it a sequence (or a list of sequences) and get back the probability that the graph assigns to it, in log space by default.

Internally it works very differently. FlashBack tokenization is deterministic and unique, so for a given sequence there is exactly one walk through the graph. pgen() traces that walk and multiplies the renormalized edge weights along it. No simulation, no approximation.

In [8]:

Copied!





# Single sequence
seq = sequences[0]
lp = graph.pgen(seq)
p  = graph.pgen(seq, log=False)
print(f'{seq}')
print(f'  log P(gen) = {lp:.4f}')
print(f'  P(gen)     = {p:.3e}')
# Single sequence
seq = sequences[0]
lp = graph.pgen(seq)
p  = graph.pgen(seq, log=False)
print(f'{seq}')
print(f'  log P(gen) = {lp:.4f}')
print(f'  P(gen)     = {p:.3e}')

CASSGLAGSRSYNEQFF
  log P(gen) = -15.9716
  P(gen)     = 1.158e-07

In [9]:

Copied!





# Batch scoring: pass a list, get a numpy array back
batch = sequences[:8]
log_probs = graph.pgen(batch)
for s, lp in zip(batch, log_probs):
    print(f'{s:<25}  log P = {lp:.4f}')
# Batch scoring: pass a list, get a numpy array back
batch = sequences[:8]
log_probs = graph.pgen(batch)
for s, lp in zip(batch, log_probs):
    print(f'{s:<25}  log P = {lp:.4f}')

CASSGLAGSRSYNEQFF          log P = -15.9716
CASSPTGGVYEQYF             log P = -12.1356
CASSQTGESNQPQHF            log P = -16.4738
CASSKTDISSPLHF             log P = -10.8886
CASSLAGHSGGAQRGNEQFF       log P = -15.7519
CASSPQDRPNYGYTF            log P = -14.1788
CASSSQDRVTQYF              log P = -13.8211
CASSSSGGATEQYF             log P = -11.3128

Note: The absolute scale of pgen() on a FlashBackGraph is different from pgen() on an LZGraph built from the same data. FlashBack produces fewer, larger tokens per sequence, so each walk has fewer multiplicative steps. Use these probabilities for ranking sequences against each other within the same graph, or for direct distributional analysis. Do not compare log-pgen values across different graph families.

6. Sampling new sequences¶

simulate(n) runs n independent random walks from @ to a terminal $, sampling each transition by the renormalized edge weights. Because the graph is Markovian, this is a single forward sweep per walk.

In [10]:

Copied!

sim = graph.simulate(10, seed=42)
for s in sim:
    print(s)
sim = graph.simulate(10, seed=42)
for s in sim:
    print(s)

CAWSADRGNTIYF
CAWSVQGESPQYF
CASSPRDRPYEQHF
CASSFREQGNQPLHF
CASSFGVYGYTF
CASRWGRAGYTF
CASSRDKSTDTQYF
CASSERDGVYGYTF
CASRPHLGGSPQYF
CARGGGGNYGYTF

In [11]:

Copied!





# The returned object is a SimulationResult; same shape as for LZGraph
print(f'n simulated:      {len(sim)}')
print(f'log_probs[:5]:    {sim.log_probs[:5]}')
print(f'tokens per walk:  {sim.n_tokens[:5]}')
print(f'mean log P:       {sim.log_probs.mean():.4f}')
print(f'mean #tokens:     {sim.n_tokens.mean():.2f}')
# The returned object is a SimulationResult; same shape as for LZGraph
print(f'n simulated:      {len(sim)}')
print(f'log_probs[:5]:    {sim.log_probs[:5]}')
print(f'tokens per walk:  {sim.n_tokens[:5]}')
print(f'mean log P:       {sim.log_probs.mean():.4f}')
print(f'mean #tokens:     {sim.n_tokens.mean():.2f}')

n simulated:      10
log_probs[:5]:    [-14.49633766 -11.5858477  -13.7537104  -14.7077029  -12.10459348]
tokens per walk:  [8 8 8 8 7]
mean log P:       -12.5479
mean #tokens:     7.60

7. SCALE: the self-calibrated anomaly score¶

SCALE scores how unexpected a sequence is under the graph. Raw -log pgen already flags improbable sequences, but it grows with sequence length, so a fixed cutoff is biased toward long sequences. SCALE removes that dependence by calibrating against the graph's own simulated output.

Two steps:

calibrate_scale() simulates sequences from the graph and records, per length, the median and IQR of -log pgen. This is the reusable calibration cache (a ScaleCalibration, which you can save() / load()).
scale_score(seq, calibration) returns the length-invariant score (-log pgen(seq) - median[len]) / IQR[len]. Higher means more anomalous.

In [12]:

Copied!





# Calibrate once against the graph (simulate, then per-length median/IQR of -log pgen)
calibration = graph.calibrate_scale(n_sim=50_000, seed=42)

# In-distribution sequence: low SCALE score
in_dist = sequences[0]
print(f'In-distribution: {in_dist}')
print(f'SCALE = {graph.scale_score(in_dist, calibration):.3f}')
# Calibrate once against the graph (simulate, then per-length median/IQR of -log pgen)
calibration = graph.calibrate_scale(n_sim=50_000, seed=42)

# In-distribution sequence: low SCALE score
in_dist = sequences[0]
print(f'In-distribution: {in_dist}')
print(f'SCALE = {graph.scale_score(in_dist, calibration):.3f}')

In-distribution: CASSGLAGSRSYNEQFF
SCALE = -0.031

In [13]:

Copied!





# Out-of-distribution sequence: high SCALE score
out_dist = 'WWWWWWWWWWWWWW'
print(f'Out-of-distribution: {out_dist}')
print(f'SCALE = {graph.scale_score(out_dist, calibration):.3f}')
# Out-of-distribution sequence: high SCALE score
out_dist = 'WWWWWWWWWWWWWW'
print(f'Out-of-distribution: {out_dist}')
print(f'SCALE = {graph.scale_score(out_dist, calibration):.3f}')

Out-of-distribution: WWWWWWWWWWWWWW
SCALE = 238.365

In [14]:

Copied!





# Batch: pass a list, get a numpy array of scores
scores = graph.scale_score(sequences[:5], calibration)
for s, sc in zip(sequences[:5], scores):
    print(f'{s:<25}  SCALE={sc:.4f}')
# Batch: pass a list, get a numpy array of scores
scores = graph.scale_score(sequences[:5], calibration)
for s, sc in zip(sequences[:5], scores):
    print(f'{s:<25}  SCALE={sc:.4f}')

CASSGLAGSRSYNEQFF          SCALE=-0.0309
CASSPTGGVYEQYF             SCALE=-0.5103
CASSQTGESNQPQHF            SCALE=0.7375
CASSKTDISSPLHF             SCALE=-0.9492
CASSLAGHSGGAQRGNEQFF       SCALE=-0.5452

8. Building from a file¶

For repertoires that don't fit comfortably in memory as a Python list, FlashBackGraph.from_file reads sequences directly from disk in a streaming fashion. The file format is one sequence per line, or sequence<TAB>count for abundance-weighted builds (gzipped inputs are detected by extension).

In [15]:

Copied!





# Write a tiny example file so the cell is self-contained
tmp_path = 'fb_example.tsv'
with open(tmp_path, 'w') as f:
    for s in sequences[:1000]:
        f.write(s + '\n')

g_file = FlashBackGraph.from_file(tmp_path)
print(g_file)
print(f'nodes={g_file.n_nodes}  edges={g_file.n_edges}')

import os; os.remove(tmp_path)
# Write a tiny example file so the cell is self-contained
tmp_path = 'fb_example.tsv'
with open(tmp_path, 'w') as f:
    for s in sequences[:1000]:
        f.write(s + '\n')

g_file = FlashBackGraph.from_file(tmp_path)
print(g_file)
print(f'nodes={g_file.n_nodes}  edges={g_file.n_edges}')

import os; os.remove(tmp_path)

FlashBackGraph(nodes=1300, edges=3089)
nodes=1300  edges=3089

For truly huge inputs, an even more flexible API is FlashBackStream, which lets you push sequences in batches, take periodic snapshots, and abort cleanly if you decide partway through that you don't need the full graph. We cover that in Lesson 3 (Advanced Usage).

9. Save and reload¶

Both graph families share the same .lzg binary format on disk. A FlashBackGraph saved with .save() can only be loaded back with FlashBackGraph.load(); the format records the variant so that loading under the wrong class would fail rather than silently produce nonsense.

In [16]:

Copied!





graph.save('my_fbg.lzg')

loaded = FlashBackGraph.load('my_fbg.lzg')
print(f'Loaded: {loaded}')
print(f'Match: nodes={loaded.n_nodes == graph.n_nodes}, '
      f'edges={loaded.n_edges == graph.n_edges}')

import os; os.remove('my_fbg.lzg')
graph.save('my_fbg.lzg')

loaded = FlashBackGraph.load('my_fbg.lzg')
print(f'Loaded: {loaded}')
print(f'Match: nodes={loaded.n_nodes == graph.n_nodes}, '
      f'edges={loaded.n_edges == graph.n_edges}')

import os; os.remove('my_fbg.lzg')

Loaded: FlashBackGraph(nodes=2826, edges=11899)
Match: nodes=True, edges=True

What you learned¶

You now know how to:

Tokenize a sequence with FlashBack and reverse the tokenization
Build a FlashBackGraph from a list of sequences, or from a file
Inspect the graph's anatomy (n_nodes, n_edges, path_count, etc.)
Score sequences with pgen() (single and batch, log and linear)
Sample new sequences with simulate()
Detect anomalous sequences with SCALE (calibrate_scale + scale_score)
Persist and reload graphs via .lzg

You also know the key trade-offs versus LZGraph: FlashBack gives you exact, closed-form analytics and a more compact graph, but no V/J gene support and a different absolute scale for pgen().

Next: Lesson 2 (02_Analytics_and_Diversity.ipynb) introduces diversity metrics on FlashBackGraph, which are computed in closed form and run noticeably faster than on LZGraph. We'll also cover pgen_moments, the exact moments of the log-PGEN distribution under the Markovian model.

Lesson 1: Getting Started with FlashBackGraph¶

Why FlashBackGraph exists, in one paragraph¶