Skip to content

CLI Reference

The lzg command-line tool provides fast, scriptable access to every major LZGraphs operation: building graphs, scoring sequences, simulating repertoires, measuring diversity, and more. Every command reads from files (or stdin) and writes tab-separated text to stdout, so it slots naturally into Unix pipelines.


Installation and verification

After installing LZGraphs (pip install LZGraphs), the lzg entry point is available system-wide.

# Confirm it is installed
lzg --version
lzg 3.0.2
# See all available commands
lzg --help
usage: lzg [-h] [--version] [-q] {build,info,score,simulate,diversity,compare,decompose,saturation,predict,posterior} ...

LZGraphs — LZ76 compression graphs for immune repertoire analysis

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -q, --quiet           suppress progress

commands:
  {build,info,score,simulate,diversity,compare,decompose,saturation,predict,posterior}
    build               Build a graph from sequences
    info                Inspect a saved graph
    score               Compute LZPGEN for sequences
    simulate            Generate sequences from a graph
    diversity           Diversity metrics
    compare             Compare two repertoires
    decompose           LZ76-decompose sequences
    saturation          Node/edge saturation curve
    predict             Occupancy predictions
    posterior           Bayesian posterior update

Global flags

Every subcommand accepts the global -q / --quiet flag, which suppresses the informational messages printed to stderr. Useful when piping stdout into another program.


Input file formats

lzg auto-detects the file format from the first line. Three formats are supported:

Plain text -- one sequence per line

sequences.txt
CASSLAPGATNEKLFF
CASSLVGGPYEQYF
CASSQEAGGTDTQYF

Plain text with abundances

Tab-separate each sequence from its integer count:

sequences_abd.txt
CASSLAPGATNEKLFF    5
CASSLVGGPYEQYF  12
CASSQEAGGTDTQYF 1

TSV / CSV with header

Any .tsv, .csv, or .gz-compressed tabular file with a header row. Column names are auto-detected per variant:

Variant Auto-detected sequence columns (first match wins)
aap junction_aa, cdr3_amino_acid, cdr3_aa, aminoacid
ndp junction, cdr3_rearrangement, cdr3_nt, nucleotide
naive junction_aa, cdr3_amino_acid, junction, cdr3_rearrangement

Fallback columns tried for all variants: sequence, cdr3, seq.

Gene columns default to v_call and j_call (AIRR standard). Abundance defaults to duplicate_count.

repertoire.tsv
junction_aa v_call  j_call  duplicate_count
CASSLAPGATNEKLFF    TRBV5-1*01  TRBJ1-4*01  5
CASSLVGGPYEQYF  TRBV28*01   TRBJ2-7*01  12
CASSQEAGGTDTQYF TRBV4-2*01  TRBJ2-3*01  1

Reading from stdin

Commands that accept an input positional argument default to - (stdin) when omitted, so you can pipe data directly:

cat sequences.txt | lzg decompose

Gzip support

Any input file ending in .gz is transparently decompressed.


Commands

build -- Build a graph from sequences

Read sequences (and optionally V/J genes and abundances) from a tabular or plain-text file, construct an LZ76 compression graph, and save it to a compact .lzg binary.

Usage

lzg build INPUT -o OUTPUT [options]

Arguments and options

Flag Description
INPUT Input file (.txt, .tsv, .csv, .gz, or - for stdin)
-o, --output (required) Output .lzg file path
-V, --variant Graph variant: aap, ndp, or naive (default: aap)
-s, --seq-column Sequence column name (default: auto-detect)
--v-column V gene column name (default: v_call)
--j-column J gene column name (default: j_call)
-a, --abundance-column Abundance / count column name
--no-genes Ignore gene columns even if present
--smoothing Laplace smoothing constant (default: 0.0)

Example

lzg build repertoire.tsv -o repertoire.lzg -V aap
[build] 48312 sequences read (14 V genes, 13 J genes) (0.31s)
[build] 9842 nodes, 27531 edges (1.47s)
[build] saved repertoire.lzg (2104.3 KB)

Choosing a variant

Use aap for amino-acid CDR3 sequences (the most common case), ndp for nucleotide sequences with positional encoding, and naive for raw character-level decomposition of any string.


info -- Inspect a saved graph

Print a structured summary of a .lzg file: graph size, diversity profile, generation probability statistics, and optionally V/J gene marginals.

Usage

lzg info GRAPH [options]

Arguments and options

Flag Description
GRAPH Path to a .lzg file
--genes Include V/J gene marginal probabilities
--all Print everything (genes + full Hill curve)
--json Output as JSON instead of tagged text

Example

lzg info repertoire.lzg
# lzg info v3.0.2 — repertoire.lzg
GR  variant aap
GR  nodes   9842
GR  edges   27531
GR  initial_states  387
GR  terminal_states 214
GR  is_dag  yes
GR  has_gene_data   yes
GR  path_count  3.18204e+08
DV  effective_diversity 4021.3312
DV  entropy_nats    8.2993
DV  entropy_bits    11.9729
DV  uniformity  0.9032
PR  pgen_mean   -18.4210
PR  pgen_std    3.2714
PR  dynamic_range_decades   12.6831
PR  is_proper   yes
# Gene marginals included
lzg info repertoire.lzg --genes

Machine-readable output

The tagged text format (PREFIX<tab>KEY<tab>VALUE) is easy to parse with awk or cut. For structured consumption, use --json:

lzg info repertoire.lzg --json | python3 -m json.tool

score -- Compute LZPGEN for sequences

Score one or more sequences against a graph, producing the log-probability (LZPGEN) of each sequence under the graph's generative model.

Usage

lzg score GRAPH [INPUT] [options]

Arguments and options

Flag Description
GRAPH Path to a .lzg file
INPUT Sequence file (default: stdin)
-s, --seq-column Sequence column name (default: auto-detect)
-o, --output Output file (default: stdout)
--prob Output raw probability instead of log-probability
--append Pass through input columns (not just sequence)
--json JSON output

Example

lzg score repertoire.lzg query_sequences.txt
sequence    lzpgen
CASSLAPGATNEKLFF    -14.831204
CASSLVGGPYEQYF  -16.229417
CASSQEAGGTDTQYF -19.003851
[score] scored 3 sequences
# Pipe from stdin, output probabilities
echo "CASSLAPGATNEKLFF" | lzg score repertoire.lzg --prob
sequence    pgen
CASSLAPGATNEKLFF    0.000000
[score] scored 1 sequences

Note

The --prob flag exponentiates the log-probability. For very rare sequences the probability will be indistinguishable from zero in fixed-precision output -- prefer log-probabilities for downstream analysis.


simulate -- Generate sequences from a graph

Sample new sequences from the graph's learned transition model using a fast C extension (or optimized Python fallback). Optionally constrain by V/J gene.

Usage

lzg simulate GRAPH -n COUNT [options]

Arguments and options

Flag Description
GRAPH Path to a .lzg file
-n, --count (required) Number of sequences to generate
-o, --output Output file (default: stdout)
--seed RNG seed for reproducibility
--v-gene Constrain to a specific V gene
--j-gene Constrain to a specific J gene
--sample-genes Sample V/J genes from the joint distribution
--with-details Include lzpgen and n_tokens columns
--json JSON output

Example

lzg simulate repertoire.lzg -n 5 --seed 42
CASSLGQAYEQYF
CASSPAGGTEAFF
CASSQDRANYGYTF
CASSFRGGNTIYF
CASSLEETQYF
[simulate] generated 5 sequences
# With generation probability and token count
lzg simulate repertoire.lzg -n 3 --seed 42 --with-details
sequence    lzpgen  n_tokens
CASSLGQAYEQYF   -15.203419  5
CASSPAGGTEAFF   -17.841002  6
CASSQDRANYGYTF  -16.558134  6
[simulate] generated 3 sequences

Gene-constrained generation

If the graph was built with gene data, you can condition on specific genes:

lzg simulate repertoire.lzg -n 1000 --v-gene "TRBV5-1*01" --seed 7

Or sample V/J pairs from the joint distribution with --sample-genes, which attaches gene labels to each simulated sequence.


diversity -- Diversity and structural statistics

Compute Hill diversity numbers, effective diversity, Shannon entropy, uniformity, and generation-probability moments for a graph.

Usage

lzg diversity GRAPH [options]

Arguments and options

Flag Description
GRAPH Path to a .lzg file
--hill Comma-separated Hill orders (default: 0,1,2,5,inf)
--json JSON output

Example

lzg diversity repertoire.lzg
# lzg diversity v3.0.2
HL  0   318204000.0000
HL  1   4021.3312
HL  2   1847.5590
HL  5   623.4102
HL  inf 42.1837
DV  effective_diversity 4021.3312
DV  entropy_nats    8.2993
DV  entropy_bits    11.9729
DV  uniformity  0.9032
DR  dynamic_range_decades   12.6831
DR  pgen_mean   -18.4210
DR  pgen_std    3.2714

Hill number interpretation

  • Order 0: Total richness (number of distinct achievable sequences).
  • Order 1: Exponential of Shannon entropy -- the "effective" number of equally-likely sequences.
  • Order 2: Inverse Simpson concentration.
  • Order inf: Inverse of the maximum probability -- dominated by the single most likely sequence.

compare -- Jensen-Shannon divergence between two graphs

Measure the distributional distance between two repertoire graphs using Jensen-Shannon divergence, along with structural overlap statistics (shared nodes/edges, Jaccard indices).

Usage

lzg compare GRAPH_A GRAPH_B [options]

Arguments and options

Flag Description
GRAPH_A First .lzg file
GRAPH_B Second .lzg file
--json JSON output

Example

lzg compare healthy.lzg disease.lzg
# lzg compare v3.0.2 — healthy.lzg vs disease.lzg
CP  jsd 0.142837
CP  nodes_a 9842
CP  nodes_b 11204
CP  nodes_shared    6318
CP  edges_a 27531
CP  edges_b 30819
CP  edges_shared    14207
CP  jaccard_nodes   0.4295
CP  jaccard_edges   0.3217

Note

JSD is symmetric and bounded in [0, 1]. A value near 0 means the two graphs encode nearly identical generation-probability distributions.


decompose -- LZ76-decompose sequences

Print the Lempel-Ziv 76 decomposition of each input sequence. Useful for understanding how the algorithm tokenizes a sequence before graph construction.

Usage

lzg decompose [INPUT] [options]

Arguments and options

Flag Description
INPUT Sequence file (default: stdin)
-s, --seq-column Sequence column name (default: auto-detect)
-o, --output Output file (default: stdout)
-d, --delimiter Token delimiter in output (default: \|)
--json JSON output

Example

echo "CASSLAPGATNEKLFF" | lzg decompose
sequence    tokens  n_tokens
CASSLAPGATNEKLFF    C|A|S|SL|AP|G|AT|NE|KL|FF   10
# Use a different delimiter
echo "CASSLAPGATNEKLFF" | lzg decompose -d " "
sequence    tokens  n_tokens
CASSLAPGATNEKLFF    C A S SL AP G AT NE KL FF   10

Tip

Pipe a file of sequences through decompose to get a quick sense of complexity. Sequences with fewer tokens (relative to their length) are more repetitive and will contribute fewer unique nodes to the graph.


saturation -- Node/edge saturation curve

Track how the number of unique nodes and edges grows as sequences are incrementally added. This helps assess whether a repertoire has been sequenced deeply enough to capture its structural diversity.

Usage

lzg saturation INPUT [options]

Arguments and options

Flag Description
INPUT Sequence file
-V, --variant Graph variant: aap, ndp, or naive (default: aap)
-s, --seq-column Sequence column name (default: auto-detect)
-o, --output Output file (default: stdout)
--log-every Record a data point every N sequences (default: 100)
--json JSON output

Example

lzg saturation repertoire.tsv -V aap --log-every 500 -o saturation.tsv
[saturation] 48312 sequences, variant=aap
head -5 saturation.tsv
n_sequences n_nodes n_edges
500 1847    3214
1000    3102    5891
1500    4018    8012
2000    4729    9847

Tip

Plot the output to visually check for plateau. If the curve is still climbing steeply at the end, you likely need deeper sequencing.


predict -- Occupancy predictions

Predict ecological properties of the repertoire at arbitrary sequencing depths. Three subcommands are available: richness, overlap, and sharing.

predict richness

Estimate the number of distinct sequences (species richness) that would be observed at given sampling depths.

Usage

lzg predict richness GRAPH --depths DEPTHS [options]
Flag Description
GRAPH .lzg file
--depths (required) Comma-separated depths or START:END:N for log-spaced
-o, --output Output file (default: stdout)
--json JSON output

Example

lzg predict richness repertoire.lzg --depths 1000,10000,100000,1000000
depth   predicted_richness
1000    987.4210
10000   8241.3019
100000  42819.5531
1000000 118402.7743
# Log-spaced depths: 10 points from 1000 to 1000000
lzg predict richness repertoire.lzg --depths 1000:1000000:10

predict overlap

Predict the expected number of sequences shared between two independent samples of sizes d_i and d_j drawn from the same repertoire.

Usage

lzg predict overlap GRAPH --di D_I --dj D_J [options]
Flag Description
GRAPH .lzg file
--di (required) Depth of sample i
--dj (required) Depth of sample j
-o, --output Output file (default: stdout)
--json JSON output

Example

lzg predict overlap repertoire.lzg --di 50000 --dj 50000
PO  d_i 50000
PO  d_j 50000
PO  predicted_overlap   12847.3019

predict sharing

Predict the sharing spectrum: how many sequences are expected to appear in exactly k out of N donors, given draw sizes for each donor.

Usage

lzg predict sharing GRAPH --draws D1,D2,... [options]
Flag Description
GRAPH .lzg file
--draws (required) Comma-separated draw sizes, one per donor
--max-k Maximum sharing degree (default: number of donors)
-o, --output Output file (default: stdout)
--json JSON output

Example

lzg predict sharing repertoire.lzg --draws 10000,10000,10000
k   expected_count
1   18421.201832
2   3814.490217
3   412.083104

posterior -- Bayesian posterior update

Update a prior graph with new observations, producing a posterior graph. This is the Bayesian mechanism for incorporating new sequencing data into an existing model without rebuilding from scratch.

Usage

lzg posterior PRIOR NEW_DATA -o OUTPUT [options]

Arguments and options

Flag Description
PRIOR Prior .lzg graph file
NEW_DATA File with new observations
-o, --output (required) Output .lzg file for the posterior graph
-s, --seq-column Sequence column name (default: auto-detect)
-a, --abundance-column Abundance column name
--kappa Prior strength / concentration parameter (default: 1.0)

Example

lzg posterior day0.lzg day30_repertoire.tsv -o day30_posterior.lzg --kappa 0.5
[posterior] 21847 new sequences, kappa=0.5
[posterior] saved day30_posterior.lzg

Choosing kappa

kappa controls how much weight the prior receives relative to the new data. A value of 1.0 treats prior and data equally. Values below 1 let the new data dominate; values above 1 make the posterior more conservative.


Common workflows

These examples show how lzg commands chain together for typical analysis tasks.

Build and inspect

# Build the graph
lzg build repertoire.tsv -o rep.lzg

# Quick summary
lzg info rep.lzg

# Full details as JSON
lzg info rep.lzg --all --json > rep_info.json

Build, simulate, and score

Generate synthetic sequences from a repertoire and then score them to verify the generation probability distribution.

# Build
lzg build repertoire.tsv -o rep.lzg

# Simulate 10k sequences with details
lzg simulate rep.lzg -n 10000 --seed 42 --with-details -o synthetic.tsv

# Score an independent set against the same graph
lzg score rep.lzg test_sequences.txt -o scored.tsv

Compare two repertoires

lzg build healthy.tsv -o healthy.lzg
lzg build disease.tsv -o disease.lzg

# Distributional distance + structural overlap
lzg compare healthy.lzg disease.lzg

# Side-by-side diversity profiles (JSON for scripting)
lzg diversity healthy.lzg --json > healthy_div.json
lzg diversity disease.lzg --json > disease_div.json

Longitudinal tracking with posterior updates

# Day 0 baseline
lzg build day0.tsv -o day0.lzg

# Day 30: update the prior with new data
lzg posterior day0.lzg day30.tsv -o day30.lzg --kappa 1.0

# Day 60: chain another update
lzg posterior day30.lzg day60.tsv -o day60.lzg --kappa 1.0

# Compare baseline to final
lzg compare day0.lzg day60.lzg

Saturation check before building

# Is 50k sequences enough?
lzg saturation repertoire.tsv --log-every 1000 -o sat.tsv

# If saturated, build the graph
lzg build repertoire.tsv -o rep.lzg

Richness extrapolation

lzg build repertoire.tsv -o rep.lzg

# Predict how many unique sequences you'd see at 1M depth
lzg predict richness rep.lzg --depths 1000:1000000:20 -o richness.tsv

Unix pipeline: decompose and count tokens

# Average number of LZ76 tokens per sequence
lzg decompose repertoire.tsv | tail -n +2 | awk -F'\t' '{sum+=$3; n++} END {print sum/n}'
8.417