Feature Extraction for Machine Learning¶

LZGraphs provides three strategies for extracting fixed-size numerical feature vectors from repertoire graphs. These vectors can be fed directly into scikit-learn classifiers, neural networks, or any ML pipeline.

The challenge¶

Immune repertoires have variable size — one sample might have 1,000 sequences, another 50,000. Standard ML algorithms need fixed-dimensional input. LZGraphs solves this by projecting repertoires into consistent feature spaces derived from the graph structure.

Three strategies at a glance¶

Strategy	Method	Dimension	Best for
A. Reference-aligned	`ref.feature_aligned(query)`	`ref.n_nodes`	Cross-repertoire classification
B. Mass profile	`graph.feature_mass_profile()`	`max_pos + 1` (~31)	Length/position-based phenotyping
C. Statistics	`graph.feature_stats()`	15	Quick screening, low-dimensional

Strategy A: Reference-aligned features¶

This is the most powerful approach. You choose a reference graph (typically built from a large population or cohort), and then project every sample into the reference graph's node space.

How it works¶

For each node in the reference graph, look up whether the same node exists in the query graph. If it does, record its frequency (outgoing traffic proportion). If not, record 0. The result is a vector whose dimension equals the reference graph's node count.

from LZGraphs import LZGraph

# 1. Build a reference graph from a large cohort
ref = LZGraph(cohort_sequences, variant='aap')
print(f"Reference: {ref.n_nodes} nodes → {ref.n_nodes}-dim feature space")

# 2. Project individual samples into this space
features_patient_1 = ref.feature_aligned(LZGraph(patient_1_seqs, variant='aap'))
features_patient_2 = ref.feature_aligned(LZGraph(patient_2_seqs, variant='aap'))

print(f"Feature vector shape: {features_patient_1.shape}")
print(f"Non-zero features: {(features_patient_1 > 0).sum()}")

Building a classification pipeline¶

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Build reference from all training data combined
all_train_seqs = healthy_seqs + disease_seqs
ref = LZGraph(all_train_seqs, variant='aap')

# Extract features for each sample
X = []
y = []
for seqs, label in [(healthy_samples, 0), (disease_samples, 1)]:
    for sample_seqs in seqs:
        vec = ref.feature_aligned(LZGraph(sample_seqs, variant='aap'))
        X.append(vec)
        y.append(label)

X = np.array(X)
y = np.array(y)

# Train a classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Accuracy: {scores.mean():.1%} +/- {scores.std():.1%}")

Choosing the reference graph

The reference graph defines the feature space. Using a large, diverse cohort as the reference ensures that most subpattern-position pairs are represented, giving you a rich feature space. If you use a small reference, many query nodes will map to zero.

Properties of aligned features¶

Consistent dimension across all samples (= reference node count)
Interpretable: each feature corresponds to a specific subpattern at a specific position
Sparse: most samples use only a fraction of the reference nodes
Normalized: values sum to approximately 1.0 (frequency distribution)

Strategy B: Mass profile¶

The mass profile distributes the graph's probability mass across sequence positions. It answers: how much probability mass terminates at each position?

graph = LZGraph(sequences, variant='aap')

profile = graph.feature_mass_profile(max_pos=30)
print(f"Shape: {profile.shape}")    # (31,)
print(f"Sum:   {profile.sum():.4f}")  # ~1.0

The result is a 31-element vector (positions 0 through 30) that forms a probability distribution. Peak positions correspond to the most common sequence lengths.

Visualizing the mass profile¶

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(range(len(profile)), profile, color='#42A5F5')
ax.set_xlabel("Sequence position")
ax.set_ylabel("Probability mass")
ax.set_title("Mass Profile")
plt.tight_layout()
plt.show()

Mass profile for a 5,000-sequence repertoire. Each bar shows how much probability mass terminates at that position. The peak at position 13 reflects the most common CDR3 length. This 31-element vector is a compact, fixed-size fingerprint of the repertoire's length-position structure.

Use case: comparing length-position distributions¶

The mass profile captures differences in CDR3 length distribution and how probability mass is distributed across positions. Two repertoires with different V-gene usage will have different mass profiles even if their overall diversity is similar.

profile_a = graph_a.feature_mass_profile()
profile_b = graph_b.feature_mass_profile()

# Cosine similarity between profiles
from numpy.linalg import norm
cos_sim = np.dot(profile_a, profile_b) / (norm(profile_a) * norm(profile_b))
print(f"Profile similarity: {cos_sim:.4f}")

Strategy C: Graph statistics vector¶

The fastest and simplest approach — a 15-element vector of scalar statistics computed from the graph:

graph = LZGraph(sequences, variant='aap')
stats = graph.feature_stats()
print(f"Shape: {stats.shape}")  # (15,)

What's in the vector¶

Index	Feature	Description
0	`n_nodes`	Total nodes in the graph
1	`n_edges`	Total edges
2	`n_root`	Number of root nodes (always 1)
3	`n_sinks`	Number of terminal ($) nodes
4	`D(0)`	Hill number at order 0 (richness)
5	`D(0.5)`	Hill number at order 0.5
6	`D(1)`	Hill number at order 1 (effective diversity)
7	`D(2)`	Hill number at order 2 (inverse Simpson)
8	`D(5)`	Hill number at order 5
9	`H`	Shannon entropy (nats)
10	`DR`	Dynamic range (orders of magnitude)
11	`n_sinks`	Terminal node count (same as index 3; reserved for future use)
12	`sink_frac`	Fraction of nodes that are terminals
13	`max_out`	Maximum out-degree
14	`uniformity`	Entropy uniformity (0 to 1)

Use case: quick cohort screening¶

When you have many samples and need a fast initial comparison:

import numpy as np

# Extract stats for all samples
all_stats = np.array([
    LZGraph(sample_seqs, variant='aap').feature_stats()
    for sample_seqs in all_samples
])

print(f"Feature matrix: {all_stats.shape}")  # (n_samples, 15)

# PCA for visualization
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
coords = pca.fit_transform(all_stats)

# The 15 features capture diversity, complexity, and structural properties
# in a single low-dimensional representation

Combining strategies¶

For maximum discriminative power, you can concatenate features from multiple strategies:

import numpy as np

ref = LZGraph(cohort_seqs, variant='aap')

def extract_features(seqs):
    g = LZGraph(seqs, variant='aap')
    aligned = ref.feature_aligned(g)      # ~1500 features
    mass = g.feature_mass_profile()         # 31 features
    stats = g.feature_stats()               # 15 features
    return np.concatenate([aligned, mass, stats])

# Build feature matrix
X = np.array([extract_features(s) for s in all_samples])
print(f"Combined feature matrix: {X.shape}")

Integration with scipy sparse¶

For very large reference graphs, the aligned feature vectors are sparse (most entries are zero). You can export the graph as a CSR matrix for efficient computation:

from scipy.sparse import csr_matrix

csr = graph.adjacency_csr()
A = csr_matrix(
    (csr['weights'], csr['col_indices'], csr['row_offsets']),
    shape=(graph.n_nodes, graph.n_nodes),
)
print(f"Sparse adjacency: {A.shape}, nnz={A.nnz}")