How-to: Clustering Validation

A recipe for measuring the biological utility of generated scRNA-seq data via clustering quality.

Goal

Report ARI, NMI, and macro-F1 scores by comparing Leiden clusters of the generated data to ground-truth cell type labels.

Prerequisites

Generated data CSV.
Feature selection completed (see ./feature_selection).

Steps

1. Run the complete validation

python -m data_validation.data_validation \
    --dataset muraro \
    --gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv \
    --method cv2 \
    --n_genes 100 \
    --plot_umap

This performs feature selection, PCA, KNN graph construction, Leiden clustering (resolution sweep), and ARI/NMI/F1 reporting — all in one command.

2. Examine the UMAP plots

The --plot_umap flag saves UMAP visualisations to results/:

results/<dataset>_<method>_umap_real.pdf — real data coloured by Leiden cluster.
results/<dataset>_<method>_umap_gen.pdf — generated data coloured by Leiden cluster.

Compare these qualitatively: clusters in the generated plot should mirror clusters in the real plot in shape, separation, and relative position.

3. Tune the resolution sweep

The default resolution ranges are in data_validation/data_validation.py:

RESOLUTION_RANGES = {
    "yan": np.arange(0.80, 1.61, 0.01),
    "pollen": np.arange(0.10, 3.01, 0.01),
    "cbmc": np.arange(0.20, 0.81, 0.01),
    "muraro": np.arange(0.10, 3.01, 0.01),
}

For your own dataset, experiment with this range. The best resolution is the one that maximises ARI against ground truth.

4. Interpret the results

cat results/muraro_cv2_ari_nmi_f1.csv

Typical columns:

dataset,method,n_genes,best_resolution,ari,nmi,f1
muraro,cv2,100,1.42,0.6734,0.7519,0.6108

Interpretation thresholds (scRNA-seq data):

ARI Range	Interpretation
> 0.70	Excellent — the generated data’s clusters match ground truth nearly as well as real data.
0.50–0.70	Good — meaningful biological signal captured.
0.30–0.50	Fair — some signal, but cluster boundaries are fuzzy.
< 0.30	Poor — generated data lacks the structure to resolve cell types.

Using the Python API

from data_validation.data_validation import validate_generated_data
import pandas as pd

gen_df = pd.read_csv("data/gen_data/muraro_data_mixdata_iter3_top_426.csv")
metrics = validate_generated_data(
    dataset="muraro",
    gen_df=gen_df,
    method="cv2",
    n_genes=100,
    plot_umap=True
)
print(metrics)  # {"ari": 0.6734, "nmi": 0.7519, "f1": 0.6108, "resolution": 1.42}

Troubleshooting

Symptom	Cause	Fix
ARI = 0.0	Feature selection failed to pick informative genes.	Try `--method pca` or increase `--n_genes`.
Leiden returns 1 cluster	Resolution too low.	Widen the sweep range (e.g., start at 0.01).
UMAP is a blob	Generated data lacks structure.	Re-train GARAGE with higher `leakage_fraction`.
NMI > ARI	NMI is less sensitive to cluster size imbalance — normal for datasets with rare types.	Expected.
macro-F1 much lower than ARI	Rare cell types are poorly represented in the generated data.	Increase `priority_weight` in `config.py`.