How-to: Clustering Validation
A recipe for measuring the biological utility of generated scRNA-seq data via clustering quality.
Goal
Report ARI, NMI, and macro-F1 scores by comparing Leiden clusters of the generated data to ground-truth cell type labels.
Prerequisites
Generated data CSV.
Feature selection completed (see ./feature_selection).
Steps
1. Run the complete validation
python -m data_validation.data_validation \
--dataset muraro \
--gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv \
--method cv2 \
--n_genes 100 \
--plot_umap
This performs feature selection, PCA, KNN graph construction, Leiden clustering (resolution sweep), and ARI/NMI/F1 reporting — all in one command.
2. Examine the UMAP plots
The --plot_umap flag saves UMAP visualisations to results/:
results/<dataset>_<method>_umap_real.pdf— real data coloured by Leiden cluster.results/<dataset>_<method>_umap_gen.pdf— generated data coloured by Leiden cluster.
Compare these qualitatively: clusters in the generated plot should mirror clusters in the real plot in shape, separation, and relative position.
3. Tune the resolution sweep
The default resolution ranges are in data_validation/data_validation.py:
RESOLUTION_RANGES = {
"yan": np.arange(0.80, 1.61, 0.01),
"pollen": np.arange(0.10, 3.01, 0.01),
"cbmc": np.arange(0.20, 0.81, 0.01),
"muraro": np.arange(0.10, 3.01, 0.01),
}
For your own dataset, experiment with this range. The best resolution is the one that maximises ARI against ground truth.
4. Interpret the results
cat results/muraro_cv2_ari_nmi_f1.csv
Typical columns:
dataset,method,n_genes,best_resolution,ari,nmi,f1
muraro,cv2,100,1.42,0.6734,0.7519,0.6108
Interpretation thresholds (scRNA-seq data):
ARI Range |
Interpretation |
|---|---|
> 0.70 |
Excellent — the generated data’s clusters match ground truth nearly as well as real data. |
0.50–0.70 |
Good — meaningful biological signal captured. |
0.30–0.50 |
Fair — some signal, but cluster boundaries are fuzzy. |
< 0.30 |
Poor — generated data lacks the structure to resolve cell types. |
Using the Python API
from data_validation.data_validation import validate_generated_data
import pandas as pd
gen_df = pd.read_csv("data/gen_data/muraro_data_mixdata_iter3_top_426.csv")
metrics = validate_generated_data(
dataset="muraro",
gen_df=gen_df,
method="cv2",
n_genes=100,
plot_umap=True
)
print(metrics) # {"ari": 0.6734, "nmi": 0.7519, "f1": 0.6108, "resolution": 1.42}
Troubleshooting
Symptom |
Cause |
Fix |
|---|---|---|
ARI = 0.0 |
Feature selection failed to pick informative genes. |
Try |
Leiden returns 1 cluster |
Resolution too low. |
Widen the sweep range (e.g., start at 0.01). |
UMAP is a blob |
Generated data lacks structure. |
Re-train GARAGE with higher |
NMI > ARI |
NMI is less sensitive to cluster size imbalance — normal for datasets with rare types. |
Expected. |
macro-F1 much lower than ARI |
Rare cell types are poorly represented in the generated data. |
Increase |