How-to: Interpret Outputs
A guide to reading GARAGE’s multi-metric evaluation and diagnosing issues.
Goal
Understand what each metric tells you, how to combine them into a coherent story, and how to diagnose common failure modes.
The Metric Pyramid
Interpret metrics in this order:
┌──────────────┐
│ UMAP visual │ ← Quick sanity check
├──────────────┤
│ WD, MMD │ ← Distributional fidelity
├──────────────┤
│ ARI, NMI │ ← Clustering quality
├──────────────┤
│ macro-F1 │ ← Rare-cell preservation check
├──────────────┤
│ Bio. valid. │ ← Biological interpretability
└──────────────┘
Scenario 1: Everything Looks Good
WD = 0.007 (excellent)
ARI = 0.67 (good)
NMI = 0.75 (good)
F1 = 0.61 (reasonable, on par with real data)
UMAP shows separated clusters matching real data
Conclusion: GARAGE is working well. Proceed to biological validation and rare-cell experiments.
Scenario 2: Good WD, Poor Clustering
WD = 0.008 (excellent)
ARI = 0.21 (poor)
NMI = 0.30 (poor)
F1 = 0.15 (poor)
UMAP is a single blob
Diagnosis: The generator produces data that is distributionally similar to the real data (low WD) but lacks the structure to form distinct clusters. This is common when:
The leakage fraction is too high (generator copies real data → WD looks good, but no novelty).
The generated data has the right gene expression means but wrong covariance structure.
Fix:
Reduce
leakage_fraction(0.2 → 0.15 or 0.1).Try different feature selection methods (
--method pca).Check that the GAN converged (D_loss should be ~ 0.69 plateau).
Scenario 3: Poor WD, Good Clustering
WD = 0.34 (poor)
ARI = 0.58 (good)
NMI = 0.65 (good)
F1 = 0.49 (moderate)
UMAP shows decent clusters but shifted relative to real data
Diagnosis: The generator produces data that clusters nicely but is shifted in gene-expression space. This can happen with:
Under-normalisation (generated and real data on different scales).
Batch-effect-like translation.
Fix:
Check that the real data and generated data are on the same scale.
Try re-normalising both to z-scores before computing WD.
WD is computed on raw normalised data — try computing on PC-reduced data instead.
Scenario 4: Good ARI, Low macro-F1
ARI = 0.65 (good)
NMI = 0.72 (good)
F1 = 0.28 (poor)
Diagnosis: The generated data has good overall cluster structure but fails to capture rare cell types. The macro-F1 averages per-type F1 with equal weight, so a low macro-F1 means at least one rare type is being lost.
Fix:
Increase
priority_weight(2.0 → 4.0).Increase
leakage_fractionslightly.Check that
rare_thresholdis set correctly — the rare type may not be flagged as rare.
Scenario 5: Everything Is Bad
WD = 0.89 (terrible)
ARI = 0.05 (essentially random)
NMI = 0.10
F1 = 0.03
UMAP is noise
Diagnosis: The generator has failed to converge. Likely causes:
GAN training diverged (check training logs: G_loss → ∞ or oscillating).
The dataset is too large for the model capacity (increase hidden dimensions).
The learning rates are poorly tuned.
Fix:
Check training logs first — identify which stage failed.
If GAT loss > 0.5 after 7500 epochs → reduce
gat_epochs, the model is underfitting.If D_loss → 0 while G_loss → ∞ → discriminator is too strong; reduce
nd_stepsord_lr.Try training on a smaller dataset (Yan) first to verify the pipeline works.
Reset
config.pytoGARAGE_DEFAULTSand try again.
Quick Diagnosis Table
WD |
ARI |
macro-F1 |
Likely Issue |
Action |
|---|---|---|---|---|
✓ |
✓ |
✓ |
Everything fine |
Proceed. |
✓ |
✗ |
✗ |
No cluster structure |
Reduce leakage. |
✗ |
✓ |
✓ |
Distributional shift |
Re-normalise. |
✓ |
✓ |
✗ |
Rare cells lost |
Increase priority_weight. |
✗ |
✗ |
✗ |
Training failure |
Check losses, re-tune. |