GARAGE

Getting Started

  • Summary
    • Workflow
    • Key Features
    • Repository Structure
    • Citation
    • License
  • Installation
    • Prerequisites
      • Hardware Notes
    • Step 1: Clone the Repository
    • Step 2: Set Up the Environment
    • Step 3: Install Dependencies
      • Core pipeline (required)
      • Benchmarking dependencies (optional)
      • Validation dependencies (recommended)
    • Step 4: Verify Installation
    • Step 5: Run the Pipeline
  • Quickstart
    • Prerequisites
    • 1. Activate the Environment
    • 2. Verify Your Setup
    • 3. Run GARAGE on a Small Dataset (Yan)
    • 4. Validate the Generated Data
    • 5. Check the Wasserstein Distance
    • 6. View the Results
    • Next Steps
    • Common First-Time Issues
  • Preparing Your Data
    • Required Input Files
    • Expression Matrix Format
      • Option A: No header (cells × genes, no gene names)
      • Option B: With header (rows = cells, columns = gene names)
    • Cell-Type Labels Format
    • Registering Your Dataset
    • Setting the Rare Threshold
    • Placing Your Files
    • Running GARAGE on Your Data
    • Troubleshooting
    • Example: Adding the CBMC Dataset from Scratch

Tutorials

  • Tutorial: End-to-End Guide
    • Step 1: Understand Your Dataset
    • Step 2: Run GARAGE (GAT + GAN)
      • What happens under the hood
      • Console output (abridged)
    • Step 3: Compute Wasserstein Distance
    • Step 4: Feature Selection and Clustering Validation
      • What happens
      • Output (approximate)
      • Interpreting the Scores
    • Step 5: Biological Validation
    • Step 6: Benchmark Against SOTA Models
    • Step 7: Run Ablation Studies
    • Complete Workflow Script
    • What’s Next
  • Tutorial: Biological Validation
    • What Biological Validation Measures
    • Step 1: Run the Main Biological Validation
      • What happens
    • Step 2: Interpret the Results
      • Output files
      • Console output (interpretation guide)
    • Step 3: Marker Gene Clustering
    • Step 4: Held-out Rare Cell Utility
    • Biological Interpretation for Papers
    • Customising for Your Own Dataset
  • Tutorial: Rare Cell Experiment
    • Experimental Design
    • Step 1: Run the Experiment
    • Step 2: Interpret the Results
      • Console output (representative)
      • What to look for
    • Step 3: Statistical Rigour
      • Output files
    • Customising the Experiment
      • Change the rare-cell type
      • Change the augmentation ratio
      • Add a new model
    • Related Pages
  • Tutorial: Benchmark Against SOTA
    • Why Benchmark?
    • Step 1: Generate Data with All Models
      • GARAGE
      • General-purpose SOTA baselines
      • scRNA-seq-specific baselines
    • Step 2: Compute Distributional Metrics
    • Step 3: Compute Clustering Metrics
    • Step 4: Build Summary Tables
    • Step 5: Generate Publication-Quality Figures
    • Step 6: Interpret the Results
      • Output files
      • How to compare
      • Key comparison dimensions
    • Reproducing the Paper’s Figures
    • Adding a New Baseline
    • Related Pages

How‑to Guides

  • How-to: Run GARAGE
    • Goal
    • Prerequisites
    • Steps
      • 1. Basic invocation
      • 2. Customise hyper-parameters
      • 3. Monitor training
      • 4. Running on multiple datasets
    • Output
    • Troubleshooting
    • Related
  • How-to: Feature Selection
    • Goal
    • Prerequisites
    • Steps
      • 1. CV² selection (recommended default)
      • 2. Fano index selection
      • 3. PCA loading selection
      • 4. Using the Python API directly
    • Choosing the right method
    • Output
    • Troubleshooting
    • Related
  • How-to: Clustering Validation
    • Goal
    • Prerequisites
    • Steps
      • 1. Run the complete validation
      • 2. Examine the UMAP plots
      • 3. Tune the resolution sweep
      • 4. Interpret the results
    • Using the Python API
    • Troubleshooting
    • Related
  • How-to: Compute Wasserstein Distance
    • Goal
    • Prerequisites
    • Steps
      • 1. Single dataset
      • 2. Batch across multiple datasets
      • 3. Using the Python API
    • Interpreting the Output
    • Related
  • How-to: Biological Validation
    • Goal
    • Prerequisites
    • Steps
      • 1. Run the main validation
      • 2. Run marker gene clustering
      • 3. Examine the output files
      • 4. Key metrics to verify
    • Related
  • How-to: Benchmarking
    • Goal
    • Prerequisites
    • Steps
      • 1. Generate data from all models
      • 2. Compute metrics for all
      • 3. Build tables
      • 4. Check the output
    • Related
  • How-to: Run Ablation Studies
    • Goal
    • Prerequisites
    • Steps
      • 1. Leakage fraction ablation
      • 2. Multi-seed synthesis
      • 3. Generate the WD vs. leakage figure
    • Interpreting the Results
    • Related
  • How-to: Interpret Outputs
    • Goal
    • The Metric Pyramid
    • Scenario 1: Everything Looks Good
    • Scenario 2: Good WD, Poor Clustering
    • Scenario 3: Poor WD, Good Clustering
    • Scenario 4: Good ARI, Low macro-F1
    • Scenario 5: Everything Is Bad
    • Quick Diagnosis Table
    • Related

Theoretical Background

  • Motivation
    • Why Generate Synthetic scRNA‑seq Data?
    • The GARAGE Approach
    • Where GARAGE Fits in the Literature
    • Who Should Use GARAGE?
    • Next Steps
  • scRNA-seq Data Challenges
    • The Structure of scRNA-seq Data
      • Key Properties
    • Rare Cell Collapse in Generative Models
      • The Fano Index as a Diagnostic
    • Dimensionality Reduction for scRNA-seq
    • Common Pitfalls
    • Further Reading
  • GARAGE Architecture
    • Overview
    • Stage 1: GAT-Based Cell Selection
      • Input
      • Step-by-Step
      • Mathematical Formulation
    • Stage 2: GAN Generation with Attention-Guided Seeding
      • Input
      • Hybrid Input Batch
      • Generator Architecture
      • Discriminator Architecture
      • Training Loop (per iteration)
      • Label Smoothing
    • Why the Two-Stage Architecture Works
    • Hyper-parameter Reference
    • Datasets
  • Generative Adversarial Networks in GARAGE
    • The GAN Framework
      • Generator (\(G\))
      • Discriminator (\(D\))
      • Loss Functions
    • How GARAGE Uses GANs
      • The Hybrid Input Batch
      • Architecture Details
      • Training Loop
    • Advantages of GANs for scRNA-seq Data
    • Common Challenges and GARAGE’s Countermeasures
    • Variants Used in Benchmarking
  • Graph Attention Networks in GARAGE
    • GAT Overview
      • Attention Mechanism
    • GCN vs. GAT
    • GAT in GARAGE: The Cell Selection Stage
      • Step 1: Build the KNN Graph
      • Step 2: GAT Classifier with Priority Weighting
      • Step 3: Extract Seed Cells
    • Full Implementation
    • References
  • Evaluation Metrics
    • Distributional Metrics
      • Wasserstein Distance (Earth Mover’s Distance)
      • Maximum Mean Discrepancy (MMD)
      • Sliced Wasserstein Distance (SWD)
    • Clustering-Based Metrics
      • Adjusted Rand Index (ARI)
      • Normalised Mutual Information (NMI)
      • Macro-F1 Score
    • Visualisation
      • UMAP (Uniform Manifold Approximation and Projection)
      • PCA (Principal Component Analysis)
    • Feature Selection Methods
      • 1. CV² (Coefficient of Variation Squared)
      • 2. Fano Index
      • 3. PCA Loading
    • Leiden Clustering
    • Recommended Evaluation Workflow
  • Single Cell Clustering
    • What is single-cell clustering?
    • Key Steps
    • Applications
    • scRNA-seq vs. Bulk RNA-seq
  • Wasserstein Distance
    • Overview
      • Advantages
      • Limitations
    • Wasserstein Distance in GARAGE
      • Running the Computation
      • Implementation Reference
      • What Good Scores Look Like
      • Improving Wasserstein Distance
    • Related Metrics

API Reference

  • config module
    • Constants
      • DATASET_CONFIG
      • GARAGE_DEFAULTS
  • data_generation module
    • GARAGE — Graph-Attentive Rare-cell-Aware single-cell data GEneration.
      • Usage
      • Citation
    • Discriminator
      • Discriminator.forward()
    • GATClassifier
      • GATClassifier.forward()
    • Generator
      • Generator.forward()
    • gat_subsample()
    • generate_data()
    • load_dataset()
    • main()
    • run_garage()
    • sample_Z()
    • train_gan()
    • Wasserstein distance between real and generated scRNA-seq distributions.
      • Datasets
      • Usage
    • load_generated()
    • load_real()
    • main()
    • wasserstein_distance()
    • Core Functions
  • data_validation module
    • Data validation for GARAGE‑generated scRNA‑seq data.
      • Usage
    • cluster_and_evaluate()
    • cv2_selection()
    • load_generated()
    • load_labels()
    • load_real()
    • main()
    • plot_umap()
    • sweep_resolution()
    • Feature selection for scRNA-seq data (Python port of feature_selection.R).
      • Strategy
      • Usage
    • cv2_selection()
    • fano_selection()
    • main()
    • pca_loading_selection()
    • run_feature_selection()
    • Core Functions
    • Resolution Sweep
    • Reference Notebook
  • biological_analysis module
    • Biological validation of attention-prioritised cells.
    • enrichment_analysis()
    • load_cbmc()
    • main()
    • marker_expression_analysis()
    • per_celltype_positive_rate_analysis()
    • print_summary_interpretation()
    • run_gat_and_get_attention()
    • Held-out rare-cell utility experiment.
    • GATClassifier
      • GATClassifier.forward()
    • GarageDiscriminator
      • GarageDiscriminator.forward()
    • GarageGenerator
      • GarageGenerator.forward()
    • LSHDiscriminator
      • LSHDiscriminator.forward()
    • LSHGenerator
      • LSHGenerator.forward()
    • VanillaDiscriminator
      • VanillaDiscriminator.forward()
    • VanillaGenerator
      • VanillaGenerator.forward()
    • evaluate_classifier()
    • evaluate_classifier_downgraded()
    • garage_gat_seeds()
    • generate_garage()
    • generate_lsh_gan()
    • generate_vanilla_gan()
    • knn_subsample()
    • load_data()
    • main()
    • sample_Z()
    • split_rare()
    • train_garage_gan()
    • train_lsh_gan()
    • train_vanilla_gan()
    • Fixed marker-gene clustering evaluation.
    • compute_ari_nmi()
    • evaluate_baseline()
    • evaluate_clustering()
    • evaluate_real_reference()
    • evaluate_sweep()
    • get_pseudo_labels()
    • load_garage_data()
    • load_gen_data()
    • load_real_data()
    • main()
    • select_markers_cbmc()
    • select_markers_muraro()
    • select_markers_pollen()
    • select_markers_yan()
    • Modules Overview
  • ablation_study module
    • GAN training stability ablation study.
    • Discriminator
      • Discriminator.forward()
    • Generator
      • Generator.forward()
    • gat_subsample()
    • load_cbmc()
    • load_muraro()
    • load_pollen()
    • load_yan()
    • main()
    • sample_Z()
    • train_gan_with_leakage()
    • Critic
      • Critic.forward()
    • FDiscriminator
      • FDiscriminator.forward()
    • FGenerator
      • FGenerator.forward()
    • GATG_Discriminator
      • GATG_Discriminator.forward()
    • GATG_Generator
      • GATG_Generator.forward()
    • LSHDiscriminator
      • LSHDiscriminator.forward()
    • LSHGenerator
      • LSHGenerator.forward()
    • VanillaDiscriminator
      • VanillaDiscriminator.forward()
    • VanillaGenerator
      • VanillaGenerator.forward()
    • WGenerator
      • WGenerator.forward()
    • fisher_ratio()
    • gat_subsample()
    • knn_subsample()
    • load_labels()
    • load_real()
    • main()
    • sample_Z()
    • train_and_generate_fgan()
    • train_and_generate_gan()
    • train_and_generate_gatgan()
    • train_and_generate_lshgan()
    • train_and_generate_wgan()
    • Modules Overview
  • analysis module
    • load_real()
    • load_synthetic()
    • main()
    • mmd_rbf()
    • sliced_wasserstein()
    • cluster_and_evaluate()
    • cv2()
    • evaluate_baseline()
    • evaluate_sweep()
    • load_gen()
    • load_labels()
    • load_real()
    • main()
    • fmt_mean_std()
    • main()
    • evaluate_ari()
    • fano_selection()
    • load_real()
    • main()
    • Fixed marker-gene clustering evaluation (grid sweep).
    • compute_ari_nmi()
    • evaluate_clustering()
    • evaluate_real_reference()
    • evaluate_sweep()
    • get_pseudo_labels()
    • load_garage_data()
    • load_gen_data()
    • load_real_data()
    • main()
    • select_markers_cbmc()
    • select_markers_muraro()
    • select_markers_pollen()
    • select_markers_yan()
    • load_gan_data()
    • load_garage_data()
    • load_lsh_gan_data()
    • load_real_data()
    • main()
    • mmd_rbf()
    • rbf_kernel()
    • load_gan_data()
    • load_garage_data()
    • load_lsh_gan_data()
    • load_real_data()
    • main()
    • sliced_wasserstein()
    • Modules Overview

Appendix

  • Glossary
    • A
    • B
    • C
    • D
    • E
    • F
    • G
    • H
    • L
    • M
    • N
    • P
    • R
    • S
    • U
    • V
    • W
  • FAQ
    • General
      • What is GARAGE?
      • What Python version do I need?
      • Do I need a GPU?
      • Can I use GARAGE on my own dataset?
    • Generation
      • How long does training take?
      • My GAN losses are oscillating. Is that normal?
      • What does the leakage fraction (\(\lambda\)) do?
    • Validation
      • My ARI is very low. What should I do?
      • What’s a “good” ARI for scRNA-seq data?
      • Why is my macro-F1 much lower than ARI?
      • Can I compare different models?
    • Troubleshooting
      • CUDA out of memory
      • “No module named ‘torch_geometric’”
      • Generated CSV is empty or all zeros
      • Feature selection returns 0 genes
    • Citing GARAGE
    • Getting Help
  • Troubleshooting
    • Installation issues
      • PyTorch Geometric installation fails
      • NumPy/SciPy version mismatch
    • Runtime errors
      • CUDA out of memory
      • File not found
      • Data shape mismatch
      • Leiden returns 1 cluster
      • Leiden returns too many clusters (1 per cell)
    • Training problems
      • GAT loss doesn’t decrease
      • GAN generator loss diverges
      • Mode collapse (generator produces only one cell type)
      • NaN losses
    • Validation problems
      • All metrics are zero
      • UMAPs look identical
    • Build and docs
      • Sphinx build fails with “WARNING: document isn’t included in any toctree”
      • ReadTheDocs build fails on external images
    • Still stuck?
  • Citation
    • BibTeX
    • DOI
    • Citation File Format (CFF)
    • APA
    • License
  • Changelog
    • [1.0.0] — 2025-09-28 (bioRxiv preprint)
      • Added
GARAGE
  • Search


© Copyright 2025, GARAGE.

Built with Sphinx using a theme provided by Read the Docs.