Summary

GARAGE (Graph-Attentive RAre-cell aware single-cell data GEneration) is a deep learning framework for generating high-fidelity synthetic single-cell RNA-seq (scRNA-seq) data.

Traditional Generative Adversarial Networks (GANs) often struggle with the high-dimensional and sparse nature of scRNA-seq data, leading to training instability and a failure to reproduce rare but biologically important cell populations. GARAGE overcomes these challenges with a unique two-stage architecture that intelligently guides the generative process.

Workflow

The GARAGE framework uses a two-stage process to generate realistic synthetic cells, with a special focus on preserving rare cell types.

GARAGE

A high-level overview of the GARAGE framework.

Stage 1: GAT-based Cell Selection: A Graph Attention Network (GAT) is trained on a cell-cell KNN graph. By leveraging its attention mechanism, the GAT identifies a core set of “archetypal” or high-importance cells that are most influential in defining the data’s structure and cell-type identities.
Stage 2: GAT-Seeded GAN Generation: Instead of receiving only random noise, the GAN’s generator is fed a hybrid input batch — a mixture of random vectors and the high-priority “seed” cells selected by the GAT. This “attention-guided leakage” anchors the generator to known, biologically realistic states, stabilising training and ensuring all cell types are represented.

Key Features

GAT-Informed Seeding: Moves beyond random sampling to intelligently select the most representative cells to guide generation.
Enhanced Rare Cell Generation: The framework is explicitly designed to better capture and generate samples for rare and underrepresented cell populations.
Improved Stability & Convergence: The seeded-generation process significantly stabilises GAN training, reduces mode collapse, and accelerates convergence.
High-Fidelity Synthetic Data: Produces synthetic datasets ideal for data augmentation, methods benchmarking, and privacy-preserving data sharing.

Repository Structure

.
├── data_generation/                Core GARAGE pipeline
│   ├── garage.py                   Full GARAGE: GAT subsampling + GAN generation
│   └── wasserstein_distance.py     Wasserstein distance (real vs. generated)
│
├── data_validation/                Quality evaluation
│   ├── feature_selection.py        Python implementation (CV², Fano, PCA loading)
│   ├── feature_selection.R         Original R implementation (reference)
│   ├── data_validation.py          Clustering (Leiden) → ARI / NMI / macro-F1 / UMAP
│   └── data_vaidation_garage.ipynb Original validation notebook (reference)
│
├── benchmarking/
│   ├── sota/                       General-purpose baselines (PyTorch)
│   │   ├── gan.py, wgan.py, fgan.py, vae.py, lsh_gan.py
│   │   └── *_tf1.py               Original TF1.11 reference implementations
│   └── scrna_seq_specific/        scRNA-seq-specific baselines (PyTorch)
│       ├── scgan.py, scvae.py, scdiffusion.py
│       └── gan_ros.py, vae_ros.py
│
├── biological_analysis/            Rare-cell biology experiments
│   ├── biological_validation.py    GAT attention ↔ marker-gene enrichment
│   ├── rare_cell_utility.py        Held-out rare-cell classification utility
│   └── marker_gene_clustering.py   Marker-gene-based clustering evaluation
│
├── ablation_study/                 Sensitivity experiments
│   ├── leakage_ablation.py         GAT leakage fraction 0.0–0.3
│   └── multi_seed_synthesis.py     Multi-seed generation (5 seeds × 4 datasets)
│
├── analysis/                       Post-processing and plotting
│   ├── distribution_metrics.py     MMD + Sliced Wasserstein Distance
│   ├── clustering_evaluation.py    Feature-selection + clustering across seeds
│   ├── aggregate_losses.py         Aggregate GAN loss records
│   ├── build_summary_tables.py     Mean±std summary tables
│   └── plot_wasserstein_vs_leakage.py  WD vs leakage figure
│
├── data/                           Input data directory
│   ├── cell_types/                 Cell-type label files (*.csv)
│   └── expression_matrix/          Gene expression matrices (*.csv)
│
├── results/                        Output directory (generated CSVs, figures)
├── docs/                           ReadTheDocs documentation source
├── img/                            Images used in README and docs
├── config.py                       Shared paths and hyper-parameter constants
├── CITATION.cff                    Citation metadata
├── LICENSE                         MIT License
├── requirements_garage.txt         Core dependencies (Python 3.12.5)
└── requirements_benchmarking.txt   Benchmarking dependencies (Python 3.7.12)

Note: GARAGE pipeline requires Python 3.12.5; benchmarking reference baselines use Python 3.7.12.

Citation

If you use GARAGE in your research, please cite:

Ganguly, R., et al. “GARAGE: A Graph-Attentive GAN for Rare Cell-Aware Single-Cell RNA-seq Data Generation.” bioRxiv, 2025.

@software{garage2025,
  author    = {Ganguly, Ritwik and others},
  title     = {GARAGE: Graph-Attentive Rare-cell-Aware single-cell RNA-seq Data Generation},
  year      = {2025},
  publisher = {bioRxiv},
  doi       = {10.1101/2025.09.28.679012},
  url       = {https://github.com/RitwikGanguly/GARAGE}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.