Quickstart
This guide gets you from zero to your first GARAGE-generated synthetic dataset in under 5 minutes.
Prerequisites
Python 3.12.5 installed with Conda.
The repository cloned and dependencies installed (
requirements_garage.txt).See Installation if you haven’t done this yet.
1. Activate the Environment
conda activate venv_garage
cd GARAGE
2. Verify Your Setup
python -c "import torch; print('PyTorch', torch.__version__); print('CUDA:', torch.cuda.is_available())"
python -c "from config import DATASET_CONFIG, GARAGE_DEFAULTS; print(list(DATASET_CONFIG))"
Expected output (roughly):
PyTorch 2.4.0
CUDA: True
['yan', 'pollen', 'cbmc', 'muraro']
3. Run GARAGE on a Small Dataset (Yan)
Yan is the smallest dataset (124 cells, 6 types) — it trains in ~2 minutes on CPU:
python -m data_generation.garage --dataset yan
What this does:
Loads the Yan expression matrix and cell-type labels.
Stage 1: Trains a GAT classifier on a KNN cell-cell graph, identifies seed cells.
Stage 2: Trains a GAN with hybrid noise+seed input batches (20,001 iterations).
Saves the generated data to
data/gen_data/(e.g.,yan_data_mixdata_iter3_top_426.csv).
The console output will show loss values for both stages. Watch for the GAT loss decreasing and the GAN generator/discriminator losses converging.
4. Validate the Generated Data
python -m data_validation.data_validation \
--dataset yan \
--gen_csv data/gen_data/yan_data_mixdata_iter3_top_426.csv \
--method cv2 \
--plot_umap
What this does:
Loads the generated and real data.
Applies CV² feature selection (100 genes).
Runs Leiden clustering over a resolution sweep.
Reports ARI, NMI, and macro-F1 scores.
Optionally saves UMAP plots to
results/.
Expected output (approximate):
Dataset: yan | Feature selection: cv2 | top_genes: 100
Best resolution: 1.05
ARI: 0.7243
NMI: 0.8011
F1: 0.6938
5. Check the Wasserstein Distance
python -m data_generation.wasserstein_distance \
--dataset yan \
--gen_csv data/gen_data/yan_data_mixdata_iter3_top_426.csv
A good Wasserstein distance for Yan is \(< 0.01\).
6. View the Results
ls results/
Typical output:
yan_cv2_ari_nmi_f1.csv
yan_cv2_umap_real.pdf
yan_cv2_umap_gen.pdf
yan_wasserstein_distance.csv
Next Steps
Now that you’ve seen the basic workflow:
Run on larger datasets: Replace
--dataset yanwithpollen,cbmc, ormuraro.Plug in your own data: See Preparing Your Data.
Full end-to-end tutorial: Tutorial: End-to-End Guide.
Biological validation: Tutorial: Biological Validation.
Benchmark against SOTA: Tutorial: Benchmark Against SOTA.
Common First-Time Issues
Problem |
Solution |
|---|---|
|
Install PyG: |
|
Reduce |
|
Check the filename pattern in the console output; it depends on the iteration and dataset. |
|
Increase the resolution sweep range in |
|
Try a different feature selection method ( |