Preparing Your Data

GARAGE can generate synthetic cells from any scRNA-seq dataset — not just the four built-in ones. This guide explains how to format and register your own data.

Required Input Files

You need exactly two CSV/TSV files:

File	Format	Example
Expression matrix	Rows = cells, columns = genes. Values = log-normalised or scaled counts.	`my_data_counts.csv`
Cell-type labels	One column with the cell type for each cell. Same row order as the expression matrix.	`my_data_labels.csv`

Expression Matrix Format

The expression matrix should be a cells × genes CSV file. Values are typically:

Log-normalised counts (e.g., log2(CPM + 1) from Seurat or Scanpy).
Scaled counts (e.g., from sc.pp.scale).
Raw normalised counts (the GAN will re-scale via Sigmoid).

Option A: No header (cells × genes, no gene names)

0.0,1.2,0.0,3.4,0.0,...
0.0,0.0,5.6,0.0,2.1,...
...

This is the format for Yan and Pollen datasets.

Option B: With header (rows = cells, columns = gene names)

cell_id,ENSG000001,RPL13A,GAPDH,...
cell_1,0.0,1.2,0.0,...
cell_2,0.0,0.0,5.6,...
...

This is the format for CBMC and Muraro datasets.

The critical setting in config.py is:

Parameter	`yan`/`pollen`	`cbmc`/`muraro`
`header`	`None`	`0`
`transpose`	`True` (files are genes × cells)	`False` (files are cells × genes)
`index_col`	—	`0` (first column is cell ID)

Cell-Type Labels Format

A single column of cell-type identifiers, one per cell:

cell_type
B cell
T cell CD4+
NK cell
...

Important: The row order must match the expression matrix. If your expression matrix has 300 rows (cells), your labels file must have 300 entries.

Registering Your Dataset

Add a new entry to DATASET_CONFIG in config.py:

DATASET_CONFIG = {
    # ... existing entries ...

    "my_dataset": {
        "expression_file": "my_data_counts.csv",
        "label_file": "my_data_labels.csv",
        "header": 0,              # or None if no header
        "index_col": 0,           # or None
        "transpose": False,       # True if the file is genes × cells
        "label_header": 0,        # or None if labels file has no header
        "label_col": "cell_type", # or 0 if using column index
        "rare_threshold": 50,     # types with < 50 cells are "rare"
        "iter_map": [0, 1, 2, 3, 4, 5],
    },
}

Setting the Rare Threshold

The rare_threshold parameter is critical — it defines which cell types GARAGE treats as “rare” and gives priority attention. A good rule of thumb:

Total cells (\(n\))	Recommended threshold
< 500	\(n / 10\)
500 – 5,000	50 – 200
> 5,000	200 – 500

Run this quick check in Python to see which types would be flagged:

import pandas as pd
labels = pd.read_csv("my_data_labels.csv", header=0)
counts = labels["cell_type"].value_counts()
print(counts[counts < 50])  # types with fewer than 50 cells

Placing Your Files

Put the expression matrix and labels files in:

GARAGE/
├── data/
│   ├── cell_types/
│   │   └── my_data_labels.csv       ← your labels file
│   └── expression_matrix/
│       └── my_data_counts.csv       ← your expression matrix

The paths are relative to config.py’s DATA_DIR:

DATA_DIR = os.path.join(REPO_ROOT, "data")
CELL_TYPES_DIR = os.path.join(DATA_DIR, "cell_types")
EXPRESSION_DIR = os.path.join(DATA_DIR, "expression_matrix")

Running GARAGE on Your Data

Once your dataset is registered and files are in place:

python -m data_generation.garage --dataset my_dataset
python -m data_validation.data_validation \
    --dataset my_dataset \
    --gen_csv data/gen_data/my_dataset_data_mixdata_iter3_top_426.csv \
    --method cv2

Troubleshooting

Problem	Likely Cause	Fix
`FileNotFoundError`	Expression or labels file not found.	Check that files exist in the expected subdirectory.
Shape mismatch	Expression matrix and labels have different row counts.	Verify that both files correspond to the same cells in the same order.
`ValueError` on `transpose`	Wrong transpose setting.	Try toggling `transpose` in `DATASET_CONFIG`.
GAT loss NaN	Zero-expression cells or genes.	Filter out cells/genes with zero total expression.
ARI = 0	Feature selection selected uninformative genes.	Try `--method pca` or increase `n_genes`.
CUDA OOM	Dataset too large for GPU.	Reduce `leakage_fraction` or run on CPU (set `DEVICE = "cpu"` in `config.py`).

Example: Adding the CBMC Dataset from Scratch

For reference, here’s how the CBMC dataset entry looks in config.py:

"cbmc": {
    "expression_file": "cbmc_rna_scaled.csv",
    "label_file": "cell_type_cbmc.csv",
    "header": 0,
    "index_col": 0,
    "transpose": True,
    "label_header": 0,
    "label_col": "x",
    "rare_threshold": 200,
    "iter_map": [0, 1, 2, 3, 4, 5],
},

This handles a genes × cells matrix (hence transpose: True) with a header row and index column, cell type labels with a header and the label column named "x".