Troubleshooting
Common problems and their solutions.
Installation issues
PyTorch Geometric installation fails
ImportError: No module named 'torch_geometric'
Fix: Install PyG matching your PyTorch/CUDA version:
# PyTorch 2.4 with CUDA 12.1
pip install torch_geometric
pip install torch_scatter torch_sparse torch_cluster -f https://data.pyg.org/whl/torch-2.4.0+cu121.html
Alternatively, use Conda:
conda install pyg -c pyg
NumPy/SciPy version mismatch
UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required
Fix: This warning appears when SciPy was compiled against an older NumPy. It’s generally harmless for GARAGE’s usage. To fix, re-install the environment:
conda create --name venv_garage python=3.12.5
conda activate venv_garage
pip install -r requirements_garage.txt
Runtime errors
CUDA out of memory
RuntimeError: CUDA out of memory. Tried to allocate ...
Fixes (in order of preference):
Reduce
leakage_fractioninconfig.py(0.2 → 0.1).Reduce the GAN batch size (edit
garage.py).Run on CPU by modifying
garage.py:DEVICE = torch.device("cpu")
For CBMC specifically, the 7,895 × 2,000 matrix with KNN graph can be memory-intensive. Consider using a CPU-only run.
File not found
FileNotFoundError: [Errno 2] No such file or directory: 'data/cell_types/...'
Fix: Ensure you’re running from the repository root:
cd /path/to/GARAGE
python -m data_generation.garage --dataset yan
Data shape mismatch
ValueError: shapes (124,10564) and (150,100) not aligned
Fix: The generated file may be from a different dataset than the one specified. Check:
--datasetflag matches the input the generator was trained on.The
--gen_csvfile corresponds to the correct dataset.
Leiden returns 1 cluster
Best resolution: 0.10 | ARI: 0.00 (only 1 cluster found)
Fix: The resolution sweep range is too low for your data. Edit the range in data_validation/data_validation.py:
RESOLUTION_RANGES = {
"my_dataset": np.arange(0.01, 5.01, 0.05), # Wider range
}
Leiden returns too many clusters (1 per cell)
Best resolution: 3.00 | ARI: 0.00 (n_clusters = n_cells)
Fix: The resolution is too high. Narrow the range:
RESOLUTION_RANGES = {
"my_dataset": np.arange(0.05, 1.01, 0.05),
}
Training problems
GAT loss doesn’t decrease
epoch 0: loss=2.302, acc=0.140
epoch 1000: loss=2.301, acc=0.142
epoch 7000: loss=2.298, acc=0.145
Fix: The GAT is not learning. Check:
The expression matrix might be on a scale that confuses the network. Try scaling to \([0, 1]\).
The cell-type labels might not be parseable. Check
DATASET_CONFIG['your_dataset']['label_col'].Reduce the learning rate in the GAT’s Adam optimiser.
GAN generator loss diverges
iter 0: D_loss=1.231, G_loss=0.823
iter 1000: D_loss=0.035, G_loss=8.421
iter 2000: D_loss=0.001, G_loss=12.743
Fix: The discriminator is too strong relative to the generator. Solutions:
Reduce discriminator learning rate (
d_lr: 0.0004 → 0.0001).Increase
ng_steps(more generator updates per discriminator step: 2 → 3).Decrease
nd_steps(fewer discriminator updates: 5 → 2).Remove label smoothing (set
label_smooth_real = 1.0,label_smooth_fake = 0.0).
Mode collapse (generator produces only one cell type)
Check with UMAP:
python -m data_validation.data_validation \
--dataset muraro \
--gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv \
--method cv2 --plot_umap
If the UMAP shows only 1–2 clusters while the real data shows 10+, the generator has mode-collapsed.
Fix:
Increase
leakage_fraction(0.2 → 0.3).Increase
priority_weight(2.0 → 4.0).Train for fewer iterations (20,001 → 10,000) — sometimes longer training worsens mode collapse.
NaN losses
iter 500: D_loss=nan, G_loss=nan
Fix: Numerical instability. Solutions:
Reduce learning rates.
Add gradient clipping to the optimisers.
Check that the input data doesn’t contain NaN values.
Increase the epsilon in Adam (
torch.optim.Adam(..., eps=1e-4)).
Validation problems
All metrics are zero
ARI: 0.0000, NMI: 0.0000, F1: 0.0000
Fix: Nothing was learned. Check:
The generated CSV contains variation (not all zeros/ones).
Feature selection picked genes that exist in both real and generated data.
The
--datasetflag matches the data the model was trained on.
UMAPs look identical
If the real and generated UMAPs are near-identical, the leakage fraction may be too high — the generator is simply copying real data. Reduce \(\lambda\).
Build and docs
Sphinx build fails with “WARNING: document isn’t included in any toctree”
All .md and .rst files in the docs/ directory must be included in the index.rst toctree or excluded in conf.py → exclude_patterns.
ReadTheDocs build fails on external images
ReadTheDocs cannot access external URLs during build. All images should be hosted locally in docs/images/ and referenced with relative paths.
Still stuck?
Check the FAQ for common questions.
Open a GitHub Issue with:
Full error traceback.
Your
config.pysettings.Operating system and Python version.
Whether you’re using CPU or GPU.