scRNA-seq Data Challenges

The Structure of scRNA-seq Data

Let \(X \in \mathbb{R}^{n \times g}\) be a gene expression matrix with \(n\) cells and \(g\) genes. Each entry \(X_{ij}\) represents the expression level (often log-normalised counts) of gene \(j\) in cell \(i\).

Key Properties

High-dimensionality: \(g\) ranges from \(2{,}000\) to \(>30{,}000\) genes. Typically \(g \gg n\).
Sparsity: Most entries are zero (the “zero-inflation” problem). Typical dropout rates are 50–90%.
Heteroscedastic noise: Variance depends on mean expression level — highly expressed genes show more technical noise. This is captured by the coefficient of variation (\(\text{CV}^2 = \sigma^2 / \mu^2\)).
Batch effects: Technical variation between sequencing runs, labs, or protocols can mimic or mask biological signal.

Rare Cell Collapse in Generative Models

Consider a dataset with \(K\) cell types where type \(k\) has \(n_k\) cells. If one type is rare (e.g., \(n_{\text{rare}} = 10\) while \(n_{\text{abundant}} = 5{,}000\)), standard generative models produce output that is overwhelmingly from the abundant types.

Why this happens:

The GAN generator’s loss gradient is dominated by the abundant types.
The discriminator rarely sees real rare cells, so it cannot penalise the generator for failing to produce them.
Mode collapse reinforces this — once the generator settles on producing abundant cell types, it has no incentive to explore rare ones.

The Fano Index as a Diagnostic

The Fano index (variance-to-mean ratio) quantifies gene expression variability:

\[\text{Fano}(g) = \frac{\sigma_g^2}{\mu_g}\]

Genes with high Fano indices have high biological variability and often define rare cell type identities. A generator that does not preserve high-Fano genes is dropping rare cell types.

Dimensionality Reduction for scRNA-seq

scRNA-seq data requires aggressive dimensionality reduction before clustering or GAN training:

Method	Use in GARAGE	When to use
PCA	Pre-step before UMAP	Reduce from \(g\) to 50–100 components.
CV² filtering	Select top-\(k\) genes (default: 100)	Retain the most variable genes.
Fano index	Alternative to CV²	Useful for count-based data.
PCA loading	Rank genes by loading on PC1–PC3	Focus on genes that define major axes of variation.
UMAP	2D visualisation	Qualitative sanity check.
Leiden clustering	Community detection for ARI/NMI/F1	Quantitative comparison of real vs. generated.

Co-dependence between these methods: feature selection (\(\to\) PCA \(\to\) KNN graph \(\to\) Leiden clustering \(\to\) ARI/NMI/F1/UMAP.

Common Pitfalls

Pitfall	Symptom	Fix
Leiden resolution too low	Only 1–2 clusters found.	Increase resolution (range: 0.1–3.0 depending on dataset).
Leiden resolution too high	Every cell is its own cluster.	Decrease resolution.
Feature selection too aggressive	ARI ~ 0 on all methods.	Increase `n_genes` or use PCA loading instead of CV².
GAT attention collapse	All cells get equal attention.	Increase `priority_weight` or reduce `gat_epochs`.
CUDA OOM on CBMC	7,895 × 2,000 matrix with KNN graph.	Reduce `k` in KNN or use CPU fallback.