scRNA-seq Data Challenges
The Structure of scRNA-seq Data
Let \(X \in \mathbb{R}^{n \times g}\) be a gene expression matrix with \(n\) cells and \(g\) genes. Each entry \(X_{ij}\) represents the expression level (often log-normalised counts) of gene \(j\) in cell \(i\).
Key Properties
High-dimensionality: \(g\) ranges from \(2{,}000\) to \(>30{,}000\) genes. Typically \(g \gg n\).
Sparsity: Most entries are zero (the “zero-inflation” problem). Typical dropout rates are 50–90%.
Heteroscedastic noise: Variance depends on mean expression level — highly expressed genes show more technical noise. This is captured by the coefficient of variation (\(\text{CV}^2 = \sigma^2 / \mu^2\)).
Batch effects: Technical variation between sequencing runs, labs, or protocols can mimic or mask biological signal.
Rare Cell Collapse in Generative Models
Consider a dataset with \(K\) cell types where type \(k\) has \(n_k\) cells. If one type is rare (e.g., \(n_{\text{rare}} = 10\) while \(n_{\text{abundant}} = 5{,}000\)), standard generative models produce output that is overwhelmingly from the abundant types.
Why this happens:
The GAN generator’s loss gradient is dominated by the abundant types.
The discriminator rarely sees real rare cells, so it cannot penalise the generator for failing to produce them.
Mode collapse reinforces this — once the generator settles on producing abundant cell types, it has no incentive to explore rare ones.
The Fano Index as a Diagnostic
The Fano index (variance-to-mean ratio) quantifies gene expression variability:
Genes with high Fano indices have high biological variability and often define rare cell type identities. A generator that does not preserve high-Fano genes is dropping rare cell types.
Dimensionality Reduction for scRNA-seq
scRNA-seq data requires aggressive dimensionality reduction before clustering or GAN training:
Method |
Use in GARAGE |
When to use |
|---|---|---|
PCA |
Pre-step before UMAP |
Reduce from \(g\) to 50–100 components. |
CV² filtering |
Select top-\(k\) genes (default: 100) |
Retain the most variable genes. |
Fano index |
Alternative to CV² |
Useful for count-based data. |
PCA loading |
Rank genes by loading on PC1–PC3 |
Focus on genes that define major axes of variation. |
UMAP |
2D visualisation |
Qualitative sanity check. |
Leiden clustering |
Community detection for ARI/NMI/F1 |
Quantitative comparison of real vs. generated. |
Co-dependence between these methods: feature selection (\(\to\) PCA \(\to\) KNN graph \(\to\) Leiden clustering \(\to\) ARI/NMI/F1/UMAP.
Common Pitfalls
Pitfall |
Symptom |
Fix |
|---|---|---|
Leiden resolution too low |
Only 1–2 clusters found. |
Increase resolution (range: 0.1–3.0 depending on dataset). |
Leiden resolution too high |
Every cell is its own cluster. |
Decrease resolution. |
Feature selection too aggressive |
ARI ~ 0 on all methods. |
Increase |
GAT attention collapse |
All cells get equal attention. |
Increase |
CUDA OOM on CBMC |
7,895 × 2,000 matrix with KNN graph. |
Reduce |
Further Reading
Hicks, S.C., et al. “Missing data and technical variability in single-cell RNA-sequencing experiments.” Biostatistics, 2018.
Luecken, M.D. and Theis, F.J. “Current best practices in single-cell RNA-seq analysis: a tutorial.” Molecular Systems Biology, 2019.