How-to: Feature Selection

A recipe for selecting the most informative genes from your generated data.

Goal

Identify the top \(k\) genes that capture the most biological variability in generated scRNA-seq data, for downstream clustering evaluation.

Prerequisites

Generated data CSV from GARAGE or a baseline model.
The corresponding real expression matrix and cell-type labels (via config.py).

Steps

1. CV² selection (recommended default)

python -m data_validation.data_validation \
    --dataset muraro \
    --gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv \
    --method cv2 \
    --n_genes 100

CV² = \(\sigma^2 / \mu^2\) — selects genes with high relative variability.

2. Fano index selection

python -m data_validation.data_validation \
    --dataset muraro \
    --gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv \
    --method fano \
    --n_genes 100

Fano = \(\sigma^2 / \mu\) — similar to CV² but with different mean scaling.

3. PCA loading selection

python -m data_validation.data_validation \
    --dataset muraro \
    --gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv \
    --method pca \
    --n_genes 100

Ranks genes by their aggregate loading on the first 3 principal components.

4. Using the Python API directly

from data_validation.data_validation import cv2_selection, fano_selection, pca_loading_selection
import pandas as pd

gen_data = pd.read_csv("data/gen_data/muraro_data_mixdata_iter3_top_426.csv").values
top_genes = cv2_selection(gen_data, n_genes=100)

Choosing the right method

Method	Best for	Weakness
CV²	General purpose; good for log-normalised data.	Sensitive to low-mean genes (division by small μ).
Fano	Count-based or raw-normalised data.	Favours high-mean genes.
PCA loading	Captures global variance structure.	Sensitive to batch effects and outliers.

Rule of thumb: Start with CV² for 100 genes. If ARI is unexpectedly low, try PCA loading as a fallback.

Output

The --gen_csv command prints selected gene indices and the top metrics (ARI, NMI, F1). It also saves results to results/<dataset>_<method>_ari_nmi_f1.csv.

Troubleshooting

Symptom	Fix
CV² returns all zeros	Zero-variance genes in the generated data. Check that the GAN produced diverse output.
PCA loading selects the same 3 genes	The first 3 PCs explain nearly all variance (common for simple datasets). Try Fano.
All methods give similar ARI ~ 0.5	This may be the best the model can do — see Interpret Outputs.