data_validation module

Data validation for GARAGE‑generated scRNA‑seq data.

Computes clustering quality metrics (ARI, NMI, macro-F1) by:

Loading generated data (GARAGE or baselines).
Loading the corresponding real data and ground-truth labels.
Applying feature selection (CV², Fano, PCA loading) on the generated data then filtering the real data to those features.
Clustering the filtered real data with Leiden (resolution sweep).
Reporting ARI, NMI, and macro-F1 against ground truth.

Uses Scanpy for PCA, neighbourhood graph, Leiden clustering, and UMAP. Matches the workflow in data_vaidation_garage.ipynb (original notebook).

Usage

python -m data_validation.data_validation –dataset muraro –gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv –method cv2 –plot_umap

data_validation.data_validation.cluster_and_evaluate(real_filt, true_labels, resolution, n_pcs=20, n_neighbors=30)

data_validation.data_validation.cv2_selection(data, n_genes=100)

data_validation.data_validation.load_generated(gen_csv, n_features)

data_validation.data_validation.load_labels(dataset_name)

data_validation.data_validation.load_real(dataset_name)

data_validation.data_validation.main()

data_validation.data_validation.plot_umap(adata, title, save_path, dpi=300)

data_validation.data_validation.sweep_resolution(real_filt, true_labels, res_range, n_pcs=20, n_neighbors=30)

Feature selection for scRNA-seq data (Python port of feature_selection.R).

Three methods are provided, mirroring the original R implementation:

Fano factor (fano_selection) Selects genes with the lowest variance-to-mean ratio (Fano factor). R equivalent: Fano_ind() at feature_selection.R:12-19

PCA loading (pca_loading_selection) Selects top-k genes by absolute loading on PC1–PC3. R equivalent: PCA_loading() at feature_selection.R:10-21

CV² (normalised coefficient-of-variation squared) (cv2_selection) Computes per-gene dispersion (variance/mean), bins by mean expression, normalises by bin median/MAD, and returns the top-k by normalised dispersion. R equivalent: CV2() at feature_selection.R:21-55

Strategy

Feature selection from generated data applied to real data. Feature selection from real data applied to real data. Feature selection from combined (gen + real) data applied to real data.

Usage

python -m data_validation.feature_selection –method cv2 –gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv –real_csv data/muraro_expression_matrix.csv –transpose False –header 0

data_validation.feature_selection.cv2_selection(data, n_genes=100)

Select top n_genes by normalised CV² dispersion.

Procedure:

For each gene, compute dispersion = variance / mean.
Partition genes into bins by mean-expression quantiles.
Normalise each gene’s dispersion by its bin’s median and MAD.
Return the n_genes with the largest normalised dispersion.

R equivalent: CV2() in feature_selection.R.

data_validation.feature_selection.fano_selection(data, n_genes=100)

Select n_genes with the LOWEST Fano factor (variance / mean).

R equivalent: Fano_ind() in feature_selection.R.

data_validation.feature_selection.main()

data_validation.feature_selection.pca_loading_selection(data, n_genes=100, n_components=3)

Rank genes by the maximum absolute loading across the first n_components principal components.

R equivalent: PCA_loading() in feature_selection.R.

data_validation.feature_selection.run_feature_selection(gen_csv, real_csv, out_dir, method='cv2', header=None, transpose=False, index_col=None, label_csv=None, n_genes=100)

Feature-selection pipeline.

Reads generated and real CSV files, applies method to each, then writes three filtered versions of the real data to out_dir:

datafilt1.csv — real data filtered to features selected from gen data datafilt2.csv — real data filtered to features selected from real data datafilt_combined.csv — real data filtered to features from gen+real

Parameters:

gen_csv (str Paths to the CSV files.)
real_csv (str Paths to the CSV files.)
out_dir (str Output directory.)
method (str One of {"fano", "pca", "cv2"}.)
header (passed to pd.read_csv for the real data.)
transpose (passed to pd.read_csv for the real data.)
index_col (passed to pd.read_csv for the real data.)
label_csv (str or None If the real CSV includes a label column, separate it.)
n_genes (int Number of features to select (default 100).)

Core Functions

The data_validation module contains:

validate_generated_data() — end-to-end validation: loads data, runs feature selection, performs Leiden clustering over a resolution sweep, and reports ARI/NMI/macro-F1.
cv2_selection() — selects top n_genes by coefficient of variation squared.
fano_selection() — selects top n_genes by Fano index.
pca_loading_selection() — selects top n_genes by PCA loading on the first 3 components.

The feature_selection module provides a standalone Python port of the original feature_selection.R script (Fano, PCA loading, CV²).

Resolution Sweep

The Leiden resolution sweep is controlled by data_validation.data_validation.RESOLUTION_RANGES, a dictionary mapping dataset names to numpy arrays of resolution values.

Reference Notebook

The original validation notebook is preserved at data_validation/data_vaidation_garage.ipynb for reference and reproducibility.