data_validation module
Data validation for GARAGE‑generated scRNA‑seq data.
- Computes clustering quality metrics (ARI, NMI, macro-F1) by:
Loading generated data (GARAGE or baselines).
Loading the corresponding real data and ground-truth labels.
Applying feature selection (CV², Fano, PCA loading) on the generated data then filtering the real data to those features.
Clustering the filtered real data with Leiden (resolution sweep).
Reporting ARI, NMI, and macro-F1 against ground truth.
Uses Scanpy for PCA, neighbourhood graph, Leiden clustering, and UMAP. Matches the workflow in data_vaidation_garage.ipynb (original notebook).
Usage
python -m data_validation.data_validation –dataset muraro –gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv –method cv2 –plot_umap
- data_validation.data_validation.cluster_and_evaluate(real_filt, true_labels, resolution, n_pcs=20, n_neighbors=30)
- data_validation.data_validation.cv2_selection(data, n_genes=100)
- data_validation.data_validation.load_generated(gen_csv, n_features)
- data_validation.data_validation.load_labels(dataset_name)
- data_validation.data_validation.load_real(dataset_name)
- data_validation.data_validation.main()
- data_validation.data_validation.plot_umap(adata, title, save_path, dpi=300)
- data_validation.data_validation.sweep_resolution(real_filt, true_labels, res_range, n_pcs=20, n_neighbors=30)
Feature selection for scRNA-seq data (Python port of feature_selection.R).
Three methods are provided, mirroring the original R implementation:
Fano factor (fano_selection) Selects genes with the lowest variance-to-mean ratio (Fano factor). R equivalent:
Fano_ind()at feature_selection.R:12-19PCA loading (pca_loading_selection) Selects top-k genes by absolute loading on PC1–PC3. R equivalent:
PCA_loading()at feature_selection.R:10-21CV² (normalised coefficient-of-variation squared) (cv2_selection) Computes per-gene dispersion (variance/mean), bins by mean expression, normalises by bin median/MAD, and returns the top-k by normalised dispersion. R equivalent:
CV2()at feature_selection.R:21-55
Strategy
Feature selection from generated data applied to real data. Feature selection from real data applied to real data. Feature selection from combined (gen + real) data applied to real data.
Usage
python -m data_validation.feature_selection –method cv2 –gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv –real_csv data/muraro_expression_matrix.csv –transpose False –header 0
- data_validation.feature_selection.cv2_selection(data, n_genes=100)
Select top n_genes by normalised CV² dispersion.
- Procedure:
For each gene, compute dispersion = variance / mean.
Partition genes into bins by mean-expression quantiles.
Normalise each gene’s dispersion by its bin’s median and MAD.
Return the n_genes with the largest normalised dispersion.
R equivalent: CV2() in feature_selection.R.
- data_validation.feature_selection.fano_selection(data, n_genes=100)
Select n_genes with the LOWEST Fano factor (variance / mean).
R equivalent: Fano_ind() in feature_selection.R.
- data_validation.feature_selection.main()
- data_validation.feature_selection.pca_loading_selection(data, n_genes=100, n_components=3)
Rank genes by the maximum absolute loading across the first n_components principal components.
R equivalent: PCA_loading() in feature_selection.R.
- data_validation.feature_selection.run_feature_selection(gen_csv, real_csv, out_dir, method='cv2', header=None, transpose=False, index_col=None, label_csv=None, n_genes=100)
Feature-selection pipeline.
Reads generated and real CSV files, applies method to each, then writes three filtered versions of the real data to out_dir:
datafilt1.csv — real data filtered to features selected from gen data datafilt2.csv — real data filtered to features selected from real data datafilt_combined.csv — real data filtered to features from gen+real
- Parameters:
gen_csv (str Paths to the CSV files.)
real_csv (str Paths to the CSV files.)
out_dir (str Output directory.)
method (str One of {"fano", "pca", "cv2"}.)
header (passed to pd.read_csv for the real data.)
transpose (passed to pd.read_csv for the real data.)
index_col (passed to pd.read_csv for the real data.)
label_csv (str or None If the real CSV includes a label column, separate it.)
n_genes (int Number of features to select (default 100).)
Core Functions
The data_validation module contains:
validate_generated_data()— end-to-end validation: loads data, runs feature selection, performs Leiden clustering over a resolution sweep, and reports ARI/NMI/macro-F1.cv2_selection()— selects topn_genesby coefficient of variation squared.fano_selection()— selects topn_genesby Fano index.pca_loading_selection()— selects topn_genesby PCA loading on the first 3 components.
The feature_selection module provides a standalone Python port of the original
feature_selection.R script (Fano, PCA loading, CV²).
Resolution Sweep
The Leiden resolution sweep is controlled by data_validation.data_validation.RESOLUTION_RANGES, a dictionary mapping dataset names to numpy arrays of resolution values.
Reference Notebook
The original validation notebook is preserved at
data_validation/data_vaidation_garage.ipynb for reference and reproducibility.