data_validation module

Data validation for GARAGE‑generated scRNA‑seq data.

Computes clustering quality metrics (ARI, NMI, macro-F1) by:
  1. Loading generated data (GARAGE or baselines).

  2. Loading the corresponding real data and ground-truth labels.

  3. Applying feature selection (CV², Fano, PCA loading) on the generated data then filtering the real data to those features.

  4. Clustering the filtered real data with Leiden (resolution sweep).

  5. Reporting ARI, NMI, and macro-F1 against ground truth.

Uses Scanpy for PCA, neighbourhood graph, Leiden clustering, and UMAP. Matches the workflow in data_vaidation_garage.ipynb (original notebook).

Usage

python -m data_validation.data_validation –dataset muraro –gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv –method cv2 –plot_umap

data_validation.data_validation.cluster_and_evaluate(real_filt, true_labels, resolution, n_pcs=20, n_neighbors=30)
data_validation.data_validation.cv2_selection(data, n_genes=100)
data_validation.data_validation.load_generated(gen_csv, n_features)
data_validation.data_validation.load_labels(dataset_name)
data_validation.data_validation.load_real(dataset_name)
data_validation.data_validation.main()
data_validation.data_validation.plot_umap(adata, title, save_path, dpi=300)
data_validation.data_validation.sweep_resolution(real_filt, true_labels, res_range, n_pcs=20, n_neighbors=30)

Feature selection for scRNA-seq data (Python port of feature_selection.R).

Three methods are provided, mirroring the original R implementation:

  1. Fano factor (fano_selection) Selects genes with the lowest variance-to-mean ratio (Fano factor). R equivalent: Fano_ind() at feature_selection.R:12-19

  2. PCA loading (pca_loading_selection) Selects top-k genes by absolute loading on PC1–PC3. R equivalent: PCA_loading() at feature_selection.R:10-21

  3. CV² (normalised coefficient-of-variation squared) (cv2_selection) Computes per-gene dispersion (variance/mean), bins by mean expression, normalises by bin median/MAD, and returns the top-k by normalised dispersion. R equivalent: CV2() at feature_selection.R:21-55

Strategy

Feature selection from generated data applied to real data. Feature selection from real data applied to real data. Feature selection from combined (gen + real) data applied to real data.

Usage

python -m data_validation.feature_selection –method cv2 –gen_csv data/gen_data/muraro_data_mixdata_iter3_top_426.csv –real_csv data/muraro_expression_matrix.csv –transpose False –header 0

data_validation.feature_selection.cv2_selection(data, n_genes=100)

Select top n_genes by normalised CV² dispersion.

Procedure:
  1. For each gene, compute dispersion = variance / mean.

  2. Partition genes into bins by mean-expression quantiles.

  3. Normalise each gene’s dispersion by its bin’s median and MAD.

  4. Return the n_genes with the largest normalised dispersion.

R equivalent: CV2() in feature_selection.R.

data_validation.feature_selection.fano_selection(data, n_genes=100)

Select n_genes with the LOWEST Fano factor (variance / mean).

R equivalent: Fano_ind() in feature_selection.R.

data_validation.feature_selection.main()
data_validation.feature_selection.pca_loading_selection(data, n_genes=100, n_components=3)

Rank genes by the maximum absolute loading across the first n_components principal components.

R equivalent: PCA_loading() in feature_selection.R.

data_validation.feature_selection.run_feature_selection(gen_csv, real_csv, out_dir, method='cv2', header=None, transpose=False, index_col=None, label_csv=None, n_genes=100)

Feature-selection pipeline.

Reads generated and real CSV files, applies method to each, then writes three filtered versions of the real data to out_dir:

datafilt1.csv — real data filtered to features selected from gen data datafilt2.csv — real data filtered to features selected from real data datafilt_combined.csv — real data filtered to features from gen+real

Parameters:
  • gen_csv (str Paths to the CSV files.)

  • real_csv (str Paths to the CSV files.)

  • out_dir (str Output directory.)

  • method (str One of {"fano", "pca", "cv2"}.)

  • header (passed to pd.read_csv for the real data.)

  • transpose (passed to pd.read_csv for the real data.)

  • index_col (passed to pd.read_csv for the real data.)

  • label_csv (str or None If the real CSV includes a label column, separate it.)

  • n_genes (int Number of features to select (default 100).)

Core Functions

The data_validation module contains:

  • validate_generated_data() — end-to-end validation: loads data, runs feature selection, performs Leiden clustering over a resolution sweep, and reports ARI/NMI/macro-F1.

  • cv2_selection() — selects top n_genes by coefficient of variation squared.

  • fano_selection() — selects top n_genes by Fano index.

  • pca_loading_selection() — selects top n_genes by PCA loading on the first 3 components.

The feature_selection module provides a standalone Python port of the original feature_selection.R script (Fano, PCA loading, CV²).

Resolution Sweep

The Leiden resolution sweep is controlled by data_validation.data_validation.RESOLUTION_RANGES, a dictionary mapping dataset names to numpy arrays of resolution values.

Reference Notebook

The original validation notebook is preserved at data_validation/data_vaidation_garage.ipynb for reference and reproducibility.