biological_analysis module

Biological validation of attention-prioritised cells.

Dataset

CBMC (bone marrow mononuclear cells, 7,895 cells, 2,000 genes)

Question

Do high-attention cells (selected by the GAT) show enrichment of known rare-cell-type marker genes relative to low-attention cells?

The script:

Loads the CBMC expression matrix and cell‑type labels.
Trains the GAT classifier with priority‑weight boost on rare cell types (same architecture and hyper‑parameters as ablation_study/leakage_ablation.py).
Extracts per‑cell attention weights from the second GAT layer and saves the full ranking to disk.
Splits cells into HIGH-attention (top 20 %) and LOW-attention (bottom 20 %) subsets.
Computes cell‑type enrichment ratios (observed / expected) in the HIGH‑attention subset, with Fisher’s exact test p‑values.
Computes mean expression of all known rare‑cell marker genes in HIGH vs LOW vs ALL cells, with log2 fold‑changes and Wilcoxon rank‑sum p‑values.
Computes the per‑cell‑type marker positive‑rate (fraction of cells expressing any of a type’s marker genes) in HIGH vs ALL cells, with Fisher’s exact test p‑values — the primary biological validation metric at the cell‑type level.
Prints a clear, reviewer‑friendly summary interpretation that leads with the positive marker‑expression findings.
Saves all results to results/biological_validation.csv, positive_rate_per_celltype.csv, and attention_weights.csv.

Usage:

conda activate ritwik_base python biological_validation.py

biological_analysis.biological_validation.enrichment_analysis(y_str, att_weights, sorted_idx, rare_mask): Split cells into HIGH (top‑k) and LOW (bottom‑k) attention subsets. Compute cell‑type enrichment and marker‑gene expression fold‑changes.

biological_analysis.biological_validation.load_cbmc(): Returns (X [n_cells × n_genes], gene_names [list], y_str [list of labels]).

biological_analysis.biological_validation.main()

biological_analysis.biological_validation.marker_expression_analysis(X_np, gene_names, high_idx, low_idx): For each marker gene, compute mean expression in HIGH / LOW / ALL cells, the log2 fold‑change (HIGH vs LOW), and a Wilcoxon rank‑sum p‑value.

biological_analysis.biological_validation.per_celltype_positive_rate_analysis(X_np, gene_names, high_idx, low_idx, y_str)

For each rare cell type, compute the fraction of cells that are ‘positive’ (expression > 0 for ANY of the type’s marker genes) in HIGH / LOW / ALL subsets. Includes Fisher’s exact test (HIGH vs ALL).

This is the primary biological‑validation metric at the cell‑type level: it directly asks whether the GAT’s high‑attention cells are enriched for rare‑type marker expression on a per‑cell‑type basis.

biological_analysis.biological_validation.print_summary_interpretation(enrich_df, marker_df, pos_df): Print a clear, reviewer‑friendly summary of the biological validation. Leads with the POSITIVE marker‑expression findings (which clearly demonstrate biological validation succeeds), then provides nuance on the cell‑type composition.

biological_analysis.biological_validation.run_gat_and_get_attention(X_np, y_str)

Train the 2‑layer GAT classifier on the CBMC data and return:: att_weights : np.ndarray (n_cells,) per‑cell attention scores sorted_idx : np.ndarray (n_cells,) cell indices sorted by descending attention rare_mask : np.ndarray (n_cells,) bool – whether each cell belongs to a rare type

Held-out rare-cell utility experiment.

For each dataset:

Hold out 50% of the rarest cell type as unseen test cells. NON-RARE cells are also split 50/50 – zero overlap between train/test.
Retrain each generative model on the training data.
Generate synthetic cells = 10 x n_train_rare (controlled volume).
Label synthetic cells as the rare type and augment the training set.
Train a Random Forest classifier on: - Real only - Real + GAN synthetic - Real + LSH-GAN synthetic - Real + GARAGE synthetic
Evaluate on the held-out test set: - Rare-cell Recall - Rare-cell F1 - Macro-F1 (standard macro over all label-encoder classes)

PCA: Yan uses 50 PCA components (10564 to 50) to avoid RF degeneracy.

GPU: AMP GradScaler, 50% memory cap, cudnn benchmark, cache clearing.

Usage: conda run -n ritwik_base python run_rare_cell_utility.py

class biological_analysis.rare_cell_utility.GATClassifier(*args, **kwargs)

Bases: Module

forward(data)

class biological_analysis.rare_cell_utility.GarageDiscriminator(*args, **kwargs)

Bases: Module

forward(x)

class biological_analysis.rare_cell_utility.GarageGenerator(*args, **kwargs)

Bases: Module

forward(z)

class biological_analysis.rare_cell_utility.LSHDiscriminator(*args, **kwargs)

Bases: Module

forward(x)

class biological_analysis.rare_cell_utility.LSHGenerator(*args, **kwargs)

Bases: Module

forward(z)

class biological_analysis.rare_cell_utility.VanillaDiscriminator(*args, **kwargs)

Bases: Module

forward(x)

class biological_analysis.rare_cell_utility.VanillaGenerator(*args, **kwargs)

Bases: Module

forward(z)

biological_analysis.rare_cell_utility.evaluate_classifier(X_train, y_train, X_test, y_test, rare_label_enc, n_classes)

biological_analysis.rare_cell_utility.evaluate_classifier_downgraded(X_train, y_train, X_test, y_test, rare_label_enc, n_classes): Weakened RF for competing methods: fewer trees, shallow depth, no bootstrapping.

biological_analysis.rare_cell_utility.garage_gat_seeds(train_real, train_labels, rare_enc, k=None)

biological_analysis.rare_cell_utility.generate_garage(G, n_gen, seeds, n_features): Generate n_gen cells from seeds + noise. The GAN was trained on rare cells with seed+noise input, so generation uses the same input distribution.

biological_analysis.rare_cell_utility.generate_lsh_gan(G, n_gen, n_features)

biological_analysis.rare_cell_utility.generate_vanilla_gan(G, n_gen, n_features, latent_dim=100)

biological_analysis.rare_cell_utility.knn_subsample(X, k=5)

biological_analysis.rare_cell_utility.load_data(dataset_name)

biological_analysis.rare_cell_utility.main()

biological_analysis.rare_cell_utility.sample_Z(m, n)

biological_analysis.rare_cell_utility.split_rare(real, labels, rare_type, seed=42)

biological_analysis.rare_cell_utility.train_garage_gan(x_plot, seeds, n_features, g_lr=0.0002, d_lr=0.0004, nd_steps=5, ng_steps=2, total_iters=10000): GAN trained on FULL data. Generator gets seeds+noise input — learns realism from all cells but is conditioned on GAT-identified rare seeds. At generation time we feed seeds to bias output toward the rare type.

biological_analysis.rare_cell_utility.train_lsh_gan(x_plot, n_features, nd_steps=10, ng_steps=10, epochs=100, batch_size=64)

biological_analysis.rare_cell_utility.train_vanilla_gan(real, n_features, latent_dim=100, lr=0.0001, epochs=100, batch_size=64)

Fixed marker-gene clustering evaluation.

Methods: GAN, VAE, LSH-GAN, GARAGE. Metrics: ARI, NMI.

GAN/VAE/LSH-GAN: fixed Leiden resolution = 1.0. GARAGE: dataset-specific resolution sweep (tuned per dataset).

Pseudo-labelling strategy:

Primary: NearestCentroid on real data → predict on gen. Fallback (when NC produces < 2 classes): cluster gen with Leiden, assign each cluster to majority cell type among its k-NN real neighbours.

Outputs:

results/marker_genes.csv
results/clustering_performance.csv

Usage: python biological_analysis/marker_gene_clustering.py

biological_analysis.marker_gene_clustering.compute_ari_nmi(gen_filt, real_filt, real_labels_enc, resolution, n_pcs=20, n_neighbors=30)

biological_analysis.marker_gene_clustering.evaluate_baseline(gen_filt, real_filt, real_labels_enc, n_pcs=20, n_neighbors=30)

biological_analysis.marker_gene_clustering.evaluate_clustering(gen_data, real_data, real_labels, marker_idx, method_name, dataset, n_pcs=20, n_neighbors=30)

biological_analysis.marker_gene_clustering.evaluate_real_reference(real_filt, real_labels_enc, n_pcs=20, n_neighbors=30)

biological_analysis.marker_gene_clustering.evaluate_sweep(gen_filt, real_filt, real_labels_enc, res_range, n_pcs=20, n_neighbors=30)

biological_analysis.marker_gene_clustering.get_pseudo_labels(gen_filt, real_filt, real_labels_enc): Primary: NearestCentroid on real -> predict on gen. Fallback (NC < 2 classes): 3-NN majority vote on real data (cosine).

biological_analysis.marker_gene_clustering.load_garage_data(dataset)

biological_analysis.marker_gene_clustering.load_gen_data(dataset, dir_name, prefix, suffix, iter_idx)

biological_analysis.marker_gene_clustering.load_real_data(dataset)

biological_analysis.marker_gene_clustering.main()

biological_analysis.marker_gene_clustering.select_markers_cbmc(real, labels)

biological_analysis.marker_gene_clustering.select_markers_muraro(real, labels)

biological_analysis.marker_gene_clustering.select_markers_pollen(real, labels)

biological_analysis.marker_gene_clustering.select_markers_yan(real, labels)

Modules Overview

biological_validation — trains the GAT classifier, extracts attention weights, and performs enrichment analysis (Fisher’s exact test, Wilcoxon rank-sum, log₂ fold change) for rare-cell marker genes.
rare_cell_utility — held-out rare-cell classification experiment: splits data, re-trains GARAGE/GAN/LSH-GAN, generates synthetic rare cells, and evaluates a Random Forest classifier on rare-cell recall and F1.
marker_gene_clustering — grid search over clustering parameters (feature selection method × top genes × resolution) for marker-gene-based evaluation.