Tutorial: Biological Validation
This tutorial demonstrates how to validate that GARAGE-generated synthetic data preserves biologically meaningful properties — specifically, that rare cell type marker genes are enriched in the cells the GAT selects as “important.”
Dataset: CBMC (bone marrow mononuclear cells, 7,895 cells, 13 cell types).
Prerequisite: Run GARAGE on CBMC first (python -m data_generation.garage --dataset cbmc).
What Biological Validation Measures
GARAGE’s biological validation asks two questions:
Are GAT-selected cells enriched for rare cell types?
If the priority-weight mechanism works, high-attention cells should contain a disproportionately large share of rare cell types.Do high-attention cells express known marker genes for rare cell types?
This confirms that the GAT’s attention mechanism captures biologically meaningful signal, not just statistical artefacts.
Step 1: Run the Main Biological Validation
python biological_analysis/biological_validation.py
What happens
Loads CBMC data — expression matrix (7,895 × 2,000) and cell-type labels (13 types).
Trains the GAT classifier — same architecture as
garage.py:2 GATConv layers (heads: 8, 1).
Priority weight = 2.0 on rare cell types.
7,501 training iterations.
Extracts attention weights from the second GAT layer:
Per-cell attention scores are saved to
results/attention_weights.csv.Cells are split into HIGH (top 20 %) and LOW (bottom 20 %) attention groups.
Cell-type enrichment analysis:
For each cell type, computes the observed/expected ratio in the HIGH-attention group.
Fisher’s exact test: is enrichment significantly non-random?
Marker-gene expression analysis:
Defines a curated list of rare-cell marker genes (e.g., CD34 for haematopoietic stem cells).
Computes mean expression of each marker in HIGH vs. LOW vs. ALL cells.
Reports \(\log_2\) fold change and Wilcoxon rank-sum p-value.
Positive-rate analysis (cell-type level):
For each cell type and its marker genes, computes the fraction of cells expressing any marker.
Compares this positive rate in HIGH vs. ALL cells.
Fisher’s exact test for significance.
Step 2: Interpret the Results
Output files
File |
Contents |
|---|---|
|
Per-cell attention scores (sorted highest to lowest). |
|
Expression of all marker genes in HIGH vs. LOW vs. ALL, with log₂FC and p-values. |
|
Per-cell-type marker positive rate in HIGH vs. ALL, with Fisher p-values. |
Console output (interpretation guide)
The script prints a self-interpreting summary. Key patterns to look for:
Good signs:
✓ HIGH-attention group is enriched for CD34+ HSCs (observed/expected = 3.2, p < 1e-6)
✓ CD34 expression: 2.4-fold higher in HIGH vs. LOW (Wilcoxon p < 1e-10)
✓ 68 % of HIGH cells are positive for their type's marker genes vs. 32 % in ALL
Warning signs:
✗ No significant enrichment of rare cell types in the HIGH group
✗ Marker gene expression is not different between HIGH and LOW (p > 0.05)
If you see warning signs, try:
Increasing
priority_weightinconfig.py(e.g., from 2.0 to 4.0).Increasing
leakage_fraction(e.g., from 0.2 to 0.3).Checking that
rare_thresholdinDATASET_CONFIG['cbmc']is set correctly (should capture the types you want to study).
Step 3: Marker Gene Clustering
python biological_analysis/marker_gene_clustering.py
This script performs a grid search over clustering parameters (feature selection method × number of genes × resolution) and reports ARI/NMI/F1 for each combination — specifically focused on whether marker genes for rare cell types improve clustering quality.
Step 4: Held-out Rare Cell Utility
python biological_analysis/rare_cell_utility.py
This experiment quantifies how much GARAGE-generated data helps a downstream classifier detect rare cell types. See Tutorial: Rare Cell Experiment for a detailed walkthrough.
Biological Interpretation for Papers
When writing up results, the following narrative is typical:
“Biological validation confirmed that cells selected by GARAGE’s GAT attention mechanism are significantly enriched for rare haematopoietic cell types (CD34+ HSCs: 3.2× enrichment, Fisher p < 1e-6; erythroblasts: 2.1× enrichment, p < 1e-4). Known marker genes for these rare populations — including KLF1, GATA1, and HBG2 — were expressed at 2–5× higher levels in the high-attention group compared to the low-attention group. The per-cell-type marker positive rate was 68 % in the high-attention subset vs. 32 % in the full dataset (p < 1e-4), confirming that the GAT’s priority-weight mechanism selects cells with genuine biological relevance.”
Customising for Your Own Dataset
To run biological validation on your own data:
Edit
biological_analysis/biological_validation.py:Change the dataset loading to use your expression matrix and labels.
Define marker genes for your rare cell types.
Edit
biological_analysis/marker_gene_clustering.pysimilarly.
A future version of GARAGE will accept a --marker_genes JSON or CSV argument for ad-hoc marker gene lists.