Tutorial: Biological Validation

This tutorial demonstrates how to validate that GARAGE-generated synthetic data preserves biologically meaningful properties — specifically, that rare cell type marker genes are enriched in the cells the GAT selects as “important.”

Dataset: CBMC (bone marrow mononuclear cells, 7,895 cells, 13 cell types).

Prerequisite: Run GARAGE on CBMC first (python -m data_generation.garage --dataset cbmc).

What Biological Validation Measures

GARAGE’s biological validation asks two questions:

Are GAT-selected cells enriched for rare cell types?
If the priority-weight mechanism works, high-attention cells should contain a disproportionately large share of rare cell types.
Do high-attention cells express known marker genes for rare cell types?
This confirms that the GAT’s attention mechanism captures biologically meaningful signal, not just statistical artefacts.

Step 1: Run the Main Biological Validation

python biological_analysis/biological_validation.py

What happens

Loads CBMC data — expression matrix (7,895 × 2,000) and cell-type labels (13 types).
Trains the GAT classifier — same architecture as garage.py:
- 2 GATConv layers (heads: 8, 1).
- Priority weight = 2.0 on rare cell types.
- 7,501 training iterations.
Extracts attention weights from the second GAT layer:
- Per-cell attention scores are saved to results/attention_weights.csv.
- Cells are split into HIGH (top 20 %) and LOW (bottom 20 %) attention groups.
Cell-type enrichment analysis:
- For each cell type, computes the observed/expected ratio in the HIGH-attention group.
- Fisher’s exact test: is enrichment significantly non-random?
Marker-gene expression analysis:
- Defines a curated list of rare-cell marker genes (e.g., CD34 for haematopoietic stem cells).
- Computes mean expression of each marker in HIGH vs. LOW vs. ALL cells.
- Reports \(\log_2\) fold change and Wilcoxon rank-sum p-value.
Positive-rate analysis (cell-type level):
- For each cell type and its marker genes, computes the fraction of cells expressing any marker.
- Compares this positive rate in HIGH vs. ALL cells.
- Fisher’s exact test for significance.

Step 2: Interpret the Results

Output files

File	Contents
`results/attention_weights.csv`	Per-cell attention scores (sorted highest to lowest).
`results/biological_validation.csv`	Expression of all marker genes in HIGH vs. LOW vs. ALL, with log₂FC and p-values.
`results/positive_rate_per_celltype.csv`	Per-cell-type marker positive rate in HIGH vs. ALL, with Fisher p-values.

Console output (interpretation guide)

The script prints a self-interpreting summary. Key patterns to look for:

Good signs:

✓ HIGH-attention group is enriched for CD34+ HSCs (observed/expected = 3.2, p < 1e-6)
✓ CD34 expression: 2.4-fold higher in HIGH vs. LOW (Wilcoxon p < 1e-10)
✓ 68 % of HIGH cells are positive for their type's marker genes vs. 32 % in ALL

Warning signs:

✗ No significant enrichment of rare cell types in the HIGH group
✗ Marker gene expression is not different between HIGH and LOW (p > 0.05)

If you see warning signs, try:

Increasing priority_weight in config.py (e.g., from 2.0 to 4.0).
Increasing leakage_fraction (e.g., from 0.2 to 0.3).
Checking that rare_threshold in DATASET_CONFIG['cbmc'] is set correctly (should capture the types you want to study).

Step 3: Marker Gene Clustering

python biological_analysis/marker_gene_clustering.py

This script performs a grid search over clustering parameters (feature selection method × number of genes × resolution) and reports ARI/NMI/F1 for each combination — specifically focused on whether marker genes for rare cell types improve clustering quality.

Step 4: Held-out Rare Cell Utility

python biological_analysis/rare_cell_utility.py

This experiment quantifies how much GARAGE-generated data helps a downstream classifier detect rare cell types. See Tutorial: Rare Cell Experiment for a detailed walkthrough.

Biological Interpretation for Papers

When writing up results, the following narrative is typical:

“Biological validation confirmed that cells selected by GARAGE’s GAT attention mechanism are significantly enriched for rare haematopoietic cell types (CD34+ HSCs: 3.2× enrichment, Fisher p < 1e-6; erythroblasts: 2.1× enrichment, p < 1e-4). Known marker genes for these rare populations — including KLF1, GATA1, and HBG2 — were expressed at 2–5× higher levels in the high-attention group compared to the low-attention group. The per-cell-type marker positive rate was 68 % in the high-attention subset vs. 32 % in the full dataset (p < 1e-4), confirming that the GAT’s priority-weight mechanism selects cells with genuine biological relevance.”

Customising for Your Own Dataset

To run biological validation on your own data:

Edit biological_analysis/biological_validation.py:
- Change the dataset loading to use your expression matrix and labels.
- Define marker genes for your rare cell types.
Edit biological_analysis/marker_gene_clustering.py similarly.

A future version of GARAGE will accept a --marker_genes JSON or CSV argument for ad-hoc marker gene lists.