IVD Single-Cell Atlas

Goal: Identify the cell types and cell states present in the human intervertebral disc (IVD) and determine how these change with aging and degeneration.

Approach

This project uses a human-gated agentic pipeline to analyze publicly available single-cell RNA-seq datasets of human IVD tissue. An AI agent (Claude) executes well-defined computational steps, while a human PI reviews results and makes scientific decisions at defined checkpoints.

The pipeline is driven by a loop:

while :; do cat PROMPT.md | claude; done

Each iteration, the agent reads the current state from analysis_plan.md, executes the next task as defined in the module specs, runs validation, and updates the plan. The loop halts at human checkpoints — the agent prepares review materials and stops until the human advances the plan.

Pipeline Modules

#	Module	Description
01	Dataset Discovery	Systematic search for all human IVD scRNA-seq datasets
02	Metadata Harmonization	Standardize condition labels, demographics, covariates
03	Preprocessing	Per-dataset QC, normalization, clustering
04	Coarse Annotation	Coarse cell classification for scANVI integration anchors
05	Integration	Cross-study tiered scANVI integration
06	Clustering	Resolution-optimized Leiden clustering of integrated objects
07	Post-Integration Annotation	De novo cell type annotation from integrated, clustered data
08	Differential Analysis	Cell composition changes and pseudobulk DE (DESeq2) between conditions
09	Biological Interpretation	Pathway enrichment, GRNs, pain-associated gene analysis
10	Trajectory & Dynamics	Pseudotime, RNA velocity for the cell state continuum
11	Cell-Cell Communication	Ligand-receptor interactions between IVD cell populations
12	Reporting	Final report, figures, reproducibility documentation

Key Files

PROMPT.md — Fed to the agent on each loop iteration
AGENT.md — Execution rules and environment instructions
analysis_plan.md — Living document tracking progress, decisions, and revisions
specs/ — Module specifications defining inputs, outputs, methods, validation, and checkpoints

Compute Requirements

Most modules run on a standard workstation (32GB RAM). Two modules benefit from HPC:

Module 05 (Integration): scVI/scANVI training is significantly faster with GPU. ~200k cells across all studies.
Module 07 (SCENIC): Gene regulatory network inference is RAM-intensive (64GB+).

Key Design Decisions

Human-gated, not fully autonomous. The agent executes computational steps but stops at decision points for human review. The automated validation checks are regression safeguards, not proof of correctness.

Per-dataset first, then integrate. Each dataset is preprocessed and annotated independently before cross-study integration. This avoids the known problem where batch correction erases the subtle cell state variation in the chondrocyte/fibroblast continuum.

Tiered integration. Non-resident cells (immune, endothelial) integrate easily and are handled with standard methods. Resident IVD cells require conservative integration or alternative approaches (label transfer, metacells) to preserve the biological continuum.

The plan is revisable. Every human checkpoint includes an evaluation of whether the downstream plan still makes sense given what's been learned. The analysis may loop back to earlier steps with revised parameters.

Pseudobulk DE, not single-cell DE. Differential expression uses pseudobulk aggregation (DESeq2/edgeR) to avoid inflated statistics from treating cells as independent observations.

Directory Structure

ivd-analysis/
├── PROMPT.md               # Agent loop prompt
├── AGENT.md                # Agent execution rules
├── README.md               # This file
├── analysis_plan.md        # Living plan document
├── specs/                  # Module specifications
│   ├── 00_PROJECT.md
│   ├── 01_DATASET_DISCOVERY.md
│   ├── 02_METADATA.md
│   ├── 03_PREPROCESSING.md
│   ├── 04_ANNOTATION.md
│   ├── 05_INTEGRATION.md
│   ├── 06_CLUSTERING.md
│   ├── 07_POST_ANNOTATION.md
│   ├── 08_DIFFERENTIAL.md
│   ├── 09_INTERPRETATION.md
│   ├── 10_TRAJECTORY.md
│   ├── 11_COMMUNICATION.md
│   └── 12_REPORTING.md
├── data/
│   ├── raw/                # Downloaded datasets
│   ├── processed/          # Per-dataset h5ad files
│   └── integrated/         # Cross-study integrated objects
├── metadata/               # Dataset registry, sample metadata
├── results/                # All analysis outputs
├── scripts/                # Compute scripts (run by agent or on HPC)
└── notebooks/              # Jupyter notebooks (visualization, figures, checkpoint review)
    ├── 01_datasets.ipynb       → Table 1
    ├── 02_metadata.ipynb       → Table 1 (cont.)
    ├── 03_qc.ipynb             → Fig S1
    ├── 04_classification.ipynb → Fig S2
    ├── 05_integration.ipynb    → Fig S3
    ├── 06_clustering.ipynb     → Fig S3 (cont.)
    ├── 07_annotation.ipynb     → Fig 1
    ├── 08_differential.ipynb   → Fig 2-3, Table 2
    ├── 09_interpretation.ipynb → Fig 4-5, Fig S4
    ├── 10_trajectory.ipynb     → Fig 6
    └── 11_communication.ipynb  → Fig 7

Scripts vs. Notebooks

Each pipeline module produces two outputs:

A script in scripts/ that does the heavy computation (can run headlessly on HPC)
A notebook in notebooks/ that loads saved results and produces figures and interpretation

Notebooks are independent of scripts — they read from data/ and results/, not from in-memory objects. This means a reviewer can run a notebook without re-executing the full compute pipeline. Notebooks also serve as draft manuscript figures: each maps to specific figures and tables in the planned publication (see arrows in directory listing above).

Citation

If this analysis contributes to a publication, cite:

The original publications for each included dataset
The tools used (scanpy, scvi-tools, DESeq2, etc.)
This pipeline methodology as appropriate

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
docs		docs
metadata		metadata
notebooks		notebooks
phylo_analysis		phylo_analysis
scripts		scripts
single_nuclei_r		single_nuclei_r
specs		specs
.gitignore		.gitignore
AGENT.md		AGENT.md
CLAUDE.md		CLAUDE.md
PROMPT.md		PROMPT.md
README.md		README.md
analysis_plan.md		analysis_plan.md
requirements.txt		requirements.txt
requirements_frozen.txt		requirements_frozen.txt
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IVD Single-Cell Atlas

Approach

Pipeline Modules

Key Files

Compute Requirements

Key Design Decisions

Directory Structure

Scripts vs. Notebooks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

IVD Single-Cell Atlas

Approach

Pipeline Modules

Key Files

Compute Requirements

Key Design Decisions

Directory Structure

Scripts vs. Notebooks

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages