Goal: Identify the cell types and cell states present in the human intervertebral disc (IVD) and determine how these change with aging and degeneration.
This project uses a human-gated agentic pipeline to analyze publicly available single-cell RNA-seq datasets of human IVD tissue. An AI agent (Claude) executes well-defined computational steps, while a human PI reviews results and makes scientific decisions at defined checkpoints.
The pipeline is driven by a loop:
while :; do cat PROMPT.md | claude; doneEach iteration, the agent reads the current state from analysis_plan.md, executes the next task as defined in the module specs, runs validation, and updates the plan. The loop halts at human checkpoints — the agent prepares review materials and stops until the human advances the plan.
| # | Module | Description |
|---|---|---|
| 01 | Dataset Discovery | Systematic search for all human IVD scRNA-seq datasets |
| 02 | Metadata Harmonization | Standardize condition labels, demographics, covariates |
| 03 | Preprocessing | Per-dataset QC, normalization, clustering |
| 04 | Coarse Annotation | Coarse cell classification for scANVI integration anchors |
| 05 | Integration | Cross-study tiered scANVI integration |
| 06 | Clustering | Resolution-optimized Leiden clustering of integrated objects |
| 07 | Post-Integration Annotation | De novo cell type annotation from integrated, clustered data |
| 08 | Differential Analysis | Cell composition changes and pseudobulk DE (DESeq2) between conditions |
| 09 | Biological Interpretation | Pathway enrichment, GRNs, pain-associated gene analysis |
| 10 | Trajectory & Dynamics | Pseudotime, RNA velocity for the cell state continuum |
| 11 | Cell-Cell Communication | Ligand-receptor interactions between IVD cell populations |
| 12 | Reporting | Final report, figures, reproducibility documentation |
PROMPT.md— Fed to the agent on each loop iterationAGENT.md— Execution rules and environment instructionsanalysis_plan.md— Living document tracking progress, decisions, and revisionsspecs/— Module specifications defining inputs, outputs, methods, validation, and checkpoints
Most modules run on a standard workstation (32GB RAM). Two modules benefit from HPC:
- Module 05 (Integration): scVI/scANVI training is significantly faster with GPU. ~200k cells across all studies.
- Module 07 (SCENIC): Gene regulatory network inference is RAM-intensive (64GB+).
Human-gated, not fully autonomous. The agent executes computational steps but stops at decision points for human review. The automated validation checks are regression safeguards, not proof of correctness.
Per-dataset first, then integrate. Each dataset is preprocessed and annotated independently before cross-study integration. This avoids the known problem where batch correction erases the subtle cell state variation in the chondrocyte/fibroblast continuum.
Tiered integration. Non-resident cells (immune, endothelial) integrate easily and are handled with standard methods. Resident IVD cells require conservative integration or alternative approaches (label transfer, metacells) to preserve the biological continuum.
The plan is revisable. Every human checkpoint includes an evaluation of whether the downstream plan still makes sense given what's been learned. The analysis may loop back to earlier steps with revised parameters.
Pseudobulk DE, not single-cell DE. Differential expression uses pseudobulk aggregation (DESeq2/edgeR) to avoid inflated statistics from treating cells as independent observations.
ivd-analysis/
├── PROMPT.md # Agent loop prompt
├── AGENT.md # Agent execution rules
├── README.md # This file
├── analysis_plan.md # Living plan document
├── specs/ # Module specifications
│ ├── 00_PROJECT.md
│ ├── 01_DATASET_DISCOVERY.md
│ ├── 02_METADATA.md
│ ├── 03_PREPROCESSING.md
│ ├── 04_ANNOTATION.md
│ ├── 05_INTEGRATION.md
│ ├── 06_CLUSTERING.md
│ ├── 07_POST_ANNOTATION.md
│ ├── 08_DIFFERENTIAL.md
│ ├── 09_INTERPRETATION.md
│ ├── 10_TRAJECTORY.md
│ ├── 11_COMMUNICATION.md
│ └── 12_REPORTING.md
├── data/
│ ├── raw/ # Downloaded datasets
│ ├── processed/ # Per-dataset h5ad files
│ └── integrated/ # Cross-study integrated objects
├── metadata/ # Dataset registry, sample metadata
├── results/ # All analysis outputs
├── scripts/ # Compute scripts (run by agent or on HPC)
└── notebooks/ # Jupyter notebooks (visualization, figures, checkpoint review)
├── 01_datasets.ipynb → Table 1
├── 02_metadata.ipynb → Table 1 (cont.)
├── 03_qc.ipynb → Fig S1
├── 04_classification.ipynb → Fig S2
├── 05_integration.ipynb → Fig S3
├── 06_clustering.ipynb → Fig S3 (cont.)
├── 07_annotation.ipynb → Fig 1
├── 08_differential.ipynb → Fig 2-3, Table 2
├── 09_interpretation.ipynb → Fig 4-5, Fig S4
├── 10_trajectory.ipynb → Fig 6
└── 11_communication.ipynb → Fig 7
Each pipeline module produces two outputs:
- A script in
scripts/that does the heavy computation (can run headlessly on HPC) - A notebook in
notebooks/that loads saved results and produces figures and interpretation
Notebooks are independent of scripts — they read from data/ and results/, not from in-memory objects. This means a reviewer can run a notebook without re-executing the full compute pipeline. Notebooks also serve as draft manuscript figures: each maps to specific figures and tables in the planned publication (see arrows in directory listing above).
If this analysis contributes to a publication, cite:
- The original publications for each included dataset
- The tools used (scanpy, scvi-tools, DESeq2, etc.)
- This pipeline methodology as appropriate