Programmable Agent for Retrieving Contextualized Experiments
PARCE is an agentic workflow that fetches public omics data and produces structured JSON narratives describing how the data was obtained. Each narrative interleaves human-readable descriptions with URI references to raw data files, making it suitable for training multimodal autoregressive embedding models.
User / CLI
│
▼
main.py ──► agent/curator.py ──► AzureAIAgentClient
│
┌─────────┴─────────┐
│ Azure AI Foundry │
│ (any model) │
└─────────┬─────────┘
│
┌───────────┼───────────┐
▼ │ ▼
tool_call: │ structured output:
geo_fetcher.py │ ExperimentNarrative
(tools/) │ (models/)
│ │
▼ │
metadata JSON ─────┘
The agent uses Azure AI Foundry as its model gateway. Any model deployed in your Foundry project -- Mistral, GPT-4o, DeepSeek, Llama, etc. -- can be used by changing a single environment variable (AZURE_AI_MODEL_DEPLOYMENT_NAME). Tools, prompts, and Pydantic output schemas remain unchanged.
parce/
├── pyproject.toml # Dependencies and build config
├── .env.example # Template for required env vars
├── src/
│ └── parce/
│ ├── main.py # Entry point
│ ├── agent/
│ │ ├── curator.py # Agent factory (AzureAIAgentClient)
│ │ └── prompts.py # System prompts
│ ├── tools/
│ │ └── geo_fetcher.py # GEO metadata fetcher (stub)
│ ├── models/
│ │ └── narrative.py # Pydantic output schemas
│ └── config/
│ └── settings.py # Env-based configuration
├── data_pipelines/ # Future: Spark / ADLS scripts
├── data/ # Local test data (gitignored)
└── tests/
└── test_models.py
Key design decisions:
- src-layout prevents accidental imports from the project root.
- agent/ is decoupled from tools/ so new data sources (e.g. TCellAtlas, SRA) are added as new tool files without touching orchestration logic.
- models/ schemas serve double duty: structured output for the agent (
response_format) and validation for downstream consumers. - data_pipelines/ lives at the repo root because Spark jobs are submitted independently from the Python package.
- Python 3.11+
- Azure CLI installed
- An Azure AI Foundry project with at least one model deployed (e.g. Mistral Large, GPT-4o, DeepSeek-R1)
az loginThis lets AzureCliCredential obtain tokens for your Foundry project without managing API keys.
cp .env.example .envEdit .env with your Foundry project endpoint and deployment name:
AZURE_AI_PROJECT_ENDPOINT=https://<your-resource>.services.ai.azure.com/api/projects/<project-id>
AZURE_AI_MODEL_DEPLOYMENT_NAME=mistral-large
The project endpoint is found in your Azure AI Foundry project settings page.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"Note: The
agent-frameworkpackages are currently in pre-release. If installation fails, run:pip install agent-framework agent-framework-azure-ai --pre pip install -e ".[dev]"
parce
# or
python -m parce.mainThe agent will fetch mock metadata for GSE164378 and return a structured ExperimentNarrative JSON.
pytestBecause PARCE uses Azure AI Foundry as a model gateway, swapping the underlying LLM is a one-line change in .env:
# Mistral
AZURE_AI_MODEL_DEPLOYMENT_NAME=mistral-large
# OpenAI
AZURE_AI_MODEL_DEPLOYMENT_NAME=gpt-4o
# DeepSeek
AZURE_AI_MODEL_DEPLOYMENT_NAME=DeepSeek-R1
# Meta Llama
AZURE_AI_MODEL_DEPLOYMENT_NAME=Meta-Llama-3-70BThe deployment name must match a model you have deployed in your Foundry project's model catalog. No code changes are required -- all providers produce a standard Agent with the same interface.
For models not in the Azure AI Foundry catalog (e.g. Anthropic Claude), the Microsoft Agent Framework provides dedicated providers (AnthropicChatClient) with an identical agent interface.
- FASTQ-to-Parquet conversion using PySpark on Azure Databricks
- Azure Data Lake Storage Gen2 integration for persisting narratives and raw genomic data at scale
- Batch orchestration for processing large T-cell transcriptomics datasets
- TCellAtlas fetcher tool
- SRA direct metadata fetcher
- CellxGene Census integration
The structured narratives produced by PARCE will serve as training data for a multimodal autoregressive embedding model that jointly learns from experimental metadata and raw omics signals.