AnnData HuggingFace Datasets Pipeline

A comprehensive pipeline for processing single-cell RNA-seq data and creating HuggingFace datasets for machine learning. The pipeline handles everything from raw data download through preprocessing, embedding generation, to final dataset creation and publication.

Overview

This pipeline transforms raw single-cell RNA-seq data into ready-to-use HuggingFace datasets through a series of automated steps:

1. Download → 2. Preprocessing → 3. Embedding Prep → 4. CPU Embedding → 5. GPU Embedding → 6. Dataset Creation
   (optional)                                                                                    ↓
                                                                                    HuggingFace Hub Publication

Key Features:

🔄 Automated workflow orchestration with SLURM or local execution
🧬 Memory-efficient processing of large-scale single-cell datasets
🎯 Multiple embedding methods (PCA, scVI, HVG, Geneformer)
📊 Quality control with automatic plots and metrics
🤗 HuggingFace integration with rich dataset cards
🔧 Highly configurable with dataset-centric YAML configs

Installation
Test Run
HuggingFace Hub Integration
Configuration
- Dataset Configuration
- Workflow Orchestrator Configuration
Quick Start
- Local Execution
- SLURM Cluster Execution
Nextcloud Integration
Zenodo Integration
Pipeline Steps
Advanced Usage
Adding a New Embedding Method
Documentation
Troubleshooting

Installation

Prerequisites

Python 3.10-3.13 (see Python version requirements)
Git (submodules only needed if installing Geneformer support or running the full pipeline)

Installation

Option 1: Using pip (For Library Usage)

If you only want to use the package as a library in another project (without running the full pipeline), you can install it directly:

pip install git+https://github.com/mengerj/adata_hf_datasets.git

Note: The pipeline scripts and configuration files are not included in the pip package. If you need to run the full pipeline workflows, see Option 2: Clone Repository below.

Example: Using as a library in your project

from adata_hf_datasets import InitialEmbedder, AnnDataSetConstructor
from adata_hf_datasets.pp import preprocess_adata
import anndata as ad

# Use the embedders and preprocessing functions
embedder = InitialEmbedder(method="gs10k")
embeddings = embedder.embed(adata=your_adata)

# Or use preprocessing
processed_adata = preprocess_adata(your_raw_adata)

# Or create a hf_dataset from an adata object
constr = AnnDataSetConstructor(dataset_format="multiplets")
# See the docs of the method itself for details
constr.add_anndata(adata=your_adata, caption_key = "your_caption_key_in_adata.obs", sentence_keys=["sample_idx"], adata_link="local_path or remote_share_link to h5ad or zarr file of this adata object")
#you can add multiple anndata objects before creating the dataset
ds = contrs.get_dataset()

Note: Some embedding methods (scVI, Geneformer) require additional packages. If you try to use them, you'll see helpful error messages with installation instructions.

Option 2: Clone Repository (For Pipeline/Workflow Usage)

For pipeline/workflow usage, you need to clone the repository. The pipeline scripts and configuration files are not included in the pip package.

Clone the repository:

git clone https://github.com/mengerj/adata_hf_datasets.git
cd adata_hf_datasets

Install the package:

You can use either uv (recommended, faster) or pip:

Option A: Using uv (Recommended)

uv is a fast Python package installer. Install it first if you don't have it:

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

Then install the package:

uv sync
source .venv/bin/activate

Option B: Using pip

# create a virtual env with your method of choice
pip install -e .
# activate that virtual env

Optional: Install optional dependencies

Some embedding methods require additional packages that are not installed by default:

scVI embeddings: Install scvi-tools if you want to use scVI embeddings:
```
pip install scvi-tools
```
Geneformer embeddings: Requires cloning the Geneformer submodule and installing it:
```
git submodule update --init --recursive
pip install external/Geneformer
```
Note: Geneformer can only be installed on Linux machines with CUDA support. Also it requires git-lfs to be installed on the machine.

If you try to use these methods without the required packages, you'll see helpful error messages with installation instructions.

Python Version Requirements

The package requires Python 3.10, 3.11, 3.12, or 3.13. The current requirement is >=3.10,<3.14. Python 3.9 and earlier, or Python 3.14+ are not supported.

What Gets Installed

The base installation includes:

Core dependencies (anndata, scanpy, datasets, huggingface-hub)
Embedding tools (PCA, HVG, gene selection)
Workflow orchestration tools (Hydra)
All required dependencies

Optional dependencies (install separately if needed):

scvi-tools: Required for scVI embeddings (pip install scvi-tools)
geneformer: Required for Geneformer embeddings (requires submodule initialization, see above)

If you try to use these methods without the required packages, you'll see helpful error messages with installation instructions.

Test Run

Before attempting to add your own dataset, try running the workflow with the example data. This will download a .h5ad, preprocess it, run several embedders and create a HuggingFace dataset.

Note: This test run will NOT use HuggingFace Hub or Nextcloud (both are disabled in the example config for simplicity).

Run the Test Workflow

# Activate the virtual environment
source .venv/bin/activate

# Run workflow in foreground (recommended for first runs)
python scripts/workflow/submit_workflow.py \
    --config dataset_config_example \
    --foreground

# Or run in background (detached)
python scripts/workflow/submit_workflow.py \
    --config dataset_config_example

Monitor Progress

Check the main log to see the progress:

# View workflow summary (replace date/run_id with your actual run)
tail -f outputs/2025-*/workflow_local_*/logs/workflow_summary.log

Specific logs for each step are in their respective subfolders:

outputs/{date}/workflow_{run_id}/preprocessing/
outputs/{date}/workflow_{run_id}/embedding/
outputs/{date}/workflow_{run_id}/dataset_creation/

Load the Created Dataset

Once the workflow completes successfully, find the dataset location in the logs:

# Check the dataset creation output (it will show the final location)
cat outputs/*/workflow_local_*/dataset_creation/job_local_*/create_ds_0.out

Load and inspect the dataset:

from datasets import load_from_disk

# Replace with your actual path from the logs
dataset_path = "outputs/2025-*/workflow_local_*_*/dataset_creation/job_local_*/job_0/demo_dataset"

# Load the dataset
ds = load_from_disk(dataset_path)

# Inspect it
print(ds)
# DatasetDict({
#     train: Dataset({...})
#     validation: Dataset({...})
# })

# Check a sample
print(ds['train'][0])

Important: Use load_from_disk() to load locally saved datasets. The load_dataset() function is for loading from the HuggingFace Hub.

HuggingFace Hub Integration

The pipeline can automatically publish datasets to the HuggingFace Hub for easy sharing and distribution.

Setup

Create a HuggingFace account:
- Visit https://huggingface.co/join
- Create an account (free)
Get your access token:
- Go to https://huggingface.co/settings/tokens
- Create a new token with write permissions
- Copy the token
Configure authentication:

Create a .env file in the project root:

# In the project root directory
cat > .env << 'EOF'
# HuggingFace Hub authentication
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx  # Your token here
EOF

Or log in via CLI:

huggingface-cli login
# Paste your token when prompted

Configure dataset publication:

In your dataset config (or conf/dataset_default.yaml):

dataset_creation:
  enabled: true
  push_to_hub: true # Enable HuggingFace Hub upload

Usage

When the workflow completes, your dataset will be published to:

https://huggingface.co/datasets/{your-username}/{dataset-name}

Example: https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_10k

The dataset format and caption key information are documented in the dataset card metadata.

The dataset will include:

✅ All splits (train/validation or test)
✅ Automatically generated README with dataset card
✅ Schema and feature descriptions
✅ Download statistics and usage examples

Private vs Public Datasets

By default, datasets are uploaded as private. You can control this in your dataset configuration:

dataset_creation:
  push_to_hub: true
  private: true # true = private (default), false = public

To make a dataset public, set private: false in your config, or update the repository settings on HuggingFace after upload.

Configuration

The pipeline uses two main configuration files:

Dataset Configuration

Dataset configurations define what data to process and how to process it. Each dataset has its own YAML file in the conf/ directory.

Example: conf/dataset_config_example.yaml

Template with all parameters: conf/dataset_default.yaml

Key Sections

Dataset Metadata:

dataset:
  name: "human_pancreas"
  description: "Human pancreas dataset"
  download_url: "https://example.com/data.h5ad"

Common Keys (used across all steps):

batch_key: "batch_id" # Batch/dataset identifier
annotation_key: "cell_type" # Cell type annotations
caption_key: "natural_language_annotation" # Natural language descriptions

Step Configuration (enable/disable and configure each step):

download:
  enabled: false
  subset_size: 10000

preprocessing:
  enabled: true
  chunk_size: 10000
  split_dataset: true

embedding_preparation:
  enabled: true
  methods: ["geneformer"]

embedding_cpu:
  enabled: true
  methods: ["pca", "scvi_fm", "hvg"]

embedding_gpu:
  enabled: true
  methods: ["geneformer"]

dataset_creation:
  enabled: true
  dataset_format: "multiplets"
  negatives_per_sample: 2

All available parameters are documented in dataset_default.yaml.

Workflow Orchestrator Configuration

The workflow orchestrator configuration defines where and how to run the pipeline.

Configuration file: conf/workflow_orchestrator.yaml

Key Concept: Unified Configuration

You can configure both local and SLURM paths simultaneously in the same config file. Simply change the execution_mode to switch between them - no need to manually edit paths or use different scripts!

Execution Mode

Choose between local execution or SLURM cluster by setting execution_mode:

workflow:
  execution_mode: "local" # or "slurm"

The submission script automatically uses the appropriate paths based on this setting.

Complete Configuration Example

Here's a complete configuration with both local and SLURM settings:

workflow:
  # Switch between "local" and "slurm" to change execution mode
  execution_mode: "local"

  # Local execution settings
  local_output_directory: "../outputs" # Output directory for local runs
  local_project_directory: "." # Project directory for local runs (relative to script location)
  local_base_file_path: "./data/RNA" # Base data directory for local runs
  local_max_workers: 2 # Number of parallel workers for local execution
  local_enable_gpu: false # Enable GPU embedding locally (requires CUDA)

  # SLURM execution settings
  cpu_login:
    host: "cpu_cluster" # SSH host (must be in ~/.ssh/config)
    user: "username"
  gpu_login:
    host: "gpu_cluster" # SSH host (must be in ~/.ssh/config)
    user: "username"
  cpu_partition: "slurm" # CPU partition name (check with `sinfo`)
  gpu_partition: "gpu" # GPU partition name (check with `sinfo`)
  slurm_output_directory: "/home/username/outputs" # Output directory on cluster
  slurm_project_directory: "/home/username/adata_hf_datasets" # Project directory on cluster
  slurm_base_file_path: "/scratch/global/username/data/RNA" # Base data directory (must be accessible by both clusters!)

  # Shared settings
  venv_path: ".venv" # Virtual environment path (relative to project_directory)
  enable_transfers: false # Use shared filesystem (recommended)

Important Notes for SLURM

SSH Configuration:

The SLURM mode requires passwordless SSH access to the clusters
Set up SSH keys so that ssh cpu_cluster and ssh gpu_cluster work without password prompts
Configure hosts in ~/.ssh/config if needed (including ProxyJump if required)
Always keep the repos synced, for example when changing a configuration file

Shared Storage:

slurm_base_file_path must be accessible by both CPU and GPU clusters
Typically a global scratch filesystem (e.g., /scratch/global/username/)
Data is written by one cluster and read by another during the workflow
Local execution is actually much faster, since I/O speeds on a global filesystem are usually very slow, and the pipeline requires reading data into memory at several steps. If you don't need a gpu, or have a gpu locally, I would recommend to work locally. But if you don't bother waiting a while and want to just submit a bunch of datasets, the cluster is better suited.

Cluster-Specific Settings:

cpu_partition and gpu_partition names are cluster-specific
Check your cluster's SLURM configuration for the correct partition names
Use sinfo on your cluster to list available partitions

Quick Start

Local Execution (macOS/Linux)

For running the complete pipeline on your local machine:

Configure for local execution:

Edit conf/workflow_orchestrator.yaml and set execution_mode: "local":

workflow:
  execution_mode: "local" # Switch to local mode

  # Local paths (already configured above)
  local_output_directory: "../outputs"
  local_project_directory: "."
  local_base_file_path: "./data/RNA"
  local_max_workers: 2
  local_enable_gpu: false # Set to true if you have CUDA-capable GPU

Configure your dataset: Take a close look at the example dataset config and the default config

Edit or create a dataset config in conf/, for example conf/my_dataset.yaml:

defaults:
  - dataset_default.yaml
  - _self_

dataset:
  name: "my_dataset"
  description: "My single-cell dataset"
  download_url: "https://..."
  full_name: "my_dataset_full" #required if subsetting the dataset

# Enable/disable steps as needed
preprocessing:
  enabled: true
embedding_cpu:
  enabled: true
# ... etc

Run the workflow:

# Activate virtual environment
source .venv/bin/activate

# Run workflow in foreground (recommended for first runs)
# You can use either a config name or a path:
python scripts/workflow/submit_workflow.py \
    --config my_dataset \

# Or use a relative path:
python scripts/workflow/submit_workflow.py \
    --config conf/my_dataset.yaml \

# Or use an absolute path:
python scripts/workflow/submit_workflow.py \
    --config /absolute/path/to/my_dataset.yaml \

Note: The --config argument accepts either:

A config name (e.g., my_dataset) - looks for conf/my_dataset.yaml
A relative path (e.g., conf/my_dataset.yaml) - relative to project root
An absolute path (e.g., /path/to/config.yaml) - full file path

What happens:

The workflow runs each step sequentially
Steps are executed based on the enabled flags in your dataset config
Logs are written to {local_output_directory}/{date}/workflow_{timestamp}/
Data files are written to the local_base_file_path directory

SLURM Cluster Execution

For running on SLURM clusters with SSH orchestration:

Set up SSH keys:

# Generate SSH key if you don't have one
ssh-keygen -t ed25519

# Copy to clusters
ssh-copy-id username@cpu_cluster
ssh-copy-id username@gpu_cluster

# Test passwordless access
ssh cpu_cluster "hostname"
ssh gpu_cluster "hostname"

Note: Depending on your cluster, you might need to setup a proxy-jump. Edit the .ssh/config file on your machine.

Configure for SLURM:

Edit conf/workflow_orchestrator.yaml and set execution_mode: "slurm":

workflow:
  execution_mode: "slurm" # Switch to SLURM mode

  # SLURM paths (already configured above)
  cpu_login:
    host: "cpu_cluster" # Your CPU cluster SSH alias
    user: "username"
  gpu_login:
    host: "gpu_cluster" # Your GPU cluster SSH alias
    user: "username"
  cpu_partition: "slurm" # Check with `sinfo` on your cluster
  gpu_partition: "gpu" # Check with `sinfo` on your cluster
  slurm_output_directory: "/home/username/outputs"
  slurm_project_directory: "/home/username/adata_hf_datasets"
  slurm_base_file_path: "/scratch/global/username/data/RNA" # Must be accessible by both clusters!

Note: Before attempting to run on SLURM, make sure that the repository is installed on the cluster. Follow the same steps as locally to install. UV can be installed without sudo rights.

Ensure the repository is synced on the cluster:

# On your local machine, push to git
git push

# SSH to the cluster and pull
ssh cpu_cluster
cd /home/username/adata_hf_datasets
git pull
git submodule update --init --recursive
uv sync --all-extras
exit

Submit the workflow:

# From your local machine (same script as local mode!)
# You can use either a config name or a path:
python scripts/workflow/submit_workflow.py \
    --config my_dataset

# Or use a path:
python scripts/workflow/submit_workflow.py \
    --config conf/my_dataset.yaml

Note: The --config argument accepts either a config name or a file path (relative or absolute), just like in local mode.

What happens:

A master SLURM job is submitted to the CPU cluster
The master job orchestrates all subsequent steps
Steps run on appropriate clusters (CPU vs GPU)
Job dependencies are automatically managed by SLURM
You can monitor progress with ssh cpu_cluster "squeue -u username"

Output location:

Logs: {slurm_output_directory}/{date}/workflow_{job_id}/
Data: {slurm_base_file_path}/ (organized into raw/, processed/, processed_with_emb/)

Nextcloud Integration

Nextcloud integration allows you to store large AnnData files remotely, making your HuggingFace datasets truly autonomous and shareable without local file dependencies.

Why Nextcloud?

HuggingFace datasets store only metadata (cell sentences, captions, negative indices). The actual AnnData files with expression matrices and embeddings are stored separately. Nextcloud provides:

☁️ Remote storage for large AnnData files
🔗 Share links embedded in the dataset for downstream access
🌐 Independence from local file systems
🤝 Easy sharing - anyone with the HF dataset can access the data

Setup

Get Nextcloud access:
- Obtain a Nextcloud account (institutional, self-hosted, or cloud provider)
- You need: URL, username, and password
Configure credentials:

Create or edit the .env file in the project root:

# In the project root directory
cat >> .env << 'EOF'

# Nextcloud authentication
NEXTCLOUD_URL=https://cloud.example.com  # Your Nextcloud instance URL (what you type in browser)
NEXTCLOUD_USER=your-username              # Your Nextcloud username
NEXTCLOUD_PASSWORD=your-password          # Your Nextcloud password
EOF

Security note: The .env file is in .gitignore and will not be committed to git.

Enable Nextcloud in dataset config:

dataset_creation:
  enabled: true
  use_nextcloud: true # Enable Nextcloud upload

  nextcloud_config:
    url: "NEXTCLOUD_URL" # Will be read from .env
    username: "NEXTCLOUD_USER" # Will be read from .env
    password: "NEXTCLOUD_PASSWORD" # Will be read from .env
    remote_path: "" # Automatically set based on dataset

The environment variables will be automatically resolved at runtime.

How It Works

During dataset creation:
- Processed AnnData files are uploaded to Nextcloud
- Share links are generated for each file
- Links are embedded in the HuggingFace dataset
When using the dataset:
- The adata_link column contains Nextcloud share URLs
- Downstream models can download files on-demand
- No local file dependencies needed

Nextcloud Directory Structure

Files are organized in Nextcloud as:

{remote_path}/
└── {dataset_name}/
    ├── train/
    │   ├── chunk_0.zarr.zip
    │   └── chunk_1.zarr.zip
    └── validation/
        └── chunk_0.zarr.zip

Zenodo Integration

Zenodo integration provides persistent, citable storage for your AnnData files with DOI assignment, making your datasets suitable for academic publication and long-term archival.

Why Zenodo?

Zenodo is a research data repository hosted by CERN that provides:

📚 Academic publishing - Get a DOI for your dataset
🔒 Long-term archival - CERN-backed preservation guarantees
🆓 Free storage - Up to 50GB per dataset
🌍 Public accessibility - Open science friendly
📝 Versioning - Built-in support for dataset versions
🧪 Sandbox testing - Test your uploads before going to production

Setup

Get a Zenodo account:
- Create an account at zenodo.org (production) or sandbox.zenodo.org (testing)
- These are separate accounts - sandbox is recommended for testing first
Create an access token:

For production:

Go to https://zenodo.org/account/settings/applications/
Click "New token"
Select scopes: deposit:write and deposit:actions
Copy the generated token

For sandbox (testing):

Go to https://sandbox.zenodo.org/account/settings/applications/
Create a token with the same scopes as above
Copy the generated token (this is a different token from production)

Configure credentials:

Create or edit the .env file in the project root:

# For production Zenodo
ZENODO_TOKEN=your-production-token-here

# For sandbox Zenodo (separate token required)
ZENODO_SANDBOX_TOKEN=your-sandbox-token-here

Security note: The .env file is in .gitignore and will not be committed to git.

Enable Zenodo in dataset config:

For production:

dataset_creation:
  enabled: true
  use_zenodo: true # Enable Zenodo upload

  zenodo_config:
    sandbox: false # Use production Zenodo

For sandbox (testing):

dataset_creation:
  enabled: true
  use_zenodo: true # Enable Zenodo upload

  zenodo_config:
    sandbox: true # Use sandbox Zenodo for testing

The appropriate environment variable (ZENODO_TOKEN or ZENODO_SANDBOX_TOKEN) will be automatically used based on the sandbox setting.

How It Works

During dataset creation:
- Processed AnnData files are packaged as ZIP archives
- A single Zenodo deposit (draft) is created for the entire dataset
- All files (train/validation splits) are uploaded to this deposit
- Download URLs are generated and embedded in the HuggingFace dataset
- The deposit remains in draft state - you can publish it manually on Zenodo
When using the dataset:
- The adata_link column contains Zenodo download URLs
- Files can be downloaded on-demand using the Zenodo API
- No authentication required for published deposits
Deposit management:
- Deposit information is saved in zenodo_share_map.json in your data directory
- Re-running the pipeline reuses the same deposit (no duplicates)
- You can edit metadata and publish the deposit on the Zenodo website

Production vs Sandbox

Sandbox (sandbox.zenodo.org):

✅ Safe testing environment
✅ Same API as production
✅ Can be deleted/reset without consequences
❌ Not persistent (may be wiped periodically)
❌ No real DOIs

Production (zenodo.org):

✅ Permanent storage with real DOIs
✅ Suitable for publication
⚠️ Deposits cannot be deleted once published
⚠️ Use with care

Workflow: Always test with sandbox first, then switch to production when ready.

Other Storage Backends

Currently supported: Nextcloud, Zenodo

Want other backends? If you need support for other cloud storage providers (AWS S3, Google Drive, Figshare, etc.), please open an issue describing your use case. We're interested in adding compatibility for additional storage backends!

For developers: The storage interface is in src/adata_hf_datasets/file_utils.py. Contributions for new backends are welcome!

Pipeline Steps

The pipeline consists of six steps, each with detailed documentation:

1. Download (Optional)

Downloads and optionally subsets raw data from a URL.

Documentation: scripts/download/README.md

Key Features:

Download from URLs or file paths
Stratified subsetting with preserved proportions
Validation of downloaded files

Configuration:

download:
  enabled: true
  subset_size: 10000
  stratify_keys: ["cell_type", "tissue"]
  preserve_proportions: true

2. Preprocessing

Cleans, filters, and normalizes raw count data.

Documentation: scripts/preprocessing/README.md

Key Features:

Quality control with MAD-based outlier detection
Gene/cell filtering
Normalization and log-transformation
Highly variable gene selection
Optional train/val split
SRA metadata enrichment

Configuration:

preprocessing:
  enabled: true
  min_cells: 20
  min_genes: 200
  n_top_genes: 5000
  chunk_size: 200000
  split_dataset: true
  train_split: 0.9

3. Embedding Preparation (CPU)

Performs CPU-intensive preparation for GPU embedding methods (e.g., Geneformer tokenization).

Documentation: scripts/embed/README.md

Key Features:

Separates CPU-intensive prep from GPU computation
Tokenization for Geneformer
Cached preparation results

Configuration:

embedding_preparation:
  enabled: true
  methods: ["geneformer"] # Methods that need preparation

4. CPU Embedding

Generates embeddings using CPU-based methods.

Documentation: scripts/embed/README.md

Key Features:

PCA: Linear dimensionality reduction
scVI: Deep learning foundation model
Memory-efficient streaming to disk

Configuration:

embedding_cpu:
  enabled: true
  methods: ["pca", "scvi_fm", "gs10k"]
  embedding_dim_map:
    pca: 50
    scvi_fm: 50
    gs10k: 10000

5. GPU Embedding

Generates embeddings using GPU-based methods.

Documentation: scripts/embed/README.md

Key Features:

Geneformer: Transformer-based embeddings (! Needs Cuda device !)
Automatic retry on GPU errors
Uses preparation results from step 3

Configuration:

embedding_gpu:
  enabled: true
  methods: ["geneformer"]
  embedding_dim_map:
    geneformer: 768

6. Dataset Creation

Creates HuggingFace datasets with contrastive learning pairs/multiplets.

Documentation: scripts/dataset_creation/README.md

Key Features:

Multiple dataset formats (multiplets, pairs, single)
Cell sentence generation
Intelligent negative sampling
HuggingFace Hub publication
Optional Nextcloud integration

Configuration:

dataset_creation:
  enabled: true
  dataset_format: "multiplets"
  sentence_keys: ["sample_id_og"]
  negatives_per_sample: 2
  required_obsm_keys: ["X_pca", "X_scvi_fm", "X_geneformer"]
  push_to_hub: true

Advanced Usage

Running Individual Steps

While the workflow orchestrator runs all enabled steps automatically, you can run individual steps manually:

# Activate environment
source .venv/bin/activate

# Run preprocessing only
python scripts/preprocessing/preprocess.py --config-name my_dataset

# Run CPU embedding only
python scripts/embed/embed_core.py \
    --config my_dataset \
    ++embedding_config_section=embedding_cpu

# Run dataset creation only
python scripts/dataset_creation/create_dataset.py --config-name my_dataset

See individual step documentation for more details.

Configuration Overrides

Override any configuration parameter via command line:

python scripts/workflow/submit_workflow.py \
    --config my_dataset \
    ++preprocessing.chunk_size=50000 \
    ++embedding_cpu.methods='["pca"]' \
    ++dataset_creation.push_to_hub=false

Monitoring Progress

Local execution:

# Check logs in real-time
tail -f outputs/{date}/workflow_{timestamp}/logs/workflow_master.out

# View step-specific logs
tail -f outputs/{date}/workflow_{timestamp}/preprocessing/job_*/preprocessing.out

SLURM execution:

# Check job queue
ssh cpu_cluster "squeue -u username"

# View master job logs
ssh cpu_cluster "cat /home/username/outputs/{date}/workflow_{job_id}/logs/workflow_master.out"

# View step-specific logs
ssh cpu_cluster "cat /home/username/outputs/{date}/workflow_{job_id}/preprocessing/job_*/preprocessing.out"

Skipping Steps

To skip steps (e.g., if already completed):

# In your dataset config
preprocessing:
  enabled: false # Skip preprocessing

embedding_preparation:
  enabled: false # Skip embedding preparation

Or via command line:

python scripts/workflow/submit_workflow.py \
    --config my_dataset \
    ++preprocessing.enabled=false \
    ++embedding_preparation.enabled=false

Adding a New Embedding Method

This section explains how to add a custom embedding method to the pipeline by creating a new embedder class.

Overview

All embedders inherit from the BaseEmbedder class and implement three core methods:

__init__: Initialize the embedder with configuration parameters
prepare: Prepare the embedder (e.g., load models, tokenize data)
embed: Generate embeddings from the data

Step-by-Step Guide

1. Create Your Embedder Class

Create a new class that inherits from BaseEmbedder in src/adata_hf_datasets/embed/initial_embedder.py:

from adata_hf_datasets.embed.initial_embedder import BaseEmbedder, _check_load_adata
from importlib.util import find_spec
import numpy as np
import anndata as ad
import logging

logger = logging.getLogger(__name__)


class MyCustomEmbedder(BaseEmbedder):
    """
    Custom embedder that generates embeddings using MyCustomMethod.
    """

    def __init__(self, embedding_dim: int = 64, **kwargs):
        """
        Initialize the custom embedder.

        Parameters
        ----------
        embedding_dim : int
            Dimensionality of the output embedding.
        **kwargs
            Additional keyword arguments for the embedder.
        """
        # Check for required packages
        if find_spec("my_custom_package") is None:
            raise ImportError(
                "my_custom_package is required to use MyCustomEmbedder. "
                "Please install it with: pip install my-custom-package"
            )

        super().__init__(embedding_dim=embedding_dim)
        self.model = None
        self.init_kwargs = kwargs

    def prepare(
        self,
        adata: ad.AnnData | None = None,
        adata_path: str | None = None,
        **kwargs,
    ) -> None:
        """
        Prepare the embedder (e.g., load model, preprocess data).

        Parameters
        ----------
        adata : anndata.AnnData, optional
            Single-cell dataset in memory.
        adata_path : str, optional
            Path to the AnnData file (.h5ad or .zarr).
        **kwargs
            Additional keyword arguments for preparation.
        """
        # Use helper function to load adata if path is provided
        adata = _check_load_adata(adata, adata_path)

        logger.info("Preparing MyCustomEmbedder...")
        # Your preparation logic here
        # For example: load a pre-trained model, tokenize data, etc.
        self.model = load_my_custom_model(**self.init_kwargs)

    def embed(
        self,
        adata: ad.AnnData | None = None,
        adata_path: str | None = None,
        obsm_key: str = "X_my_custom",
        **kwargs,
    ) -> np.ndarray:
        """
        Generate embeddings from the data.

        Parameters
        ----------
        adata : anndata.AnnData, optional
            Single-cell dataset in memory.
        adata_path : str, optional
            Path to the AnnData file (.h5ad or .zarr).
        obsm_key : str
            Key in `adata.obsm` to store the embedding.
        **kwargs
            Additional keyword arguments for embedding.

        Returns
        -------
        np.ndarray
            Embedding matrix of shape (n_cells, embedding_dim).
            Must be in the same order as adata.obs.index.
        """
        # Load adata if path is provided
        adata = _check_load_adata(adata, adata_path)

        logger.info("Generating embeddings with MyCustomEmbedder...")

        # Generate embeddings
        # IMPORTANT: Ensure output order matches adata.obs.index
        embedding_matrix = self.model.embed(adata.X)

        # Ensure correct shape and dtype
        embedding_matrix = embedding_matrix.astype(np.float32)

        # Store in adata.obsm if adata object was provided
        if adata is not None:
            adata.obsm[obsm_key] = embedding_matrix
            logger.info(
                f"Stored embeddings in adata.obsm['{obsm_key}'], shape: {embedding_matrix.shape}"
            )

        return embedding_matrix

2. Key Requirements

Handle Both Input Types:

Always use _check_load_adata() helper function to handle both in-memory AnnData objects and file paths
The helper automatically handles .h5ad and .zarr formats

Package Checking:

Use find_spec() from importlib.util to check if required packages are installed
Provide helpful error messages with installation instructions if packages are missing

Output Requirements:

Return a np.ndarray of shape (n_cells, embedding_dim)
Critical: The embedding matrix must be in the same order as adata.obs.index
Use np.float32 dtype for consistency
If an adata object is provided, store embeddings in adata.obsm[obsm_key]

Memory Efficiency:

When using file paths, you can read only necessary data from .h5ad or .zarr stores
For .zarr stores, you can write embeddings directly to adata.obsm without loading the entire object
See GeneformerEmbedder for an example of efficient file-based operations

3. Register Your Embedder

Add your embedder class to the embedder_classes dictionary in the InitialEmbedder class:

# In InitialEmbedder.__init__ method
embedder_classes = {
    "scvi_fm": SCVIEmbedderFM,
    "geneformer": GeneformerEmbedder,
    "geneformer-v1": GeneformerV1Embedder,
    "pca": PCAEmbedder,
    "hvg": HighlyVariableGenesEmbedder,
    "gs": GeneSelectEmbedder,
    "gs10k": GeneSelectEmbedder10k,
    "my_custom": MyCustomEmbedder,  # Add your embedder here
}

4. Add to Configuration (For Pipeline Usage)

If you want to use your embedder in the pipeline, add it to conf/dataset_default.yaml:

embedding_cpu: # or embedding_gpu, depending on your method
  embedding_dim_map:
    scvi_fm: 50
    geneformer: 768
    pca: 50
    hvg: 512
    gs: 3936
    gs10k: 10000
    geneformer-v1: 512
    my_custom: 64 # Add your embedding dimension here

Add the same entry to both embedding_cpu and embedding_gpu sections if applicable.

5. Usage

Once registered, you can use your embedder:

from adata_hf_datasets import InitialEmbedder

# Initialize embedder
embedder = InitialEmbedder(
    method="my_custom",
    embedding_dim=64,
    # Your custom init_kwargs here
)

# Prepare (if needed)
embedder.prepare(adata_path="path/to/data.h5ad")

# Generate embeddings
embeddings = embedder.embed(
    adata_path="path/to/data.h5ad",
    obsm_key="X_my_custom"
)

Or use it in the pipeline by adding "my_custom" to the methods list in your dataset config:

embedding_cpu:
  enabled: true
  methods: ["pca", "my_custom"]

Best Practices

Error Handling: Provide clear error messages when required data/attributes are missing
Logging: Use the logger to inform users about what's happening
Documentation: Add comprehensive docstrings explaining parameters and behavior
Testing: Test with both .h5ad and .zarr formats, and with both in-memory and file-based inputs
Order Preservation: Always ensure embeddings match the order of adata.obs.index

Documentation

Detailed documentation for each component:

Pipeline Steps

Download - Data acquisition and subsetting
Preprocessing - QC, filtering, normalization
Embedding - PCA, scVI, HVG, Geneformer embeddings
Dataset Creation - HuggingFace dataset generation

Configuration

Dataset Configuration Template - All available parameters
Dataset Configuration Example - Working example
Workflow Orchestrator Config - Workflow settings

Source Code

src/adata_hf_datasets/ - Core library code
scripts/ - Executable scripts for each step

Troubleshooting

Installation Issues

Problem: uv sync fails with missing dependencies

Solution:

# Try with verbose output
uv sync --all-extras -v

Problem: Python version error: "requires a different Python: X.X.X not in '<3.14,>=3.10'"

Solution: The package requires Python 3.10, 3.11, 3.12, or 3.13. If you see this error, you're likely using Python 3.14+ or an older version (3.9 or earlier). Check your Python version:

python --version

If you need to use a different Python version, consider using pyenv or a virtual environment with the correct version.

Problem: Pip installation is very slow or fails

Solution: Pip installation can be slow due to dependency resolution. Consider:

Using uv instead (much faster): uv pip install .
Using pre-built wheels: pip install --only-binary :all: . (may not work for local packages)
Installing in a clean virtual environment

Problem: Git submodule issues/maintenance

Solutions:

# Initialize submodules (only needed for Geneformer)
git submodule update --init --recursive

# Skip submodules entirely when cloning
git clone --recurse-submodules=no https://github.com/mengerj/adata_hf_datasets.git

# Remove submodule initialization if already cloned
git submodule deinit --all -f

Note: Submodules are optional and only needed for Geneformer support. You can install and use the package without them.

Problem: Geneformer or scVI embedder fails with ImportError

Solution: These are optional dependencies. Install them separately:

For scVI: pip install scvi-tools
For Geneformer:
1. Initialize submodules: git submodule update --init --recursive
2. Install from local path: pip install external/Geneformer

Configuration Issues

Problem: "Config file not found"

Solution:

# Ensure you're in the project root
cd /path/to/adata_hf_datasets

# Check config file exists
ls conf/my_dataset.yaml

# Use config name without .yaml extension
python scripts/workflow/submit_workflow_local.py --config my_dataset

Problem: "base_file_path is not set"

Solution: Ensure workflow_orchestrator.yaml has the correct path:

workflow:
  local_base_file_path: "/absolute/path/to/data/RNA" # For local
  slurm_base_file_path: "/absolute/path/to/data/RNA" # For SLURM

Execution Issues

Problem: SSH timeout when submitting to SLURM

Solution:

# Test SSH connection
ssh cpu_cluster "hostname"

# Check SSH config
cat ~/.ssh/config

# Ensure SSH agent is running
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Problem: SLURM job fails immediately

Solution:

# Check SLURM logs on cluster
ssh cpu_cluster "cat /home/username/outputs/{date}/workflow_{job_id}/logs/workflow_master.err"

# Verify partition names
ssh cpu_cluster "sinfo"

# Check virtual environment exists on cluster
ssh cpu_cluster "ls /home/username/adata_hf_datasets/.venv/bin/python"

Problem: "File not found" errors during workflow

Solution:

Verify base_file_path is accessible and has correct permissions
For SLURM: Ensure base_file_path is accessible from both CPU and GPU clusters
Check that previous steps completed successfully

Dataset Loading Issues

Problem: ValueError: Couldn't infer the same data file format for all splits

Solution: Use load_from_disk() instead of load_dataset() for locally saved datasets:

# ✅ Correct - for locally saved datasets
from datasets import load_from_disk
ds = load_from_disk("/path/to/dataset")

# ❌ Wrong - this is for HuggingFace Hub
from datasets import load_dataset
ds = load_dataset("/path/to/dataset")  # Will fail!

load_dataset() is only for loading datasets from the HuggingFace Hub or inferring formats from raw files. For datasets saved with save_to_disk(), always use load_from_disk().

Getting Help

For more help:

Check the step-specific README files linked above
Review log files in outputs/{date}/workflow_{timestamp}/
Check the .err files for error messages
Review the dataset config for any misconfigurations

Citation

If you use this pipeline in your research, please cite:

@software{adata_hf_datasets,
  title = {AnnData HuggingFace Datasets Pipeline},
  author = {Jonatan Menger},
  year = {2025},
  url = {https://github.com/mengerj/adata_hf_datasets}
}

License

[Specify your license here]

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

Acknowledgments

This pipeline builds on:

AnnData for handling data matrices
Scanpy for single-cell analysis
scVI for probabilistic models
Geneformer for transformer embeddings
Hugging Face Datasets for dataset management

Name		Name	Last commit message	Last commit date
Latest commit History 1,035 Commits
.github		.github
conf		conf
external		external
notebooks		notebooks
resources		resources
scripts		scripts
src/adata_hf_datasets		src/adata_hf_datasets
tests		tests
.codecov.yaml		.codecov.yaml
.cruft.json		.cruft.json
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierrc.yaml		.prettierrc.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENCE		LICENCE
README.md		README.md
cellwhisperer_geneformer.txt		cellwhisperer_geneformer.txt
cw-geneformer.txt		cw-geneformer.txt
cw_gf.txt		cw_gf.txt
pyproject.toml		pyproject.toml
test.h5ad		test.h5ad
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AnnData HuggingFace Datasets Pipeline

Overview

Table of Contents

Installation

Prerequisites

Installation

Option 1: Using pip (For Library Usage)

Option 2: Clone Repository (For Pipeline/Workflow Usage)

Python Version Requirements

What Gets Installed

Test Run

Run the Test Workflow

Monitor Progress

Load the Created Dataset

HuggingFace Hub Integration

Setup

Usage

Private vs Public Datasets

Configuration

Dataset Configuration

Key Sections

Workflow Orchestrator Configuration

Key Concept: Unified Configuration

Execution Mode

Complete Configuration Example

Important Notes for SLURM

Quick Start

Local Execution (macOS/Linux)

SLURM Cluster Execution

Nextcloud Integration

Why Nextcloud?

Setup

How It Works

Nextcloud Directory Structure

Zenodo Integration

Why Zenodo?

Setup

How It Works

Production vs Sandbox

Other Storage Backends

Pipeline Steps

1. Download (Optional)

2. Preprocessing

3. Embedding Preparation (CPU)

4. CPU Embedding

5. GPU Embedding

6. Dataset Creation

Advanced Usage

Running Individual Steps

Configuration Overrides

Monitoring Progress

Skipping Steps

Adding a New Embedding Method

Overview

Step-by-Step Guide

1. Create Your Embedder Class

2. Key Requirements

3. Register Your Embedder

4. Add to Configuration (For Pipeline Usage)

5. Usage

Best Practices

Documentation

Pipeline Steps

Configuration

Source Code

Troubleshooting

Installation Issues

Configuration Issues

Execution Issues

Dataset Loading Issues

Getting Help

Citation

License

Contributing

Acknowledgments

About

Packages