A comprehensive pipeline for processing single-cell RNA-seq data and creating HuggingFace datasets for machine learning. The pipeline handles everything from raw data download through preprocessing, embedding generation, to final dataset creation and publication.
This pipeline transforms raw single-cell RNA-seq data into ready-to-use HuggingFace datasets through a series of automated steps:
1. Download → 2. Preprocessing → 3. Embedding Prep → 4. CPU Embedding → 5. GPU Embedding → 6. Dataset Creation
(optional) ↓
HuggingFace Hub Publication
Key Features:
- 🔄 Automated workflow orchestration with SLURM or local execution
- 🧬 Memory-efficient processing of large-scale single-cell datasets
- 🎯 Multiple embedding methods (PCA, scVI, HVG, Geneformer)
- 📊 Quality control with automatic plots and metrics
- 🤗 HuggingFace integration with rich dataset cards
- 🔧 Highly configurable with dataset-centric YAML configs
- Installation
- Test Run
- HuggingFace Hub Integration
- Configuration
- Quick Start
- Nextcloud Integration
- Zenodo Integration
- Pipeline Steps
- Advanced Usage
- Adding a New Embedding Method
- Documentation
- Troubleshooting
- Python 3.10-3.13 (see Python version requirements)
- Git (submodules only needed if installing Geneformer support or running the full pipeline)
If you only want to use the package as a library in another project (without running the full pipeline), you can install it directly:
pip install git+https://github.com/mengerj/adata_hf_datasets.gitNote: The pipeline scripts and configuration files are not included in the pip package. If you need to run the full pipeline workflows, see Option 2: Clone Repository below.
Example: Using as a library in your project
from adata_hf_datasets import InitialEmbedder, AnnDataSetConstructor
from adata_hf_datasets.pp import preprocess_adata
import anndata as ad
# Use the embedders and preprocessing functions
embedder = InitialEmbedder(method="gs10k")
embeddings = embedder.embed(adata=your_adata)
# Or use preprocessing
processed_adata = preprocess_adata(your_raw_adata)
# Or create a hf_dataset from an adata object
constr = AnnDataSetConstructor(dataset_format="multiplets")
# See the docs of the method itself for details
constr.add_anndata(adata=your_adata, caption_key = "your_caption_key_in_adata.obs", sentence_keys=["sample_idx"], adata_link="local_path or remote_share_link to h5ad or zarr file of this adata object")
#you can add multiple anndata objects before creating the dataset
ds = contrs.get_dataset()Note: Some embedding methods (scVI, Geneformer) require additional packages. If you try to use them, you'll see helpful error messages with installation instructions.
For pipeline/workflow usage, you need to clone the repository. The pipeline scripts and configuration files are not included in the pip package.
- Clone the repository:
git clone https://github.com/mengerj/adata_hf_datasets.git
cd adata_hf_datasets- Install the package:
You can use either uv (recommended, faster) or pip:
Option A: Using uv (Recommended)
uv is a fast Python package installer. Install it first if you don't have it:
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | shThen install the package:
uv sync
source .venv/bin/activateOption B: Using pip
# create a virtual env with your method of choice
pip install -e .
# activate that virtual env- Optional: Install optional dependencies
Some embedding methods require additional packages that are not installed by default:
-
scVI embeddings: Install
scvi-toolsif you want to use scVI embeddings:pip install scvi-tools
-
Geneformer embeddings: Requires cloning the Geneformer submodule and installing it:
git submodule update --init --recursive pip install external/Geneformer
Note: Geneformer can only be installed on Linux machines with CUDA support. Also it requires git-lfs to be installed on the machine.
If you try to use these methods without the required packages, you'll see helpful error messages with installation instructions.
The package requires Python 3.10, 3.11, 3.12, or 3.13. The current requirement is >=3.10,<3.14. Python 3.9 and earlier, or Python 3.14+ are not supported.
The base installation includes:
- Core dependencies (anndata, scanpy, datasets, huggingface-hub)
- Embedding tools (PCA, HVG, gene selection)
- Workflow orchestration tools (Hydra)
- All required dependencies
Optional dependencies (install separately if needed):
- scvi-tools: Required for scVI embeddings (
pip install scvi-tools) - geneformer: Required for Geneformer embeddings (requires submodule initialization, see above)
If you try to use these methods without the required packages, you'll see helpful error messages with installation instructions.
Before attempting to add your own dataset, try running the workflow with the example data. This will download a .h5ad, preprocess it, run several embedders and create a HuggingFace dataset.
Note: This test run will NOT use HuggingFace Hub or Nextcloud (both are disabled in the example config for simplicity).
# Activate the virtual environment
source .venv/bin/activate
# Run workflow in foreground (recommended for first runs)
python scripts/workflow/submit_workflow.py \
--config dataset_config_example \
--foreground
# Or run in background (detached)
python scripts/workflow/submit_workflow.py \
--config dataset_config_exampleCheck the main log to see the progress:
# View workflow summary (replace date/run_id with your actual run)
tail -f outputs/2025-*/workflow_local_*/logs/workflow_summary.logSpecific logs for each step are in their respective subfolders:
outputs/{date}/workflow_{run_id}/preprocessing/outputs/{date}/workflow_{run_id}/embedding/outputs/{date}/workflow_{run_id}/dataset_creation/
Once the workflow completes successfully, find the dataset location in the logs:
# Check the dataset creation output (it will show the final location)
cat outputs/*/workflow_local_*/dataset_creation/job_local_*/create_ds_0.outLoad and inspect the dataset:
from datasets import load_from_disk
# Replace with your actual path from the logs
dataset_path = "outputs/2025-*/workflow_local_*_*/dataset_creation/job_local_*/job_0/demo_dataset"
# Load the dataset
ds = load_from_disk(dataset_path)
# Inspect it
print(ds)
# DatasetDict({
# train: Dataset({...})
# validation: Dataset({...})
# })
# Check a sample
print(ds['train'][0])Important: Use load_from_disk() to load locally saved datasets. The load_dataset() function is for loading from the HuggingFace Hub.
The pipeline can automatically publish datasets to the HuggingFace Hub for easy sharing and distribution.
-
Create a HuggingFace account:
- Visit https://huggingface.co/join
- Create an account (free)
-
Get your access token:
- Go to https://huggingface.co/settings/tokens
- Create a new token with write permissions
- Copy the token
-
Configure authentication:
Create a .env file in the project root:
# In the project root directory
cat > .env << 'EOF'
# HuggingFace Hub authentication
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx # Your token here
EOFOr log in via CLI:
huggingface-cli login
# Paste your token when prompted- Configure dataset publication:
In your dataset config (or conf/dataset_default.yaml):
dataset_creation:
enabled: true
push_to_hub: true # Enable HuggingFace Hub uploadWhen the workflow completes, your dataset will be published to:
https://huggingface.co/datasets/{your-username}/{dataset-name}
Example: https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_10k
The dataset format and caption key information are documented in the dataset card metadata.
The dataset will include:
- ✅ All splits (train/validation or test)
- ✅ Automatically generated README with dataset card
- ✅ Schema and feature descriptions
- ✅ Download statistics and usage examples
By default, datasets are uploaded as private. You can control this in your dataset configuration:
dataset_creation:
push_to_hub: true
private: true # true = private (default), false = publicTo make a dataset public, set private: false in your config, or update the repository settings on HuggingFace after upload.
The pipeline uses two main configuration files:
Dataset configurations define what data to process and how to process it. Each dataset has its own YAML file in the conf/ directory.
Example: conf/dataset_config_example.yaml
Template with all parameters: conf/dataset_default.yaml
-
Dataset Metadata:
dataset: name: "human_pancreas" description: "Human pancreas dataset" download_url: "https://example.com/data.h5ad"
-
Common Keys (used across all steps):
batch_key: "batch_id" # Batch/dataset identifier annotation_key: "cell_type" # Cell type annotations caption_key: "natural_language_annotation" # Natural language descriptions
-
Step Configuration (enable/disable and configure each step):
download: enabled: false subset_size: 10000 preprocessing: enabled: true chunk_size: 10000 split_dataset: true embedding_preparation: enabled: true methods: ["geneformer"] embedding_cpu: enabled: true methods: ["pca", "scvi_fm", "hvg"] embedding_gpu: enabled: true methods: ["geneformer"] dataset_creation: enabled: true dataset_format: "multiplets" negatives_per_sample: 2
All available parameters are documented in dataset_default.yaml.
The workflow orchestrator configuration defines where and how to run the pipeline.
Configuration file: conf/workflow_orchestrator.yaml
You can configure both local and SLURM paths simultaneously in the same config file. Simply change the execution_mode to switch between them - no need to manually edit paths or use different scripts!
Choose between local execution or SLURM cluster by setting execution_mode:
workflow:
execution_mode: "local" # or "slurm"The submission script automatically uses the appropriate paths based on this setting.
Here's a complete configuration with both local and SLURM settings:
workflow:
# Switch between "local" and "slurm" to change execution mode
execution_mode: "local"
# Local execution settings
local_output_directory: "../outputs" # Output directory for local runs
local_project_directory: "." # Project directory for local runs (relative to script location)
local_base_file_path: "./data/RNA" # Base data directory for local runs
local_max_workers: 2 # Number of parallel workers for local execution
local_enable_gpu: false # Enable GPU embedding locally (requires CUDA)
# SLURM execution settings
cpu_login:
host: "cpu_cluster" # SSH host (must be in ~/.ssh/config)
user: "username"
gpu_login:
host: "gpu_cluster" # SSH host (must be in ~/.ssh/config)
user: "username"
cpu_partition: "slurm" # CPU partition name (check with `sinfo`)
gpu_partition: "gpu" # GPU partition name (check with `sinfo`)
slurm_output_directory: "/home/username/outputs" # Output directory on cluster
slurm_project_directory: "/home/username/adata_hf_datasets" # Project directory on cluster
slurm_base_file_path: "/scratch/global/username/data/RNA" # Base data directory (must be accessible by both clusters!)
# Shared settings
venv_path: ".venv" # Virtual environment path (relative to project_directory)
enable_transfers: false # Use shared filesystem (recommended)SSH Configuration:
- The SLURM mode requires passwordless SSH access to the clusters
- Set up SSH keys so that
ssh cpu_clusterandssh gpu_clusterwork without password prompts - Configure hosts in
~/.ssh/configif needed (including ProxyJump if required) - Always keep the repos synced, for example when changing a configuration file
Shared Storage:
slurm_base_file_pathmust be accessible by both CPU and GPU clusters- Typically a global scratch filesystem (e.g.,
/scratch/global/username/) - Data is written by one cluster and read by another during the workflow
- Local execution is actually much faster, since I/O speeds on a global filesystem are usually very slow, and the pipeline requires reading data into memory at several steps. If you don't need a gpu, or have a gpu locally, I would recommend to work locally. But if you don't bother waiting a while and want to just submit a bunch of datasets, the cluster is better suited.
Cluster-Specific Settings:
cpu_partitionandgpu_partitionnames are cluster-specific- Check your cluster's SLURM configuration for the correct partition names
- Use
sinfoon your cluster to list available partitions
For running the complete pipeline on your local machine:
- Configure for local execution:
Edit conf/workflow_orchestrator.yaml and set execution_mode: "local":
workflow:
execution_mode: "local" # Switch to local mode
# Local paths (already configured above)
local_output_directory: "../outputs"
local_project_directory: "."
local_base_file_path: "./data/RNA"
local_max_workers: 2
local_enable_gpu: false # Set to true if you have CUDA-capable GPU- Configure your dataset: Take a close look at the example dataset config and the default config
Edit or create a dataset config in conf/, for example conf/my_dataset.yaml:
defaults:
- dataset_default.yaml
- _self_
dataset:
name: "my_dataset"
description: "My single-cell dataset"
download_url: "https://..."
full_name: "my_dataset_full" #required if subsetting the dataset
# Enable/disable steps as needed
preprocessing:
enabled: true
embedding_cpu:
enabled: true
# ... etc- Run the workflow:
# Activate virtual environment
source .venv/bin/activate
# Run workflow in foreground (recommended for first runs)
# You can use either a config name or a path:
python scripts/workflow/submit_workflow.py \
--config my_dataset \
# Or use a relative path:
python scripts/workflow/submit_workflow.py \
--config conf/my_dataset.yaml \
# Or use an absolute path:
python scripts/workflow/submit_workflow.py \
--config /absolute/path/to/my_dataset.yaml \Note: The --config argument accepts either:
- A config name (e.g.,
my_dataset) - looks forconf/my_dataset.yaml - A relative path (e.g.,
conf/my_dataset.yaml) - relative to project root - An absolute path (e.g.,
/path/to/config.yaml) - full file path
What happens:
- The workflow runs each step sequentially
- Steps are executed based on the
enabledflags in your dataset config - Logs are written to
{local_output_directory}/{date}/workflow_{timestamp}/ - Data files are written to the
local_base_file_pathdirectory
For running on SLURM clusters with SSH orchestration:
- Set up SSH keys:
# Generate SSH key if you don't have one
ssh-keygen -t ed25519
# Copy to clusters
ssh-copy-id username@cpu_cluster
ssh-copy-id username@gpu_cluster
# Test passwordless access
ssh cpu_cluster "hostname"
ssh gpu_cluster "hostname"Note: Depending on your cluster, you might need to setup a proxy-jump. Edit the .ssh/config file on your machine.
- Configure for SLURM:
Edit conf/workflow_orchestrator.yaml and set execution_mode: "slurm":
workflow:
execution_mode: "slurm" # Switch to SLURM mode
# SLURM paths (already configured above)
cpu_login:
host: "cpu_cluster" # Your CPU cluster SSH alias
user: "username"
gpu_login:
host: "gpu_cluster" # Your GPU cluster SSH alias
user: "username"
cpu_partition: "slurm" # Check with `sinfo` on your cluster
gpu_partition: "gpu" # Check with `sinfo` on your cluster
slurm_output_directory: "/home/username/outputs"
slurm_project_directory: "/home/username/adata_hf_datasets"
slurm_base_file_path: "/scratch/global/username/data/RNA" # Must be accessible by both clusters!Note: Before attempting to run on SLURM, make sure that the repository is installed on the cluster. Follow the same steps as locally to install. UV can be installed without sudo rights.
- Ensure the repository is synced on the cluster:
# On your local machine, push to git
git push
# SSH to the cluster and pull
ssh cpu_cluster
cd /home/username/adata_hf_datasets
git pull
git submodule update --init --recursive
uv sync --all-extras
exit- Submit the workflow:
# From your local machine (same script as local mode!)
# You can use either a config name or a path:
python scripts/workflow/submit_workflow.py \
--config my_dataset
# Or use a path:
python scripts/workflow/submit_workflow.py \
--config conf/my_dataset.yamlNote: The --config argument accepts either a config name or a file path (relative or absolute), just like in local mode.
What happens:
- A master SLURM job is submitted to the CPU cluster
- The master job orchestrates all subsequent steps
- Steps run on appropriate clusters (CPU vs GPU)
- Job dependencies are automatically managed by SLURM
- You can monitor progress with
ssh cpu_cluster "squeue -u username"
Output location:
- Logs:
{slurm_output_directory}/{date}/workflow_{job_id}/ - Data:
{slurm_base_file_path}/(organized intoraw/,processed/,processed_with_emb/)
Nextcloud integration allows you to store large AnnData files remotely, making your HuggingFace datasets truly autonomous and shareable without local file dependencies.
HuggingFace datasets store only metadata (cell sentences, captions, negative indices). The actual AnnData files with expression matrices and embeddings are stored separately. Nextcloud provides:
- ☁️ Remote storage for large AnnData files
- 🔗 Share links embedded in the dataset for downstream access
- 🌐 Independence from local file systems
- 🤝 Easy sharing - anyone with the HF dataset can access the data
-
Get Nextcloud access:
- Obtain a Nextcloud account (institutional, self-hosted, or cloud provider)
- You need: URL, username, and password
-
Configure credentials:
Create or edit the .env file in the project root:
# In the project root directory
cat >> .env << 'EOF'
# Nextcloud authentication
NEXTCLOUD_URL=https://cloud.example.com # Your Nextcloud instance URL (what you type in browser)
NEXTCLOUD_USER=your-username # Your Nextcloud username
NEXTCLOUD_PASSWORD=your-password # Your Nextcloud password
EOFSecurity note: The .env file is in .gitignore and will not be committed to git.
- Enable Nextcloud in dataset config:
dataset_creation:
enabled: true
use_nextcloud: true # Enable Nextcloud upload
nextcloud_config:
url: "NEXTCLOUD_URL" # Will be read from .env
username: "NEXTCLOUD_USER" # Will be read from .env
password: "NEXTCLOUD_PASSWORD" # Will be read from .env
remote_path: "" # Automatically set based on datasetThe environment variables will be automatically resolved at runtime.
-
During dataset creation:
- Processed AnnData files are uploaded to Nextcloud
- Share links are generated for each file
- Links are embedded in the HuggingFace dataset
-
When using the dataset:
- The
adata_linkcolumn contains Nextcloud share URLs - Downstream models can download files on-demand
- No local file dependencies needed
- The
Files are organized in Nextcloud as:
{remote_path}/
└── {dataset_name}/
├── train/
│ ├── chunk_0.zarr.zip
│ └── chunk_1.zarr.zip
└── validation/
└── chunk_0.zarr.zip
Zenodo integration provides persistent, citable storage for your AnnData files with DOI assignment, making your datasets suitable for academic publication and long-term archival.
Zenodo is a research data repository hosted by CERN that provides:
- 📚 Academic publishing - Get a DOI for your dataset
- 🔒 Long-term archival - CERN-backed preservation guarantees
- 🆓 Free storage - Up to 50GB per dataset
- 🌍 Public accessibility - Open science friendly
- 📝 Versioning - Built-in support for dataset versions
- 🧪 Sandbox testing - Test your uploads before going to production
-
Get a Zenodo account:
- Create an account at zenodo.org (production) or sandbox.zenodo.org (testing)
- These are separate accounts - sandbox is recommended for testing first
-
Create an access token:
For production:
- Go to https://zenodo.org/account/settings/applications/
- Click "New token"
- Select scopes:
deposit:writeanddeposit:actions - Copy the generated token
For sandbox (testing):
- Go to https://sandbox.zenodo.org/account/settings/applications/
- Create a token with the same scopes as above
- Copy the generated token (this is a different token from production)
- Configure credentials:
Create or edit the .env file in the project root:
# For production Zenodo
ZENODO_TOKEN=your-production-token-here
# For sandbox Zenodo (separate token required)
ZENODO_SANDBOX_TOKEN=your-sandbox-token-hereSecurity note: The .env file is in .gitignore and will not be committed to git.
- Enable Zenodo in dataset config:
For production:
dataset_creation:
enabled: true
use_zenodo: true # Enable Zenodo upload
zenodo_config:
sandbox: false # Use production ZenodoFor sandbox (testing):
dataset_creation:
enabled: true
use_zenodo: true # Enable Zenodo upload
zenodo_config:
sandbox: true # Use sandbox Zenodo for testingThe appropriate environment variable (ZENODO_TOKEN or ZENODO_SANDBOX_TOKEN) will be automatically used based on the sandbox setting.
-
During dataset creation:
- Processed AnnData files are packaged as ZIP archives
- A single Zenodo deposit (draft) is created for the entire dataset
- All files (train/validation splits) are uploaded to this deposit
- Download URLs are generated and embedded in the HuggingFace dataset
- The deposit remains in draft state - you can publish it manually on Zenodo
-
When using the dataset:
- The
adata_linkcolumn contains Zenodo download URLs - Files can be downloaded on-demand using the Zenodo API
- No authentication required for published deposits
- The
-
Deposit management:
- Deposit information is saved in
zenodo_share_map.jsonin your data directory - Re-running the pipeline reuses the same deposit (no duplicates)
- You can edit metadata and publish the deposit on the Zenodo website
- Deposit information is saved in
Sandbox (sandbox.zenodo.org):
- ✅ Safe testing environment
- ✅ Same API as production
- ✅ Can be deleted/reset without consequences
- ❌ Not persistent (may be wiped periodically)
- ❌ No real DOIs
Production (zenodo.org):
- ✅ Permanent storage with real DOIs
- ✅ Suitable for publication
⚠️ Deposits cannot be deleted once published⚠️ Use with care
Workflow: Always test with sandbox first, then switch to production when ready.
Currently supported: Nextcloud, Zenodo
Want other backends? If you need support for other cloud storage providers (AWS S3, Google Drive, Figshare, etc.), please open an issue describing your use case. We're interested in adding compatibility for additional storage backends!
For developers: The storage interface is in src/adata_hf_datasets/file_utils.py. Contributions for new backends are welcome!
The pipeline consists of six steps, each with detailed documentation:
Downloads and optionally subsets raw data from a URL.
Documentation: scripts/download/README.md
Key Features:
- Download from URLs or file paths
- Stratified subsetting with preserved proportions
- Validation of downloaded files
Configuration:
download:
enabled: true
subset_size: 10000
stratify_keys: ["cell_type", "tissue"]
preserve_proportions: trueCleans, filters, and normalizes raw count data.
Documentation: scripts/preprocessing/README.md
Key Features:
- Quality control with MAD-based outlier detection
- Gene/cell filtering
- Normalization and log-transformation
- Highly variable gene selection
- Optional train/val split
- SRA metadata enrichment
Configuration:
preprocessing:
enabled: true
min_cells: 20
min_genes: 200
n_top_genes: 5000
chunk_size: 200000
split_dataset: true
train_split: 0.9Performs CPU-intensive preparation for GPU embedding methods (e.g., Geneformer tokenization).
Documentation: scripts/embed/README.md
Key Features:
- Separates CPU-intensive prep from GPU computation
- Tokenization for Geneformer
- Cached preparation results
Configuration:
embedding_preparation:
enabled: true
methods: ["geneformer"] # Methods that need preparationGenerates embeddings using CPU-based methods.
Documentation: scripts/embed/README.md
Key Features:
- PCA: Linear dimensionality reduction
- scVI: Deep learning foundation model
- Memory-efficient streaming to disk
Configuration:
embedding_cpu:
enabled: true
methods: ["pca", "scvi_fm", "gs10k"]
embedding_dim_map:
pca: 50
scvi_fm: 50
gs10k: 10000Generates embeddings using GPU-based methods.
Documentation: scripts/embed/README.md
Key Features:
- Geneformer: Transformer-based embeddings (! Needs Cuda device !)
- Automatic retry on GPU errors
- Uses preparation results from step 3
Configuration:
embedding_gpu:
enabled: true
methods: ["geneformer"]
embedding_dim_map:
geneformer: 768Creates HuggingFace datasets with contrastive learning pairs/multiplets.
Documentation: scripts/dataset_creation/README.md
Key Features:
- Multiple dataset formats (multiplets, pairs, single)
- Cell sentence generation
- Intelligent negative sampling
- HuggingFace Hub publication
- Optional Nextcloud integration
Configuration:
dataset_creation:
enabled: true
dataset_format: "multiplets"
sentence_keys: ["sample_id_og"]
negatives_per_sample: 2
required_obsm_keys: ["X_pca", "X_scvi_fm", "X_geneformer"]
push_to_hub: trueWhile the workflow orchestrator runs all enabled steps automatically, you can run individual steps manually:
# Activate environment
source .venv/bin/activate
# Run preprocessing only
python scripts/preprocessing/preprocess.py --config-name my_dataset
# Run CPU embedding only
python scripts/embed/embed_core.py \
--config my_dataset \
++embedding_config_section=embedding_cpu
# Run dataset creation only
python scripts/dataset_creation/create_dataset.py --config-name my_datasetSee individual step documentation for more details.
Override any configuration parameter via command line:
python scripts/workflow/submit_workflow.py \
--config my_dataset \
++preprocessing.chunk_size=50000 \
++embedding_cpu.methods='["pca"]' \
++dataset_creation.push_to_hub=falseLocal execution:
# Check logs in real-time
tail -f outputs/{date}/workflow_{timestamp}/logs/workflow_master.out
# View step-specific logs
tail -f outputs/{date}/workflow_{timestamp}/preprocessing/job_*/preprocessing.outSLURM execution:
# Check job queue
ssh cpu_cluster "squeue -u username"
# View master job logs
ssh cpu_cluster "cat /home/username/outputs/{date}/workflow_{job_id}/logs/workflow_master.out"
# View step-specific logs
ssh cpu_cluster "cat /home/username/outputs/{date}/workflow_{job_id}/preprocessing/job_*/preprocessing.out"To skip steps (e.g., if already completed):
# In your dataset config
preprocessing:
enabled: false # Skip preprocessing
embedding_preparation:
enabled: false # Skip embedding preparationOr via command line:
python scripts/workflow/submit_workflow.py \
--config my_dataset \
++preprocessing.enabled=false \
++embedding_preparation.enabled=falseThis section explains how to add a custom embedding method to the pipeline by creating a new embedder class.
All embedders inherit from the BaseEmbedder class and implement three core methods:
__init__: Initialize the embedder with configuration parametersprepare: Prepare the embedder (e.g., load models, tokenize data)embed: Generate embeddings from the data
Create a new class that inherits from BaseEmbedder in src/adata_hf_datasets/embed/initial_embedder.py:
from adata_hf_datasets.embed.initial_embedder import BaseEmbedder, _check_load_adata
from importlib.util import find_spec
import numpy as np
import anndata as ad
import logging
logger = logging.getLogger(__name__)
class MyCustomEmbedder(BaseEmbedder):
"""
Custom embedder that generates embeddings using MyCustomMethod.
"""
def __init__(self, embedding_dim: int = 64, **kwargs):
"""
Initialize the custom embedder.
Parameters
----------
embedding_dim : int
Dimensionality of the output embedding.
**kwargs
Additional keyword arguments for the embedder.
"""
# Check for required packages
if find_spec("my_custom_package") is None:
raise ImportError(
"my_custom_package is required to use MyCustomEmbedder. "
"Please install it with: pip install my-custom-package"
)
super().__init__(embedding_dim=embedding_dim)
self.model = None
self.init_kwargs = kwargs
def prepare(
self,
adata: ad.AnnData | None = None,
adata_path: str | None = None,
**kwargs,
) -> None:
"""
Prepare the embedder (e.g., load model, preprocess data).
Parameters
----------
adata : anndata.AnnData, optional
Single-cell dataset in memory.
adata_path : str, optional
Path to the AnnData file (.h5ad or .zarr).
**kwargs
Additional keyword arguments for preparation.
"""
# Use helper function to load adata if path is provided
adata = _check_load_adata(adata, adata_path)
logger.info("Preparing MyCustomEmbedder...")
# Your preparation logic here
# For example: load a pre-trained model, tokenize data, etc.
self.model = load_my_custom_model(**self.init_kwargs)
def embed(
self,
adata: ad.AnnData | None = None,
adata_path: str | None = None,
obsm_key: str = "X_my_custom",
**kwargs,
) -> np.ndarray:
"""
Generate embeddings from the data.
Parameters
----------
adata : anndata.AnnData, optional
Single-cell dataset in memory.
adata_path : str, optional
Path to the AnnData file (.h5ad or .zarr).
obsm_key : str
Key in `adata.obsm` to store the embedding.
**kwargs
Additional keyword arguments for embedding.
Returns
-------
np.ndarray
Embedding matrix of shape (n_cells, embedding_dim).
Must be in the same order as adata.obs.index.
"""
# Load adata if path is provided
adata = _check_load_adata(adata, adata_path)
logger.info("Generating embeddings with MyCustomEmbedder...")
# Generate embeddings
# IMPORTANT: Ensure output order matches adata.obs.index
embedding_matrix = self.model.embed(adata.X)
# Ensure correct shape and dtype
embedding_matrix = embedding_matrix.astype(np.float32)
# Store in adata.obsm if adata object was provided
if adata is not None:
adata.obsm[obsm_key] = embedding_matrix
logger.info(
f"Stored embeddings in adata.obsm['{obsm_key}'], shape: {embedding_matrix.shape}"
)
return embedding_matrixHandle Both Input Types:
- Always use
_check_load_adata()helper function to handle both in-memoryAnnDataobjects and file paths - The helper automatically handles
.h5adand.zarrformats
Package Checking:
- Use
find_spec()fromimportlib.utilto check if required packages are installed - Provide helpful error messages with installation instructions if packages are missing
Output Requirements:
- Return a
np.ndarrayof shape(n_cells, embedding_dim) - Critical: The embedding matrix must be in the same order as
adata.obs.index - Use
np.float32dtype for consistency - If an
adataobject is provided, store embeddings inadata.obsm[obsm_key]
Memory Efficiency:
- When using file paths, you can read only necessary data from
.h5ador.zarrstores - For
.zarrstores, you can write embeddings directly toadata.obsmwithout loading the entire object - See
GeneformerEmbedderfor an example of efficient file-based operations
Add your embedder class to the embedder_classes dictionary in the InitialEmbedder class:
# In InitialEmbedder.__init__ method
embedder_classes = {
"scvi_fm": SCVIEmbedderFM,
"geneformer": GeneformerEmbedder,
"geneformer-v1": GeneformerV1Embedder,
"pca": PCAEmbedder,
"hvg": HighlyVariableGenesEmbedder,
"gs": GeneSelectEmbedder,
"gs10k": GeneSelectEmbedder10k,
"my_custom": MyCustomEmbedder, # Add your embedder here
}If you want to use your embedder in the pipeline, add it to conf/dataset_default.yaml:
embedding_cpu: # or embedding_gpu, depending on your method
embedding_dim_map:
scvi_fm: 50
geneformer: 768
pca: 50
hvg: 512
gs: 3936
gs10k: 10000
geneformer-v1: 512
my_custom: 64 # Add your embedding dimension hereAdd the same entry to both embedding_cpu and embedding_gpu sections if applicable.
Once registered, you can use your embedder:
from adata_hf_datasets import InitialEmbedder
# Initialize embedder
embedder = InitialEmbedder(
method="my_custom",
embedding_dim=64,
# Your custom init_kwargs here
)
# Prepare (if needed)
embedder.prepare(adata_path="path/to/data.h5ad")
# Generate embeddings
embeddings = embedder.embed(
adata_path="path/to/data.h5ad",
obsm_key="X_my_custom"
)Or use it in the pipeline by adding "my_custom" to the methods list in your dataset config:
embedding_cpu:
enabled: true
methods: ["pca", "my_custom"]- Error Handling: Provide clear error messages when required data/attributes are missing
- Logging: Use the logger to inform users about what's happening
- Documentation: Add comprehensive docstrings explaining parameters and behavior
- Testing: Test with both
.h5adand.zarrformats, and with both in-memory and file-based inputs - Order Preservation: Always ensure embeddings match the order of
adata.obs.index
Detailed documentation for each component:
- Download - Data acquisition and subsetting
- Preprocessing - QC, filtering, normalization
- Embedding - PCA, scVI, HVG, Geneformer embeddings
- Dataset Creation - HuggingFace dataset generation
- Dataset Configuration Template - All available parameters
- Dataset Configuration Example - Working example
- Workflow Orchestrator Config - Workflow settings
src/adata_hf_datasets/- Core library codescripts/- Executable scripts for each step
Problem: uv sync fails with missing dependencies
Solution:
# Try with verbose output
uv sync --all-extras -vProblem: Python version error: "requires a different Python: X.X.X not in '<3.14,>=3.10'"
Solution: The package requires Python 3.10, 3.11, 3.12, or 3.13. If you see this error, you're likely using Python 3.14+ or an older version (3.9 or earlier). Check your Python version:
python --versionIf you need to use a different Python version, consider using pyenv or a virtual environment with the correct version.
Problem: Pip installation is very slow or fails
Solution: Pip installation can be slow due to dependency resolution. Consider:
- Using
uvinstead (much faster):uv pip install . - Using pre-built wheels:
pip install --only-binary :all: .(may not work for local packages) - Installing in a clean virtual environment
Problem: Git submodule issues/maintenance
Solutions:
# Initialize submodules (only needed for Geneformer)
git submodule update --init --recursive
# Skip submodules entirely when cloning
git clone --recurse-submodules=no https://github.com/mengerj/adata_hf_datasets.git
# Remove submodule initialization if already cloned
git submodule deinit --all -fNote: Submodules are optional and only needed for Geneformer support. You can install and use the package without them.
Problem: Geneformer or scVI embedder fails with ImportError
Solution: These are optional dependencies. Install them separately:
- For scVI:
pip install scvi-tools - For Geneformer:
- Initialize submodules:
git submodule update --init --recursive - Install from local path:
pip install external/Geneformer
- Initialize submodules:
Problem: "Config file not found"
Solution:
# Ensure you're in the project root
cd /path/to/adata_hf_datasets
# Check config file exists
ls conf/my_dataset.yaml
# Use config name without .yaml extension
python scripts/workflow/submit_workflow_local.py --config my_datasetProblem: "base_file_path is not set"
Solution: Ensure workflow_orchestrator.yaml has the correct path:
workflow:
local_base_file_path: "/absolute/path/to/data/RNA" # For local
slurm_base_file_path: "/absolute/path/to/data/RNA" # For SLURMProblem: SSH timeout when submitting to SLURM
Solution:
# Test SSH connection
ssh cpu_cluster "hostname"
# Check SSH config
cat ~/.ssh/config
# Ensure SSH agent is running
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519Problem: SLURM job fails immediately
Solution:
# Check SLURM logs on cluster
ssh cpu_cluster "cat /home/username/outputs/{date}/workflow_{job_id}/logs/workflow_master.err"
# Verify partition names
ssh cpu_cluster "sinfo"
# Check virtual environment exists on cluster
ssh cpu_cluster "ls /home/username/adata_hf_datasets/.venv/bin/python"Problem: "File not found" errors during workflow
Solution:
- Verify
base_file_pathis accessible and has correct permissions - For SLURM: Ensure
base_file_pathis accessible from both CPU and GPU clusters - Check that previous steps completed successfully
Problem: ValueError: Couldn't infer the same data file format for all splits
Solution: Use load_from_disk() instead of load_dataset() for locally saved datasets:
# ✅ Correct - for locally saved datasets
from datasets import load_from_disk
ds = load_from_disk("/path/to/dataset")
# ❌ Wrong - this is for HuggingFace Hub
from datasets import load_dataset
ds = load_dataset("/path/to/dataset") # Will fail!load_dataset() is only for loading datasets from the HuggingFace Hub or inferring formats from raw files. For datasets saved with save_to_disk(), always use load_from_disk().
For more help:
- Check the step-specific README files linked above
- Review log files in
outputs/{date}/workflow_{timestamp}/ - Check the
.errfiles for error messages - Review the dataset config for any misconfigurations
If you use this pipeline in your research, please cite:
@software{adata_hf_datasets,
title = {AnnData HuggingFace Datasets Pipeline},
author = {Jonatan Menger},
year = {2025},
url = {https://github.com/mengerj/adata_hf_datasets}
}[Specify your license here]
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This pipeline builds on:
- AnnData for handling data matrices
- Scanpy for single-cell analysis
- scVI for probabilistic models
- Geneformer for transformer embeddings
- Hugging Face Datasets for dataset management