Skip to content

baskargroup/HTC_EP

Repository files navigation

HTC_EP — High-Throughput EnergyPlus on HPC

Run large batches of EnergyPlus building-energy simulations in parallel on SLURM-based HPC clusters using containerized EnergyPlus (Apptainer/Singularity).

Tested on Iowa State Nova, TACC Frontera, and TACC Stampede3.


Who is this for?

Researchers and engineers who need to run hundreds or thousands of EnergyPlus simulations (e.g., parametric studies, sensitivity analyses, benchmarking) on an HPC cluster with SLURM scheduling.


Prerequisites

Requirement Notes
SLURM workload manager Available on most HPC clusters
Apptainer / Singularity For running the EnergyPlus container
EnergyPlus .sif container image See Getting EnergyPlus onto your HPC below
EnergyPlus .epw weather file Example provided in Files/
Python 3.9+ For summary.py, Compile_Incomplete_list.py, and the dataset pipeline
PyLauncher (optional) Frontera/Stampede3 PyLauncher workflows only — module load pylauncher

Repository Layout

HTC_EP_internal/
├── Files/
│   ├── 1ZoneUncontrolled.idf               # Minimal 1-zone example model
│   ├── 5ZoneAirCooled.idf                  # Larger 5-zone example model
│   └── USA_CO_Golden-NREL.724666_TMY3.epw  # Example TMY3 weather file
├── Jobs/
│   ├── app-alloc.sh                        # SLURM array job — Nova / generic OpenHPC
│   ├── frontera_jobs/
│   │   ├── jobscript.sh                    # SLURM + PyLauncher — TACC Frontera
│   │   └── launcher.py                     # PyLauncher entry point
│   ├── stampede3_jobs/
│   │   ├── job-alloc.sh                    # SLURM array job — TACC Stampede3
│   │   └── stampede3_pylauncher/
│   │       ├── jobscript_paramiko.sh       # SLURM + PyLauncher — TACC Stampede3
│   │       └── launcher.py                 # PyLauncher entry point
│   ├── dataset_creation/                   # AI-ready parametric dataset pipeline
│   │   ├── dataset_config.py               # Master config — all settings in one place
│   │   ├── generate_variants.py            # Step 2: parametric IDF generation (text/regex)
│   │   ├── generate_tasks.py               # Step 3: cross variants × EPW → tasks.txt
│   │   ├── dataset_pylauncher_job.sh       # Step 4a: Frontera / PyLauncher job script
│   │   ├── dataset_array_job.sh            # Step 4b: Nova / SLURM array alternative
│   │   ├── postprocess_to_parquet.py       # Step 5: CSV + MTR → Parquet + schema.json
│   │   ├── upload_to_huggingface.py        # Step 6: push dataset to Hugging Face Hub
│   │   └── setup_idd.sh                    # Optional: extract Energy+.idd from container
│   └── azure_batch/
│       ├── config.py                       # Placeholder config template (commit this)
│       ├── config_secret.py                # Real credentials — gitignored, never commit
│       ├── launch_az_pool.py               # Create Azure Batch VM pool
│       ├── EP_HiTP_doe_prototype.py        # Submit job + tasks
│       ├── azjobstatusscaler.py            # Monitor job, download logs, scale pool down
│       ├── TaskErrorOutputDownloader.py    # Download stdout/stderr for all tasks
│       ├── Task_Summary.py                 # Compute timing stats → CSV
│       └── rerun_audit.py                  # Identify retried tasks → CSV
├── docs/
│   ├── quick-start-nova.md                 # Full step-by-step guide for Nova
│   ├── pylauncher-workflow.md              # Full guide for Frontera / Stampede3 PyLauncher
│   ├── stampede3-slurm-array.md            # Stampede3 SLURM array guide (ibrun, tacc-apptainer)
│   ├── azure_batch.md                      # Azure Batch cloud workflow guide
│   └── dataset_workflow.md                 # AI-ready dataset pipeline — end-to-end guide
├── config.env.example                      # Template for local paths/credentials (safe to commit)
├── config.env                              # Your real paths/credentials — gitignored, never commit
├── copyfp.sh                               # Create N identical IDF copies for scaling benchmarks
├── findidf_listconfig.sh                   # Find IDFs and build Config_IDFlist.txt
├── make_tasks.sh                           # Generate tasks.txt for PyLauncher workflows
├── job_timestats.sh                        # Extract timing stats from SLURM jobs
├── summary.py                              # Parse SLURM output → summary.csv
└── Compile_Incomplete_list.py              # Find IDFs that did not complete

Note on copyfp.sh and copy_N/ directories: The basic workflow uses copyfp.sh to create N identical copies of one IDF. This is a scaling benchmark — all copies run the same model in parallel so you can measure throughput and wall-time across different node/core configurations. For real parametric studies (sweeping different parameter values), use the Jobs/dataset_creation/ pipeline, which generates unique IDF variants via generate_variants.py.


Minimal Runnable Example (Nova)

This runs 2 copies of the included 1-zone model on 2 cores using EnergyPlus 23.1.0. All commands run from run_directory — keeps job outputs separate from repo files.

# 1. Clone the repo
git clone <repo-url>
cd HTC_EP_internal

# 2. Pull the EnergyPlus container (note the full path to the .sif produced)
module spider apptainer && module load apptainer
apptainer pull docker://nrel/energyplus:23.1.0

# 3. Create run_directory and enter it
mkdir run_directory && cd run_directory

# 4. Copy job script, create IDF copies, build config list
cp ../Jobs/app-alloc.sh .
bash ../copyfp.sh ../Files/1ZoneUncontrolled.idf 2
bash ../findidf_listconfig.sh

# 5. Edit app-alloc.sh  (nano: arrow keys to move, Ctrl+O save, Ctrl+X exit)
nano ./app-alloc.sh
#   Set APPTAINER_IMAGE, EPW, Ncases=2, Ncores=2
#   Set --ntasks-per-node=2, --array=1-2, --mail-user

# 6. Submit and monitor
sbatch ./app-alloc.sh          # note the job ID printed
squeue -j <job_id>             # monitor status

# 7. Collect results
module spider python && module load python
python ../summary.py && cat summary.csv

For the full step-by-step walkthrough with explanations → docs/quick-start-nova.md

For Frontera / Stampede3 PyLauncher → docs/pylauncher-workflow.md

For Stampede3 SLURM array (ibrun, tacc-apptainer) → docs/stampede3-slurm-array.md

For Azure Batch (cloud, on-demand VMs) → docs/azure_batch.md

For the AI-ready parametric dataset pipeline → docs/dataset_workflow.md


Getting EnergyPlus onto your HPC

Option 1 — Pull directly from Docker (recommended)

apptainer pull docker://nrel/energyplus:23.1.0
# Produces: energyplus_23.1.0.sif in your current directory

Best practice on TACC clusters: Do not pull on the login node. Use idev first:

idev -t 0:30:00
module spider tacc-apptainer && module load tacc-apptainer/1.3.3
apptainer pull docker://nrel/energyplus:23.1.0
exit

On Nova, use salloc -N 1 -n 1 -t 0:30:00 if needed.

Option 2 — Transfer via Globus (when direct pull is unavailable)

Use Globus if your cluster's compute nodes cannot reach the internet, or if you want to reuse a .sif image already built locally or on another cluster.

  1. Install Globus Connect Personal on your local machine (or use an existing Globus endpoint at your institution).
  2. Log in at globus.org and open the File Manager.
  3. Set one endpoint to your local machine (or source cluster) and navigate to the .sif file.
  4. Set the other endpoint to your target HPC system. Common endpoints:
    • TACC Frontera — search TACC Frontera in the Globus catalog
    • TACC Stampede3 — search TACC Stampede3
    • Iowa State University HPC — search Iowa State
  5. Select the file(s) and click Start. Transfer to a persistent work directory (e.g., /work2/$USER/ on Frontera, /work/mech-ai/$USER/ on Nova).

The same approach works for transferring custom .idf and .epw files when scp/rsync is inconvenient or too slow for large file sets.

Option 3 — Build a customised EnergyPlus from source

Use this if you need to modify the EnergyPlus source before running.

# a. Clone source
git clone --branch v23.1.0 --single-branch https://github.com/NREL/EnergyPlus.git

# b. Start an interactive node and enter the container
salloc -N 1 -n 4 -t 1:00:00
module load apptainer
apptainer exec /path/to/energyplus_23.1.0.sif sh

# c. Build inside the container
cd /tmp
cmake -DBUILD_FORTRAN=ON /path/to/EnergyPlus
make -j 4 && make install
exit

Bringing simulation files to HPC via Globus

Use Globus to transfer .idf model files, .epw weather files, and large result directories between your laptop, institutional storage, and HPC clusters. Globus is especially useful when:

  • File sets are large (hundreds of IDFs, multi-GB result archives).
  • scp/rsync is blocked or rate-limited by your institution's firewall.
  • You need to move data between two HPC systems (e.g., Frontera → Nova).

Quick steps:

  1. Go to globus.orgFile Manager.
  2. Source endpoint: your local machine (Globus Connect Personal) or a cluster's Globus endpoint.
  3. Destination endpoint: the target HPC. Common endpoints:
    • TACC Frontera — search TACC Frontera
    • TACC Stampede3 — search TACC Stampede3
    • Iowa State University HPC — search Iowa State
  4. Navigate to the target directory (e.g., your $WORK or $SCRATCH) and click Start.

After transfer, set APPTAINER_IMAGE, EPW, and any IDF paths in your job scripts or config.env to the destination paths.


Loading Required Modules

Module names vary by cluster. Use module spider to find the right one:

module spider python       # find available Python modules
module spider apptainer    # find available Apptainer/Singularity modules

Job scripts load their own modules automatically at runtime. You only need to load modules manually for interactive tasks (apptainer pull, summary.py, etc.).

On TACC clusters, Apptainer is named tacc-apptainer:

module spider tacc-apptainer
module load tacc-apptainer/1.3.3

Editing Scripts — Reference

New to terminal editors? Use nano: arrow keys to move, type to edit, Ctrl+O to save, Ctrl+X to exit.

Each job script has a USER CONFIGURATION block near the top. Copy the script into run_directory before editing.

Nova (app-alloc.sh):

APPTAINER_IMAGE="/path/to/energyplus.sif"
EPW="/path/to/weather.epw"
CONFIG_FILE="Config_IDFlist.txt"  # relative — works as-is when sbatch run from run_directory
Ncases=4   # total IDF cases
Ncores=2   # must match --ntasks-per-node and --array upper limit
#SBATCH --mail-user=your@institution.edu
#SBATCH --ntasks-per-node=2   # must equal Ncores
#SBATCH --array=1-2           # upper limit must equal Ncores
#SBATCH --time=01:00:00

TACC clusters — also set:

#SBATCH -A YOUR_ALLOCATION_ID   # https://tacc.utexas.edu/portal/projects

Queue selection:

Cluster Testing Production
Frontera development (max 2 nodes, 30 min) normal
Stampede3 skx-dev (max 2 nodes, 2 hrs) skx

Inputs and Outputs
Item Description
Input .idf EnergyPlus model files
Input .epw weather file
Input EnergyPlus Apptainer container (.sif)
Output EnergyPlus results in each copy_N/ directory
Output slurm-<jobid>_<taskid>.out — per-task SLURM logs
Output Config_IDFlist.txt — indexed list of all IDF paths
Output tasks.txt — task list for PyLauncher workflows
Output summary.csv — runtime for each completed IDF
Output IncompleteIDF_list*.txt — IDFs that did not finish
Output job_<id>_time_log.txt — per-job elapsed time from sacct

Cluster-Specific Notes

Iowa State Nova

  • Script: Jobs/app-alloc.sh — copy into run_directory.
  • Uses mpirun -n 1 + Apptainer in a SLURM array job.
  • Modules: intel/20.1, apptainer/1.3.6-py311-nvfjdsj.
  • No allocation ID (-A) needed.

TACC Frontera

  • Script: Jobs/frontera_jobs/jobscript.sh + launcher.py — copy both into run_directory.
  • Uses PyLauncher to distribute tasks across all requested cores.
  • Modules: python3/3.9.2, pylauncher, tacc-apptainer/1.3.3.
  • Queue: -p development for testing, -p normal for production.

TACC Stampede3

  • SLURM array: Jobs/stampede3_jobs/job-alloc.sh — copy into run_directory. Uses ibrun -n 1 per task. → Full guide
  • PyLauncher: Jobs/stampede3_pylauncher/jobscript_paramiko.sh + launcher.py — copy both into run_directory. → Full guide
  • Modules: intel, tacc-apptainer/1.3.3 (array); python/3.9.18, pylauncher, tacc-apptainer/1.3.3 (PyLauncher).
  • Queue: -p skx-dev for testing, -p skx for production (Skylake nodes).
  • summary.py works on Stampede3 without modification — it auto-detects the IDF path offset.

License

MIT — see LICENSE.

Contact

Code: Vishal Muralidharan (vishalm@iastate.edu) and Baskar Ganapathysubramanian (baskarg@iastate.edu)

Citation

If you use this framework in your research, please cite our work:

APA

Muralidharan, V., Passe, U., & Ganapathysubramanian, B. (2026). A High Throughput Framework for
Large Scale Building Energy Simulation: From Real-Time Alerts to AI-Ready Surrogates. Proceedings
of SimBuild Conference 2026, 12, 527–537. https://doi.org/10.26868/30680611.2026.1334

BibTeX

@inproceedings{muralidharan2026htc,
  title     = {A High Throughput Framework for Large Scale Building Energy Simulation:
               From Real-Time Alerts to {AI}-Ready Surrogates},
  author    = {Muralidharan, Vishal and Passe, Ulrike and Ganapathysubramanian, Baskar},
  booktitle = {Proceedings of SimBuild Conference 2026},
  series    = {IBPSA-USA Building Simulation Conference},
  volume    = {12},
  pages     = {527--537},
  year      = {2026},
  publisher = {IBPSA-USA},
  address   = {Minneapolis, Minnesota},
  doi       = {10.26868/30680611.2026.1334},
  url       = {https://publications.ibpsa.org/conference/paper/?id=simbuild2026_1334},
  isbn      = {978-1-964372-10-5}
}

Paper: IBPSA Publications | PDF

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors