Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .claude/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,14 @@
uv venv && uv pip install -r requirements-docs.txt # setup
uv pip install "git+https://github.com/logicalclocks/hopsworks-api.git@main#subdirectory=python" # install Python API (needed for API docs section)
touch docs/javadoc; uv run mkdocs build -s; rm docs/javadoc # build (strict)
uv run mkdocs serve # preview with live reload
uv run mike deploy <version> latest --update-alias # build a versioned bundle to the gh-pages worktree (use repo's current version, e.g. 4.4); first time only: `uv run mike set-default latest`
uv run mike serve # serve the gh-pages worktree locally (preview); does NOT live-reload from source — re-run `mike deploy` after edits
npx markdownlint-cli2 "**/*.md" # lint Markdown (requires Node.js)
uv tool install md-snakeoil && snakeoil --line-length 88 --rules "E,F,B,C4,ISC,PIE,PYI,Q,RSE,RET,SIM,TC,I,W,D2,D3,D4,INP,UP,FA" docs # lint Python code blocks
```

`uv run mkdocs serve` is available too, but its livereload watcher does not fire rebuilds in this repo's plugin combination on macOS — `mike serve` is the canonical preview tool per the repo README.

## Rules

- One sentence per line in all Markdown prose
Expand Down
6 changes: 5 additions & 1 deletion .claude/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,16 @@ There is no application code — all work is writing Markdown under `docs/` and
```bash
uv venv && uv pip install -r requirements-docs.txt # setup
uv pip install "git+https://github.com/logicalclocks/hopsworks-api.git@main#subdirectory=python" # needed for Python API section
touch docs/javadoc; uv run mkdocs serve; rm docs/javadoc # preview with live reload
touch docs/javadoc; uv run mkdocs build -s; rm docs/javadoc # build in strict mode
uv run mike deploy <version> latest --update-alias # build a versioned bundle (use repo's current version, e.g. 4.4); first time only: `uv run mike set-default latest`
uv run mike serve # serve the versioned bundle locally (canonical preview, per repo README); no source live-reload — re-run `mike deploy` after edits
npx markdownlint-cli2 "**/*.md" # lint Markdown (requires Node.js)
uv tool install md-snakeoil && snakeoil --line-length 88 --rules "E,F,B,C4,ISC,PIE,PYI,Q,RSE,RET,SIM,TC,I,W,D2,D3,D4,INP,UP,FA" docs # lint Python code blocks
```

`uv run mkdocs serve` is available as a static dev server, but its livereload watcher does not fire rebuilds in this repo's plugin combination on macOS.
Use `mike serve` for previews.

`docs/javadoc` is a directory generated by CI from the `hopsworks-api` Java source.
Locally it must exist as a stub (`touch`) for the build to pass.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
78 changes: 78 additions & 0 deletions docs/setup_installation/admin/compute_resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
description: Cluster configuration for the per-project Compute Resources Usage view
---

# Configure the Compute Resources Usage view

## Introduction

This page is for cluster administrators.
It explains how the per-project node list in the **Compute Resources Usage** card is derived, and the RBAC required for that derivation to work.
For the end-user view of the same card, see [Compute Resources Usage][compute-resources-usage].

When Kueue is installed, the card limits the node list to nodes a given project can actually schedule on.
The mapping is driven entirely by standard Kueue objects: **LocalQueue → ClusterQueue → ResourceFlavor**.
There is no Hopsworks-specific configuration on top.

## How project → node visibility is derived

For each project, Hopsworks walks the queue hierarchy bound to the project's Kubernetes namespace.

- Start from every `LocalQueue` in the project's namespace.
- Follow each `LocalQueue.spec.clusterQueue` to its `ClusterQueue`.
- For each `ClusterQueue`, collect the `ResourceFlavor`s named in `spec.resourceGroups[].flavors[].name`.

The resulting set of `ResourceFlavor`s is the project's "reachable flavors".
A node is included in the project's Node Resources view only if it matches at least one reachable flavor.

The per-queue node filter in the UI is built from the same walk, but kept keyed by `LocalQueue` name so users can narrow the view to a single queue.

## How a ResourceFlavor matches a node

A node matches a `ResourceFlavor` when both of these hold.

- **Labels:** every key/value in `ResourceFlavor.spec.nodeLabels` is present on the node with the same value.
Extra labels on the node are fine — the flavor's label set must be a subset of the node's labels.
- **Taints:** every taint on the node with effect `NoSchedule` or `NoExecute` is covered, either by a matching entry in `ResourceFlavor.spec.nodeTaints` or by a matching entry in `ResourceFlavor.spec.tolerations`.
Taints with effect `PreferNoSchedule` are soft and do not block matching.

Both rules mirror Kueue's own admission logic, so the view reflects exactly which nodes Kueue would dispatch work to for that flavor.

Cordoned nodes (`spec.unschedulable: true`) and nodes the metrics server cannot report on are dropped from the view regardless of flavor matching, because no useful capacity figure can be produced for them.

## Required RBAC

Hopsworks needs read access to the Kueue CRDs in order to walk the queue hierarchy.
The Hopsworks Helm chart ships a `ClusterRole` and binding that grant these permissions, so a default install needs no extra action.

If you are managing RBAC manually (e.g. an externally provisioned `hopsworks` service account), grant at least the following:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hopsworks-kueue-reader
rules:
- apiGroups: ["kueue.x-k8s.io"]
resources: ["localqueues", "clusterqueues", "resourceflavors"]
verbs: ["get", "list"]
```

Bind this role to the service account Hopsworks runs as.
The walk uses `get` and `list` only; no `watch`, `create`, `update`, or `delete` is needed.

## Troubleshooting

The view surfaces several distinct situations.
Use the table below to map a symptom to a likely cause.

| Symptom | Likely cause |
| --- | --- |
| Node Resources sub-section is empty and the access notice says *"None of the queues available in this project currently match any nodes in the cluster."* | The project's LocalQueues resolve to flavors that don't match any node — check `ResourceFlavor.spec.nodeLabels` and `nodeTaints`/`tolerations` against the actual node labels and taints. |
| Node Resources lists every schedulable node and there is no Queue filter or Queue Resources sub-section | Kueue is not installed, the project namespace has no LocalQueues, or Hopsworks lacks the Kueue RBAC above. All three cases fall through to the legacy non-Kueue path, with no access notice. To distinguish: `kubectl get crd resourceflavors.kueue.x-k8s.io` (absent means Kueue isn't installed), then `kubectl auth can-i list localqueues.kueue.x-k8s.io -n <project-ns> --as=system:serviceaccount:<hopsworks-ns>:<hopsworks-sa>` (`no` means apply the `ClusterRole` and binding above). The `-n` flag is required because `LocalQueue` is namespaced; `--as=` requires the caller to have ServiceAccount impersonation rights (granted by `cluster-admin`). |
| A node you expect to see is missing | The node is either cordoned, missing from the metrics server, or not matched by any reachable flavor — check `kubectl describe node` for `Unschedulable: true` and confirm node labels/taints satisfy the flavor rules above. |

## See also

- [Compute Resources Usage][compute-resources-usage] — the end-user view this configuration drives.
- [Kueue][kueue-details] — overview of the Kueue abstractions referenced above.
95 changes: 95 additions & 0 deletions docs/user_guides/projects/scheduling/compute_resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
description: Reading and filtering the Compute Resources Usage view
---

# Compute Resources Usage

## Introduction

The **Compute Resources Usage** card shows you how much capacity is currently available to your project on the cluster.
It is meant as a planning aid before submitting work that will consume cluster resources.
Numbers refresh automatically and reflect the live state of the nodes your project can schedule on.

The same card appears at the top of three pages, so you see it wherever you launch work:

- **Jobs** — above the job list.
- **Jupyter** — on the Jupyter overview, above the server controls.
- **Model Deployments** — above the deployments list.

Expand it to see a breakdown of resources per node, namespace, and queue.

![Compute Resources Usage view, fully expanded](../../../assets/images/guides/project/scheduler/compute_resources_usage.png)

## Reading the summary

The collapsed header shows three totals across all the nodes your project can reach: **Memory free**, **CPU free**, and **GPU free**.
"Free" on each node is its allocatable capacity minus the maximum of utilized and requested resources, and the header is the **sum** of those per-node figures.

These totals give you a sense of the cluster-wide capacity available to your project, but they do not tell you the size of the largest job you can launch.
A job runs on exactly one node, so the biggest job that will fit is bounded by the single node with the most free resources — not by the sum.
Always cross-check the **Node Resources** sub-section before sizing a heavy job: a header that reads *100 GB free* can hide the fact that no individual node has more than, say, 30 GB free, in which case a 50 GB job will not start anywhere.

Expanding the card reveals three sub-sections:

- **Node Resources** — per-node breakdown of free Memory, CPU, and GPU.
- **Namespace Resources** — quotas applied at the project's Kubernetes namespace level.
- **Queue Resources** — per-queue nominal and borrowable capacity from the Kueue queues you have access to.

## Filter nodes

Two filters sit above the node list: **Queue:** on the left, **Labels:** on the right.
By default both are inactive — Queue is set to *any* and Labels is empty — so the node list shows the **union** of every node reachable through any of your project's queues.

Use either filter on its own, or both together.
When both are active, a node is shown only if it passes *both* filters (intersection).

### Queue filter

Choose a queue from the **Queue:** dropdown to narrow the node list to just the nodes reachable through that queue.

![Queue dropdown listing the project's LocalQueues](../../../assets/images/guides/project/scheduler/compute_resources_usage_queue_dropdown.png)

The options are:

- **any** (default) — every node reachable through *any* of your queues.
- The name of each queue your project has access to — only the nodes reachable through that one queue.

Picking a specific queue shrinks the node list to just the nodes Kueue would actually dispatch to for jobs submitted to that queue.

![Node Resources filtered to the other queue](../../../assets/images/guides/project/scheduler/compute_resources_usage_filtered.png)

The Queue Resources sub-section below is unaffected by this filter — it always lists every queue you have access to.

### Labels filter

Pick one or more labels in the **Labels:** dropdown to narrow the node list to nodes that carry every selected label.
The dropdown is populated from the labels your project administrator has made available; if no labels are configured for the project, the list is empty.

The Queue and Labels filters compose: with Queue set to *pool-a* and Labels set to `tier:workload`, the view shows only nodes that pool-a can reach *and* that carry `tier:workload`.

## The access notice

When Kueue is configured and your project has at least one LocalQueue, an info icon appears next to **Node Resources**.
Hover it to see one of two messages.

- **"Reachable through the queues available in this project.
See Queue Resources below for the list."**
This is the normal case — the listed nodes are the ones your queues route work to.
The Queue Resources sub-section names each queue, so you can cross-check which queue claims which capacity.

- **"None of the queues available in this project currently match any nodes in the cluster."**
Your project has queues, but none of them currently resolve to any nodes in the cluster.
This typically means the queue's underlying configuration (resource flavor) is looking for nodes that don't exist, or all matching nodes are unschedulable.
Ask your administrator to review the queue configuration.

## When Kueue is not in use

If the cluster is not running Kueue, or your project has no LocalQueues at all, the Node Resources sub-section lists every schedulable node in the cluster instead.
There is no Queue filter, no access notice, and no Queue Resources sub-section in that case.
Jobs run through the standard Kubernetes scheduler rather than a queue.

## See also

- Administrators: see [Configure the Compute Resources Usage view][configure-the-compute-resources-usage-view] for the underlying queue → node mapping and the cluster role permissions required for this view to work.
- [Kueue][kueue-details] — overview of Kueue's abstractions (ResourceFlavor, ClusterQueue, LocalQueue) used by Hopsworks.

2 changes: 1 addition & 1 deletion docs/user_guides/projects/scheduling/kueue_details.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
description: Kueue abstractions
---

# Kueue
# Kueue { #kueue-details }

## Introduction

Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,7 @@ nav:
- Kubernetes Scheduling:
- Base: user_guides/projects/scheduling/kube_scheduler.md
- Kueue: user_guides/projects/scheduling/kueue_details.md
- Compute Resources Usage: user_guides/projects/scheduling/compute_resources.md

- Airflow: user_guides/projects/airflow/airflow.md
- OpenSearch:
Expand Down Expand Up @@ -257,6 +258,7 @@ nav:
- Configure Alerts: setup_installation/admin/alert.md
- IAM Role Chaining: setup_installation/admin/roleChaining.md
- Configure Project Mapping: setup_installation/admin/configure-project-mapping.md
- Configure Compute Resources Usage View: setup_installation/admin/compute_resources.md
- Monitoring:
- Services Dashboards: setup_installation/admin/monitoring/grafana.md
- Export metrics: setup_installation/admin/monitoring/export-metrics.md
Expand Down
Loading