diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 8bbf2dbc3a..7992df94d4 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -6,11 +6,14 @@ uv venv && uv pip install -r requirements-docs.txt # setup uv pip install "git+https://github.com/logicalclocks/hopsworks-api.git@main#subdirectory=python" # install Python API (needed for API docs section) touch docs/javadoc; uv run mkdocs build -s; rm docs/javadoc # build (strict) -uv run mkdocs serve # preview with live reload +uv run mike deploy latest --update-alias # build a versioned bundle to the gh-pages worktree (use repo's current version, e.g. 4.4); first time only: `uv run mike set-default latest` +uv run mike serve # serve the gh-pages worktree locally (preview); does NOT live-reload from source — re-run `mike deploy` after edits npx markdownlint-cli2 "**/*.md" # lint Markdown (requires Node.js) uv tool install md-snakeoil && snakeoil --line-length 88 --rules "E,F,B,C4,ISC,PIE,PYI,Q,RSE,RET,SIM,TC,I,W,D2,D3,D4,INP,UP,FA" docs # lint Python code blocks ``` +`uv run mkdocs serve` is available too, but its livereload watcher does not fire rebuilds in this repo's plugin combination on macOS — `mike serve` is the canonical preview tool per the repo README. + ## Rules - One sentence per line in all Markdown prose diff --git a/.claude/docs/README.md b/.claude/docs/README.md index 64e4a9f278..e60a1223f7 100644 --- a/.claude/docs/README.md +++ b/.claude/docs/README.md @@ -8,12 +8,16 @@ There is no application code — all work is writing Markdown under `docs/` and ```bash uv venv && uv pip install -r requirements-docs.txt # setup uv pip install "git+https://github.com/logicalclocks/hopsworks-api.git@main#subdirectory=python" # needed for Python API section -touch docs/javadoc; uv run mkdocs serve; rm docs/javadoc # preview with live reload touch docs/javadoc; uv run mkdocs build -s; rm docs/javadoc # build in strict mode +uv run mike deploy latest --update-alias # build a versioned bundle (use repo's current version, e.g. 4.4); first time only: `uv run mike set-default latest` +uv run mike serve # serve the versioned bundle locally (canonical preview, per repo README); no source live-reload — re-run `mike deploy` after edits npx markdownlint-cli2 "**/*.md" # lint Markdown (requires Node.js) uv tool install md-snakeoil && snakeoil --line-length 88 --rules "E,F,B,C4,ISC,PIE,PYI,Q,RSE,RET,SIM,TC,I,W,D2,D3,D4,INP,UP,FA" docs # lint Python code blocks ``` +`uv run mkdocs serve` is available as a static dev server, but its livereload watcher does not fire rebuilds in this repo's plugin combination on macOS. +Use `mike serve` for previews. + `docs/javadoc` is a directory generated by CI from the `hopsworks-api` Java source. Locally it must exist as a stub (`touch`) for the build to pass. diff --git a/docs/assets/images/guides/project/scheduler/compute_resources_usage.png b/docs/assets/images/guides/project/scheduler/compute_resources_usage.png new file mode 100644 index 0000000000..081157962b Binary files /dev/null and b/docs/assets/images/guides/project/scheduler/compute_resources_usage.png differ diff --git a/docs/assets/images/guides/project/scheduler/compute_resources_usage_filtered.png b/docs/assets/images/guides/project/scheduler/compute_resources_usage_filtered.png new file mode 100644 index 0000000000..7ae56560e8 Binary files /dev/null and b/docs/assets/images/guides/project/scheduler/compute_resources_usage_filtered.png differ diff --git a/docs/assets/images/guides/project/scheduler/compute_resources_usage_queue_dropdown.png b/docs/assets/images/guides/project/scheduler/compute_resources_usage_queue_dropdown.png new file mode 100644 index 0000000000..9153050538 Binary files /dev/null and b/docs/assets/images/guides/project/scheduler/compute_resources_usage_queue_dropdown.png differ diff --git a/docs/setup_installation/admin/compute_resources.md b/docs/setup_installation/admin/compute_resources.md new file mode 100644 index 0000000000..c0d6ba43bf --- /dev/null +++ b/docs/setup_installation/admin/compute_resources.md @@ -0,0 +1,78 @@ +--- +description: Cluster configuration for the per-project Compute Resources Usage view +--- + +# Configure the Compute Resources Usage view + +## Introduction + +This page is for cluster administrators. +It explains how the per-project node list in the **Compute Resources Usage** card is derived, and the RBAC required for that derivation to work. +For the end-user view of the same card, see [Compute Resources Usage][compute-resources-usage]. + +When Kueue is installed, the card limits the node list to nodes a given project can actually schedule on. +The mapping is driven entirely by standard Kueue objects: **LocalQueue → ClusterQueue → ResourceFlavor**. +There is no Hopsworks-specific configuration on top. + +## How project → node visibility is derived + +For each project, Hopsworks walks the queue hierarchy bound to the project's Kubernetes namespace. + +- Start from every `LocalQueue` in the project's namespace. +- Follow each `LocalQueue.spec.clusterQueue` to its `ClusterQueue`. +- For each `ClusterQueue`, collect the `ResourceFlavor`s named in `spec.resourceGroups[].flavors[].name`. + +The resulting set of `ResourceFlavor`s is the project's "reachable flavors". +A node is included in the project's Node Resources view only if it matches at least one reachable flavor. + +The per-queue node filter in the UI is built from the same walk, but kept keyed by `LocalQueue` name so users can narrow the view to a single queue. + +## How a ResourceFlavor matches a node + +A node matches a `ResourceFlavor` when both of these hold. + +- **Labels:** every key/value in `ResourceFlavor.spec.nodeLabels` is present on the node with the same value. + Extra labels on the node are fine — the flavor's label set must be a subset of the node's labels. +- **Taints:** every taint on the node with effect `NoSchedule` or `NoExecute` is covered, either by a matching entry in `ResourceFlavor.spec.nodeTaints` or by a matching entry in `ResourceFlavor.spec.tolerations`. + Taints with effect `PreferNoSchedule` are soft and do not block matching. + +Both rules mirror Kueue's own admission logic, so the view reflects exactly which nodes Kueue would dispatch work to for that flavor. + +Cordoned nodes (`spec.unschedulable: true`) and nodes the metrics server cannot report on are dropped from the view regardless of flavor matching, because no useful capacity figure can be produced for them. + +## Required RBAC + +Hopsworks needs read access to the Kueue CRDs in order to walk the queue hierarchy. +The Hopsworks Helm chart ships a `ClusterRole` and binding that grant these permissions, so a default install needs no extra action. + +If you are managing RBAC manually (e.g. an externally provisioned `hopsworks` service account), grant at least the following: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: hopsworks-kueue-reader +rules: + - apiGroups: ["kueue.x-k8s.io"] + resources: ["localqueues", "clusterqueues", "resourceflavors"] + verbs: ["get", "list"] +``` + +Bind this role to the service account Hopsworks runs as. +The walk uses `get` and `list` only; no `watch`, `create`, `update`, or `delete` is needed. + +## Troubleshooting + +The view surfaces several distinct situations. +Use the table below to map a symptom to a likely cause. + +| Symptom | Likely cause | +| --- | --- | +| Node Resources sub-section is empty and the access notice says *"None of the queues available in this project currently match any nodes in the cluster."* | The project's LocalQueues resolve to flavors that don't match any node — check `ResourceFlavor.spec.nodeLabels` and `nodeTaints`/`tolerations` against the actual node labels and taints. | +| Node Resources lists every schedulable node and there is no Queue filter or Queue Resources sub-section | Kueue is not installed, the project namespace has no LocalQueues, or Hopsworks lacks the Kueue RBAC above. All three cases fall through to the legacy non-Kueue path, with no access notice. To distinguish: `kubectl get crd resourceflavors.kueue.x-k8s.io` (absent means Kueue isn't installed), then `kubectl auth can-i list localqueues.kueue.x-k8s.io -n --as=system:serviceaccount::` (`no` means apply the `ClusterRole` and binding above). The `-n` flag is required because `LocalQueue` is namespaced; `--as=` requires the caller to have ServiceAccount impersonation rights (granted by `cluster-admin`). | +| A node you expect to see is missing | The node is either cordoned, missing from the metrics server, or not matched by any reachable flavor — check `kubectl describe node` for `Unschedulable: true` and confirm node labels/taints satisfy the flavor rules above. | + +## See also + +- [Compute Resources Usage][compute-resources-usage] — the end-user view this configuration drives. +- [Kueue][kueue-details] — overview of the Kueue abstractions referenced above. diff --git a/docs/user_guides/projects/scheduling/compute_resources.md b/docs/user_guides/projects/scheduling/compute_resources.md new file mode 100644 index 0000000000..956a26da28 --- /dev/null +++ b/docs/user_guides/projects/scheduling/compute_resources.md @@ -0,0 +1,95 @@ +--- +description: Reading and filtering the Compute Resources Usage view +--- + +# Compute Resources Usage + +## Introduction + +The **Compute Resources Usage** card shows you how much capacity is currently available to your project on the cluster. +It is meant as a planning aid before submitting work that will consume cluster resources. +Numbers refresh automatically and reflect the live state of the nodes your project can schedule on. + +The same card appears at the top of three pages, so you see it wherever you launch work: + +- **Jobs** — above the job list. +- **Jupyter** — on the Jupyter overview, above the server controls. +- **Model Deployments** — above the deployments list. + +Expand it to see a breakdown of resources per node, namespace, and queue. + +![Compute Resources Usage view, fully expanded](../../../assets/images/guides/project/scheduler/compute_resources_usage.png) + +## Reading the summary + +The collapsed header shows three totals across all the nodes your project can reach: **Memory free**, **CPU free**, and **GPU free**. +"Free" on each node is its allocatable capacity minus the maximum of utilized and requested resources, and the header is the **sum** of those per-node figures. + +These totals give you a sense of the cluster-wide capacity available to your project, but they do not tell you the size of the largest job you can launch. +A job runs on exactly one node, so the biggest job that will fit is bounded by the single node with the most free resources — not by the sum. +Always cross-check the **Node Resources** sub-section before sizing a heavy job: a header that reads *100 GB free* can hide the fact that no individual node has more than, say, 30 GB free, in which case a 50 GB job will not start anywhere. + +Expanding the card reveals three sub-sections: + +- **Node Resources** — per-node breakdown of free Memory, CPU, and GPU. +- **Namespace Resources** — quotas applied at the project's Kubernetes namespace level. +- **Queue Resources** — per-queue nominal and borrowable capacity from the Kueue queues you have access to. + +## Filter nodes + +Two filters sit above the node list: **Queue:** on the left, **Labels:** on the right. +By default both are inactive — Queue is set to *any* and Labels is empty — so the node list shows the **union** of every node reachable through any of your project's queues. + +Use either filter on its own, or both together. +When both are active, a node is shown only if it passes *both* filters (intersection). + +### Queue filter + +Choose a queue from the **Queue:** dropdown to narrow the node list to just the nodes reachable through that queue. + +![Queue dropdown listing the project's LocalQueues](../../../assets/images/guides/project/scheduler/compute_resources_usage_queue_dropdown.png) + +The options are: + +- **any** (default) — every node reachable through *any* of your queues. +- The name of each queue your project has access to — only the nodes reachable through that one queue. + +Picking a specific queue shrinks the node list to just the nodes Kueue would actually dispatch to for jobs submitted to that queue. + +![Node Resources filtered to the other queue](../../../assets/images/guides/project/scheduler/compute_resources_usage_filtered.png) + +The Queue Resources sub-section below is unaffected by this filter — it always lists every queue you have access to. + +### Labels filter + +Pick one or more labels in the **Labels:** dropdown to narrow the node list to nodes that carry every selected label. +The dropdown is populated from the labels your project administrator has made available; if no labels are configured for the project, the list is empty. + +The Queue and Labels filters compose: with Queue set to *pool-a* and Labels set to `tier:workload`, the view shows only nodes that pool-a can reach *and* that carry `tier:workload`. + +## The access notice + +When Kueue is configured and your project has at least one LocalQueue, an info icon appears next to **Node Resources**. +Hover it to see one of two messages. + +- **"Reachable through the queues available in this project. + See Queue Resources below for the list."** + This is the normal case — the listed nodes are the ones your queues route work to. + The Queue Resources sub-section names each queue, so you can cross-check which queue claims which capacity. + +- **"None of the queues available in this project currently match any nodes in the cluster."** + Your project has queues, but none of them currently resolve to any nodes in the cluster. + This typically means the queue's underlying configuration (resource flavor) is looking for nodes that don't exist, or all matching nodes are unschedulable. + Ask your administrator to review the queue configuration. + +## When Kueue is not in use + +If the cluster is not running Kueue, or your project has no LocalQueues at all, the Node Resources sub-section lists every schedulable node in the cluster instead. +There is no Queue filter, no access notice, and no Queue Resources sub-section in that case. +Jobs run through the standard Kubernetes scheduler rather than a queue. + +## See also + +- Administrators: see [Configure the Compute Resources Usage view][configure-the-compute-resources-usage-view] for the underlying queue → node mapping and the cluster role permissions required for this view to work. +- [Kueue][kueue-details] — overview of Kueue's abstractions (ResourceFlavor, ClusterQueue, LocalQueue) used by Hopsworks. + diff --git a/docs/user_guides/projects/scheduling/kueue_details.md b/docs/user_guides/projects/scheduling/kueue_details.md index c4965ee906..6076fb85ab 100644 --- a/docs/user_guides/projects/scheduling/kueue_details.md +++ b/docs/user_guides/projects/scheduling/kueue_details.md @@ -2,7 +2,7 @@ description: Kueue abstractions --- -# Kueue +# Kueue { #kueue-details } ## Introduction diff --git a/mkdocs.yml b/mkdocs.yml index eba092bf49..91b9780a7c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -177,6 +177,7 @@ nav: - Kubernetes Scheduling: - Base: user_guides/projects/scheduling/kube_scheduler.md - Kueue: user_guides/projects/scheduling/kueue_details.md + - Compute Resources Usage: user_guides/projects/scheduling/compute_resources.md - Airflow: user_guides/projects/airflow/airflow.md - OpenSearch: @@ -257,6 +258,7 @@ nav: - Configure Alerts: setup_installation/admin/alert.md - IAM Role Chaining: setup_installation/admin/roleChaining.md - Configure Project Mapping: setup_installation/admin/configure-project-mapping.md + - Configure Compute Resources Usage View: setup_installation/admin/compute_resources.md - Monitoring: - Services Dashboards: setup_installation/admin/monitoring/grafana.md - Export metrics: setup_installation/admin/monitoring/export-metrics.md