chore(k8s): configure prod overlay for quantms-ddalfq deployment by t0mdavid-m · Pull Request #19 · OpenMS/quantms-web

t0mdavid-m · 2026-04-26T18:57:26Z

Wire the prod Kustomize overlay to the OpenMS Traefik cluster:

namePrefix/commonLabels: quantms-ddalfq
image: ghcr.io/openms/quantms-web:main-full
IngressRoute: opendda.webapps.openms.{de,org}
Redis URL pointed at quantms-ddalfq-redis
memory-tier-high component (heavy DIA/LFQ workloads)
Workspace PVC sized to 3Ti

Summary by CodeRabbit

Release Notes

Chores
- Increased workspace storage capacity to 3 terabytes
- Updated production environment configuration with new service names and hostnames
- Modified Redis connection settings for production deployments
- Enhanced production memory tier configuration

Wire the prod Kustomize overlay to the OpenMS Traefik cluster: - namePrefix/commonLabels: quantms-ddalfq - image: ghcr.io/openms/quantms-web:main-full - IngressRoute: opendda.webapps.openms.{de,org} - Redis URL pointed at quantms-ddalfq-redis - memory-tier-high component (heavy DIA/LFQ workloads) - Workspace PVC sized to 3Ti Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-26T19:00:44Z

Warning

Rate limit exceeded

@t0mdavid-m has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 56 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 56 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 40d63c69-74ca-4ef7-a072-4e22849899f6

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca8dbe and cb1bd20.

📒 Files selected for processing (1)

.github/workflows/build-and-test.yml

📝 Walkthrough

Walkthrough

Updates Kubernetes manifests to increase workspace storage capacity from 500Gi to 3Ti and reconfigure the production overlay environment with new application naming, memory tier settings, image references, and service connections.

Changes

Cohort / File(s)	Summary
Kubernetes Base Configuration `k8s/base/workspace-pvc.yaml`	Increases PersistentVolumeClaim storage capacity from 500Gi to 3Ti.
Production Deployment Configuration `k8s/overlays/prod/kustomization.yaml`	Migrates prod overlay from `template-app` to `quantms-ddalfq` naming, upgrades memory tier to high, updates image reference to `ghcr.io/openms/quantms-web`, changes IngressRoute hostnames from `template.webapps.openms.` to `opendda.webapps.openms.`, and updates Redis service connections in streamlit and rq-worker deployments.

Poem

🐰 From template's days to quantms new,
Storage swells to three tiers through,
Redis redirects with proper care,
Production infrastructure, debonair!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: configuring the production Kustomize overlay for the quantms-ddalfq deployment, which aligns with the primary objectives of the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch setup_deployment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

k8s/overlays/prod/kustomization.yaml (3)
31-44: Positional env/0 JSON patch is brittle — prefer a strategic merge by env name.

Both Redis patches target /spec/template/spec/containers/0/env/0/value, which silently depends on REDIS_URL remaining the first entry in the base deployments' env list (k8s/base/streamlit-deployment.yaml and k8s/base/rq-worker-deployment.yaml). If anyone reorders env vars in base, these patches will quietly overwrite the wrong variable instead of failing. A strategic merge patch keyed by name: REDIS_URL is safer and self-documenting.
♻️ Proposed refactor
-  - target:
-      kind: Deployment
-      name: streamlit
-    patch: |
-      - op: replace
-        path: /spec/template/spec/containers/0/env/0/value
-        value: "redis://quantms-ddalfq-redis:6379/0"
-  - target:
-      kind: Deployment
-      name: rq-worker
-    patch: |
-      - op: replace
-        path: /spec/template/spec/containers/0/env/0/value
-        value: "redis://quantms-ddalfq-redis:6379/0"
+  - patch: |-
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: streamlit
+      spec:
+        template:
+          spec:
+            containers:
+              - name: streamlit
+                env:
+                  - name: REDIS_URL
+                    value: "redis://quantms-ddalfq-redis:6379/0"
+  - patch: |-
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: rq-worker
+      spec:
+        template:
+          spec:
+            containers:
+              - name: rq-worker
+                env:
+                  - name: REDIS_URL
+                    value: "redis://quantms-ddalfq-redis:6379/0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 31 - 44, The JSON patches
currently replace /spec/template/spec/containers/0/env/0/value which is brittle;
update the two patches for the Deployment targets named streamlit and rq-worker
to use a strategic merge patch (or patchStrategicMerge) that matches the env
entry by name "REDIS_URL" and sets its value to
"redis://quantms-ddalfq-redis:6379/0" instead of using positional env/0; locate
the patches referencing those Deployment names (streamlit, rq-worker) and
replace the positional JSON patch with a strategic merge fragment that contains
the container env entry { name: REDIS_URL, value:
"redis://quantms-ddalfq-redis:6379/0" } so the patch is resilient to env
reordering.
12-13: Migrate commonLabels to the labels field for full Kustomize compatibility.

commonLabels is deprecated in recent Kustomize releases (v5.3+) and generates warnings. While still functional, migrating to the labels field with explicit control is recommended. Use kustomize edit fix to automate this migration, or manually update as shown below.
♻️ Suggested migration (to preserve current behavior)
-commonLabels:
-  app: quantms-ddalfq
+labels:
+  - pairs:
+      app: quantms-ddalfq
+    includeSelectors: true
+    includeTemplates: true
Set both includeSelectors: true and includeTemplates: true to maintain full parity with the original commonLabels behavior (labels applied to metadata, selectors, and templates). Omit includeTemplates or set includeSelectors: false if you only want labels added to metadata without selector injection.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 12 - 13, Replace the
top-level commonLabels mapping with a new labels block: remove the commonLabels:
app: quantms-ddalfq entry and create labels containing includeSelectors: true
and includeTemplates: true plus the label mapping (app: quantms-ddalfq) so the
label is applied to metadata, selectors and templates; use the symbols
commonLabels, labels, includeSelectors and includeTemplates to find and update
the kustomization YAML.
21-30: Hardcoded quantms-ddalfq-streamlit couples the IngressRoute patch to namePrefix.

Since Traefik's IngressRoute is a CRD, Kustomize's default nameReference transformer doesn't rewrite the services[].name field, so the explicit value here is necessary. However, it now has to be kept in sync manually with namePrefix (Line 10)—changing one without the other will silently route to a non-existent service.

Consider either:

Adding a short comment noting the coupling, or

Registering a configurations: entry with a nameReference for IngressRoute.spec.routes[].services[].name so Kustomize rewrites it automatically (the standard pattern for custom resources without builtin support).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 21 - 30, The IngressRoute
patch hardcodes the service name ("quantms-ddalfq-streamlit") which must stay in
sync with the kustomize namePrefix; replace this brittle approach by registering
a nameReference in kustomize configurations so Kustomize rewrites
IngressRoute.spec.routes[].services[].name automatically (or at minimum add a
clear inline comment noting the coupling). Update the kustomization to include a
configurations: entry that maps the custom resource kind IngressRoute and the
field spec.routes[].services[].name as a nameReference, or add a short comment
next to the patch mentioning it must match namePrefix whenever namePrefix
changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/base/workspace-pvc.yaml`:
- Line 11: The base PersistentVolumeClaim "workspaces-pvc" was changed to
request 3Ti at spec.resources.requests.storage, which will affect all
environments; instead revert that change in the base and add a prod-only
strategic merge patch that updates "workspaces-pvc" to storage: 3Ti in the prod
overlay. Create a new overlay patch (e.g., workspace-pvc-patch.yaml) containing
the PersistentVolumeClaim metadata name: workspaces-pvc and
spec.resources.requests.storage: 3Ti, then reference it from the prod
kustomization under patchesStrategicMerge so only the prod overlay receives the
increased size.

---

Nitpick comments:
In `@k8s/overlays/prod/kustomization.yaml`:
- Around line 31-44: The JSON patches currently replace
/spec/template/spec/containers/0/env/0/value which is brittle; update the two
patches for the Deployment targets named streamlit and rq-worker to use a
strategic merge patch (or patchStrategicMerge) that matches the env entry by
name "REDIS_URL" and sets its value to "redis://quantms-ddalfq-redis:6379/0"
instead of using positional env/0; locate the patches referencing those
Deployment names (streamlit, rq-worker) and replace the positional JSON patch
with a strategic merge fragment that contains the container env entry { name:
REDIS_URL, value: "redis://quantms-ddalfq-redis:6379/0" } so the patch is
resilient to env reordering.
- Around line 12-13: Replace the top-level commonLabels mapping with a new
labels block: remove the commonLabels: app: quantms-ddalfq entry and create
labels containing includeSelectors: true and includeTemplates: true plus the
label mapping (app: quantms-ddalfq) so the label is applied to metadata,
selectors and templates; use the symbols commonLabels, labels, includeSelectors
and includeTemplates to find and update the kustomization YAML.
- Around line 21-30: The IngressRoute patch hardcodes the service name
("quantms-ddalfq-streamlit") which must stay in sync with the kustomize
namePrefix; replace this brittle approach by registering a nameReference in
kustomize configurations so Kustomize rewrites
IngressRoute.spec.routes[].services[].name automatically (or at minimum add a
clear inline comment noting the coupling). Update the kustomization to include a
configurations: entry that maps the custom resource kind IngressRoute and the
field spec.routes[].services[].name as a nameReference, or add a short comment
next to the patch mentioning it must match namePrefix whenever namePrefix
changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e7abc86-f080-482d-87a1-21efd0028c6b

📥 Commits

Reviewing files that changed from the base of the PR and between 076351b and 7ca8dbe.

📒 Files selected for processing (2)

k8s/base/workspace-pvc.yaml
k8s/overlays/prod/kustomization.yaml

coderabbitai · 2026-04-26T19:00:47Z

  resources:
    requests:
-      storage: 500Gi
+      storage: 3Ti


⚠️ Potential issue | 🟠 Major

Scope drift: this base PVC change applies to all overlays, not just prod.

Line 11 raises the base request to 3Ti, which affects every environment inheriting k8s/base. The PR objective is prod-only sizing, so this should be done as an overlay patch in production to avoid unintended cost/capacity impact elsewhere.

Suggested direction

# k8s/base/workspace-pvc.yaml - storage: 3Ti + storage: 500Gi

# k8s/overlays/prod/workspace-pvc-patch.yaml (new) apiVersion: v1 kind: PersistentVolumeClaim metadata: name: workspaces-pvc spec: resources: requests: storage: 3Ti

# k8s/overlays/prod/kustomization.yaml patchesStrategicMerge: - workspace-pvc-patch.yaml

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@k8s/base/workspace-pvc.yaml` at line 11, The base PersistentVolumeClaim "workspaces-pvc" was changed to request 3Ti at spec.resources.requests.storage, which will affect all environments; instead revert that change in the base and add a prod-only strategic merge patch that updates "workspaces-pvc" to storage: 3Ti in the prod overlay. Create a new overlay patch (e.g., workspace-pvc-patch.yaml) containing the PersistentVolumeClaim metadata name: workspaces-pvc and spec.resources.requests.storage: 3Ti, then reference it from the prod kustomization under patchesStrategicMerge so only the prod overlay receives the increased size.

The kind-cluster integration jobs (test-nginx, test-traefik) still referenced the template's slug and Traefik hostnames, so kubectl wait selectors and curl Host headers no longer matched after the prod overlay was renamed. Replace template-app -> quantms-ddalfq and template.webapps.openms.{de,org} -> opendda.webapps.openms.{de,org}. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add Matomo analytics integration with GDPR consent support (#341) * Add Matomo Tag Manager as third analytics tracking mode Adds Matomo Tag Manager support alongside existing Google Analytics and Piwik Pro integrations. Includes settings.json configuration (url + tag), build-time script injection via hook-analytics.py, Klaro GDPR consent banner integration, and runtime consent granting via MTM data layer API. https://claude.ai/code/session_0165AXHkmRZ6bx23n7Tbyz8h * Fix Matomo Tag Manager snippet to match official docs - Accept full container JS URL instead of separate url + tag fields, supporting both self-hosted and Matomo Cloud URL patterns - Match the official snippet: var _mtm alias, _mtm.push shorthand - Remove redundant type="text/javascript" attribute - Remove unused "tag" field from settings.json https://claude.ai/code/session_0165AXHkmRZ6bx23n7Tbyz8h * Split Matomo config into base url + tag fields Separate the Matomo setting into `url` (base URL, e.g. https://cdn.matomo.cloud/openms.matomo.cloud) and `tag` (container ID, e.g. yDGK8bfY), consistent with how other providers use a tag field. The script constructs the full path: {url}/container_{tag}.js https://claude.ai/code/session_0165AXHkmRZ6bx23n7Tbyz8h * install matomo tag --------- Co-authored-by: Claude <noreply@anthropic.com> * Remove duplicate `address` key in `.streamlit/config.toml` (#346) * Initial plan * fix: remove duplicate address entry in config.toml Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com> * Fix integration test failures caused by sys.modules pollution and shutil.SameFileError (#349) * Initial plan * Fix integration test failures: restore sys.modules mocks, handle SameFileError, update CI workflow Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com> * Remove unnecessary pyopenms mock from test_topp_workflow_parameter.py, simplify test_parameter_presets.py Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com> * Fix Windows build: correct site-packages path in cleanup step Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com> * Remove server address from bundled config.toml for Windows installer (#351) On Windows, 0.0.0.0 is not a valid connect address — the browser fails to open http://0.0.0.0:8501. By removing the address entry from the bundled .streamlit/config.toml, Streamlit defaults to localhost, which works correctly for local deployments. Docker deployments are unaffected as they pass --server.address 0.0.0.0 on the command line. https://claude.ai/code/session_016amsLCZeFogTksmtk1geb5 Co-authored-by: Claude <noreply@anthropic.com> * reenable cross origin protection * Add CLAUDE.md and Claude Code skills for MS webapp development (#357) * Add CLAUDE.md and Claude Code skills for webapp development Adds project documentation (CLAUDE.md) and 6 skills to help developers scaffold and extend OpenMS web applications built from this template: - /create-page: add a new Streamlit page with proper registration - /create-workflow: scaffold a full TOPP workflow (class + 4 pages) - /add-python-tool: add a custom Python analysis script with auto-UI - /add-presets: add parameter presets for workflows - /configure-deployment: set up Docker and CI/CD for a new app - /add-visualization: add pyopenms-viz or OpenMS-Insight visualizations https://claude.ai/code/session_01WYotmLfqRtB8WJXj1Eosiz * Strengthen MS domain context in CLAUDE.md and skills Make it clear to Claude that this is THE framework for building mass spectrometry web applications for proteomics and metabolomics research. Add domain-specific context about MS data types, TOPP tool pipelines, and scientific visualization needs. https://claude.ai/code/session_01WYotmLfqRtB8WJXj1Eosiz --------- Co-authored-by: Claude <noreply@anthropic.com> * Add Kubernetes manifests and CI/CD workflows for deployment (#347) * Add Kubernetes manifests and CI workflows for de.NBI migration Decompose the monolithic Docker container into Kubernetes workloads: - Streamlit Deployment with health probes and session affinity - Redis Deployment + Service for job queue - RQ Worker Deployment for background workflows - CronJob for workspace cleanup - Ingress with WebSocket support and cookie-based sticky sessions - Shared PVC (ReadWriteMany) for workspace data - ConfigMap for runtime configuration (replaces build-time settings) - Kustomize base + template-app overlay for multi-app deployment Code changes: - Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml - Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py CI/CD: - Add build-and-push-image.yml to push Docker images to ghcr.io - Add k8s-manifests-ci.yml for manifest validation and kind integration tests https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix kubeconform validation to skip kustomization.yaml kustomization.yaml is a Kustomize config file, not a standard K8s resource, so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add matrix strategy to test both Dockerfiles in integration tests The integration-test job now uses a matrix with Dockerfile_simple and Dockerfile. Each matrix entry checks if its Dockerfile exists before running — all steps are guarded with an `if` condition so they skip gracefully when a Dockerfile is absent. This allows downstream forks that only have one Dockerfile to pass CI without errors. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Adapt K8s base manifests for de.NBI Cinder CSI storage - Switch workspace PVC from ReadWriteMany to ReadWriteOnce with cinder-csi storage class (required by de.NBI KKP cluster) - Increase PVC storage to 500Gi - Add namespace: openms to kustomization.yaml - Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU) so all workspace-mounting pods fit on a single node https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add pod affinity rules to co-locate all workspace pods on same node The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which requires all pods mounting it to run on the same node. Without explicit affinity rules, the scheduler was failing silently, leaving pods in Pending state with no events. Adds a `volume-group: workspaces` label and podAffinity with requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment, rq-worker deployment, and cleanup cronjob. This ensures the scheduler explicitly co-locates all workspace-consuming pods on the same node. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: wait for ingress-nginx admission webhook before deploying The controller pod being Ready doesn't guarantee the admission webhook service is accepting connections. Add a polling loop that waits for the webhook endpoint to have an IP assigned before applying the Ingress resource, preventing "connection refused" errors during kustomize apply. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: add -n openms namespace to integration test steps The kustomize overlay deploys into the openms namespace, but the verification steps (Redis wait, Redis ping, deployment checks) were querying the default namespace, causing "no matching resources found". https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: retry kustomize deploy for webhook readiness Replace the unreliable endpoint-IP polling with a retry loop on kubectl apply (up to 5 attempts with backoff). This handles the race where the ingress-nginx admission webhook has an endpoint IP but isn't yet accepting TCP connections. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ --------- Co-authored-by: Claude <noreply@anthropic.com> * Claude/kubernetes migration plan kq jw d (#358) * Add Kubernetes manifests and CI workflows for de.NBI migration Decompose the monolithic Docker container into Kubernetes workloads: - Streamlit Deployment with health probes and session affinity - Redis Deployment + Service for job queue - RQ Worker Deployment for background workflows - CronJob for workspace cleanup - Ingress with WebSocket support and cookie-based sticky sessions - Shared PVC (ReadWriteMany) for workspace data - ConfigMap for runtime configuration (replaces build-time settings) - Kustomize base + template-app overlay for multi-app deployment Code changes: - Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml - Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py CI/CD: - Add build-and-push-image.yml to push Docker images to ghcr.io - Add k8s-manifests-ci.yml for manifest validation and kind integration tests https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix kubeconform validation to skip kustomization.yaml kustomization.yaml is a Kustomize config file, not a standard K8s resource, so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add matrix strategy to test both Dockerfiles in integration tests The integration-test job now uses a matrix with Dockerfile_simple and Dockerfile. Each matrix entry checks if its Dockerfile exists before running — all steps are guarded with an `if` condition so they skip gracefully when a Dockerfile is absent. This allows downstream forks that only have one Dockerfile to pass CI without errors. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Adapt K8s base manifests for de.NBI Cinder CSI storage - Switch workspace PVC from ReadWriteMany to ReadWriteOnce with cinder-csi storage class (required by de.NBI KKP cluster) - Increase PVC storage to 500Gi - Add namespace: openms to kustomization.yaml - Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU) so all workspace-mounting pods fit on a single node https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add pod affinity rules to co-locate all workspace pods on same node The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which requires all pods mounting it to run on the same node. Without explicit affinity rules, the scheduler was failing silently, leaving pods in Pending state with no events. Adds a `volume-group: workspaces` label and podAffinity with requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment, rq-worker deployment, and cleanup cronjob. This ensures the scheduler explicitly co-locates all workspace-consuming pods on the same node. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: wait for ingress-nginx admission webhook before deploying The controller pod being Ready doesn't guarantee the admission webhook service is accepting connections. Add a polling loop that waits for the webhook endpoint to have an IP assigned before applying the Ingress resource, preventing "connection refused" errors during kustomize apply. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: add -n openms namespace to integration test steps The kustomize overlay deploys into the openms namespace, but the verification steps (Redis wait, Redis ping, deployment checks) were querying the default namespace, causing "no matching resources found". https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: retry kustomize deploy for webhook readiness Replace the unreliable endpoint-IP polling with a retry loop on kubectl apply (up to 5 attempts with backoff). This handles the race where the ingress-nginx admission webhook has an endpoint IP but isn't yet accepting TCP connections. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix REDIS_URL to use prefixed service name in overlay Kustomize namePrefix renames the Redis service to template-app-redis, but the REDIS_URL env var in streamlit and rq-worker deployments still referenced the unprefixed name "redis", causing the rq-worker to CrashLoopBackOff with "Name or service not known". Add JSON patches in the overlay to set the correct prefixed hostname. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add Traefik IngressRoute for direct LB IP access The cluster uses Traefik, not nginx, so the nginx Ingress annotations are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all routing and sticky session cookie for Streamlit session affinity. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: skip Traefik IngressRoute CRD in validation and integration tests kubeconform doesn't know the Traefik IngressRoute CRD schema, and the kind cluster in integration tests doesn't have Traefik installed. Skip the IngressRoute in kubeconform validation and filter it out with yq before applying to the kind cluster. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix IngressRoute service name for kustomize namePrefix Kustomize namePrefix doesn't rewrite service references inside CRDs, so the IngressRoute was pointing to 'streamlit' instead of 'template-app-streamlit', causing Traefik to return 404. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: use ConfigMap as settings override instead of full replacement The ConfigMap was replacing the entire settings.json, losing keys like "version" and "repository-name" that the app expects (causing KeyError). Now the ConfigMap only contains deployment-specific overrides, which are merged into the Docker image's base settings.json at container startup using jq. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: add set -euo pipefail to fail fast on settings merge error Addresses CodeRabbit review: if jq merge fails, the container should not start with unmerged settings. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ --------- Co-authored-by: Claude <noreply@anthropic.com> * Claude/kubernetes migration plan kq jw d (#359) * Add Kubernetes manifests and CI workflows for de.NBI migration Decompose the monolithic Docker container into Kubernetes workloads: - Streamlit Deployment with health probes and session affinity - Redis Deployment + Service for job queue - RQ Worker Deployment for background workflows - CronJob for workspace cleanup - Ingress with WebSocket support and cookie-based sticky sessions - Shared PVC (ReadWriteMany) for workspace data - ConfigMap for runtime configuration (replaces build-time settings) - Kustomize base + template-app overlay for multi-app deployment Code changes: - Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml - Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py CI/CD: - Add build-and-push-image.yml to push Docker images to ghcr.io - Add k8s-manifests-ci.yml for manifest validation and kind integration tests https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix kubeconform validation to skip kustomization.yaml kustomization.yaml is a Kustomize config file, not a standard K8s resource, so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add matrix strategy to test both Dockerfiles in integration tests The integration-test job now uses a matrix with Dockerfile_simple and Dockerfile. Each matrix entry checks if its Dockerfile exists before running — all steps are guarded with an `if` condition so they skip gracefully when a Dockerfile is absent. This allows downstream forks that only have one Dockerfile to pass CI without errors. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Adapt K8s base manifests for de.NBI Cinder CSI storage - Switch workspace PVC from ReadWriteMany to ReadWriteOnce with cinder-csi storage class (required by de.NBI KKP cluster) - Increase PVC storage to 500Gi - Add namespace: openms to kustomization.yaml - Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU) so all workspace-mounting pods fit on a single node https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add pod affinity rules to co-locate all workspace pods on same node The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which requires all pods mounting it to run on the same node. Without explicit affinity rules, the scheduler was failing silently, leaving pods in Pending state with no events. Adds a `volume-group: workspaces` label and podAffinity with requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment, rq-worker deployment, and cleanup cronjob. This ensures the scheduler explicitly co-locates all workspace-consuming pods on the same node. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: wait for ingress-nginx admission webhook before deploying The controller pod being Ready doesn't guarantee the admission webhook service is accepting connections. Add a polling loop that waits for the webhook endpoint to have an IP assigned before applying the Ingress resource, preventing "connection refused" errors during kustomize apply. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: add -n openms namespace to integration test steps The kustomize overlay deploys into the openms namespace, but the verification steps (Redis wait, Redis ping, deployment checks) were querying the default namespace, causing "no matching resources found". https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: retry kustomize deploy for webhook readiness Replace the unreliable endpoint-IP polling with a retry loop on kubectl apply (up to 5 attempts with backoff). This handles the race where the ingress-nginx admission webhook has an endpoint IP but isn't yet accepting TCP connections. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix REDIS_URL to use prefixed service name in overlay Kustomize namePrefix renames the Redis service to template-app-redis, but the REDIS_URL env var in streamlit and rq-worker deployments still referenced the unprefixed name "redis", causing the rq-worker to CrashLoopBackOff with "Name or service not known". Add JSON patches in the overlay to set the correct prefixed hostname. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add Traefik IngressRoute for direct LB IP access The cluster uses Traefik, not nginx, so the nginx Ingress annotations are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all routing and sticky session cookie for Streamlit session affinity. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: skip Traefik IngressRoute CRD in validation and integration tests kubeconform doesn't know the Traefik IngressRoute CRD schema, and the kind cluster in integration tests doesn't have Traefik installed. Skip the IngressRoute in kubeconform validation and filter it out with yq before applying to the kind cluster. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix IngressRoute service name for kustomize namePrefix Kustomize namePrefix doesn't rewrite service references inside CRDs, so the IngressRoute was pointing to 'streamlit' instead of 'template-app-streamlit', causing Traefik to return 404. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: use ConfigMap as settings override instead of full replacement The ConfigMap was replacing the entire settings.json, losing keys like "version" and "repository-name" that the app expects (causing KeyError). Now the ConfigMap only contains deployment-specific overrides, which are merged into the Docker image's base settings.json at container startup using jq. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: add set -euo pipefail to fail fast on settings merge error Addresses CodeRabbit review: if jq merge fails, the container should not start with unmerged settings. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: change imagePullPolicy to Always for mutable main tag With IfNotPresent, rollout restarts reuse the cached image even when a new version has been pushed with the same tag. Always ensures Kubernetes pulls the latest image on every pod start. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: build full Dockerfile instead of Dockerfile_simple Switch CI to build the full Docker image with OpenMS and TOPP tools, not the lightweight pyOpenMS-only image. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ --------- Co-authored-by: Claude <noreply@anthropic.com> * Claude/fix mzml files validation y zfla (#361) * Add Kubernetes manifests and CI workflows for de.NBI migration Decompose the monolithic Docker container into Kubernetes workloads: - Streamlit Deployment with health probes and session affinity - Redis Deployment + Service for job queue - RQ Worker Deployment for background workflows - CronJob for workspace cleanup - Ingress with WebSocket support and cookie-based sticky sessions - Shared PVC (ReadWriteMany) for workspace data - ConfigMap for runtime configuration (replaces build-time settings) - Kustomize base + template-app overlay for multi-app deployment Code changes: - Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml - Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py CI/CD: - Add build-and-push-image.yml to push Docker images to ghcr.io - Add k8s-manifests-ci.yml for manifest validation and kind integration tests https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix kubeconform validation to skip kustomization.yaml kustomization.yaml is a Kustomize config file, not a standard K8s resource, so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add matrix strategy to test both Dockerfiles in integration tests The integration-test job now uses a matrix with Dockerfile_simple and Dockerfile. Each matrix entry checks if its Dockerfile exists before running — all steps are guarded with an `if` condition so they skip gracefully when a Dockerfile is absent. This allows downstream forks that only have one Dockerfile to pass CI without errors. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Adapt K8s base manifests for de.NBI Cinder CSI storage - Switch workspace PVC from ReadWriteMany to ReadWriteOnce with cinder-csi storage class (required by de.NBI KKP cluster) - Increase PVC storage to 500Gi - Add namespace: openms to kustomization.yaml - Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU) so all workspace-mounting pods fit on a single node https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add pod affinity rules to co-locate all workspace pods on same node The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which requires all pods mounting it to run on the same node. Without explicit affinity rules, the scheduler was failing silently, leaving pods in Pending state with no events. Adds a `volume-group: workspaces` label and podAffinity with requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment, rq-worker deployment, and cleanup cronjob. This ensures the scheduler explicitly co-locates all workspace-consuming pods on the same node. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: wait for ingress-nginx admission webhook before deploying The controller pod being Ready doesn't guarantee the admission webhook service is accepting connections. Add a polling loop that waits for the webhook endpoint to have an IP assigned before applying the Ingress resource, preventing "connection refused" errors during kustomize apply. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: add -n openms namespace to integration test steps The kustomize overlay deploys into the openms namespace, but the verification steps (Redis wait, Redis ping, deployment checks) were querying the default namespace, causing "no matching resources found". https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: retry kustomize deploy for webhook readiness Replace the unreliable endpoint-IP polling with a retry loop on kubectl apply (up to 5 attempts with backoff). This handles the race where the ingress-nginx admission webhook has an endpoint IP but isn't yet accepting TCP connections. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix REDIS_URL to use prefixed service name in overlay Kustomize namePrefix renames the Redis service to template-app-redis, but the REDIS_URL env var in streamlit and rq-worker deployments still referenced the unprefixed name "redis", causing the rq-worker to CrashLoopBackOff with "Name or service not known". Add JSON patches in the overlay to set the correct prefixed hostname. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add Traefik IngressRoute for direct LB IP access The cluster uses Traefik, not nginx, so the nginx Ingress annotations are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all routing and sticky session cookie for Streamlit session affinity. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: skip Traefik IngressRoute CRD in validation and integration tests kubeconform doesn't know the Traefik IngressRoute CRD schema, and the kind cluster in integration tests doesn't have Traefik installed. Skip the IngressRoute in kubeconform validation and filter it out with yq before applying to the kind cluster. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix IngressRoute service name for kustomize namePrefix Kustomize namePrefix doesn't rewrite service references inside CRDs, so the IngressRoute was pointing to 'streamlit' instead of 'template-app-streamlit', causing Traefik to return 404. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: use ConfigMap as settings override instead of full replacement The ConfigMap was replacing the entire settings.json, losing keys like "version" and "repository-name" that the app expects (causing KeyError). Now the ConfigMap only contains deployment-specific overrides, which are merged into the Docker image's base settings.json at container startup using jq. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: add set -euo pipefail to fail fast on settings merge error Addresses CodeRabbit review: if jq merge fails, the container should not start with unmerged settings. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: change imagePullPolicy to Always for mutable main tag With IfNotPresent, rollout restarts reuse the cached image even when a new version has been pushed with the same tag. Always ensures Kubernetes pulls the latest image on every pod start. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: build full Dockerfile instead of Dockerfile_simple Switch CI to build the full Docker image with OpenMS and TOPP tools, not the lightweight pyOpenMS-only image. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Scope IngressRoute to hostname and drop unused nginx Ingress Traefik is the only ingress controller on the cluster; the nginx Ingress in k8s/base/ingress.yaml was orphaned (no nginx class available) and the overlay was patching it instead of the active Traefik IngressRoute. - Add Host() match to the base IngressRoute (placeholder filled by overlays) - template-app overlay patches the IngressRoute with template.webapps.openms.de - Remove ingress.yaml from the base kustomization resources list (file kept in the repo for nginx-based consumers) https://claude.ai/code/session_01YNDYJTx1eSKaL9vQe1GQzV * fix: use PVC mount for workspaces in online mode In online mode, src/common/common.py hard-coded workspaces_dir to the literal ".." which, from WORKDIR /app, resolved to /. Workspace UUID directories were therefore created on each pod's ephemeral local filesystem instead of the shared PVC mounted at /workspaces-streamlit-template, so the Streamlit pod and the RQ worker each saw their own disconnected copy. The worker's params.json load in tasks.py then hit an empty dict, producing `KeyError: 'mzML-files'` as soon as Workflow.execution() ran. - common.py: in the online branch, use WORKSPACES_DIR env var (default /workspaces-streamlit-template) so Streamlit, the RQ worker, and the cleanup cronjob (which already reads WORKSPACES_DIR) all agree on one location. - k8s streamlit & rq-worker deployments: set WORKSPACES_DIR explicitly so the env is overridable and visible at deploy time. - WorkflowManager.start_workflow: call save_parameters() before dispatch so the latest session state is flushed to disk, closing a small race where a fragment rerun could leave params.json stale when the worker picked up the job. https://claude.ai/code/session_01TsxtENPpuCZ1Ap3mX2ZpHr --------- Co-authored-by: Claude <noreply@anthropic.com> * Fix contrib tag (#360) * fix(ci): pin OpenMS contrib download to matching release tag The Windows build step downloaded contrib_build-Windows.tar.gz from OpenMS/contrib without a --tag, always pulling the latest release. When the GH Actions cache (7-day eviction) expired, a newer contrib got pulled that was incompatible with the pinned OpenMS release/3.5.0 source tree, breaking MSVC compilation in DIAPrescoring.cpp. Pin the download to release/${OPENMS_VERSION} and tie the cache key to the OpenMS version so contrib stays in lockstep with the source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): pass release tag as positional arg to gh release download `gh release download` takes the tag as a positional argument, not a `--tag` flag. Silently failed to match on Windows with the system error "The system cannot find the file specified". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: allow contrib version override via OPENMS_CONTRIB_VERSION Adds OPENMS_CONTRIB_VERSION env var that falls back to OPENMS_VERSION when empty. Lets us point OPENMS_VERSION at a non-release branch (e.g. develop) while keeping the Windows contrib download pinned to a known release tag, so CI doesn't fail on a missing contrib release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ignore docs/superpowers/ (local design notes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add Kubernetes deployment docs and refactor Claude skills (#362) * Remove stale patches from template-app overlay The Deployment/streamlit patch with Ingress-shaped path /spec/rules/0/host never applied and produced a silent no-op. The duplicate IngressRoute service-name patch was redundant with the first IngressRoute patch block. This brings the on-disk overlay in line with the production cluster's running version. * Rename configure-deployment skill to configure-docker-compose-deployment First step of splitting the skill into three focused skills (configure-app-settings, configure-docker-compose-deployment, configure-k8s-deployment). Rename is in its own commit so git log --follow traces the docker-compose content cleanly. * Scope docker-compose skill to docker-compose-only Removes app-level content (settings.json, Dockerfile choice, production app examples) that will live in configure-app-settings. Adds a prerequisite note pointing to configure-app-settings. * Add configure-app-settings skill Covers app-level configuration (settings.json, Dockerfile choice, README, dependencies) shared by every deployment mode. Prerequisite for configure-docker-compose-deployment and configure-k8s-deployment. * Fix settings.json key-field list inconsistency The Key fields prose listed max_threads (not in the JSON sample) and omitted enable_workspaces (which is in the sample). Align the prose with the sample and describe max_threads separately since it is a nested object rather than a flat field. * Add configure-k8s-deployment skill New skill walking through Kustomize overlay creation and kubectl apply for deploying a forked app to Kubernetes. Patch list reflects the three-patch canonical shape (IngressRoute match + service, streamlit Redis URL, rq-worker Redis URL). * Fix inline-code rendering in k8s skill The Host(`...`) escape syntax produced literal backslashes that broke the inline-code span when rendered by markdown parsers. Rewrite as Host(...) without nested backticks so the span renders cleanly. * Add K8s deployment doc — overview and architecture sections * Add K8s deployment doc — manifest reference section * Add K8s deployment doc — fork-and-deploy guide * Add K8s deployment doc — CI/CD pipeline section * Clarify PR-blocking behavior depends on branch protection The workflow does not block merges directly — it produces a check status that a branch-protection rule can gate on. Make the preconditions explicit. * Register Kubernetes Deployment page in Streamlit documentation * Cross-link docs/deployment.md to Kubernetes deployment page Adds a preamble listing both deployment paths and introduces a ## Docker Compose heading above the existing content. The existing docker-compose content is preserved verbatim. * Add smoke test for Kubernetes Deployment documentation page Extends the parametrized test_documentation cases to cover the new Documentation page added by this branch, closing the gap where it was the only selectbox entry without test coverage. * ci: add ghcr-cleanup workflow (scheduled disabled, dry-run default) * ci: scaffold build-and-test workflow with lint-manifests job * ci: add build job skeleton with matrix, buildx, ghcr login * ci: add metadata extraction, build-push, and registry cache * ci: add kind integration steps to build job * ci: lowercase image name for OCI cache refs github.repository preserves the original casing (OpenMS/streamlit-template). Docker OCI references require lowercase, so cache-from/cache-to fail with 'invalid reference format'. docker/metadata-action handles this internally for tags, but the cache refs bypass it. Compute IMAGE_NAME_LC once and use it in both cache refs. * ci: don't pass unprefixed local tag to buildx push With push: true, docker/build-push-action pushes every tag in its tags input. A bare name like 'openms-streamlit:simple-test' (no registry prefix) gets resolved to Docker Hub and fails with 401 unauthorized, because the workflow's GHCR token has no rights on docker.io. The local tag was only needed for the kind retag step. Since load: true already loads the image into the runner's docker daemon, we can create the stable local alias with a plain 'docker tag' step after build, picking any tag from docker/metadata-action's output. * ci: delete old docker workflows now superseded by build-and-test * k8s: pin overlay image tag to main-full (new CI scheme) * docs(skill): update k8s deploy skill for unified CI workflow * docs(k8s): update deployment doc for unified CI workflow * ci: pin container-retention-policy to v3.0.1 The @v3 floating tag does not exist on snok/container-retention-policy (v2 is the latest floating major tag; v3 only has v3.0.0 and v3.0.1 as exact version tags). The workflow fails to resolve the action with 'unable to find version v3'. Pin to v3.0.1 (latest v3 release). * fix(docker): stop cache-busting on GITHUB_TOKEN The ENV GH_TOKEN=${GITHUB_TOKEN} at the top baked the per-run token into an early layer, so every workflow run rebuilt from scratch. Moved the ARG next to the one RUN that uses it (gh release download) so earlier layers stay cacheable. * docs: fix typo (Gihub -> GitHub) in Dockerfile comments * ci: enable scheduled GHCR cleanup (weekly Sun 03:00 UTC) * k8s: serve template app on both .de and .org TLDs Updates the Traefik IngressRoute match in the template-app overlay to accept both Host() values, and mirrors the same dual-host pattern in the nginx Ingress fallback (two rules entries, same backend). Outer parentheses on the || group are required for correct precedence against PathPrefix. * ci: integration-test both .de and .org hosts on nginx and traefik Adds a dual-host curl assertion to the existing nginx kind integration and a new traefik-integration job that brings up Traefik via Helm, deploys the full overlay (no IngressRoute filter), and curls both hostnames through the IngressRoute. The traefik-integration job runs once on Dockerfile_simple — ingress routing is image-agnostic, and adding the full image variant would double the runtime without catching new regressions. * ci: enable kind to bind workspace PVC and clean up port-forwards The cinder-csi storage class isn't available in kind clusters. Patch it to 'standard' (kind's default local-path-provisioner) at apply time, alongside the existing imagePullPolicy substitution. Without this, the workspace PVC stays unbound, streamlit and rq-worker pods stay Pending, and the new dual-host curl assertions fail with 503. The existing 'Verify all deployments are available' step had been masking this with '|| true' since the integration test was added. Also wire up a trap-based EXIT cleanup for the kubectl port-forward processes; the previous trailing 'kill' line was unreachable under set -e if any curl assertion failed. * skill(configure-k8s-deployment): document dual-host overlay edit Updates the overlay-edit step to require editing both Host() values (.de and .org) plus the parallel nginx Ingress two-rules pattern. Updates the verification checklist accordingly. * skill(configure-k8s-deployment): fix markdown rendering and clarify nginx patch CommonMark code spans don't process backslash escapes for backticks, so `Host(\`…\`)` rendered as broken fragments. Wrap with double backticks instead — the inner backticks are then literal. Also clarify the nginx fallback note: 'patch both rules[].host entries' could be misread as directly editing the shared base file; 'add an overlay patch for both rules[].host entries' is unambiguous. * docs(kubernetes-deployment): document dual-host serving Updates the architecture diagram, manifest reference, customization table, and CI/CD section to describe the dual-host (.de + .org) default. Adds a short subsection on the per-host stroute cookie and why cross-TLD switches are harmless. * docs(kubernetes-deployment): fix stale job count and missing kind patches in Job 3 Two factual errors caught in review: - "both jobs run on pull requests" was true with 2 jobs, but there are now 3 (lint-manifests, build, traefik-integration). All three run on PRs. - Job 3's description omitted that the deploy step still patches imagePullPolicy and storageClassName for kind compatibility, even though it doesn't filter the IngressRoute. Job 2's description already mentions both patches; Job 3 should be parallel. * ci: use nginx Ingress hostnames for nginx-job curl assertions The nginx Ingress is unpatched by the overlay, so it retains its base hostnames (streamlit.openms.example.de / .org) from k8s/base/ingress.yaml. The previous curl step used the Traefik IngressRoute hostnames (template.webapps.openms.*), which the nginx ingress controller does not match — every request 404'd. Traefik's curl step is unchanged: the IngressRoute IS patched to the template.webapps.openms.* hostnames, so those are correct there. * k8s: mount admin password from streamlit-secrets Secret The Save-as-Demo feature reads the admin password from st.secrets (i.e. .streamlit/secrets.toml), but the Streamlit pod never had that file mounted, so the feature was always disabled in cluster deployments. Mount an optional Secret named streamlit-secrets as /app/.streamlit/secrets.toml, add a reference .example manifest (not included in kustomization -- the Secret is created out-of-band so no password lands in git), gitignore any filled-in copy, and document the imperative kubectl-create flow alongside the manifest alternative. https://claude.ai/code/session_01LAJZ5EWBJkznj7vQnKt8vV * fix errors * ci: bump pyopenms to 3.5.0 and pin python 3.10 to match Dockerfile The committed Dockerfile builds OpenMS from release/3.5.0 on python 3.10, but requirements.txt pinned pyopenms==3.3.0 and ci.yml ran on python 3.11, causing test_gui.py to fail with AttributeError on MSExperiment.to_df() (the API was renamed from get_df to to_df in 3.5). * fix(view): use pyopenms 3.5 get_df API instead of unreleased to_df `MSExperiment.to_df()` exists only on the OpenMS develop branch and is not in the published pyopenms 3.5.0 wheel that CI installs from PyPI, causing AttributeError in the raw data viewer. Switch to `get_df()` and `get_df(long=True)` — both return the same column names that the existing rename logic expects (rt/ms_level/mz_array/intensity_array for the wide form, rt/mz/intensity for the long form). * fix(k8s): mount streamlit-secrets as directory so optional: true works CI pods crashlooped because a `subPath: secrets.toml` file mount cannot resolve when the optional Secret is absent. Mount the Secret as a directory at /app/admin-secrets/ instead, and register that path via [secrets].files in .streamlit/config.toml so st.secrets picks it up without shadowing the baked-in config.toml / credentials.toml. * docs(k8s): add streamlit-secrets example to template-app overlay Mirrors the base example with overlay-specific guidance: `namePrefix` only rewrites Kustomize-managed resources, so imperative Secrets must still use the literal name `streamlit-secrets`. * k8s: two-tier scheduling via Kustomize components + LimitRange Factor node placement and memory sizing out of the base manifests into reusable Kustomize components (memory-tier-low / memory-tier-high), so each fork picks its tier with a single line in its overlay. - base: remove per-pod `resources` from streamlit and rq-worker Deployments; sizing now comes from the tier component - base: promote redis to Guaranteed QoS (requests == limits for both cpu and memory) so it bottoms the kernel OOM list - base: add LimitRange so containers without explicit resources inherit safe defaults (512Mi/250m request, 2Gi/2 limit, 64Gi/16 max) - components/memory-tier-low: nodeSelector=low, streamlit 512Mi/2Gi, rq-worker 1Gi/16Gi (Burstable) - components/memory-tier-high: nodeSelector=high, streamlit 512Mi/4Gi, rq-worker 2Gi/180Gi (Burstable — uniform across heavy workers so a single active app can burst into the shared pool) - overlays: rename template-app/ to prod/ (one overlay per repo; the repo itself identifies the app) and pull in memory-tier-low - docs & skill: document the new overlays/prod/ path and the one-line tier selector; update CI to kustomize the renamed overlay https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP * ci(k8s): label kind node to match the overlay's memory tier The memory-tier-low component adds nodeSelector openms.de/memory-tier=low to every Deployment. kind clusters have no such label, so after the rename to overlays/prod all pods stayed Pending and 'Wait for Redis to be ready' timed out. Label --all kind nodes in both the nginx and Traefik integration jobs before deploying so the nodeSelector matches. Also raise the LimitRange max.memory from 64Gi to 200Gi. The original cap was written before memory-tier-high settled on a 180Gi rq-worker limit; without the bump, a high-tier fork (e.g. OpenDIAKiosk) would be rejected by admission when deployed into the shared openms namespace after the template's LimitRange is applied. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP * k8s: move streamlit-secrets.yaml.example into overlays/prod/ Completes the overlay rename started in 6c61365 now that the branch has merged main, which added the example file under the old path. Also rewrite two remaining docs references to overlays/<your-app-name>/ and the CI description to the new prod overlay. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP * ci(k8s): two-node kind cluster with both tier labels Spin up a 2-node kind cluster (control-plane labeled memory-tier=low + ingress-ready, worker labeled memory-tier=high) so the Build-and-Test job passes regardless of which memory-tier component a fork's overlay pulls in. Previously we labeled --all nodes with a single tier after creation, which broke as soon as a fork flipped memory-tier-low to memory-tier-high. - .github/kind-config.yaml: 2-node topology with per-node labels. - .github/workflows/build-and-test.yml: point both helm/kind-action invocations (nginx build + traefik-integration) at the config and drop the now-redundant dynamic label step. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP * ci(k8s): clear control-plane NoSchedule taint in two-node kind config Previous run (2f28ed9) showed build + traefik-integration jobs still timing out on 'Wait for Redis'. Root cause: multi-node kind clusters apply node-role.kubernetes.io/control-plane:NoSchedule to the control-plane, which untolerated app pods can't land on even though the nodeSelector matches. The single-node kind used previously had no such taint, which is why CI worked until we added a second node. Add a kubeadmConfigPatches stanza setting nodeRegistration.taints to the empty list so the control-plane is schedulable. Labels and cluster shape (1 control-plane + 1 worker) stay the same. https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP * k8s: store demo workspaces on the workspaces PVC Adds a seed-demos initContainer to the Streamlit Deployment that merges image-shipped demos into /workspaces-streamlit-template/.demos/ with cp -rn, so new demos in an image appear after redeploy while admin-saved demos and edits persist across redeploys. - Point demo_workspaces.source_dirs at the PV path via the ConfigMap override (both streamlit and rq-worker pick this up through the jq settings merge at startup). - Make get_demo_target_dir() settings-driven so "Save as Demo" writes to the PV, with backwards-compatible fallbacks for the legacy source_dir string and for environments without settings (tests). - Skip hidden top-level dirs in clean-up-workspaces.py so the nightly cron does not garbage-collect .demos/. - Document the .demos/ layout and the re-seed flow. https://claude.ai/code/session_01Y87aULHSdyBobPdaD4L6tW * k8s: ship streamlit-secrets by default, hide admin UI when empty The Secret used to be an out-of-band copy-the-example step, so forgetting the resources-list edit left the pod booting with an empty admin-secrets mount and a user-facing "Admin not configured" error for a feature that was never wired up in the first place. Now the Secret is committed to the base with an empty admin password and included in k8s/base/kustomization.yaml, so kubectl apply -k always creates it. The "Save as Demo" expander is gated on a non-empty password and is hidden entirely (no error box) when not configured. Operators enable the feature by patching the live Secret or by editing the file locally with git update-index --skip-worktree, both documented. Exception handling in is_admin_configured() is tightened to also catch StreamlitSecretNotFoundError so a missing secrets file never raises. https://claude.ai/code/session_01V1noocAR7uXWjWsC9oLGhz * ci: reuse built docker images across ingress tests Split the build+test flow into three stages so the traefik ingress test no longer rebuilds Dockerfile_simple from scratch: build (matrix: full, simple) -> uploads each image as a workflow artifact test-nginx (matrix: full, simple) -> downloads artifact, kind loads, tests nginx ingress test-traefik (simple only) -> downloads simple artifact, kind loads, tests traefik ingress Artifacts (not GHCR) are used because the build job only pushes on non-PR events and fork PRs cannot auth to GHCR at all, so registry sharing would not work for every PR path. * ci: run test-traefik against both image variants Mirror the build/test-nginx matrix so the traefik ingress test also covers the full and simple variants instead of just simple. * ci: harden ingress-test wait/curl flow for slow simple deployments test-traefik (simple) failed in the combined "Wait for Redis and deployments to be ready" step because the deployment took longer than 120s to become available, and unlike the test-nginx wait the failure was not soft. Align test-traefik with test-nginx: - Split Redis wait (hard, 60s) from deployment wait (soft, `|| true`). - Bump deployment timeout 120s -> 180s in both jobs. - Widen the curl warm-up loop from 5x2s to 30x2s in both jobs so a marginally late deployment is tolerated; a real failure still surfaces via the trailing unconditional curl. * Rework configure-k8s-deployment skill as an interview The previous skill was a manual find-and-replace checklist that assumed Claude could run kubectl against the cluster. Restructure it as an interview-driven file-editing guide with a clear handoff to a human operator (or CI) for cluster apply. - Drop kubectl, kubectl kustomize, and rollout-verification steps that Claude can't actually execute. - Drop nginx ingress fallback; production is Traefik-only. - Add a Step 1 recon over a fixed set of base/overlay/CI files so defaults are derived from the repo, and the skill bails on layouts it doesn't recognize. - Replace the manual checklist with six interview questions, each paired with what it controls in the running deployment, the proposed default, and the reasoning. Slug, GHCR ref, image tag, ingress subdomain, memory tier, workspace storage size. - Make storage a single 1-line edit to k8s/base/workspace-pvc.yaml when the user picks a non-default size; keep the PVC base name unchanged (namePrefix scopes it per-fork, no collisions). - Pin the default storage size to 500 Gi to match the stock base, so the default needs zero file edits. - Explain that images[0].name is a Kustomize match key and must not change. * k8s: drop cross-fork pod-affinity, rely on RWO PVC for co-location The shared volume-group: workspaces label and required pod-affinity attracted every fork's workspace pods onto a single node per memory tier and deadlocked the first replica of any fork landing on an otherwise-empty tier (no peer pod for the required affinity to match). Per-fork RWO PVCs (<slug>-workspaces-pvc) already constrain all of a fork's workspace-using pods to the node the volume is attached to via the scheduler's VolumeBinding plugin, so the explicit affinity adds nothing on top. Removing it scopes co-location naturally to one fork and lets a fresh tier bootstrap without manual affinity-strip. NodeSelector continues to pick the memory tier; the RWO mount picks the specific node within that tier. * ci: derive slug + Traefik hosts from overlay so forks stay green The kind integration jobs in build-and-test.yml hardcoded `template-app` as the slug label and `template.webapps.openms.{de,org}` as the Traefik hostnames. The configure-k8s-deployment skill rewrites those values when a fork customizes its overlay, after which `kubectl wait -l app=...` returns "no matching resources found" and Traefik curl tests hit the wrong Host header. This broke OpenMS/quantms-web PR #19 on its first overlay PR (run 24964475081). Have test-nginx and test-traefik discover SLUG (from `commonLabels.app`) and TRAEFIK_HOSTS (parsed from the rendered IngressRoute match) right after deploy, and substitute them into the wait/curl steps. The nginx hostnames stay hardcoded — they come from `k8s/base/ingress.yaml`, which the skill never edits and Kustomize doesn't rewrite. Update the configure-k8s-deployment skill to (a) check during recon that the workflow uses dynamic discovery, (b) flag forks still on the old hardcoded shape so the skill applies the patch before editing the overlay, and (c) note in the handoff that no fork-specific workflow edits are needed. * refix ci * refix admin panel --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

t0mdavid-m merged commit 93ab8ec into main Apr 27, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(k8s): configure prod overlay for quantms-ddalfq deployment#19

chore(k8s): configure prod overlay for quantms-ddalfq deployment#19
t0mdavid-m merged 2 commits intomainfrom
setup_deployment

t0mdavid-m commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t0mdavid-m commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

t0mdavid-m commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading