Skip to content

chore(k8s): configure prod overlay for quantms-ddalfq deployment#19

Merged
t0mdavid-m merged 2 commits intomainfrom
setup_deployment
Apr 27, 2026
Merged

chore(k8s): configure prod overlay for quantms-ddalfq deployment#19
t0mdavid-m merged 2 commits intomainfrom
setup_deployment

Conversation

@t0mdavid-m
Copy link
Copy Markdown
Member

@t0mdavid-m t0mdavid-m commented Apr 26, 2026

Wire the prod Kustomize overlay to the OpenMS Traefik cluster:

  • namePrefix/commonLabels: quantms-ddalfq
  • image: ghcr.io/openms/quantms-web:main-full
  • IngressRoute: opendda.webapps.openms.{de,org}
  • Redis URL pointed at quantms-ddalfq-redis
  • memory-tier-high component (heavy DIA/LFQ workloads)
  • Workspace PVC sized to 3Ti

Summary by CodeRabbit

Release Notes

  • Chores
    • Increased workspace storage capacity to 3 terabytes
    • Updated production environment configuration with new service names and hostnames
    • Modified Redis connection settings for production deployments
    • Enhanced production memory tier configuration

Wire the prod Kustomize overlay to the OpenMS Traefik cluster:
- namePrefix/commonLabels: quantms-ddalfq
- image: ghcr.io/openms/quantms-web:main-full
- IngressRoute: opendda.webapps.openms.{de,org}
- Redis URL pointed at quantms-ddalfq-redis
- memory-tier-high component (heavy DIA/LFQ workloads)
- Workspace PVC sized to 3Ti

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Warning

Rate limit exceeded

@t0mdavid-m has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 56 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 56 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 40d63c69-74ca-4ef7-a072-4e22849899f6

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca8dbe and cb1bd20.

📒 Files selected for processing (1)
  • .github/workflows/build-and-test.yml
📝 Walkthrough

Walkthrough

Updates Kubernetes manifests to increase workspace storage capacity from 500Gi to 3Ti and reconfigure the production overlay environment with new application naming, memory tier settings, image references, and service connections.

Changes

Cohort / File(s) Summary
Kubernetes Base Configuration
k8s/base/workspace-pvc.yaml
Increases PersistentVolumeClaim storage capacity from 500Gi to 3Ti.
Production Deployment Configuration
k8s/overlays/prod/kustomization.yaml
Migrates prod overlay from template-app to quantms-ddalfq naming, upgrades memory tier to high, updates image reference to ghcr.io/openms/quantms-web, changes IngressRoute hostnames from template.webapps.openms.* to opendda.webapps.openms.*, and updates Redis service connections in streamlit and rq-worker deployments.

Poem

🐰 From template's days to quantms new,
Storage swells to three tiers through,
Redis redirects with proper care,
Production infrastructure, debonair!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: configuring the production Kustomize overlay for the quantms-ddalfq deployment, which aligns with the primary objectives of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch setup_deployment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
k8s/overlays/prod/kustomization.yaml (3)

31-44: Positional env/0 JSON patch is brittle — prefer a strategic merge by env name.

Both Redis patches target /spec/template/spec/containers/0/env/0/value, which silently depends on REDIS_URL remaining the first entry in the base deployments' env list (k8s/base/streamlit-deployment.yaml and k8s/base/rq-worker-deployment.yaml). If anyone reorders env vars in base, these patches will quietly overwrite the wrong variable instead of failing. A strategic merge patch keyed by name: REDIS_URL is safer and self-documenting.

♻️ Proposed refactor
-  - target:
-      kind: Deployment
-      name: streamlit
-    patch: |
-      - op: replace
-        path: /spec/template/spec/containers/0/env/0/value
-        value: "redis://quantms-ddalfq-redis:6379/0"
-  - target:
-      kind: Deployment
-      name: rq-worker
-    patch: |
-      - op: replace
-        path: /spec/template/spec/containers/0/env/0/value
-        value: "redis://quantms-ddalfq-redis:6379/0"
+  - patch: |-
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: streamlit
+      spec:
+        template:
+          spec:
+            containers:
+              - name: streamlit
+                env:
+                  - name: REDIS_URL
+                    value: "redis://quantms-ddalfq-redis:6379/0"
+  - patch: |-
+      apiVersion: apps/v1
+      kind: Deployment
+      metadata:
+        name: rq-worker
+      spec:
+        template:
+          spec:
+            containers:
+              - name: rq-worker
+                env:
+                  - name: REDIS_URL
+                    value: "redis://quantms-ddalfq-redis:6379/0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 31 - 44, The JSON patches
currently replace /spec/template/spec/containers/0/env/0/value which is brittle;
update the two patches for the Deployment targets named streamlit and rq-worker
to use a strategic merge patch (or patchStrategicMerge) that matches the env
entry by name "REDIS_URL" and sets its value to
"redis://quantms-ddalfq-redis:6379/0" instead of using positional env/0; locate
the patches referencing those Deployment names (streamlit, rq-worker) and
replace the positional JSON patch with a strategic merge fragment that contains
the container env entry { name: REDIS_URL, value:
"redis://quantms-ddalfq-redis:6379/0" } so the patch is resilient to env
reordering.

12-13: Migrate commonLabels to the labels field for full Kustomize compatibility.

commonLabels is deprecated in recent Kustomize releases (v5.3+) and generates warnings. While still functional, migrating to the labels field with explicit control is recommended. Use kustomize edit fix to automate this migration, or manually update as shown below.

♻️ Suggested migration (to preserve current behavior)
-commonLabels:
-  app: quantms-ddalfq
+labels:
+  - pairs:
+      app: quantms-ddalfq
+    includeSelectors: true
+    includeTemplates: true

Set both includeSelectors: true and includeTemplates: true to maintain full parity with the original commonLabels behavior (labels applied to metadata, selectors, and templates). Omit includeTemplates or set includeSelectors: false if you only want labels added to metadata without selector injection.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 12 - 13, Replace the
top-level commonLabels mapping with a new labels block: remove the commonLabels:
app: quantms-ddalfq entry and create labels containing includeSelectors: true
and includeTemplates: true plus the label mapping (app: quantms-ddalfq) so the
label is applied to metadata, selectors and templates; use the symbols
commonLabels, labels, includeSelectors and includeTemplates to find and update
the kustomization YAML.

21-30: Hardcoded quantms-ddalfq-streamlit couples the IngressRoute patch to namePrefix.

Since Traefik's IngressRoute is a CRD, Kustomize's default nameReference transformer doesn't rewrite the services[].name field, so the explicit value here is necessary. However, it now has to be kept in sync manually with namePrefix (Line 10)—changing one without the other will silently route to a non-existent service.

Consider either:

  1. Adding a short comment noting the coupling, or
  2. Registering a configurations: entry with a nameReference for IngressRoute.spec.routes[].services[].name so Kustomize rewrites it automatically (the standard pattern for custom resources without builtin support).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/overlays/prod/kustomization.yaml` around lines 21 - 30, The IngressRoute
patch hardcodes the service name ("quantms-ddalfq-streamlit") which must stay in
sync with the kustomize namePrefix; replace this brittle approach by registering
a nameReference in kustomize configurations so Kustomize rewrites
IngressRoute.spec.routes[].services[].name automatically (or at minimum add a
clear inline comment noting the coupling). Update the kustomization to include a
configurations: entry that maps the custom resource kind IngressRoute and the
field spec.routes[].services[].name as a nameReference, or add a short comment
next to the patch mentioning it must match namePrefix whenever namePrefix
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/base/workspace-pvc.yaml`:
- Line 11: The base PersistentVolumeClaim "workspaces-pvc" was changed to
request 3Ti at spec.resources.requests.storage, which will affect all
environments; instead revert that change in the base and add a prod-only
strategic merge patch that updates "workspaces-pvc" to storage: 3Ti in the prod
overlay. Create a new overlay patch (e.g., workspace-pvc-patch.yaml) containing
the PersistentVolumeClaim metadata name: workspaces-pvc and
spec.resources.requests.storage: 3Ti, then reference it from the prod
kustomization under patchesStrategicMerge so only the prod overlay receives the
increased size.

---

Nitpick comments:
In `@k8s/overlays/prod/kustomization.yaml`:
- Around line 31-44: The JSON patches currently replace
/spec/template/spec/containers/0/env/0/value which is brittle; update the two
patches for the Deployment targets named streamlit and rq-worker to use a
strategic merge patch (or patchStrategicMerge) that matches the env entry by
name "REDIS_URL" and sets its value to "redis://quantms-ddalfq-redis:6379/0"
instead of using positional env/0; locate the patches referencing those
Deployment names (streamlit, rq-worker) and replace the positional JSON patch
with a strategic merge fragment that contains the container env entry { name:
REDIS_URL, value: "redis://quantms-ddalfq-redis:6379/0" } so the patch is
resilient to env reordering.
- Around line 12-13: Replace the top-level commonLabels mapping with a new
labels block: remove the commonLabels: app: quantms-ddalfq entry and create
labels containing includeSelectors: true and includeTemplates: true plus the
label mapping (app: quantms-ddalfq) so the label is applied to metadata,
selectors and templates; use the symbols commonLabels, labels, includeSelectors
and includeTemplates to find and update the kustomization YAML.
- Around line 21-30: The IngressRoute patch hardcodes the service name
("quantms-ddalfq-streamlit") which must stay in sync with the kustomize
namePrefix; replace this brittle approach by registering a nameReference in
kustomize configurations so Kustomize rewrites
IngressRoute.spec.routes[].services[].name automatically (or at minimum add a
clear inline comment noting the coupling). Update the kustomization to include a
configurations: entry that maps the custom resource kind IngressRoute and the
field spec.routes[].services[].name as a nameReference, or add a short comment
next to the patch mentioning it must match namePrefix whenever namePrefix
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e7abc86-f080-482d-87a1-21efd0028c6b

📥 Commits

Reviewing files that changed from the base of the PR and between 076351b and 7ca8dbe.

📒 Files selected for processing (2)
  • k8s/base/workspace-pvc.yaml
  • k8s/overlays/prod/kustomization.yaml

resources:
requests:
storage: 500Gi
storage: 3Ti
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope drift: this base PVC change applies to all overlays, not just prod.

Line 11 raises the base request to 3Ti, which affects every environment inheriting k8s/base. The PR objective is prod-only sizing, so this should be done as an overlay patch in production to avoid unintended cost/capacity impact elsewhere.

Suggested direction
# k8s/base/workspace-pvc.yaml
-      storage: 3Ti
+      storage: 500Gi
# k8s/overlays/prod/workspace-pvc-patch.yaml (new)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: workspaces-pvc
spec:
  resources:
    requests:
      storage: 3Ti
# k8s/overlays/prod/kustomization.yaml
patchesStrategicMerge:
  - workspace-pvc-patch.yaml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/base/workspace-pvc.yaml` at line 11, The base PersistentVolumeClaim
"workspaces-pvc" was changed to request 3Ti at spec.resources.requests.storage,
which will affect all environments; instead revert that change in the base and
add a prod-only strategic merge patch that updates "workspaces-pvc" to storage:
3Ti in the prod overlay. Create a new overlay patch (e.g.,
workspace-pvc-patch.yaml) containing the PersistentVolumeClaim metadata name:
workspaces-pvc and spec.resources.requests.storage: 3Ti, then reference it from
the prod kustomization under patchesStrategicMerge so only the prod overlay
receives the increased size.

The kind-cluster integration jobs (test-nginx, test-traefik) still
referenced the template's slug and Traefik hostnames, so kubectl wait
selectors and curl Host headers no longer matched after the prod
overlay was renamed. Replace template-app -> quantms-ddalfq and
template.webapps.openms.{de,org} -> opendda.webapps.openms.{de,org}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@t0mdavid-m t0mdavid-m merged commit 93ab8ec into main Apr 27, 2026
9 checks passed
t0mdavid-m added a commit that referenced this pull request Apr 27, 2026
* Add Matomo analytics integration with GDPR consent support (#341)

* Add Matomo Tag Manager as third analytics tracking mode

Adds Matomo Tag Manager support alongside existing Google Analytics and
Piwik Pro integrations. Includes settings.json configuration (url + tag),
build-time script injection via hook-analytics.py, Klaro GDPR consent
banner integration, and runtime consent granting via MTM data layer API.

https://claude.ai/code/session_0165AXHkmRZ6bx23n7Tbyz8h

* Fix Matomo Tag Manager snippet to match official docs

- Accept full container JS URL instead of separate url + tag fields,
  supporting both self-hosted and Matomo Cloud URL patterns
- Match the official snippet: var _mtm alias, _mtm.push shorthand
- Remove redundant type="text/javascript" attribute
- Remove unused "tag" field from settings.json

https://claude.ai/code/session_0165AXHkmRZ6bx23n7Tbyz8h

* Split Matomo config into base url + tag fields

Separate the Matomo setting into `url` (base URL, e.g.
https://cdn.matomo.cloud/openms.matomo.cloud) and `tag` (container ID,
e.g. yDGK8bfY), consistent with how other providers use a tag field.
The script constructs the full path: {url}/container_{tag}.js

https://claude.ai/code/session_0165AXHkmRZ6bx23n7Tbyz8h

* install matomo tag

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Remove duplicate `address` key in `.streamlit/config.toml` (#346)

* Initial plan

* fix: remove duplicate address entry in config.toml

Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com>

* Fix integration test failures caused by sys.modules pollution and shutil.SameFileError (#349)

* Initial plan

* Fix integration test failures: restore sys.modules mocks, handle SameFileError, update CI workflow

Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com>

* Remove unnecessary pyopenms mock from test_topp_workflow_parameter.py, simplify test_parameter_presets.py

Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com>

* Fix Windows build: correct site-packages path in cleanup step

Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: t0mdavid-m <57191390+t0mdavid-m@users.noreply.github.com>

* Remove server address from bundled config.toml for Windows installer (#351)

On Windows, 0.0.0.0 is not a valid connect address — the browser fails
to open http://0.0.0.0:8501. By removing the address entry from the
bundled .streamlit/config.toml, Streamlit defaults to localhost, which
works correctly for local deployments. Docker deployments are unaffected
as they pass --server.address 0.0.0.0 on the command line.

https://claude.ai/code/session_016amsLCZeFogTksmtk1geb5

Co-authored-by: Claude <noreply@anthropic.com>

* reenable cross origin protection

* Add CLAUDE.md and Claude Code skills for MS webapp development (#357)

* Add CLAUDE.md and Claude Code skills for webapp development

Adds project documentation (CLAUDE.md) and 6 skills to help developers
scaffold and extend OpenMS web applications built from this template:
- /create-page: add a new Streamlit page with proper registration
- /create-workflow: scaffold a full TOPP workflow (class + 4 pages)
- /add-python-tool: add a custom Python analysis script with auto-UI
- /add-presets: add parameter presets for workflows
- /configure-deployment: set up Docker and CI/CD for a new app
- /add-visualization: add pyopenms-viz or OpenMS-Insight visualizations

https://claude.ai/code/session_01WYotmLfqRtB8WJXj1Eosiz

* Strengthen MS domain context in CLAUDE.md and skills

Make it clear to Claude that this is THE framework for building mass
spectrometry web applications for proteomics and metabolomics research.
Add domain-specific context about MS data types, TOPP tool pipelines,
and scientific visualization needs.

https://claude.ai/code/session_01WYotmLfqRtB8WJXj1Eosiz

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Add Kubernetes manifests and CI/CD workflows for deployment (#347)

* Add Kubernetes manifests and CI workflows for de.NBI migration

Decompose the monolithic Docker container into Kubernetes workloads:
- Streamlit Deployment with health probes and session affinity
- Redis Deployment + Service for job queue
- RQ Worker Deployment for background workflows
- CronJob for workspace cleanup
- Ingress with WebSocket support and cookie-based sticky sessions
- Shared PVC (ReadWriteMany) for workspace data
- ConfigMap for runtime configuration (replaces build-time settings)
- Kustomize base + template-app overlay for multi-app deployment

Code changes:
- Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml
- Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py

CI/CD:
- Add build-and-push-image.yml to push Docker images to ghcr.io
- Add k8s-manifests-ci.yml for manifest validation and kind integration tests

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix kubeconform validation to skip kustomization.yaml

kustomization.yaml is a Kustomize config file, not a standard K8s resource,
so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add matrix strategy to test both Dockerfiles in integration tests

The integration-test job now uses a matrix with Dockerfile_simple and
Dockerfile. Each matrix entry checks if its Dockerfile exists before
running — all steps are guarded with an `if` condition so they skip
gracefully when a Dockerfile is absent. This allows downstream forks
that only have one Dockerfile to pass CI without errors.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Adapt K8s base manifests for de.NBI Cinder CSI storage

- Switch workspace PVC from ReadWriteMany to ReadWriteOnce with
  cinder-csi storage class (required by de.NBI KKP cluster)
- Increase PVC storage to 500Gi
- Add namespace: openms to kustomization.yaml
- Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU)
  so all workspace-mounting pods fit on a single node

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add pod affinity rules to co-locate all workspace pods on same node

The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which
requires all pods mounting it to run on the same node. Without explicit
affinity rules, the scheduler was failing silently, leaving pods in
Pending state with no events.

Adds a `volume-group: workspaces` label and podAffinity with
requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment,
rq-worker deployment, and cleanup cronjob. This ensures the scheduler
explicitly co-locates all workspace-consuming pods on the same node.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: wait for ingress-nginx admission webhook before deploying

The controller pod being Ready doesn't guarantee the admission webhook
service is accepting connections. Add a polling loop that waits for the
webhook endpoint to have an IP assigned before applying the Ingress
resource, preventing "connection refused" errors during kustomize apply.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: add -n openms namespace to integration test steps

The kustomize overlay deploys into the openms namespace, but the
verification steps (Redis wait, Redis ping, deployment checks) were
querying the default namespace, causing "no matching resources found".

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: retry kustomize deploy for webhook readiness

Replace the unreliable endpoint-IP polling with a retry loop on
kubectl apply (up to 5 attempts with backoff). This handles the race
where the ingress-nginx admission webhook has an endpoint IP but isn't
yet accepting TCP connections.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Claude/kubernetes migration plan kq jw d (#358)

* Add Kubernetes manifests and CI workflows for de.NBI migration

Decompose the monolithic Docker container into Kubernetes workloads:
- Streamlit Deployment with health probes and session affinity
- Redis Deployment + Service for job queue
- RQ Worker Deployment for background workflows
- CronJob for workspace cleanup
- Ingress with WebSocket support and cookie-based sticky sessions
- Shared PVC (ReadWriteMany) for workspace data
- ConfigMap for runtime configuration (replaces build-time settings)
- Kustomize base + template-app overlay for multi-app deployment

Code changes:
- Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml
- Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py

CI/CD:
- Add build-and-push-image.yml to push Docker images to ghcr.io
- Add k8s-manifests-ci.yml for manifest validation and kind integration tests

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix kubeconform validation to skip kustomization.yaml

kustomization.yaml is a Kustomize config file, not a standard K8s resource,
so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add matrix strategy to test both Dockerfiles in integration tests

The integration-test job now uses a matrix with Dockerfile_simple and
Dockerfile. Each matrix entry checks if its Dockerfile exists before
running — all steps are guarded with an `if` condition so they skip
gracefully when a Dockerfile is absent. This allows downstream forks
that only have one Dockerfile to pass CI without errors.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Adapt K8s base manifests for de.NBI Cinder CSI storage

- Switch workspace PVC from ReadWriteMany to ReadWriteOnce with
  cinder-csi storage class (required by de.NBI KKP cluster)
- Increase PVC storage to 500Gi
- Add namespace: openms to kustomization.yaml
- Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU)
  so all workspace-mounting pods fit on a single node

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add pod affinity rules to co-locate all workspace pods on same node

The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which
requires all pods mounting it to run on the same node. Without explicit
affinity rules, the scheduler was failing silently, leaving pods in
Pending state with no events.

Adds a `volume-group: workspaces` label and podAffinity with
requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment,
rq-worker deployment, and cleanup cronjob. This ensures the scheduler
explicitly co-locates all workspace-consuming pods on the same node.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: wait for ingress-nginx admission webhook before deploying

The controller pod being Ready doesn't guarantee the admission webhook
service is accepting connections. Add a polling loop that waits for the
webhook endpoint to have an IP assigned before applying the Ingress
resource, preventing "connection refused" errors during kustomize apply.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: add -n openms namespace to integration test steps

The kustomize overlay deploys into the openms namespace, but the
verification steps (Redis wait, Redis ping, deployment checks) were
querying the default namespace, causing "no matching resources found".

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: retry kustomize deploy for webhook readiness

Replace the unreliable endpoint-IP polling with a retry loop on
kubectl apply (up to 5 attempts with backoff). This handles the race
where the ingress-nginx admission webhook has an endpoint IP but isn't
yet accepting TCP connections.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix REDIS_URL to use prefixed service name in overlay

Kustomize namePrefix renames the Redis service to template-app-redis,
but the REDIS_URL env var in streamlit and rq-worker deployments still
referenced the unprefixed name "redis", causing the rq-worker to
CrashLoopBackOff with "Name or service not known".

Add JSON patches in the overlay to set the correct prefixed hostname.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add Traefik IngressRoute for direct LB IP access

The cluster uses Traefik, not nginx, so the nginx Ingress annotations
are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all
routing and sticky session cookie for Streamlit session affinity.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: skip Traefik IngressRoute CRD in validation and integration tests

kubeconform doesn't know the Traefik IngressRoute CRD schema, and the
kind cluster in integration tests doesn't have Traefik installed. Skip
the IngressRoute in kubeconform validation and filter it out with yq
before applying to the kind cluster.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix IngressRoute service name for kustomize namePrefix

Kustomize namePrefix doesn't rewrite service references inside CRDs,
so the IngressRoute was pointing to 'streamlit' instead of
'template-app-streamlit', causing Traefik to return 404.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: use ConfigMap as settings override instead of full replacement

The ConfigMap was replacing the entire settings.json, losing keys like
"version" and "repository-name" that the app expects (causing KeyError).
Now the ConfigMap only contains deployment-specific overrides, which are
merged into the Docker image's base settings.json at container startup
using jq.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: add set -euo pipefail to fail fast on settings merge error

Addresses CodeRabbit review: if jq merge fails, the container should
not start with unmerged settings.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Claude/kubernetes migration plan kq jw d (#359)

* Add Kubernetes manifests and CI workflows for de.NBI migration

Decompose the monolithic Docker container into Kubernetes workloads:
- Streamlit Deployment with health probes and session affinity
- Redis Deployment + Service for job queue
- RQ Worker Deployment for background workflows
- CronJob for workspace cleanup
- Ingress with WebSocket support and cookie-based sticky sessions
- Shared PVC (ReadWriteMany) for workspace data
- ConfigMap for runtime configuration (replaces build-time settings)
- Kustomize base + template-app overlay for multi-app deployment

Code changes:
- Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml
- Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py

CI/CD:
- Add build-and-push-image.yml to push Docker images to ghcr.io
- Add k8s-manifests-ci.yml for manifest validation and kind integration tests

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix kubeconform validation to skip kustomization.yaml

kustomization.yaml is a Kustomize config file, not a standard K8s resource,
so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add matrix strategy to test both Dockerfiles in integration tests

The integration-test job now uses a matrix with Dockerfile_simple and
Dockerfile. Each matrix entry checks if its Dockerfile exists before
running — all steps are guarded with an `if` condition so they skip
gracefully when a Dockerfile is absent. This allows downstream forks
that only have one Dockerfile to pass CI without errors.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Adapt K8s base manifests for de.NBI Cinder CSI storage

- Switch workspace PVC from ReadWriteMany to ReadWriteOnce with
  cinder-csi storage class (required by de.NBI KKP cluster)
- Increase PVC storage to 500Gi
- Add namespace: openms to kustomization.yaml
- Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU)
  so all workspace-mounting pods fit on a single node

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add pod affinity rules to co-locate all workspace pods on same node

The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which
requires all pods mounting it to run on the same node. Without explicit
affinity rules, the scheduler was failing silently, leaving pods in
Pending state with no events.

Adds a `volume-group: workspaces` label and podAffinity with
requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment,
rq-worker deployment, and cleanup cronjob. This ensures the scheduler
explicitly co-locates all workspace-consuming pods on the same node.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: wait for ingress-nginx admission webhook before deploying

The controller pod being Ready doesn't guarantee the admission webhook
service is accepting connections. Add a polling loop that waits for the
webhook endpoint to have an IP assigned before applying the Ingress
resource, preventing "connection refused" errors during kustomize apply.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: add -n openms namespace to integration test steps

The kustomize overlay deploys into the openms namespace, but the
verification steps (Redis wait, Redis ping, deployment checks) were
querying the default namespace, causing "no matching resources found".

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: retry kustomize deploy for webhook readiness

Replace the unreliable endpoint-IP polling with a retry loop on
kubectl apply (up to 5 attempts with backoff). This handles the race
where the ingress-nginx admission webhook has an endpoint IP but isn't
yet accepting TCP connections.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix REDIS_URL to use prefixed service name in overlay

Kustomize namePrefix renames the Redis service to template-app-redis,
but the REDIS_URL env var in streamlit and rq-worker deployments still
referenced the unprefixed name "redis", causing the rq-worker to
CrashLoopBackOff with "Name or service not known".

Add JSON patches in the overlay to set the correct prefixed hostname.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add Traefik IngressRoute for direct LB IP access

The cluster uses Traefik, not nginx, so the nginx Ingress annotations
are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all
routing and sticky session cookie for Streamlit session affinity.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: skip Traefik IngressRoute CRD in validation and integration tests

kubeconform doesn't know the Traefik IngressRoute CRD schema, and the
kind cluster in integration tests doesn't have Traefik installed. Skip
the IngressRoute in kubeconform validation and filter it out with yq
before applying to the kind cluster.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix IngressRoute service name for kustomize namePrefix

Kustomize namePrefix doesn't rewrite service references inside CRDs,
so the IngressRoute was pointing to 'streamlit' instead of
'template-app-streamlit', causing Traefik to return 404.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: use ConfigMap as settings override instead of full replacement

The ConfigMap was replacing the entire settings.json, losing keys like
"version" and "repository-name" that the app expects (causing KeyError).
Now the ConfigMap only contains deployment-specific overrides, which are
merged into the Docker image's base settings.json at container startup
using jq.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: add set -euo pipefail to fail fast on settings merge error

Addresses CodeRabbit review: if jq merge fails, the container should
not start with unmerged settings.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: change imagePullPolicy to Always for mutable main tag

With IfNotPresent, rollout restarts reuse the cached image even when a
new version has been pushed with the same tag. Always ensures Kubernetes
pulls the latest image on every pod start.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: build full Dockerfile instead of Dockerfile_simple

Switch CI to build the full Docker image with OpenMS and TOPP tools,
not the lightweight pyOpenMS-only image.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Claude/fix mzml files validation y zfla (#361)

* Add Kubernetes manifests and CI workflows for de.NBI migration

Decompose the monolithic Docker container into Kubernetes workloads:
- Streamlit Deployment with health probes and session affinity
- Redis Deployment + Service for job queue
- RQ Worker Deployment for background workflows
- CronJob for workspace cleanup
- Ingress with WebSocket support and cookie-based sticky sessions
- Shared PVC (ReadWriteMany) for workspace data
- ConfigMap for runtime configuration (replaces build-time settings)
- Kustomize base + template-app overlay for multi-app deployment

Code changes:
- Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml
- Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py

CI/CD:
- Add build-and-push-image.yml to push Docker images to ghcr.io
- Add k8s-manifests-ci.yml for manifest validation and kind integration tests

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix kubeconform validation to skip kustomization.yaml

kustomization.yaml is a Kustomize config file, not a standard K8s resource,
so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add matrix strategy to test both Dockerfiles in integration tests

The integration-test job now uses a matrix with Dockerfile_simple and
Dockerfile. Each matrix entry checks if its Dockerfile exists before
running — all steps are guarded with an `if` condition so they skip
gracefully when a Dockerfile is absent. This allows downstream forks
that only have one Dockerfile to pass CI without errors.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Adapt K8s base manifests for de.NBI Cinder CSI storage

- Switch workspace PVC from ReadWriteMany to ReadWriteOnce with
  cinder-csi storage class (required by de.NBI KKP cluster)
- Increase PVC storage to 500Gi
- Add namespace: openms to kustomization.yaml
- Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU)
  so all workspace-mounting pods fit on a single node

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add pod affinity rules to co-locate all workspace pods on same node

The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which
requires all pods mounting it to run on the same node. Without explicit
affinity rules, the scheduler was failing silently, leaving pods in
Pending state with no events.

Adds a `volume-group: workspaces` label and podAffinity with
requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment,
rq-worker deployment, and cleanup cronjob. This ensures the scheduler
explicitly co-locates all workspace-consuming pods on the same node.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: wait for ingress-nginx admission webhook before deploying

The controller pod being Ready doesn't guarantee the admission webhook
service is accepting connections. Add a polling loop that waits for the
webhook endpoint to have an IP assigned before applying the Ingress
resource, preventing "connection refused" errors during kustomize apply.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: add -n openms namespace to integration test steps

The kustomize overlay deploys into the openms namespace, but the
verification steps (Redis wait, Redis ping, deployment checks) were
querying the default namespace, causing "no matching resources found".

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: retry kustomize deploy for webhook readiness

Replace the unreliable endpoint-IP polling with a retry loop on
kubectl apply (up to 5 attempts with backoff). This handles the race
where the ingress-nginx admission webhook has an endpoint IP but isn't
yet accepting TCP connections.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix REDIS_URL to use prefixed service name in overlay

Kustomize namePrefix renames the Redis service to template-app-redis,
but the REDIS_URL env var in streamlit and rq-worker deployments still
referenced the unprefixed name "redis", causing the rq-worker to
CrashLoopBackOff with "Name or service not known".

Add JSON patches in the overlay to set the correct prefixed hostname.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Add Traefik IngressRoute for direct LB IP access

The cluster uses Traefik, not nginx, so the nginx Ingress annotations
are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all
routing and sticky session cookie for Streamlit session affinity.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix CI: skip Traefik IngressRoute CRD in validation and integration tests

kubeconform doesn't know the Traefik IngressRoute CRD schema, and the
kind cluster in integration tests doesn't have Traefik installed. Skip
the IngressRoute in kubeconform validation and filter it out with yq
before applying to the kind cluster.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Fix IngressRoute service name for kustomize namePrefix

Kustomize namePrefix doesn't rewrite service references inside CRDs,
so the IngressRoute was pointing to 'streamlit' instead of
'template-app-streamlit', causing Traefik to return 404.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: use ConfigMap as settings override instead of full replacement

The ConfigMap was replacing the entire settings.json, losing keys like
"version" and "repository-name" that the app expects (causing KeyError).
Now the ConfigMap only contains deployment-specific overrides, which are
merged into the Docker image's base settings.json at container startup
using jq.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: add set -euo pipefail to fail fast on settings merge error

Addresses CodeRabbit review: if jq merge fails, the container should
not start with unmerged settings.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: change imagePullPolicy to Always for mutable main tag

With IfNotPresent, rollout restarts reuse the cached image even when a
new version has been pushed with the same tag. Always ensures Kubernetes
pulls the latest image on every pod start.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* fix: build full Dockerfile instead of Dockerfile_simple

Switch CI to build the full Docker image with OpenMS and TOPP tools,
not the lightweight pyOpenMS-only image.

https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ

* Scope IngressRoute to hostname and drop unused nginx Ingress

Traefik is the only ingress controller on the cluster; the nginx Ingress
in k8s/base/ingress.yaml was orphaned (no nginx class available) and the
overlay was patching it instead of the active Traefik IngressRoute.

- Add Host() match to the base IngressRoute (placeholder filled by overlays)
- template-app overlay patches the IngressRoute with template.webapps.openms.de
- Remove ingress.yaml from the base kustomization resources list (file kept
  in the repo for nginx-based consumers)

https://claude.ai/code/session_01YNDYJTx1eSKaL9vQe1GQzV

* fix: use PVC mount for workspaces in online mode

In online mode, src/common/common.py hard-coded workspaces_dir to the
literal ".." which, from WORKDIR /app, resolved to /. Workspace UUID
directories were therefore created on each pod's ephemeral local
filesystem instead of the shared PVC mounted at
/workspaces-streamlit-template, so the Streamlit pod and the RQ worker
each saw their own disconnected copy. The worker's params.json load in
tasks.py then hit an empty dict, producing `KeyError: 'mzML-files'` as
soon as Workflow.execution() ran.

- common.py: in the online branch, use WORKSPACES_DIR env var (default
  /workspaces-streamlit-template) so Streamlit, the RQ worker, and the
  cleanup cronjob (which already reads WORKSPACES_DIR) all agree on one
  location.
- k8s streamlit & rq-worker deployments: set WORKSPACES_DIR explicitly so
  the env is overridable and visible at deploy time.
- WorkflowManager.start_workflow: call save_parameters() before dispatch
  so the latest session state is flushed to disk, closing a small race
  where a fragment rerun could leave params.json stale when the worker
  picked up the job.

https://claude.ai/code/session_01TsxtENPpuCZ1Ap3mX2ZpHr

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Fix contrib tag (#360)

* fix(ci): pin OpenMS contrib download to matching release tag

The Windows build step downloaded contrib_build-Windows.tar.gz from
OpenMS/contrib without a --tag, always pulling the latest release.
When the GH Actions cache (7-day eviction) expired, a newer contrib
got pulled that was incompatible with the pinned OpenMS release/3.5.0
source tree, breaking MSVC compilation in DIAPrescoring.cpp.

Pin the download to release/${OPENMS_VERSION} and tie the cache key
to the OpenMS version so contrib stays in lockstep with the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): pass release tag as positional arg to gh release download

`gh release download` takes the tag as a positional argument, not a
`--tag` flag. Silently failed to match on Windows with the system error
"The system cannot find the file specified".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: allow contrib version override via OPENMS_CONTRIB_VERSION

Adds OPENMS_CONTRIB_VERSION env var that falls back to OPENMS_VERSION
when empty. Lets us point OPENMS_VERSION at a non-release branch (e.g.
develop) while keeping the Windows contrib download pinned to a known
release tag, so CI doesn't fail on a missing contrib release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: ignore docs/superpowers/ (local design notes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add Kubernetes deployment docs and refactor Claude skills (#362)

* Remove stale patches from template-app overlay

The Deployment/streamlit patch with Ingress-shaped path /spec/rules/0/host
never applied and produced a silent no-op. The duplicate IngressRoute
service-name patch was redundant with the first IngressRoute patch block.
This brings the on-disk overlay in line with the production cluster's
running version.

* Rename configure-deployment skill to configure-docker-compose-deployment

First step of splitting the skill into three focused skills
(configure-app-settings, configure-docker-compose-deployment,
configure-k8s-deployment). Rename is in its own commit so
git log --follow traces the docker-compose content cleanly.

* Scope docker-compose skill to docker-compose-only

Removes app-level content (settings.json, Dockerfile choice, production
app examples) that will live in configure-app-settings. Adds a
prerequisite note pointing to configure-app-settings.

* Add configure-app-settings skill

Covers app-level configuration (settings.json, Dockerfile choice,
README, dependencies) shared by every deployment mode. Prerequisite
for configure-docker-compose-deployment and configure-k8s-deployment.

* Fix settings.json key-field list inconsistency

The Key fields prose listed max_threads (not in the JSON sample) and
omitted enable_workspaces (which is in the sample). Align the prose
with the sample and describe max_threads separately since it is a
nested object rather than a flat field.

* Add configure-k8s-deployment skill

New skill walking through Kustomize overlay creation and kubectl apply
for deploying a forked app to Kubernetes. Patch list reflects the
three-patch canonical shape (IngressRoute match + service, streamlit
Redis URL, rq-worker Redis URL).

* Fix inline-code rendering in k8s skill

The Host(`...`) escape syntax produced literal backslashes that
broke the inline-code span when rendered by markdown parsers. Rewrite
as Host(...) without nested backticks so the span renders cleanly.

* Add K8s deployment doc — overview and architecture sections

* Add K8s deployment doc — manifest reference section

* Add K8s deployment doc — fork-and-deploy guide

* Add K8s deployment doc — CI/CD pipeline section

* Clarify PR-blocking behavior depends on branch protection

The workflow does not block merges directly — it produces a check
status that a branch-protection rule can gate on. Make the
preconditions explicit.

* Register Kubernetes Deployment page in Streamlit documentation

* Cross-link docs/deployment.md to Kubernetes deployment page

Adds a preamble listing both deployment paths and introduces a
## Docker Compose heading above the existing content. The existing
docker-compose content is preserved verbatim.

* Add smoke test for Kubernetes Deployment documentation page

Extends the parametrized test_documentation cases to cover the new
Documentation page added by this branch, closing the gap where it
was the only selectbox entry without test coverage.

* ci: add ghcr-cleanup workflow (scheduled disabled, dry-run default)

* ci: scaffold build-and-test workflow with lint-manifests job

* ci: add build job skeleton with matrix, buildx, ghcr login

* ci: add metadata extraction, build-push, and registry cache

* ci: add kind integration steps to build job

* ci: lowercase image name for OCI cache refs

github.repository preserves the original casing (OpenMS/streamlit-template).
Docker OCI references require lowercase, so cache-from/cache-to fail with
'invalid reference format'. docker/metadata-action handles this internally
for tags, but the cache refs bypass it. Compute IMAGE_NAME_LC once and use
it in both cache refs.

* ci: don't pass unprefixed local tag to buildx push

With push: true, docker/build-push-action pushes every tag in its tags
input. A bare name like 'openms-streamlit:simple-test' (no registry
prefix) gets resolved to Docker Hub and fails with 401 unauthorized,
because the workflow's GHCR token has no rights on docker.io.

The local tag was only needed for the kind retag step. Since load: true
already loads the image into the runner's docker daemon, we can create
the stable local alias with a plain 'docker tag' step after build,
picking any tag from docker/metadata-action's output.

* ci: delete old docker workflows now superseded by build-and-test

* k8s: pin overlay image tag to main-full (new CI scheme)

* docs(skill): update k8s deploy skill for unified CI workflow

* docs(k8s): update deployment doc for unified CI workflow

* ci: pin container-retention-policy to v3.0.1

The @v3 floating tag does not exist on snok/container-retention-policy
(v2 is the latest floating major tag; v3 only has v3.0.0 and v3.0.1
as exact version tags). The workflow fails to resolve the action with
'unable to find version v3'. Pin to v3.0.1 (latest v3 release).

* fix(docker): stop cache-busting on GITHUB_TOKEN

The ENV GH_TOKEN=${GITHUB_TOKEN} at the top baked the per-run token
into an early layer, so every workflow run rebuilt from scratch.
Moved the ARG next to the one RUN that uses it (gh release download)
so earlier layers stay cacheable.

* docs: fix typo (Gihub -> GitHub) in Dockerfile comments

* ci: enable scheduled GHCR cleanup (weekly Sun 03:00 UTC)

* k8s: serve template app on both .de and .org TLDs

Updates the Traefik IngressRoute match in the template-app overlay to
accept both Host() values, and mirrors the same dual-host pattern in
the nginx Ingress fallback (two rules entries, same backend).

Outer parentheses on the || group are required for correct precedence
against PathPrefix.

* ci: integration-test both .de and .org hosts on nginx and traefik

Adds a dual-host curl assertion to the existing nginx kind integration
and a new traefik-integration job that brings up Traefik via Helm,
deploys the full overlay (no IngressRoute filter), and curls both
hostnames through the IngressRoute.

The traefik-integration job runs once on Dockerfile_simple — ingress
routing is image-agnostic, and adding the full image variant would
double the runtime without catching new regressions.

* ci: enable kind to bind workspace PVC and clean up port-forwards

The cinder-csi storage class isn't available in kind clusters. Patch
it to 'standard' (kind's default local-path-provisioner) at apply
time, alongside the existing imagePullPolicy substitution. Without
this, the workspace PVC stays unbound, streamlit and rq-worker pods
stay Pending, and the new dual-host curl assertions fail with 503.

The existing 'Verify all deployments are available' step had been
masking this with '|| true' since the integration test was added.

Also wire up a trap-based EXIT cleanup for the kubectl port-forward
processes; the previous trailing 'kill' line was unreachable under
set -e if any curl assertion failed.

* skill(configure-k8s-deployment): document dual-host overlay edit

Updates the overlay-edit step to require editing both Host() values
(.de and .org) plus the parallel nginx Ingress two-rules pattern.
Updates the verification checklist accordingly.

* skill(configure-k8s-deployment): fix markdown rendering and clarify nginx patch

CommonMark code spans don't process backslash escapes for backticks,
so `Host(\`…\`)` rendered as broken fragments. Wrap with double
backticks instead — the inner backticks are then literal.

Also clarify the nginx fallback note: 'patch both rules[].host
entries' could be misread as directly editing the shared base file;
'add an overlay patch for both rules[].host entries' is unambiguous.

* docs(kubernetes-deployment): document dual-host serving

Updates the architecture diagram, manifest reference, customization
table, and CI/CD section to describe the dual-host (.de + .org)
default. Adds a short subsection on the per-host stroute cookie and
why cross-TLD switches are harmless.

* docs(kubernetes-deployment): fix stale job count and missing kind patches in Job 3

Two factual errors caught in review:

- "both jobs run on pull requests" was true with 2 jobs, but there
  are now 3 (lint-manifests, build, traefik-integration). All three
  run on PRs.
- Job 3's description omitted that the deploy step still patches
  imagePullPolicy and storageClassName for kind compatibility, even
  though it doesn't filter the IngressRoute. Job 2's description
  already mentions both patches; Job 3 should be parallel.

* ci: use nginx Ingress hostnames for nginx-job curl assertions

The nginx Ingress is unpatched by the overlay, so it retains its base
hostnames (streamlit.openms.example.de / .org) from k8s/base/ingress.yaml.
The previous curl step used the Traefik IngressRoute hostnames
(template.webapps.openms.*), which the nginx ingress controller does
not match — every request 404'd.

Traefik's curl step is unchanged: the IngressRoute IS patched to the
template.webapps.openms.* hostnames, so those are correct there.

* k8s: mount admin password from streamlit-secrets Secret

The Save-as-Demo feature reads the admin password from st.secrets (i.e.
.streamlit/secrets.toml), but the Streamlit pod never had that file
mounted, so the feature was always disabled in cluster deployments.

Mount an optional Secret named streamlit-secrets as
/app/.streamlit/secrets.toml, add a reference .example manifest (not
included in kustomization -- the Secret is created out-of-band so no
password lands in git), gitignore any filled-in copy, and document the
imperative kubectl-create flow alongside the manifest alternative.

https://claude.ai/code/session_01LAJZ5EWBJkznj7vQnKt8vV

* fix errors

* ci: bump pyopenms to 3.5.0 and pin python 3.10 to match Dockerfile

The committed Dockerfile builds OpenMS from release/3.5.0 on python 3.10,
but requirements.txt pinned pyopenms==3.3.0 and ci.yml ran on python 3.11,
causing test_gui.py to fail with AttributeError on MSExperiment.to_df()
(the API was renamed from get_df to to_df in 3.5).

* fix(view): use pyopenms 3.5 get_df API instead of unreleased to_df

`MSExperiment.to_df()` exists only on the OpenMS develop branch and is
not in the published pyopenms 3.5.0 wheel that CI installs from PyPI,
causing AttributeError in the raw data viewer. Switch to `get_df()` and
`get_df(long=True)` — both return the same column names that the
existing rename logic expects (rt/ms_level/mz_array/intensity_array
for the wide form, rt/mz/intensity for the long form).

* fix(k8s): mount streamlit-secrets as directory so optional: true works

CI pods crashlooped because a `subPath: secrets.toml` file mount cannot
resolve when the optional Secret is absent. Mount the Secret as a
directory at /app/admin-secrets/ instead, and register that path via
[secrets].files in .streamlit/config.toml so st.secrets picks it up
without shadowing the baked-in config.toml / credentials.toml.

* docs(k8s): add streamlit-secrets example to template-app overlay

Mirrors the base example with overlay-specific guidance: `namePrefix`
only rewrites Kustomize-managed resources, so imperative Secrets must
still use the literal name `streamlit-secrets`.

* k8s: two-tier scheduling via Kustomize components + LimitRange

Factor node placement and memory sizing out of the base manifests into
reusable Kustomize components (memory-tier-low / memory-tier-high), so
each fork picks its tier with a single line in its overlay.

- base: remove per-pod `resources` from streamlit and rq-worker
  Deployments; sizing now comes from the tier component
- base: promote redis to Guaranteed QoS (requests == limits for both
  cpu and memory) so it bottoms the kernel OOM list
- base: add LimitRange so containers without explicit resources inherit
  safe defaults (512Mi/250m request, 2Gi/2 limit, 64Gi/16 max)
- components/memory-tier-low: nodeSelector=low, streamlit 512Mi/2Gi,
  rq-worker 1Gi/16Gi (Burstable)
- components/memory-tier-high: nodeSelector=high, streamlit 512Mi/4Gi,
  rq-worker 2Gi/180Gi (Burstable — uniform across heavy workers so a
  single active app can burst into the shared pool)
- overlays: rename template-app/ to prod/ (one overlay per repo; the
  repo itself identifies the app) and pull in memory-tier-low
- docs & skill: document the new overlays/prod/ path and the one-line
  tier selector; update CI to kustomize the renamed overlay

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP

* ci(k8s): label kind node to match the overlay's memory tier

The memory-tier-low component adds nodeSelector
openms.de/memory-tier=low to every Deployment. kind clusters have no
such label, so after the rename to overlays/prod all pods stayed
Pending and 'Wait for Redis to be ready' timed out.

Label --all kind nodes in both the nginx and Traefik integration jobs
before deploying so the nodeSelector matches.

Also raise the LimitRange max.memory from 64Gi to 200Gi. The original
cap was written before memory-tier-high settled on a 180Gi rq-worker
limit; without the bump, a high-tier fork (e.g. OpenDIAKiosk) would be
rejected by admission when deployed into the shared openms namespace
after the template's LimitRange is applied.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP

* k8s: move streamlit-secrets.yaml.example into overlays/prod/

Completes the overlay rename started in 6c61365 now that the branch
has merged main, which added the example file under the old path.

Also rewrite two remaining docs references to overlays/<your-app-name>/
and the CI description to the new prod overlay.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP

* ci(k8s): two-node kind cluster with both tier labels

Spin up a 2-node kind cluster (control-plane labeled memory-tier=low
+ ingress-ready, worker labeled memory-tier=high) so the Build-and-Test
job passes regardless of which memory-tier component a fork's overlay
pulls in. Previously we labeled --all nodes with a single tier after
creation, which broke as soon as a fork flipped memory-tier-low to
memory-tier-high.

- .github/kind-config.yaml: 2-node topology with per-node labels.
- .github/workflows/build-and-test.yml: point both helm/kind-action
  invocations (nginx build + traefik-integration) at the config and
  drop the now-redundant dynamic label step.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP

* ci(k8s): clear control-plane NoSchedule taint in two-node kind config

Previous run (2f28ed9) showed build + traefik-integration jobs still
timing out on 'Wait for Redis'. Root cause: multi-node kind clusters
apply node-role.kubernetes.io/control-plane:NoSchedule to the
control-plane, which untolerated app pods can't land on even though
the nodeSelector matches. The single-node kind used previously had
no such taint, which is why CI worked until we added a second node.

Add a kubeadmConfigPatches stanza setting nodeRegistration.taints to
the empty list so the control-plane is schedulable. Labels and
cluster shape (1 control-plane + 1 worker) stay the same.

https://claude.ai/code/session_01LW4iBWt5YftuqFGc3jM5ZP

* k8s: store demo workspaces on the workspaces PVC

Adds a seed-demos initContainer to the Streamlit Deployment that merges
image-shipped demos into /workspaces-streamlit-template/.demos/ with
cp -rn, so new demos in an image appear after redeploy while admin-saved
demos and edits persist across redeploys.

- Point demo_workspaces.source_dirs at the PV path via the ConfigMap
  override (both streamlit and rq-worker pick this up through the jq
  settings merge at startup).
- Make get_demo_target_dir() settings-driven so "Save as Demo" writes
  to the PV, with backwards-compatible fallbacks for the legacy
  source_dir string and for environments without settings (tests).
- Skip hidden top-level dirs in clean-up-workspaces.py so the nightly
  cron does not garbage-collect .demos/.
- Document the .demos/ layout and the re-seed flow.

https://claude.ai/code/session_01Y87aULHSdyBobPdaD4L6tW

* k8s: ship streamlit-secrets by default, hide admin UI when empty

The Secret used to be an out-of-band copy-the-example step, so forgetting
the resources-list edit left the pod booting with an empty admin-secrets
mount and a user-facing "Admin not configured" error for a feature that
was never wired up in the first place.

Now the Secret is committed to the base with an empty admin password and
included in k8s/base/kustomization.yaml, so kubectl apply -k always
creates it. The "Save as Demo" expander is gated on a non-empty password
and is hidden entirely (no error box) when not configured. Operators
enable the feature by patching the live Secret or by editing the file
locally with git update-index --skip-worktree, both documented.
Exception handling in is_admin_configured() is tightened to also catch
StreamlitSecretNotFoundError so a missing secrets file never raises.

https://claude.ai/code/session_01V1noocAR7uXWjWsC9oLGhz

* ci: reuse built docker images across ingress tests

Split the build+test flow into three stages so the traefik ingress
test no longer rebuilds Dockerfile_simple from scratch:

  build (matrix: full, simple)
    -> uploads each image as a workflow artifact
  test-nginx (matrix: full, simple)
    -> downloads artifact, kind loads, tests nginx ingress
  test-traefik (simple only)
    -> downloads simple artifact, kind loads, tests traefik ingress

Artifacts (not GHCR) are used because the build job only pushes on
non-PR events and fork PRs cannot auth to GHCR at all, so registry
sharing would not work for every PR path.

* ci: run test-traefik against both image variants

Mirror the build/test-nginx matrix so the traefik ingress test also
covers the full and simple variants instead of just simple.

* ci: harden ingress-test wait/curl flow for slow simple deployments

test-traefik (simple) failed in the combined "Wait for Redis and
deployments to be ready" step because the deployment took longer than
120s to become available, and unlike the test-nginx wait the failure
was not soft. Align test-traefik with test-nginx:

- Split Redis wait (hard, 60s) from deployment wait (soft, `|| true`).
- Bump deployment timeout 120s -> 180s in both jobs.
- Widen the curl warm-up loop from 5x2s to 30x2s in both jobs so a
  marginally late deployment is tolerated; a real failure still
  surfaces via the trailing unconditional curl.

* Rework configure-k8s-deployment skill as an interview

The previous skill was a manual find-and-replace checklist that assumed
Claude could run kubectl against the cluster. Restructure it as an
interview-driven file-editing guide with a clear handoff to a human
operator (or CI) for cluster apply.

- Drop kubectl, kubectl kustomize, and rollout-verification steps that
  Claude can't actually execute.
- Drop nginx ingress fallback; production is Traefik-only.
- Add a Step 1 recon over a fixed set of base/overlay/CI files so
  defaults are derived from the repo, and the skill bails on layouts
  it doesn't recognize.
- Replace the manual checklist with six interview questions, each
  paired with what it controls in the running deployment, the proposed
  default, and the reasoning. Slug, GHCR ref, image tag, ingress
  subdomain, memory tier, workspace storage size.
- Make storage a single 1-line edit to k8s/base/workspace-pvc.yaml when
  the user picks a non-default size; keep the PVC base name unchanged
  (namePrefix scopes it per-fork, no collisions).
- Pin the default storage size to 500 Gi to match the stock base, so
  the default needs zero file edits.
- Explain that images[0].name is a Kustomize match key and must not
  change.

* k8s: drop cross-fork pod-affinity, rely on RWO PVC for co-location

The shared volume-group: workspaces label and required pod-affinity
attracted every fork's workspace pods onto a single node per memory
tier and deadlocked the first replica of any fork landing on an
otherwise-empty tier (no peer pod for the required affinity to match).

Per-fork RWO PVCs (<slug>-workspaces-pvc) already constrain all of
a fork's workspace-using pods to the node the volume is attached to
via the scheduler's VolumeBinding plugin, so the explicit affinity
adds nothing on top. Removing it scopes co-location naturally to one
fork and lets a fresh tier bootstrap without manual affinity-strip.

NodeSelector continues to pick the memory tier; the RWO mount picks
the specific node within that tier.

* ci: derive slug + Traefik hosts from overlay so forks stay green

The kind integration jobs in build-and-test.yml hardcoded `template-app`
as the slug label and `template.webapps.openms.{de,org}` as the Traefik
hostnames. The configure-k8s-deployment skill rewrites those values when
a fork customizes its overlay, after which `kubectl wait -l app=...`
returns "no matching resources found" and Traefik curl tests hit the
wrong Host header. This broke OpenMS/quantms-web PR #19 on its first
overlay PR (run 24964475081).

Have test-nginx and test-traefik discover SLUG (from `commonLabels.app`)
and TRAEFIK_HOSTS (parsed from the rendered IngressRoute match) right
after deploy, and substitute them into the wait/curl steps. The nginx
hostnames stay hardcoded — they come from `k8s/base/ingress.yaml`, which
the skill never edits and Kustomize doesn't rewrite.

Update the configure-k8s-deployment skill to (a) check during recon that
the workflow uses dynamic discovery, (b) flag forks still on the old
hardcoded shape so the skill applies the patch before editing the
overlay, and (c) note in the handoff that no fork-specific workflow
edits are needed.

* refix ci

* refix admin panel

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant