Skip to content

feat: add experimental support for promql via CH timeseries table#2349

Merged
knudtty merged 16 commits into
mainfrom
aaron/promql-app-oss
May 27, 2026
Merged

feat: add experimental support for promql via CH timeseries table#2349
knudtty merged 16 commits into
mainfrom
aaron/promql-app-oss

Conversation

@knudtty
Copy link
Copy Markdown
Contributor

@knudtty knudtty commented May 27, 2026

Summary

Adds extremely experimental support for PromQL in HyperDX via the ClickHouse Timeseries Engine.

Screenshots or video

image

References

  • Linear Issue: Closes HDX-1238

knudtty added 7 commits May 27, 2026 10:37
(cherry picked from commit d2527afeb21b25ba785a6a62088402f2d6489ca0)
supported

(cherry picked from commit 502d0081a4832bb48d3cee815d18ef247b034fd0)
(cherry picked from commit 44ca1efb55a3f63216813d5057ad4de4d17efcbe)
(cherry picked from commit 4e21660ea09e6ec4323fa02dea78841d5584f281)
- prometheus router: use new ClickhouseClient() directly (OSS pattern)
  instead of EE-only createClickHouseClient helper
- renderChartConfig: scope useTextIndexForImplicitColumn type access
  to BuilderChartConfigWithDateRange so the new PromqlChartConfigEx
  union variant doesn't break the indexed access
- renderChartConfig test: drop unused 4th sourceFilters arg
- clickhouse/index.ts: remove unused SourceFilters import
- useChartConfig.tsx: group prometheusApi import with siblings
@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 27, 2026 8:33pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 27, 2026

🦋 Changeset detected

Latest commit: 5a58388

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 5 packages
Name Type
@hyperdx/otel-collector Minor
@hyperdx/common-utils Minor
@hyperdx/api Minor
@hyperdx/app Minor
@hyperdx/cli Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (11):
    • docker/clickhouse/local/config.xml
    • docker/clickhouse/local/users.xml
    • docker/otel-collector/config.standalone.yaml
    • docker/otel-collector/schema/seed/00008_otel_metrics_timeseries.sql
    • packages/api/src/config.ts
    • packages/api/src/routers/external-api/v2/sources.ts
    • packages/api/src/routers/external-api/v2/utils/dashboards.ts
    • packages/api/src/tasks/checkAlerts/index.ts
    • packages/otel-collector/builder-config.yaml
    • packages/otel-collector/cmd/migrate/main.go
    • packages/otel-collector/cmd/migrate/main_test.go
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 56
  • Production lines changed: 1520 (+ 525 in test files, excluded from tier calculation)
  • Branch: aaron/promql-app-oss
  • Author: knudtty

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

E2E Test Results

All tests passed • 193 passed • 3 skipped • 1244s

Status Count
✅ Passed 193
❌ Failed 0
⚠️ Flaky 2
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Deep Review

Scope: 63 files, +2028/-61 against 55926e5c. Adds experimental PromQL via ClickHouse TimeSeries — new /v1/prometheus/{query,query_range,label/:name/values} endpoints, new PromqlSource kind, new chart config type, OPAMP prometheusremotewrite exporter, frontend editor.

🔴 P0/P1 — must fix

  • packages/api/src/routers/api/prometheus.ts:133 — Any authenticated team member can set Connection.prometheusEndpoint to an arbitrary URL (Zod only validates z.string().url()), and proxyToPrometheus then fetches it server-side and returns the response body — yielding SSRF to cloud-metadata IPs (169.254.169.254), loopback, RFC1918, and other internal services.
    • Fix: Restrict the scheme to http/https in ConnectionSchema, reject loopback / link-local / RFC1918 / IPv6 ULA after DNS resolution (re-resolve immediately before fetch to defeat TOCTOU), disable redirect-following on the fetch, and gate prometheusEndpoint writes to an admin role or an env-driven host allowlist.
    • security, adversarial

🟡 P2 — recommended

  • packages/api/src/routers/api/prometheus.ts:268 — Caught errors are surfaced verbatim to the client via e instanceof Error ? e.message : String(e), exposing ClickHouse query text, internal host strings, and library stack hints to the browser at all three handlers (lines 264, 359, 442).
    • Fix: Log the full error server-side with a request id, return a generic message keyed by error class to the client.
    • reliability, security
  • packages/api/src/api-app.ts:112 — Endpoints mount at /v1/prometheus/* but real Prometheus exposes /api/v1/*; a Grafana datasource pointed at this HyperDX URL will 404 because the path layout doesn't match the spec the PR claims compatibility with.
    • Fix: Either mount the router so paths resolve as /v1/prometheus/api/v1/query_range (alias mount) or document explicitly that the endpoint is HyperDX-internal and not drop-in Prometheus-compatible.
    • api-contract
  • packages/api/src/routers/api/prometheus.ts:265 — All caught errors return HTTP 400 with errorType: 'bad_data' regardless of cause; ClickHouse outages, upstream Prometheus 5xx, and timeouts are indistinguishable from genuine bad input, defeating Grafana's retry/error classification and SLO monitoring.
    • Fix: Map error classes to bad_data (400), execution (422), timeout (503), unavailable (503), and internal (500) per the Prometheus HTTP API spec.
    • reliability, api-contract
  • packages/api/src/routers/api/prometheus.ts:216query_range resolution guard (end - start) / step > MAX_RESOLUTION silently lets NaN-comparing inputs through (any NaN comparison is false), so missing/garbage start/end/step bypasses the guard and a backwards or unbounded query proceeds to ClickHouse.
    • Fix: Require start/end/step be present, then assert Number.isFinite(start) && Number.isFinite(end) && start < end && step > 0 before computing the ratio.
    • correctness, reliability, adversarial
  • packages/api/src/routers/api/prometheus.ts:18parseTimestamp('') silently returns 0 because Number('') is 0; a missing params.start (when caller sends ?start=) yields an epoch-0 query that bypasses validation rather than erroring.
    • Fix: After Number(value), reject empty/whitespace-only strings explicitly: if (typeof value === 'string' && value.trim() === '') throw new Error('Invalid timestamp').
    • correctness, testing
  • packages/api/src/routers/api/prometheus.ts:335const evalMs = time ? Math.floor(time * 1000) : Date.now(); treats a legitimate time=0 (Unix epoch) as falsy and silently substitutes "now"; a caller asking for an instant at epoch 0 gets a different instant.
    • Fix: Compare against undefined explicitly: time !== undefined ? Math.floor(time * 1000) : Date.now().
    • correctness
  • packages/app/src/hooks/useChartConfig.tsx:308 — Inline granToSec map omits '2 day', '7 day', '30 day' (all present in the project's granularity presets) and falls back to ?? 60, so any PromQL chart at multi-day resolution silently sends step=60s and trips the backend's 11k-point guard with a 400.
    • Fix: Replace the hand-rolled map with a call to the existing convertGranularityToSeconds utility in @hyperdx/common-utils, which parses N unit strings rather than enumerating presets.
    • correctness, maintainability
  • packages/api/src/mcp/tools/query/helpers.ts:139 — Flipping !isRawSqlSavedChartConfig to isBuilderSavedChartConfig leaves PromQL configs falling through to the raw-SQL branch, which calls renderChartConfig whose PromQL no-op returns {sql:'', params:{}}; the MCP tile runner sends an empty SQL string to ClickHouse and returns no error and no rows.
    • Fix: Add an isPromqlSavedChartConfig guard before the existing branch that returns a clear 'PromQL tiles are not yet supported via MCP' error, or wire PromQL through the Prometheus API path.
    • correctness, testing, maintainability, api-contract, agent-native
  • packages/common-utils/src/core/renderChartConfig.ts:1774 — The PromQL early-return {sql: '', params: {}} lets any caller of renderChartConfig (sessions, useServiceMap, hdxMTViews, useOffsetPaginatedQuery, MCP) silently send an empty query string to ClickHouse instead of failing loudly.
    • Fix: Throw new Error('renderChartConfig: PromQL configs are not SQL-renderable; route through prometheusApi') so misuse surfaces immediately.
    • maintainability
  • packages/app/src/components/ConnectionForm.tsxConnectionSchema adds prometheusEndpoint and the API persists it, but the connection form has no input and no defaultValues entry for it, so the field is reachable only via direct API or hand-edited Mongo.
    • Fix: Add an Advanced-Settings field bound to prometheusEndpoint, or document that the field is API-only and gate it accordingly.
    • api-contract
  • packages/common-utils/src/types.ts:1202z.string().url().optional() rejects "", so any future form that submits an empty string for "no endpoint" (rather than undefined) will 400 with a non-obvious Zod error.
    • Fix: Use z.string().url().or(z.literal('')).optional().transform(v => v || undefined) to normalize empty input.
    • api-contract
  • packages/api/src/routers/external-api/v2/sources.tsformatExternalSource now emits SourceKind.Promql but the OpenAPI Source oneOf at lines 671–683 still lists only log|trace|metric|session; external consumers generated from the published schema will fail to deserialize PromQL sources.
    • Fix: Either add PromqlSource to the OpenAPI components and discriminator mapping, or filter PromQL sources from the v2 listing the way PromQL tiles are filtered in convertTileToExternalChart.
    • api-contract
  • packages/api/src/routers/external-api/v2/utils/dashboards.ts:382convertTileToExternalChart returns undefined for PromQL configs, so external API consumers and hyperdx_query_tile see PromQL tiles silently disappear; the tile-by-id lookup reports Tile not found for IDs the user can see in the UI.
    • Fix: Extend the external tile schema with a PromQL variant or return a structured "unsupported" placeholder so the tile is visible but explicitly not executable through this surface.
    • api-contract, agent-native
  • packages/api/src/opamp/controllers/opampController.ts:289 — The exporter URL is serialized as the literal 'http://${env:CLICKHOUSE_PROMETHEUS_METRICS_ENDPOINT}/write' and resolved agent-side; if the env var is unset on the collector pod the resulting URL is http:///write and the entire metrics/promql pipeline (or the whole remote config) fails to apply with no surface in the server logs.
    • Fix: Validate CLICKHOUSE_PROMETHEUS_METRICS_ENDPOINT at server startup when IS_PROMQL_ENABLED=true, or pass the resolved endpoint through the OPAMP config instead of relying on ${env} indirection.
    • reliability, adversarial
  • packages/api/src/opamp/controllers/opampController.ts:288 — The new prometheusremotewrite exporter has no timeout, retry_on_failure, or sending_queue settings, while the sibling clickhouse exporter does; on a downstream outage the collector falls back to defaults (in-memory queue, indefinite retries) and grows memory until the pipeline stalls.
    • Fix: Mirror the bounded retry_on_failure and sending_queue settings from the clickhouse exporter.
    • reliability
  • packages/api/src/routers/api/prometheus.ts:425SELECT DISTINCT all_tags[label] with mapContains cannot use a primary-key index and has no LIMIT; for high-cardinality labels (trace_id, pod, instance) the query scans the full TimeSeries tags projection on every editor keystroke (60s tanstack cache) and can OOM the autocomplete dropdown.
    • Fix: Add LIMIT 10000 before SETTINGS, and consider server-side caching of label → distinct values with TTL.
    • performance
  • packages/api/src/routers/api/prometheus.ts:224ClickhouseClient is constructed per-request at every handler (lines 224, 328, 415); under concurrent load this incurs TLS handshake overhead per call and risks pool exhaustion.
    • Fix: Cache clients keyed by ${host}|${username} in an LRU with a TTL bounded by credential rotation expectations, or confirm the upstream client already keep-alives sockets.
    • reliability, performance
  • packages/api/src/routers/api/prometheus.ts:162 — All three handlers duplicate auth lookup → connection fetch → proxy-or-CH dispatch in 30+ lines apiece; with three call sites the duplication is now an earning extraction rather than premature.
    • Fix: Extract resolveConnectionOrRespond(req, res) returning {teamId, params, connection} | null and a runOrProxy({connection, params, proxyPath, chQuery, formatter}) helper so each handler shrinks to its query-specific logic.
    • maintainability
  • packages/app/src/hooks/useChartConfig.tsx:538useAliasMapFromChartConfig uses 'configType' in config && (config as { configType: string }).configType === 'promql' because the parameter is typed BuilderChartConfigWithOptDateRange — the comment claims TS "may not include PromQL here", but that's a design choice to widen, not a type-system constraint to work around.
    • Fix: Widen the parameter type to ChartConfigWithOptDateRange | undefined and replace the runtime cast with isPromqlChartConfig(config); if the branch is genuinely unreachable, delete it.
    • maintainability, kieran-typescript
  • packages/api/src/routers/api/prometheus.ts:56getParams is heavily any-typed: req.query as Record<string, string> and req.body as Record<string, string> both narrow ParsedQs / any to a shape that runtime can violate (nested arrays, objects), and the rest of the file leans on resp.json<any>() and (r: any) => r.val.
    • Fix: Define a typed ParsedPromqlParams validated at the boundary (coerce non-string entries to strings or drop them), and type proxyToPrometheus and ClickHouse responses with concrete shapes.
    • kieran-typescript
  • packages/app/src/api.ts:626prometheusApi.labelValues calls server.get(...).json() directly, bypassing the prometheusFetch wrapper used by queryRange; HTTP errors surface as opaque ky HTTPError instead of the unwrapped server message, so the two endpoints have asymmetric error behavior.
    • Fix: Route labelValues through prometheusFetch (or extract the body-unwrap helper) so both endpoints surface server-formatted error messages uniformly.
    • kieran-typescript, reliability
  • packages/api/src/routers/api/prometheus.ts:82formatMatrixResponse parameter type declares time_series: [string, number][] but the body branches on typeof timestamp === 'string' | number and the test fixture passes [string | number, number][], requiring as any casts in the test.
    • Fix: Widen the parameter type to [string | number, number][] so the runtime guard isn't lying about its input.
    • kieran-typescript
  • .changeset/clever-suns-nail.md — The changeset frontmatter includes @hyperdx/cli: minor, but the diff touches nothing under packages/cli/, forcing a published-but-unchanged CLI version bump.
    • Fix: Remove the @hyperdx/cli line from the changeset frontmatter.
    • project-standards
  • packages/api/src/mcp/tools/query/index.ts — No MCP tool wraps /v1/prometheus/{query,query_range,label/:name/values}, so an agent cannot execute a PromQL query even though the UI can; combined with runConfigTile no-opping PromQL tiles, agents have zero path to the new capability.
    • Fix: Register hyperdx_promql_query / hyperdx_promql_label_values tools that proxy the same code path as the prometheus router.
    • agent-native
  • packages/api/src/api-app.ts:111 — The router is mounted behind isUserAuthenticated (session cookies) only; the bearer-token v2 external API has no equivalent route, so headless agents / CI tooling cannot reach PromQL endpoints.
    • Fix: Mirror the routes under /api/v2/prometheus behind validateUserAccessKey.
    • agent-native

Testing gaps (P2):

  • The ClickHouse path of /query_range, /query, and /label/:name/values has zero end-to-end coverage — the integration test only verifies the proxy branch and the "fetch must not be called" assertion. Mock ClickhouseClient.query and assert the full request → response envelope for the CH branch (the primary feature of the PR).
  • useChartConfig's PromQL branch (granularity → step map, distinguishing-labels legend transform, fallback ?? 60) is entirely untested.
  • The MCP runConfigTile PromQL fall-through behavior is untested — a regression test would have caught the silent empty-SQL no-op.
  • proxyToPrometheus error branches (TimeoutError, non-OK with body, non-JSON 200) are unexercised.
  • parseTimestamp and parseDuration have no coverage for '', whitespace, '0', negatives, scientific notation, uppercase units, or missing units.
  • isPromqlChartConfig / isPromqlSavedChartConfig guards have no unit tests; PromqlSource discriminator has no persistence/round-trip test.
🔵 P3 nitpicks (18)
  • packages/api/src/routers/api/prometheus.ts:122 — Magic numbers MAX_RESOLUTION=11_000, PROXY_TIMEOUT_MS=30_000, MAX_EXECUTION_SEC=30 have no comment on provenance.
    • Fix: Add a comment citing Prometheus / Grafana defaults, or surface as env-tunable constants.
  • packages/api/src/routers/api/prometheus.ts:56getParams merges body over query without documentation; precedence is undefined to readers and not aligned with Prometheus's own GET-vs-POST convention.
    • Fix: Either JSDoc the precedence or pick the source by req.method.
  • packages/api/src/routers/api/prometheus.ts:133new URL(path, prometheusEndpoint) discards a non-root pathname on the endpoint (e.g. https://prom.example.com/prefix/prefix dropped).
    • Fix: Concatenate trimmed pathnames: new URL(prometheusEndpoint.replace(/\/$/, '') + path).
  • packages/api/src/routers/api/prometheus.ts:155 — Upstream text/plain 200 responses (misconfigured proxies) crash resp.json() and surface as a confusing SyntaxError.
    • Fix: Check resp.headers.get('content-type') includes JSON, or wrap .json() and rethrow a clearer message.
  • packages/api/src/routers/api/prometheus.ts:406labelName from req.params.name is interpolated into the proxy path; Express already restricts :name to non-slash but no character-class guard for :, ?, #, %2E%2E.
    • Fix: Validate against ^(__name__|[A-Za-z_][A-Za-z0-9_]*)$ before forwarding.
  • packages/api/src/routers/api/prometheus.ts:86formatMatrix/VectorResponse iterate row.tags / row.time_series without null guards.
    • Fix: Default with row.tags ?? [] and row.time_series ?? [].
  • packages/api/src/routers/api/prometheus.ts:152resp.text() on the non-OK path has no body-size or read-timeout cap.
    • Fix: Wrap with a separate AbortController or content-length check.
  • packages/api/src/routers/api/prometheus.ts:264logger.error(e, ...) may include credential-bearing fields on some ClickHouse client error shapes.
    • Fix: Scrub e (host, auth, config) before logging.
  • packages/api/src/opamp/controllers/opampController.ts:290prometheusremotewrite is configured with tls.insecure: true and http:// scheme; acceptable in-VPC but should be flagged in deployment docs.
    • Fix: Derive the scheme from env or document the trust assumption.
  • packages/app/src/hooks/useChartConfig.tsx:369parseFloat(val) doesn't handle Prometheus's "NaN", "+Inf", "-Inf" value strings; NaN/Infinity can degrade Recharts rendering.
    • Fix: Map special strings to null (gaps) or explicit Infinity / -Infinity.
  • packages/app/src/hooks/useChartConfig.tsx:358 — Distinguishing-labels legend collapses series when all labels happen to have a single distinct value across the result set.
    • Fix: When distinguishingKeys is empty but there are multiple series, fall back to the full label set.
  • packages/app/src/hooks/usePromqlMetadata.ts:10 — TanStack default 3-retry behavior on a failed __name__ query can thrash an already-stressed ClickHouse.
    • Fix: Set retry: 1 or add a backoff.
  • packages/app/src/components/Sources/SourceForm.tsx:1912setValue('timestampValueExpression' as any, 'timestamp') casts away the form state typing; PromqlTableModelForm exists only to fire a useEffect and return null.
    • Fix: Seed the default in form defaultValues (or the Zod transform) and delete the empty component, removing the as any cast.
  • packages/api/src/models/source.ts:264Source.discriminator(SourceKind.Promql, new Schema({})) is an empty discriminator with no comment on intent.
    • Fix: Add a one-line comment explaining the empty schema is intentional, or remove the discriminator if the base schema suffices.
  • packages/common-utils/src/types.ts:895PromqlChartConfigSchema.step is persisted but never consumed; the only reader builds the step from granularity.
    • Fix: Thread config.step through (preferred) or drop the field.
  • packages/app/src/api.ts:590prometheusFetch JSON-parse fallback uses a fragile parseErr.message !== e.message equality test that can swallow the real error.
    • Fix: Always try json() then text(), and throw a constructed Error with the HTTP status + body snippet.
  • packages/api/src/routers/external-api/v2/utils/__tests__/dashboards.test.ts:395 — New PromQL-tile fixture uses } as any to satisfy the SavedChartConfig union.
    • Fix: Type the literal as PromqlSavedChartConfig from common-utils/types.
  • Hardcoded 'default' (db) and 'otel_metrics_ts' (table) repeated at prometheus.ts:214,326,413 and useChartConfig.tsx:332 — easy to drift.
    • Fix: Extract DEFAULT_PROMQL_DATABASE / DEFAULT_PROMQL_TABLE constants shared with the seed migration.

Pre-existing (separated, not counted in verdict):

  • ConnectionSchema.host is bare z.string() with no URL/scheme validation; this PR amplifies the impact by adding three new fetch-based handlers that consume it (packages/common-utils/src/types.ts:1194).

Reviewers (11): correctness, security, testing, maintainability, project-standards, reliability, api-contract, performance, adversarial, kieran-typescript, agent-native.

Coverage: ce-learnings-researcher (always-on) was not dispatched in this run. Untracked file scope (ce-plugin/) excluded from review per Stage 1. The base: argument resolved to a pre-merge SHA, so the diff includes a recent main-merge commit; analysis prioritized PromQL-specific paths in packages/api/src/routers/api/prometheus.ts, packages/app/src/hooks/useChartConfig.tsx, packages/common-utils/src/{types,guards,core/renderChartConfig}.ts, the OPAMP exporter, and PromQL source/discriminator wiring.

Testing gaps:

  • ClickHouse query path of /v1/prometheus/* is uncovered.
  • useChartConfig PromQL branch (granularity map, legend transform, fallback) is uncovered.
  • MCP runConfigTile PromQL fall-through is uncovered.
  • parseTimestamp / parseDuration edge cases are uncovered.
  • proxyToPrometheus failure modes (timeout, non-OK, non-JSON) are uncovered.

@knudtty knudtty merged commit 3123db5 into main May 27, 2026
28 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants