Skip to content

feat(mcp): improve MCP tool quality — error hints, shared helpers, better messages#2337

Merged
kodiakhq[bot] merged 7 commits into
mainfrom
brandon/mcp-tool-quality-improvements
May 27, 2026
Merged

feat(mcp): improve MCP tool quality — error hints, shared helpers, better messages#2337
kodiakhq[bot] merged 7 commits into
mainfrom
brandon/mcp-tool-quality-improvements

Conversation

@brandon-pereira
Copy link
Copy Markdown
Member

@brandon-pereira brandon-pereira commented May 25, 2026

What

Targeted improvements to MCP tool quality, identified and validated through performance evals against real agent workflows. These changes reduce tool-call failures, improve agent self-correction, and make tool outputs more actionable.

Query tools: top-level where + orderBy resolution

Table & Timeseries now accept a top-level where / whereLanguage that scopes the entire query. A shared mergeWhereIntoSelectItems() helper injects it into each select item's aggCondition (ANDed with any per-item where). No more duplicated merge logic.

Table gets an orderBy resolver: agents frequently write orderBy: "count", which generates ORDER BY count ASC — but ClickHouse can't resolve bare count since it's a function. The resolver maps aggFn names to aliases or synthesized expressions (count(), avg(Duration), etc.).

Timeseries now detects single-bucket collapse (1 row returned with explicit granularity) and adds a hint to adjust the time range or granularity.

Event patterns: topN, trendBuckets, restructured output

  • topN (1-100, default 20) — control how many patterns are returned
  • trendBuckets (0-60, default 24) — set to 0 to disable trend sparklines (smaller response)
  • Output restructured: top-level summary + patterns array + usage guide
  • Each pattern now includes rank and shareOfTotal for easier analysis

Schema & description improvements

  • Richer tool descriptions with concrete examples, common mistakes, and expression tips
  • MCP_AGG_FN_OPTIONS extracted as shared constant (single source of truth for aggFn names)
  • whereLanguage description now includes syntax matching guidance and common mistakes
  • groupBy documents multi-column comma-delimited input and arbitrary ClickHouse expressions
  • valueExpression description shows useful patterns (duration conversion, attribute casting, boolean→numeric)

Infrastructure

  • Error hints: Duplicated ClickHouse catch blocks extracted into clickHouseErrorResult() with pattern-matched hints for DateTime64 casting, AS alias quoting, and response size overflow
  • trimToolResponse: Now returns { data, isTrimmed } so callers detect trimming without re-serializing
  • Dashboard schemas: granularity added to line/stacked_bar external dashboard zod schemas
  • describeSource: Switched to compact JSON output (saves context tokens)
  • Better not-found messages: Source/connection not-found errors suggest calling hyperdx_list_sources
  • Regex fix: errorHint AS-alias regex uses \bAS\b word boundary to avoid false matches on words like DATABASE, ALIAS

Why

These changes came out of running MCP performance evals against real agent tasks. The combined effect:

  • Error hints cut agent confusion loops on ClickHouse-specific errors
  • orderBy resolution eliminates the most common ClickHouse ORDER BY failure mode
  • Top-level where reduces multi-step filter workarounds
  • Restructured event pattern output makes pattern analysis workflows significantly more reliable

Coming soon: We're building out a structured eval framework for MCP tool performance. Stay tuned — we'll be sharing the harness, benchmark suite, and results publicly.

Files changed (12)

File Change
helpers.ts mergeWhereIntoSelectItems(), clickHouseErrorResult(), error hints, better not-found messages
table.ts Top-level where, orderBy resolver, updated descriptions
timeseries.ts Top-level where, single-bucket collapse detection
schemas.ts MCP_AGG_FN_OPTIONS, richer descriptions for all fields
eventPatterns.ts topN, trendBuckets parameters
runEventPatterns.ts Restructured output format, shareOfTotal, rank
describeSource.ts Compact JSON, trimToolResponse API update
dashboards.ts granularity in convertToInternalTileConfig
trimToolResponse.ts Returns { data, isTrimmed }
trimToolResponse.test.ts Updated tests for new return shape
queryTool.test.ts Test for topN property
zod.ts SQLIntervalSchema + granularity on time chart schemas

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 25, 2026

🦋 Changeset detected

Latest commit: cc1279a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 27, 2026 10:40pm

Request Review

@github-actions github-actions Bot added the review/tier-2 Low risk — AI review + quick human skim label May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 642 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 9
  • Production lines changed: 642 (+ 352 in test files, excluded from tier calculation)
  • Branch: brandon/mcp-tool-quality-improvements
  • Author: brandon-pereira

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

E2E Test Results

All tests passed • 188 passed • 3 skipped • 1336s

Status Count
✅ Passed 188
❌ Failed 0
⚠️ Flaky 7
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@brandon-pereira brandon-pereira force-pushed the brandon/mcp-tool-quality-improvements branch from 0e02cd2 to ae01aef Compare May 25, 2026 16:09
@github-actions github-actions Bot added review/tier-4 Critical — deep review + domain expert sign-off and removed review/tier-2 Low risk — AI review + quick human skim labels May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

Deep Review

✅ No critical issues found. Several P2 items below are worth addressing before merge, and a number of P3 nits and testing gaps round out the report.

🟡 P2 -- recommended

  • packages/api/src/mcp/tools/query/eventDeltas.ts:434 -- the catch block wrapping the sample-row ClickHouse query still emits a bespoke Failed to sample rows: message instead of routing through the new clickHouseErrorResult helper, so DateTime64/AS/V8-string hints never reach agents on this failure path.
    • Fix: Replace the hand-rolled error wrapping at this catch site with return clickHouseErrorResult(e); after the abort call.
    • reliability
  • packages/api/src/mcp/tools/query/table.ts:111 -- resolveOrderBy returns matched select-item aliases verbatim, so an alias containing whitespace or special characters (e.g. Total Count) generates ORDER BY Total Count which ClickHouse rejects, even though renderChartConfig quotes the matching alias on the SELECT side.
    • Fix: Wrap the returned alias in double quotes when it contains whitespace or non-identifier characters (or always quote, mirroring the SELECT-side rendering).
    • adversarial
  • packages/api/src/mcp/tools/query/table.ts:107 -- the alias lookup is case-insensitive and runs before the aggFn-name synthesis, so orderBy: "count" against a select list like [{aggFn:'avg', alias:'count'}, {aggFn:'count'}] resolves to the bare string count, which ClickHouse parses as the function name rather than an identifier.
    • Fix: Either prefer aggFn-name synthesis over alias collisions on reserved function tokens, or require alias matches that collide with MCP_AGG_FN_OPTIONS to round-trip through a quoted identifier.
    • adversarial
  • packages/api/src/mcp/tools/query/runEventPatterns.ts:303 -- the auto-suggested whereSnippet is built as ${bodyExpression}:"phrase", but SAFE_BODY_EXPR_CHARS allows bracket-accessed map keys like SpanAttributes['http.url']; an agent that follows the usage instruction to feed whereSnippet into hyperdx_search (Lucene) will hit a parse error the PR's own description warns about.
    • Fix: Detect bracket access in bodyExpression and either emit a SQL-form whereSnippet with a whereLanguage hint, normalize to dot notation, or omit the snippet for unsupported shapes.
    • adversarial
  • packages/api/src/mcp/tools/query/runEventPatterns.ts:332 -- the response shape moves totalCount/sampledRows/etc. from result.* to a new top-level summary, renames sampledRows to sampledCount, and gates the per-pattern trend field behind trendBuckets > 0; any prompt or downstream caller learned against the prior shape will silently get undefined, and the changeset entry does not flag the contract break.
    • Fix: Either bump the tool name (e.g. _v2), ship temporary backwards-compat aliases on the response, or upgrade the changeset to call out the shape change as a breaking output contract.
    • api-contract
  • packages/api/src/mcp/tools/query/table.ts:225 -- when runConfigTile returns an error result, content[0].text is plain text, the warning-injection JSON.parse throws, and the bare catch {} silently drops the mergeWarnings (including the critical "top-level where was NOT applied" notice); the same pattern repeats in timeseries.ts:126.
    • Fix: When the result is isError or JSON parsing fails, append warnings as an extra text content part instead of swallowing them, and extract the parse-mutate-restringify pattern into a shared helper to avoid drifting between the two tools.
    • correctness, reliability, maintainability, kieran-typescript
  • packages/api/src/mcp/tools/query/runEventPatterns.ts:312 -- the new trendBuckets > 0 ? ... : undefined ternary changes the response contract (the per-pattern trend field disappears) but no test covers trendBuckets: 0, so the disabled-trend branch and its impact on the usage string drift unverified.
    • Fix: Add a unit/integration test that calls runEventPatterns with trendBuckets: 0 and asserts the trend field is absent on every pattern and the usage string is consistent.
    • testing
  • packages/api/src/mcp/tools/query/timeseries.ts:142 -- the single-bucket collapse detection (parse JSON content, inspect parsed.result.data.length === 1, inject a hint) has no direct test coverage on either the positive case or any of the early-return guards.
    • Fix: Add an integration test that triggers a one-row timeseries response with explicit granularity and asserts the rendered hint appears; cover the negative case (no granularity, multi-row data) as well.
    • testing
  • packages/api/src/mcp/__tests__/queryTool.test.ts -- the new mergeWarnings end-to-end injection in table.ts/timeseries.ts (parse, append warnings array, restringify with null, 2) is exercised only at the helper level; no test asserts the warnings reach the final response JSON.
    • Fix: Add tests that trigger a language mismatch (top-level whereLanguage:'lucene' plus an item with whereLanguage:'sql') and assert the parsed content[0].text contains the expected warnings array.
    • testing
🔵 P3 nitpicks (13)
  • packages/api/src/mcp/tools/query/helpers.ts:236 -- the builder-config Connection not found branch was missed when the other not-found messages were upgraded to suggest hyperdx_list_sources.
    • Fix: Append the same hyperdx_list_sources suggestion to this branch (or extract a shared notFoundError(kind, id) helper).
    • correctness, maintainability
  • packages/api/src/mcp/tools/sources/describeSource.ts:325 -- the final JSON.stringify(finalOutput) dropped the prior null, 2 indent arguments, diverging from every other MCP tool output in this PR.
    • Fix: Restore JSON.stringify(finalOutput, null, 2) to keep formatting consistent across tools.
    • maintainability, api-contract, project-standards
  • packages/api/src/mcp/tools/query/helpers.ts:406 -- the AS-alias regex omits the /i flag while the other two hint patterns include it, so lowercase as in any ClickHouse syntax-error context never triggers the hint.
    • Fix: Either add /i with a tighter context anchor (e.g. require a preceding unexpected/near), or document the intentional case-sensitivity inline.
    • correctness, reliability
  • packages/api/src/mcp/tools/query/runEventPatterns.ts:317 -- the new field name shareOfTotal is computed against sampled rows, not the full population, which is the opposite of how the name will read to an agent.
    • Fix: Rename to shareOfSample or expand the usage string to make the sampling caveat unmissable.
    • correctness
  • packages/api/src/mcp/tools/query/table.ts:91 -- when two select items share a case-insensitive alias, resolveOrderBy silently picks the first match.
    • Fix: Detect duplicate case-insensitive alias matches and emit a mergeWarnings-style note so the agent knows which item won.
    • correctness
  • packages/api/src/mcp/tools/query/table.ts:127 -- an orderBy: 'count_distinct' with a select item that omits valueExpression falls through both synthesis branches and returns the bare identifier, producing an invalid ClickHouse query.
    • Fix: Either require valueExpression for count_distinct at the schema level or emit a typed error/hint when the aggregate has no expression to synthesize from.
    • adversarial, kieran-typescript
  • packages/api/src/mcp/tools/query/runEventPatterns.ts:230 -- with trendBuckets: 0 the pattern miner still runs trend computation; Math.ceil(diffSeconds / 0) yields Infinity, falls through to the default granularity, and produces a one-bucket trend that the formatting layer then discards.
    • Fix: Short-circuit trend computation inside minePatterns when trendBuckets <= 0 so the wasted work and the divide-by-zero path are both avoided.
    • correctness
  • packages/api/src/mcp/tools/query/helpers.ts:415 -- the V8-String overflow hint pattern targets length exceeds the maximum allowed size of V8 String; verify that this matches the actual error string the ClickHouse client or Node emits, otherwise the hint silently never fires.
    • Fix: Reproduce the overflow once against the real client/runtime and either confirm the existing pattern or broaden it to also match Cannot create a string longer than / Invalid string length.
    • reliability
  • packages/api/src/utils/trimToolResponse.ts:32 -- the trimmed-branch return uses as T even though trimObject may inject a __hdx_trimmed sentinel and truncate string values, so the spread-into-output pattern in eventDeltas.ts, describeSource.ts, and runEventPatterns.ts leaks that sentinel into the agent-visible response.
    • Fix: Either omit the sentinel before returning, type the trimmed branch as Partial<T> & { __hdx_trimmed?: true }, or destructure the sentinel out before spreading at each call site.
    • kieran-typescript, adversarial
  • packages/api/src/mcp/tools/query/table.ts:86 -- resolveOrderBy lives in table.ts but is imported from timeseries.ts, creating a sibling-tool dependency that obscures the module graph and pins shared logic to a tool registration file.
    • Fix: Move resolveOrderBy and AGG_FN_NAMES into helpers.ts (or a dedicated orderBy.ts).
    • maintainability, kieran-typescript
  • packages/api/src/mcp/tools/query/schemas.ts:165 -- the orderBy zod description still says only "Column to sort results by"; agents won't discover the new aggFn-name resolution unless it's surfaced in the schema text.
    • Fix: Append a one-line example showing that bare aggFn names like count or avg are auto-resolved.
    • maintainability
  • packages/api/src/mcp/tools/query/helpers.ts:423 -- the file now sits at ~423 lines, over the project's documented 300-line file-size guideline.
    • Fix: Extract the new error-hint helpers (clickHouseErrorResult, errorHint) into a sibling errors.ts and consider lifting mergeWhereIntoSelectItems into its own module.
    • project-standards
  • packages/api/src/mcp/tools/query/runEventPatterns.ts:291 -- the new default topN: 20 silently truncates clusterCount patterns; only attentive agents that compare summary.clusterCount vs summary.patternsReturned will notice.
    • Fix: Add an explicit truncated: true flag on the response when truncation occurs and call this out in the changeset.
    • api-contract

Reviewers (11): correctness, testing, maintainability, project-standards, kieran-typescript, api-contract, reliability, performance, adversarial, agent-native, learnings.

Testing gaps:

  • No coverage for the trendBuckets: 0 output-shape branch in runEventPatterns.ts.
  • No coverage for the single-bucket collapse hint in timeseries.ts (positive or guard cases).
  • No coverage for mergeWarnings reaching the agent end-to-end (or being dropped when the underlying query errors).
  • No coverage for the isTrimmed: true note-injection branches in eventDeltas.ts, describeSource.ts, or runEventPatterns.ts after the {data, isTrimmed} migration.
  • No coverage for resolveOrderBy with aliases containing spaces, with case-insensitive collisions, or with count_distinct/quantile shapes missing valueExpression.
  • The agent-friendly-cli-principles.md learning under ce-plugin/docs/solutions/ describes related ground (bounded output, actionable error messages) and is worth a glance, though it's CLI-framed.

Changes identified and validated through MCP performance evals:

Query tool improvements:
- Add top-level 'where' + 'whereLanguage' to table and timeseries tools
  with mergeWhereIntoSelectItems() shared helper (no inline duplication)
- Add orderBy resolution for bare aggFn names (e.g. 'count' → 'count()')
- Add single-bucket collapse detection for timeseries with granularity hint
- Pass post-merge selectItems to resolveOrderBy (consistency fix)

Event patterns improvements:
- Add topN parameter (1-100, default 20) to control pattern count
- Add trendBuckets parameter (0-60, default 24) with disable option
- Restructure output: top-level 'summary' + 'patterns' + 'usage' guide
- Add rank, shareOfTotal to each pattern for easier analysis

Schema/description improvements:
- Richer tool descriptions with examples, common mistakes, expression tips
- Extract MCP_AGG_FN_OPTIONS as shared constant
- Expanded whereLanguage description with syntax matching guidance
- groupBy now documents multi-column and expression support

Infrastructure:
- Extract duplicated ClickHouse catch blocks into clickHouseErrorResult()
  with pattern-matched error hints (DateTime64, AS alias, response size)
- trimToolResponse now returns { data, isTrimmed } for cheaper detection
- Add granularity to external dashboard zod schemas (line/stacked_bar)
- describeSource uses compact JSON output
- Improve source/connection-not-found messages with recovery suggestions
P1: mergeWhereIntoSelectItems now returns { items, warnings } instead of
a bare array. When a top-level where is silently dropped due to language
mismatch, a warning is emitted and surfaced in the response JSON so the
agent can recover.

P2: Add 21 focused unit tests covering every branch of:
  - mergeWhereIntoSelectItems (empty, inject, AND-combine, mismatch warning, mixed)
  - errorHint (DateTime64, AS with word boundary, V8 overflow, no-match)
  - resolveOrderBy (undefined, passthrough, alias match, count/avg/quantile
    synthesis, case insensitivity, 'none' passthrough)

P3 nitpicks:
  - Filter 'none' out of AGG_FN_NAMES so resolveOrderBy doesn't synthesize
    an invalid none(expr) ORDER BY clause
  - Use null, 2 in the single-bucket hint JSON.stringify for consistent
    pretty-print formatting with formatQueryResult
  - Narrow eslint-disable comments in trimToolResponse.ts to cover only the
    as T assertion, not the entire return statement
P0 fixes:
- queryTool.test.ts: update event_patterns assertions from old
  output.result.{patterns,totalCount,sampledRows} shape to new
  output.{summary,patterns} envelope with sampledCount
- dashboards.ts: add granularity to raw-SQL convertToInternalTileConfig
  pick() list (was silently dropped on PUT)
- dashboards.ts: add granularity to convertToExternalTileChartConfig for
  Line/StackedBar (both builder and raw-SQL) so GET→PUT round-trips
  preserve the field

P2 fixes:
- timeseries.ts: apply resolveOrderBy so 'orderBy: count' works on
  timeseries too, not just table
- resolveOrderBy: parse and re-append trailing ASC/DESC direction
  (e.g. 'count DESC' → 'count() DESC') instead of falling through
- resolveOrderBy: skip synthesis for quantile without level to avoid
  generating invalid quantile(Duration) SQL

P3 fixes:
- runEventPatterns.ts: set sampleMultiplier: 1 in empty-data branch
  (semantically correct for no-data case)

Tests: 2 new unit tests for ASC/DESC handling and quantile-without-level
(30 total passing)
The granularity round-tripping on the external API is a separate
concern from MCP tool quality — split it out to keep this PR focused
and avoid touching the critical-path dashboards.ts file.
@github-actions github-actions Bot added review/tier-3 Standard — full human review required and removed review/tier-4 Critical — deep review + domain expert sign-off labels May 25, 2026
… hints, trim note

- resolveOrderBy: synthesize count(DISTINCT expr) for count_distinct aggFn
- resolveOrderBy: return canonical alias case on case-insensitive match
- runEventPatterns: use shared clickHouseErrorResult for error hints
- runEventPatterns: add 'Call hyperdx_list_sources' hint to source/connection errors
- errorHint: drop /i flag on AS regex to avoid false positives on lowercase 'as'
- describeSource: surface isTrimmed note when response is trimmed
- Add unit tests for all new behaviors
@brandon-pereira brandon-pereira requested review from a team and wrn14897 and removed request for a team May 25, 2026 20:28
wrn14897
wrn14897 previously approved these changes May 27, 2026
… shape

The event_deltas tool was merged before trimToolResponse was refactored
to return { data, isTrimmed } instead of the raw value. Without this
fix, every successful response would serialize the wrapper object
({data: ..., isTrimmed: false}) instead of the inner payload, breaking
the documented output contract.
@kodiakhq kodiakhq Bot merged commit a4b9fa8 into main May 27, 2026
19 checks passed
@kodiakhq kodiakhq Bot deleted the brandon/mcp-tool-quality-improvements branch May 27, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants