[4/7] Telemetry Event Emission and Aggregation#327
[4/7] Telemetry Event Emission and Aggregation#327samikshya-db wants to merge 140 commits intomainfrom
Conversation
|
The emission format confirms to the telemetry proto, marked this ready for review. |
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
87d1e85 to
32003e9
Compare
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
Implements getAuthHeaders() method for authenticated REST API requests: - Added getAuthHeaders() to IClientContext interface - Implemented in DBSQLClient using authProvider.authenticate() - Updated FeatureFlagCache to fetch from connector-service API with auth - Added driver version support for version-specific feature flags - Replaced placeholder implementation with actual REST API calls Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type - Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth - Update payload to match proto with system_configuration - Add shared buildUrl utility for protocol handling Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change payload structure to match JDBC: uploadTime, items, protoLogs - protoLogs contains JSON-stringified TelemetryFrontendLog objects - Remove workspace_id (JDBC doesn't populate it) - Remove debug logs added during testing
- Fix import order in FeatureFlagCache - Replace require() with import for driverVersion - Fix variable shadowing - Disable prefer-default-export for urlUtils
This is part 2 of 7 in the telemetry implementation stack. Components: - CircuitBreaker: Per-host endpoint protection with state management - FeatureFlagCache: Per-host feature flag caching with reference counting - CircuitBreakerRegistry: Manages circuit breakers per host Circuit Breaker: - States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery) - Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE - Per-host isolation prevents cascade failures - All state transitions logged at debug level Feature Flag Cache: - Per-host caching with 15-minute TTL - Reference counting for connection lifecycle management - Automatic cache expiration and refetch - Context removed when refCount reaches zero Testing: - 32 comprehensive unit tests for CircuitBreaker - 29 comprehensive unit tests for FeatureFlagCache - 100% function coverage, >80% line/branch coverage - CircuitBreakerStub for testing other components Dependencies: - Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack. Components: - TelemetryClient: HTTP client for telemetry export per host - TelemetryClientProvider: Manages per-host client lifecycle with reference counting TelemetryClient: - Placeholder HTTP client for telemetry export - Per-host isolation for connection pooling - Lifecycle management (open/close) - Ready for future HTTP implementation TelemetryClientProvider: - Reference counting tracks connections per host - Automatically creates clients on first connection - Closes and removes clients when refCount reaches zero - Thread-safe per-host management Design Pattern: - Follows JDBC driver pattern for resource management - One client per host, shared across connections - Efficient resource utilization - Clean lifecycle management Testing: - 31 comprehensive unit tests for TelemetryClient - 31 comprehensive unit tests for TelemetryClientProvider - 100% function coverage, >80% line/branch coverage - Tests verify reference counting and lifecycle Dependencies: - Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack. Components: - TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter - MetricsAggregator: Per-statement aggregation with batch processing TelemetryEventEmitter: - Event-driven architecture using Node.js EventEmitter - Type-safe event emission methods - Respects telemetryEnabled configuration flag - All exceptions swallowed and logged at debug level - Zero impact when disabled Event Types: - connection.open: On successful connection - statement.start: On statement execution - statement.complete: On statement finish - cloudfetch.chunk: On chunk download - error: On exception with terminal classification MetricsAggregator: - Per-statement aggregation by statement_id - Connection events emitted immediately (no aggregation) - Statement events buffered until completeStatement() called - Terminal exceptions flushed immediately - Retryable exceptions buffered until statement complete - Batch size (default 100) triggers flush - Periodic timer (default 5s) triggers flush Batching Strategy: - Optimizes export efficiency - Reduces HTTP overhead - Smart flushing based on error criticality - Memory efficient with bounded buffers Testing: - 31 comprehensive unit tests for TelemetryEventEmitter - 32 comprehensive unit tests for MetricsAggregator - 100% function coverage, >90% line/branch coverage - Tests verify exception swallowing - Tests verify debug-only logging Dependencies: - Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
This is part 5 of 7 in the telemetry implementation stack. Components: - DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker - TelemetryExporterStub: Test stub for integration tests DatabricksTelemetryExporter: - Exports telemetry metrics to Databricks via HTTP POST - Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth) - Integrates with CircuitBreaker for per-host endpoint protection - Retry logic with exponential backoff and jitter - Exception classification (terminal vs retryable) Export Flow: 1. Check circuit breaker state (skip if OPEN) 2. Execute with circuit breaker protection 3. Retry on retryable errors with backoff 4. Circuit breaker tracks success/failure 5. All exceptions swallowed and logged at debug level Retry Strategy: - Max retries: 3 (default, configurable) - Exponential backoff: 100ms * 2^attempt - Jitter: Random 0-100ms to prevent thundering herd - Terminal errors: No retry (401, 403, 404, 400) - Retryable errors: Retry with backoff (429, 500, 502, 503, 504) Circuit Breaker Integration: - Success → Record success with circuit breaker - Failure → Record failure with circuit breaker - Circuit OPEN → Skip export, log at debug - Automatic recovery via HALF_OPEN state Critical Requirements: - All exceptions swallowed (NEVER throws) - All logging at LogLevel.debug ONLY - No console logging - Driver continues when telemetry fails Testing: - 24 comprehensive unit tests - 96% statement coverage, 84% branch coverage - Tests verify exception swallowing - Tests verify retry logic - Tests verify circuit breaker integration - TelemetryExporterStub for integration tests Dependencies: - Builds on all previous layers [1/7] through [4/7]
…iring, error telemetry
Rebase the regressed exporter/aggregator/feature-flag-cache on main's
hardened versions and re-apply only the genuinely new functionality
(CONNECTION_CLOSE event, chunk-timing aggregation) on top. Closes the
critical findings from the multi-reviewer audit:
- SSRF guard, redactSensitive, sanitizeProcessName, hasAuthorization,
auth-missing warn-once — all restored via main's telemetryUtils.
- MetricsAggregator memory bounds (maxPendingMetrics with error-preferring
drop, maxErrorsPerStatement, statementTtlMs eviction) restored.
- FeatureFlagCache in-flight fetch dedup and TTL clamp [60s, 3600s]
restored; lib/telemetry/urlUtils.ts deleted.
- close() now properly awaits aggregator drain — fixes the close()/flush
race that PR #362 already fixed once.
- Driver version reads lib/version.ts via buildUserAgentString instead
of hardcoded '1.0.0'; uuidv4() restored in place of Math.random().
- TelemetryTerminalError re-exported from lib/index.ts.
Type-safe wiring:
- Added optional getTelemetryEmitter() / getTelemetryAggregator() to
IClientContext; removed all 7 `(this.context as any)` casts.
- Six copy-pasted event listeners in DBSQLClient.initializeTelemetry
collapsed into one `Object.values(TelemetryEventType)` loop — closes
the listener-name mismatch that silently dropped error events.
- mapAuthType now covers all 6 authType values instead of defaulting
everything to 'pat'.
TelemetryClient now owns the host-scoped resources:
- TelemetryClientProvider is a process-wide singleton (getInstance()).
- TelemetryClient owns DatabricksTelemetryExporter, MetricsAggregator,
CircuitBreakerRegistry, and FeatureFlagCache for its host. Implements
IClientContext itself so the owned components have a stable context
that survives any single DBSQLClient's close.
- DBSQLClient instances on the same host share the breaker counters,
feature-flag cache, exporter, and HTTP batches. Fixes the per-instance
breaker-fragmentation noted in iter-2 architecture review.
- Each DBSQLClient still holds its own TelemetryEventEmitter (respects
per-client telemetryEnabled); emitters bridge into the shared aggregator.
- Exporter falls back to context.getAuthProvider() when no explicit auth
provider is passed, so the shared exporter resolves auth through the
TelemetryClient's FIFO of registered DBSQLClients.
Error telemetry wired across operation entry points:
- Re-added emitErrorEvent(error) on DBSQLOperation; uses
ExceptionClassifier.isTerminal() to classify.
- fetchChunk, cancel, close, getMetadata wrap their bodies in try/catch
that calls emitErrorEvent before re-throwing. Verified end-to-end
against a real Azure Databricks workspace: failed query produces
STATEMENT_COMPLETE + ERROR (with redacted stack) on the wire.
- Removed the await getMetadata() call from emitStatementComplete —
eliminates the extra Thrift RPC on every close (F19) AND prevents
spurious error telemetry from getMetadata's wrapper firing during
close-cleanup of an already-failed operation.
Other:
- Iterating Map.keys() while mutating made safe via snapshot in close().
- STATEMENT_COMPLETE no longer zeroes accumulated chunk metrics when
the emit doesn't supply them (matches sibling-field guards).
- Tests for the rebased modules restored from main; provider tests
updated for the singleton API; deleted unused TelemetryExporterStub.
484 unit tests passing. Diff vs main: ~+2110/-383, down from the
original PR's +3640/-1173.
Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
- Apply prettier to TelemetryClientProvider.test.ts (sed-edits in the prior commit didn't preserve formatting). - Silence eslint `no-await-in-loop` on the auth-context fall-through in TelemetryClient.getConnectionProvider — sequential by intent. - Drop the empty public constructor on TelemetryClientProvider; leave a comment explaining the singleton + test-friendly construction contract. Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Mocha tests need `function () {}` so they can use `this.timeout()` /
`this.skip()` — arrow functions don't bind `this` to the test context.
The `func-names` rule was firing on every test in the suite (including
pre-existing tests in `protocol_versions.test.ts`); moving the rule to
the test-file override block silences those warnings.
Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
…n, knobs Iter-3 review fixes addressing 17 distinct findings from the multi-agent review. Telemetry is now functionally correct and operationally safe. Critical - F1: TelemetryClient ctor wires getOrCreateContext on FeatureFlagCache. isTelemetryEnabled was previously short-circuiting to false in production because no caller registered the host — every customer silently emitted zero events. - F2: integration test asserts the documented default (true), not the prior off-by-default. Test was contradicting production code. - F3: IClientContext.getAuthProvider now optional; consumers use ?.() so external implementers don't break on upgrade. High / privacy - F4: explicit DATABRICKS_TELEMETRY_DISABLED parser (1/true/yes/on, case insensitive). Avoids the footgun where DATABRICKS_TELEMETRY_DISABLED=false also disabled telemetry. Documented in CHANGELOG and TSDoc. - F12: TelemetryClient.registerContext warns on telemetry-config and userAgentEntry divergence so multi-tenant misconfig is visible. - F9: connect()-on-reconnect releases prior refcount; close() clears the emitter ref so post-close events can't smuggle into a closed aggregator. - M-1: redactSensitive strips /home/<user>/, /Users/<user>/, and C:\Users\<user>\ patterns from stack traces. - M-3: FeatureFlagCache.getAuthHeaders falls through to the context's auth provider — feature-flag GET is no longer unconditionally unauth. Operational - F7: MetricsAggregator.close races the final flush against a configurable telemetryCloseTimeoutMs (default 2s) so a flapping endpoint can't hang process.exit(0). - F8: flushInFlight serializer prevents concurrent fire-and-forget flushes from starving the user's HTTP socket pool. Drain pattern in close() awaits any in-flight flush, then issues a fresh one to capture close-time metrics that would otherwise be stranded. - F16: maxStatementMetrics cap (default 5000) with oldest-first eviction. Buffered errors emitted as standalone metrics first so the first-failure signal survives. - DBSQLSession.close() emits connection.close even when closeSession fails so failed-close rates are visible in dashboards. Maintainability - F10/F17: single withErrorTelemetry helper covers fetchChunk, cancel, close, finished, hasMoreRows, getSchema, getMetadata. safeEmit helper consolidates seven copy-pasted "get emitter, emit, swallow at debug" blocks across DBSQLOperation, DBSQLClient, DBSQLSession, CloudFetchResultHandler, RowSetProvider. Also fixes the inconsistency where DBSQLSession.close() lacked the swallow wrapper that the other six sites had. API surface - F13: ConnectionOptions exposes nine telemetry knobs (was three) with TSDoc. Adds telemetryFlushIntervalMs, telemetryMaxRetries, telemetryCircuitBreakerThreshold, telemetryCircuitBreakerTimeout, telemetryCloseTimeoutMs, telemetryMaxStatementMetrics. Tests - ClientContextStub gains telemetryEmitter / telemetryAggregator hooks so unit tests can assert on emit calls instead of silently no-op'ing. - 18 new unit tests covering F1 refcount, F12 divergence warn, async-close idempotency, error-telemetry wrappers (cancel, close, getMetadata, closed-op finished/getSchema/hasMoreRows), multi-context FIFO, and a new tests/unit/result/RowSetProvider.test.ts file (RowSetProvider had no test file at all). 783 unit tests pass; live e2e against adb-27363120558779.19.azuredatabricks.net validates the full pipeline. Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
…h, public types, README, test gaps - DBSQLClient.initializeTelemetry: release the per-host refcount on the catch path so a throw between getOrCreateClient and the success log doesn't leak a TelemetryClient (with its flush timer / exporter / FFCache) for the lifetime of the process on long-running supervisors. - TelemetryClient.getClient/getDriver: walk the FIFO with try/catch fallthrough to mirror getConnectionProvider, so a closed-but-not-yet- released head context doesn't take the whole shared pool down. getAuthProvider: return the first defined entry from the FIFO. - lib/index.ts: re-export TelemetryEventType, DEFAULT_TELEMETRY_CONFIG, and the consumer-facing telemetry payload types so SDK users don't need to deep-import for type-checked event/metric handling. - README: add a Telemetry section covering what's collected, the three opt-out paths (env var, programmatic, server-side feature flag), the tunable knobs, and the await-close requirement for short-lived processes. - MetricsAggregator.test.ts: cover chunk-timing aggregation (initial=first-positive, slowest=max, sum=running) and CONNECTION_CLOSE → DELETE_SESSION emission. Both were acknowledged coverage gaps. - TelemetryEventEmitter.test.ts: cover emitConnectionClose — emission shape, disabled-flag suppression, and listener-exception swallow. Co-authored-by: Isaac
|
Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase ( |
| safeEmit(this.context, (emitter) => { | ||
| emitter.emitConnectionClose({ | ||
| sessionId: this.id, | ||
| latencyMs: Date.now() - this.openTime, |
There was a problem hiding this comment.
[F1 — Critical] connection.close.latencyMs reports session lifetime, not RPC duration
this.openTime is set in the session constructor (DBSQLSession.ts:175, this.openTime = Date.now()), so this latencyMs is the wall-clock session lifetime, not the closeSession RPC duration. The matching CREATE_SESSION event emits actual openSession RPC duration. Two different definitions land under the same operation_latency_ms field server-side, leading to incorrect close-latency conclusions.
Fix: capture closeStart = Date.now() immediately before await driver.closeSession(...) and emit Date.now() - closeStart. Or rename the field on close (e.g. sessionLifetimeMs) and document both meanings.
Posted by Code Review Squad.
| // Extract workspace ID from hostname (first segment before first dot) | ||
| const parts = host.split('.'); | ||
| return parts.length > 0 ? parts[0] : host; | ||
| } |
There was a problem hiding this comment.
[F2 — Critical] extractWorkspaceId is wrong on AWS and Azure hostnames
host.split('.')[0] returns:
- AWS
dbc-XXXXX-YYYY.cloud.databricks.com→dbc-XXXXX-YYYY(not the workspace ID; that's the deployment shard prefix) - Azure
adb-NNNNNNNNNNNNN.NN.azuredatabricks.net→adb-NNNNNNNNNNNNN(workspace ID is the numeric portion ofadb-N, dropping theadb-prefix and trailing form-factor digit)
Every CREATE_SESSION/DELETE_SESSION metric ships a confidently-wrong workspaceId for AWS and Azure customers, silently mis-attributing usage server-side. The JSDoc example workspace-id.cloud.databricks.com is not a real hostname format.
Fix: pull the real workspace ID from the auth response or the connection URL ?o=N parameter. If neither is available, omit the field rather than ship a wrong value. Also: split('.') never "fails" — the or host if extraction fails comment is misleading.
Posted by Code Review Squad.
| } | ||
|
|
||
| /** @internal */ | ||
| public getTelemetryEmitter(): TelemetryEventEmitter | undefined { |
There was a problem hiding this comment.
[F3 — High] getTelemetryEmitter() exposes un-redacted error stacks to in-process listeners
@internal is a JSDoc tag, not a runtime guard — this method is reachable from any consumer that imports @databricks/sql. The TelemetryEventEmitter extends Node's EventEmitter, so:
client.getTelemetryEmitter()?.on('telemetry.error', e => exfil(e.errorStack));works at runtime today. redactSensitive runs only at export time in DatabricksTelemetryExporter.toTelemetryLog, not at emit time, so an in-process listener sees raw errorMessage and errorStack (see DBSQLOperation.ts:592-602). With telemetry now default-on, this is a credential/PII channel that didn't materially exist before.
Fix: redact at the emitter (emitError) before calling this.emit(...). The aggregator/exporter then receive already-redacted strings — single redaction site. Optionally narrow getTelemetryEmitter() to a curated read-only surface.
Posted by Code Review Squad.
| } | ||
| } | ||
| return undefined; | ||
| } |
There was a problem hiding this comment.
[F4 — High] Multi-tenant FIFO credential bleed across DBSQLClients on same host
A SaaS layer fronting tenants A and B against the same workspace host shares one TelemetryClient. Telemetry from tenant B's queries can be POSTed under tenant A's auth headers (whoever connected first wins the FIFO head). Same shape applies to userAgentEntry and telemetryAuthenticatedExport — tenant A's value silently wins. If tenant B explicitly set telemetryAuthenticatedExport: false, B's events ride A's authenticated pipeline.
Fix options (pick one):
- Shard the registry by
(host, userAgentEntry, authenticatedExport, authProvider-identity)instead of host alone. - Auto-disable telemetry when a second context registers with diverging auth/authenticatedExport (lose telemetry > leak credential).
- Per-context exporter with shared circuit-breaker counters.
At minimum, document this prominently in the README opt-out section so SaaS deployments know to set telemetryEnabled: false.
Posted by Code Review Squad.
| this.contexts = this.contexts.filter((c) => c !== context); | ||
| const auth = context.getAuthProvider?.(); | ||
| if (auth) { | ||
| this.authProviders = this.authProviders.filter((a) => a !== auth); |
There was a problem hiding this comment.
[F5 — High] Auth-provider deregistration uses identity at unregister-time → stale credential leak on rotation
unregisterContext calls context.getAuthProvider?.() at unregister time and filters by referential equality. If the context's auth provider was rotated between register and unregister (token refresh that reconstructs the provider, lazy wrappers, anything that doesn't return the same object reference), the original provider is never removed from this.authProviders. The exporter's FIFO walk in getAuthProvider() (line 203-215) can keep authenticating with revoked credentials for the lifetime of the per-host singleton.
Fix: cache the auth-provider snapshot on the contexts entry at register time (registerContext); use that cached snapshot at unregister time, not a fresh call.
Posted by Code Review Squad.
| telemetryMaxRetries: 3, | ||
| telemetryAuthenticatedExport: true, | ||
| telemetryCircuitBreakerThreshold: 5, | ||
| telemetryCircuitBreakerTimeout: 60000, // 1 minute |
There was a problem hiding this comment.
[F9 — High] Default-config drift: 7 keys here vs 15 in DEFAULT_TELEMETRY_CONFIG
getDefaultConfig (this method, lines 128-134) defines 7 telemetry keys. DEFAULT_TELEMETRY_CONFIG in lib/telemetry/types.ts:91-103 defines 15. Components like MetricsAggregator fall back through ?? DEFAULT_TELEMETRY_CONFIG.X for the missing keys. Today the values agree by hand; they will silently desync the next time someone changes one source but not the other.
Fix: spread DEFAULT_TELEMETRY_CONFIG here (with a typed mapper for the telemetry-prefixed ClientConfig keys), or stop maintaining the inline copy and route every component through the single frozen const.
Posted by Code Review Squad.
| for (const k of telemetryOverrides) { | ||
| if (options[k] !== undefined) { | ||
| // The narrow union forces a cast; values are validated at point of use. | ||
| (this.config as any)[k] = options[k]; |
There was a problem hiding this comment.
[F11 — Medium] as any cast removes type safety on every telemetry option override
The telemetryOverrides array above is as const so the keys are a literal union. Both ConnectionOptions and ClientConfig have identical key names and types for the telemetry knobs. The cast escapes the structural type system for no reason. A user passing telemetryBatchSize: "100" (string) silently writes a string into a number field; MetricsAggregator then reads it as number and aggregation thresholds break at runtime.
Fix: per-key narrowed assignments, or a small typed pickDefined<T, K>(dst: T, src: Partial<T>, keys: readonly K[]) helper. Same readability, no cast.
Posted by Code Review Squad.
| /** @internal */ | ||
| public getTelemetryAggregator() { | ||
| return this.telemetryClient?.getAggregator(); | ||
| } |
There was a problem hiding this comment.
[F18 — Medium] getTelemetryAggregator() lacks explicit return-type annotation
Three reviewers flagged this. The return type is inferred as MetricsAggregator | undefined (from this.telemetryClient?.getAggregator()), but MetricsAggregator is intentionally not re-exported from lib/index.ts. Consumers calling client.getTelemetryAggregator() see a method whose return type references a non-exported symbol — autocomplete degrades, agent-completion gets confused, and the contract diverges from IClientContext.getTelemetryAggregator?(): MetricsAggregator | undefined even though the line two above (getTelemetryEmitter()) does annotate explicitly.
Fix: add : MetricsAggregator | undefined to match IClientContext. Since the method is @internal, an even cleaner option is to remove it from the public class entirely (move to a friend interface used only within the package) — TS-public methods marked only with a JSDoc @internal are the worst of both worlds for autocomplete.
Posted by Code Review Squad.
| // charge — avoiding the footgun where a sysadmin sets the var to "false" | ||
| // expecting to enable telemetry. | ||
| const envKill = process.env.DATABRICKS_TELEMETRY_DISABLED; | ||
| const envDisabled = typeof envKill === 'string' && /^(1|true|yes|on)$/i.test(envKill.trim()); |
There was a problem hiding this comment.
[F22 — Medium] DATABRICKS_TELEMETRY_DISABLED=false keeps telemetry on — silently ignored
The regex /^(1|true|yes|on)$/i is the deliberate "footgun-avoidance" choice (so false/0/no don't accidentally disable opt-out), and the comment above explains it. The remaining UX problem: an ops engineer who sees the env var and tries to "set it to false to keep telemetry on" gets the opposite of what they expect — false is silently ignored, leaving runtime config (default true) in charge.
Fix: log a LogLevel.warn line when the env var is set to a non-recognized non-empty value (false, 0, no, off, etc.) so ops can see their disable didn't take effect. That preserves the safe-default behavior while surfacing the misconfiguration.
Posted by Code Review Squad.
| if (arrowBatches) { | ||
| for (const batch of arrowBatches) { | ||
| bytes += batch.batch?.length ?? 0; | ||
| } |
There was a problem hiding this comment.
[F10 — Medium] RowSetProvider.emitChunkEvent undercounts bytes for non-arrow paths
bytes is computed only from response.results?.arrowBatches. A TFetchResultsResp carrying inline JSON / column-based data (COLUMN_BASED_SET) reports bytes: 0. URL-based result sets report bytes via CloudFetchResultHandler (a separate emit site), so URL-based is fine. But column/json paths flow through here and the metric is then aggregated into bytesDownloaded in MetricsAggregator.processStatementEvent (details.bytesDownloaded += event.bytes ?? 0).
Net effect: dashboards that aggregate bytesDownloaded will be silently undercounted whenever the result is inline JSON / column-based.
Fix: compute and account inline-result bytes here for non-arrow paths, or document explicitly that bytesDownloaded from this emit site counts only arrow-batch payloads (so dashboards know to add the CloudFetch-emitted byte counts separately).
Posted by Code Review Squad.
| safeEmit(this.context, (emitter) => { | ||
| emitter.emitError({ | ||
| statementId: this.id, | ||
| sessionId: this.sessionId, |
There was a problem hiding this comment.
[F12 — Medium] sessionId default mismatch — '' here, undefined in emitErrorEvent
emitStatementStart (line 549) and emitStatementComplete (line 575) emit sessionId: this.sessionId || ''. emitErrorEvent (line 596) emits sessionId: this.sessionId (no || ''). The aggregator's per-statement state groups by sessionId, so an operation with no sessionId emits start/complete keyed under empty string and errors keyed under undefined — split into two synthetic sessions in the aggregator and exporter.
Fix: pick one default and apply consistently. Prefer undefined and let the aggregator filter at the processEvent boundary (cleaner than synthesizing fake empty-string sessions).
Posted by Code Review Squad.
Code Review Squad ReportScope: 29 files, +2970/-424 lines Executive SummaryThis is a sizable, mostly well-engineered telemetry layer that ships two verified data-integrity bugs that mis-stamp every connection-close metric and every workspace-attribution field on AWS+Azure customers, plus multi-tenant safety issues (FIFO auth-provider sharing, silent override of opt-out config) that materially undermine the privacy posture of a default-on feature. Tests for the +314-line None of these are unrecoverable — most have small, focused fixes — but they should land before this merges and ships to customers as part of the 1.13 default-on telemetry rollout. Critical & High findings (inline comments above)
High findings without a specific anchorF6 — Default-on telemetry ships customer-controlled F7 — F8 — Telemetry HTTP retry double-stacks with user's 15-min Medium / Low findings (no inline comment, summarized here)
Verified non-issues (confirmed against PR head; flagged so reviewers don't re-raise)
Feedback? Drop it in #code-review-squad-feedback. |
Part 4 of 7-part Telemetry Implementation Stack
This layer adds event-driven telemetry emission, per-host shared
aggregation/export, and operation-level error telemetry on top of the
hardened infrastructure shipped in [1/7]–[3/7].
What's in this PR
TelemetryEventEmitter (
lib/telemetry/TelemetryEventEmitter.ts)Per-
DBSQLClientevent emitter — typed emission methods, respects theclient's
telemetryEnabledflag, swallows all exceptions at debug level.Each emitter bridges into the shared aggregator on the per-host
TelemetryClient.Event types:
CONNECTION_OPEN,CONNECTION_CLOSE(new),STATEMENT_START,STATEMENT_COMPLETE,CLOUDFETCH_CHUNK,ERROR.TelemetryClient owns the per-host triad (
lib/telemetry/TelemetryClient.ts)TelemetryClientProvideris a process-wide singleton. Each host gets oneTelemetryClientthat owns:DatabricksTelemetryExporterMetricsAggregatorCircuitBreakerRegistryFeatureFlagCacheMultiple
DBSQLClientinstances on the same host share these — breakercounters and HTTP batches don't fragment per-instance.
TelemetryClientimplements
IClientContextso the owned components have a stable contextthat survives any single
DBSQLClient's close. Connection/auth providersare tracked in a FIFO of registered contexts; the exporter falls through
to the next active one when the head closes.
MetricsAggregator (per-host, on
TelemetryClient)Restored from
main's hardened version, with new functionality layered on:DELETE_SESSIONconnection metric.chunkInitialLatencyMs,chunkSlowestLatencyMs,chunkSumLatencyMsaccumulated fromCLOUDFETCH_CHUNK events with positive latency.
maxPendingMetrics(drop preferring non-error to keepfirst-failure signal),
maxErrorsPerStatement,statementTtlMseviction.unref()'d), terminal-errorimmediate flush, manual
flush().close()is async and awaits the final HTTP POST soawait client.close(); process.exit(0)doesn't truncate the last batch.Error telemetry wired into operation entry points
DBSQLOperationnow emitsERRORevents (withExceptionClassifierterminal/retryable classification) from
fetchChunk,cancel,close,and
getMetadata. Failed queries produce aSTATEMENT_COMPLETEplus anERRORproto witherror_info: { error_name, stack_trace }(stack runthrough
redactSensitive).emitStatementCompleteno longer issues agetMetadataThrift RPC onclose (perf regression + spurious-error-telemetry trap).
Type-safe wiring (
IClientContext)Added optional
getTelemetryEmitter()/getTelemetryAggregator()toIClientContext. Removed all(this.context as any)casts at the sevenemit call sites (
DBSQLOperation,DBSQLSession,RowSetProvider,CloudFetchResultHandler).The six copy-pasted listeners in
DBSQLClient.initializeTelemetryare nowone bridge loop over
Object.values(TelemetryEventType)— closes thelistener-name mismatch that originally caused error events to be silently
dropped.
mapAuthTypecovers all sixauthTypevalues (access-token,databricks-oauth{U2M/M2M},
custom,token-provider,external-token,static-token).Verified end-to-end against an Azure Databricks workspace
Healthy
SELECT 1produces 3 wire metrics:CREATE_SESSION(withsystem_configuration),STATEMENT_COMPLETE(withsql_operation.execution_result),DELETE_SESSION.Failed query produces 4:
CREATE_SESSION,STATEMENT_COMPLETE(latencyonly),
ERROR(with redacted stack),DELETE_SESSION.Server-side feature flag is still the kill switch —
telemetryEnabled: falseon the client also skips the entire pipeline (no acquire/release noise).
Testing
484 unit tests pass (telemetry + DBSQLClient/Operation/Session/result).
Test files for the rebased modules are restored from
main. Provider testsupdated for the singleton API.
Coverage gap acknowledged: no unit tests yet for the re-applied
chunk-timing aggregation or
CONNECTION_CLOSEhandling. Adding these as afollow-up to the same epic.
Dependencies
Depends on:
Out of scope (open follow-ups)
operation_typeproto field soCREATE_SESSIONandDELETE_SESSIONare explicitly distinguished on the wire (today they're distinguishable
only by the incidental presence of
system_configuration).getStats()for telemetry-pipeline self-observability (drop counts,queue depth, last-success timestamp).
ConnectionOptions(currently3 of 13).
DBSQLClient,close from
DBSQLSession).CONNECTION_CLOSE.