Skip to content

Add first-class logical UUID type support#18140

Open
xiangfu0 wants to merge 15 commits intoapache:masterfrom
xiangfu0:codex/uuid-v1-support
Open

Add first-class logical UUID type support#18140
xiangfu0 wants to merge 15 commits intoapache:masterfrom
xiangfu0:codex/uuid-v1-support

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 9, 2026

Summary

  • add a first-class logical UUID type to Pinot v1, backed by the existing fixed-width 16-byte BYTES representation
  • support UUID semantics across schema, ingest, planner/type resolution, casts, filters, grouping, distinct, ordering, joins, dictionaries, bloom filters, raw inverted indexes, and result rendering
  • keep v1 backward-compatible at the storage and wire level: no segment format bump and no data-table format bump
  • add unit coverage plus offline and realtime integration coverage extending CustomDataQueryClusterIntegrationTest
  • document usage, limitations, migration constraints, and performance/design rationale in the README and PR description

Why

Today Pinot users typically model UUIDs as either STRING or BYTES.

That works for storage, but it loses type semantics:

  • BYTES results render as hex instead of UUID text
  • schema and query planning cannot distinguish UUIDs from arbitrary bytes
  • ingest and query paths do not validate UUID values consistently
  • joins, grouping, and distinct work physically, but not as a native logical type

This PR adds a native logical UUID type while intentionally reusing Pinot's current BYTES storage and wire encodings in v1.

V1 design contract

Aspect Behavior
Logical type UUID
Stored type BYTES
Internal width fixed 16 bytes
External text form canonical lowercase RFC 4122 (8-4-4-4-12)
Result rendering canonical UUID string for UUID columns
Plain BYTES behavior unchanged, still rendered as hex
Multi-value support not supported in v1
Segment/data-table format unchanged in v1

Benchmark results

Pinot storage and query benchmark

Local benchmark setup:

  • 4 segments
  • 250,000 rows per segment
  • 1,000,000 total rows
  • 100,000 distinct UUIDs
  • equality filter and IN (10) filter queries

Results from this setup:

Representation Total size Bytes/row Equality filter IN (10) filter
STRING_DICT 31,329,596 B 31.330 0.407 ms/op 0.355 ms/op
UUID_DICT 23,329,684 B 23.330 0.217 ms/op 0.301 ms/op
STRING_RAW 26,811,612 B 26.812 97.822 ms/op 85.702 ms/op
UUID_RAW 37,464,100 B 37.464 39.118 ms/op 46.679 ms/op

Takeaways from these measurements:

  • dictionary-encoded UUID performs better than dictionary-encoded UUID-as-string on both storage size and lookup/filter latency
  • raw UUID performs better than raw string for equality and IN filtering in this setup
  • raw UUID is larger on disk in this setup because it always stores a fixed 16-byte payload, while raw string can be smaller depending on the source representation and compression characteristics

Note: the Pinot query benchmark had to run with JMH -f 0 because the benchmark harness inherits an executor lifecycle issue from the query test base. The relative deltas are more trustworthy than the absolute latency numbers.

In-memory hot-path benchmark

Local JMH benchmark setup:

  • scanSize=262144
  • setSize=65536
  • probeCount=8192
  • representations compared: byte[16], java.util.UUID, and long[2]

Results:

Representation Hash lookup Equality scan
byte[16] 91.523 us/op 3067.267 us/op
java.util.UUID 128.960 us/op 265.074 us/op
long[2] 83.208 us/op 65.619 us/op

Takeaways:

  • long[2] is the best representation for hot-path compare/hash/ordering work
  • java.util.UUID is acceptable at API boundaries but is slower than long[2] for compute-heavy engine work
  • raw byte[16] is a good storage/wire format but is the weakest option for repeated in-memory compare/hash operations

Design choice: long[2] in engine internals

This PR keeps the external and persisted UUID contract unchanged:

  • segment storage stays as fixed-width 16-byte BYTES
  • data-table / wire format stays unchanged in v1
  • plain BYTES remains distinct from logical UUID

For engine internals and hot-path computation, the benchmark results above support using two longs conceptually as the working representation when the code path already knows it is dealing with a logical UUID.

That is the design choice behind the optimization work in this PR:

  • use 16-byte BYTES at storage and API boundaries
  • use long[2]-style compare/hash logic in UUID-aware hot paths
  • do not rewrite generic BYTES paths that do not carry logical UUID type information

Why not use java.util.UUID internally for hot paths:

  • it is object-based and introduces extra indirection/allocation pressure relative to primitive longs
  • it is slower than long[2] in the microbenchmarks above

Why not keep byte[16] as the hot-path compute representation:

  • repeated byte-wise compare/hash is slower than decoding to two longs and comparing those directly
  • the gap is especially visible for equality/ordering-heavy scans

So the v1 design is intentionally split:

  • persisted form: byte[16]
  • hot-path compute form: two longs where the logical type is known
  • user-facing form: canonical lowercase UUID string

User guide

1. Schema definition

Single-value UUID columns can be declared directly in the Pinot schema.

{
  "schemaName": "events",
  "dimensionFieldSpecs": [
    {
      "name": "eventId",
      "dataType": "UUID"
    },
    {
      "name": "traceId",
      "dataType": "UUID"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "ts",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

If a schema declares a multi-value UUID column, validation now fails clearly.

2. Table behavior

UUID columns work with the same table features as other single-value dimension columns in v1, including:

  • dictionary and no-dictionary columns
  • bloom filters
  • raw-value inverted indexes for no-dictionary UUID columns
  • offline and realtime tables
  • upsert primary-key use cases backed by UUID columns

Example table config fragment:

{
  "tableName": "events",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "ts",
    "schemaName": "events"
  },
  "fieldConfigList": [
    {
      "name": "eventId",
      "encodingType": "RAW",
      "indexes": {
        "bloom": {},
        "inverted": {}
      }
    }
  ]
}

3. Ingest behavior

Pinot normalizes UUID inputs to a 16-byte internal representation.

Accepted v1 inputs:

  • canonical lowercase UUID strings, for example 550e8400-e29b-41d4-a716-446655440000
  • java.util.UUID
  • 16-byte byte[]

Supported ingest paths covered in this PR:

  • record-reader and segment-local normalization
  • offline ingest
  • realtime ingest
  • Avro interop, including Avro logicalType: "uuid"

Invalid UUID values fail with explicit validation/conversion errors instead of silently degrading.

4. Query examples

Selection

SELECT eventId, traceId
FROM events
ORDER BY eventId
LIMIT 10

UUID columns are returned as canonical UUID strings.

CAST to UUID

SELECT eventId
FROM events
WHERE eventId = CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID)

IN predicate

SELECT eventId
FROM events
WHERE eventId IN (
  CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID),
  CAST('550e8400-e29b-41d4-a716-446655440001' AS UUID)
)

GROUP BY

SELECT eventId, COUNT(*)
FROM events
GROUP BY eventId
ORDER BY eventId

DISTINCT

SELECT DISTINCT eventId
FROM events
ORDER BY eventId

ORDER BY

SELECT eventId, ts
FROM events
ORDER BY eventId, ts

Equality join

SELECT a.eventId, a.ts, b.ts
FROM events_offline a
JOIN events_realtime b
  ON a.eventId = b.eventId
WHERE a.eventId = CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID)

These query patterns are covered for both SSE and MSE where applicable.

5. Result behavior

For columns declared as UUID:

  • SELECT uuidCol FROM t returns canonical lowercase UUID strings
  • GROUP BY, DISTINCT, and join outputs also render UUID strings
  • Arrow and JSON broker response encoders both preserve UUID string rendering

For columns declared as BYTES:

  • behavior stays unchanged
  • results still render as hex
  • no existing BYTES semantics are changed by this PR

6. Existing UUID byte/string helpers

This PR keeps the existing helper functions usable:

  • toUUIDBytes(...)
  • fromUUIDBytes(...)

The new logical UUID type is additive and does not replace plain BYTES workflows.

Migration notes

Schema evolution constraint

Pinot does not support changing the data type of an existing column in place. This PR adds a new logical type, but it does not change that schema-evolution rule.

From STRING

If a column currently stores canonical UUID strings and should behave as a typed UUID column, the practical path is to create a new UUID column or a new table/schema and reingest or backfill the data.

From BYTES

If a column already stores 16-byte UUID payloads, declaring a new column as UUID gives UUID-aware query behavior and rendering, but existing BYTES columns remain BYTES columns. Adopting the logical UUID type still requires a new column or table rebuild/reingest rather than an in-place type mutation.

Important distinction

A column must be declared as UUID to get UUID result rendering. Declaring the column as BYTES continues to render hex, even if the underlying bytes happen to represent UUID values.

Format compatibility

The UUID type itself does not require a segment or data-table wire format bump in v1. That compatibility statement is about representation, not about in-place schema migration.

Scope exclusions in v1

This PR intentionally does not include:

  • multi-value UUID columns
  • segment format changes
  • data-table wire format changes
  • in-place schema type mutation from existing STRING or BYTES columns to UUID
  • a broader reinterpretation of plain BYTES columns as UUIDs
  • UUID range predicate support as part of the v1 contract

Validation

Targeted local validation run for this branch includes:

  • ./mvnw -pl pinot-spi -Dtest=FieldSpecTest,SchemaTest test -Dcheckstyle.skip=true
  • ./mvnw -pl pinot-common,pinot-core -am -Dtest=JsonResponseEncoderTest,ArrowResponseEncoderTest,BytesDistinctTableTest,SelectionOperatorUtilsTest,CastTransformFunctionTest,ScalarTransformFunctionWrapperTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
  • ./mvnw -pl pinot-segment-local,pinot-segment-spi -am -Dtest=MutableDictionaryTest,DefaultNullValueVirtualColumnProviderTest,RawValueBitmapInvertedIndexTest,BloomFilterCreatorTest,BloomFilterSegmentPrunerTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
  • ./mvnw -pl pinot-plugins/pinot-input-format/pinot-avro-base -am -Dtest=AvroUtilsTest,AvroSchemaUtilTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
  • ./mvnw -pl pinot-integration-tests -am -Dtest=UuidTypeTest,UuidTypeRealtimeTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
  • ./mvnw -pl pinot-integration-tests -am -Ppinot-fastdev -Dtest=UuidTypeTest,UuidTypeRealtimeTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
  • ./mvnw -pl pinot-common checkstyle:check license:check -DskipTests

GitHub Actions on this PR continue to run the full Pinot CI matrix.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (31eac83) to head (590c1c4).

Additional details and impacted files
@@              Coverage Diff               @@
##             master    #18140       +/-   ##
==============================================
+ Coverage     63.29%   100.00%   +36.70%     
+ Complexity     1627         6     -1621     
==============================================
  Files          3226         3     -3223     
  Lines        196636         6   -196630     
  Branches      30401         0    -30401     
==============================================
- Hits         124466         6   -124460     
+ Misses        62192         0    -62192     
+ Partials       9978         0     -9978     
Flag Coverage Δ
custom-integration1 ?
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 0.00% <ø> (-63.27%) ⬇️
java-21 100.00% <ø> (+36.74%) ⬆️
temurin 100.00% <ø> (+36.70%) ⬆️
unittests ?
unittests1 ?
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a first-class logical UUID type to Pinot (backed by 16-byte BYTES in v1), and wires UUID semantics through schema/type handling, indexing, query planning/execution, and result formatting, with unit + integration coverage and user-facing documentation.

Changes:

  • Introduce UUID as a logical type (FieldSpec.DataType.UUID / DataSchema.ColumnDataType.UUID) with canonical RFC 4122 lowercase string rendering via UuidUtils.
  • Propagate UUID-aware behavior through dictionaries, bloom filters, raw-value inverted indexes, casts/literals, predicates, distinct/grouping, and query planner/runtime type mapping + (de)serialization.
  • Add targeted unit tests plus offline/realtime integration tests and README documentation.

Reviewed changes

Copilot reviewed 76 out of 76 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
README.md Document UUID logical type usage, casting, and migration notes.
pinot-spi/src/test/java/org/apache/pinot/spi/data/SchemaTest.java Add schema validation tests for UUID SV-only and default handling.
pinot-spi/src/test/java/org/apache/pinot/spi/data/FieldSpecTest.java Add UUID DataType storedType/size and conversion/default-null tests.
pinot-spi/src/main/java/org/apache/pinot/spi/utils/UuidUtils.java New UUID conversion utilities (string/UUID/bytes/ByteArray).
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Add UUID null placeholders.
pinot-spi/src/main/java/org/apache/pinot/spi/data/Schema.java Allow UUID in schema validation; enforce UUID SV-only.
pinot-spi/src/main/java/org/apache/pinot/spi/data/FieldSpec.java Add UUID DataType and UUID-aware conversions/formatting/default handling.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/creator/BloomFilterCreator.java Add UUID string rendering when inserting into bloom filters.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/inv/RawValueBitmapInvertedIndexTest.java Extend raw inverted index tests to UUID + generic API path.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/BloomFilterCreatorTest.java Add bloom filter creator test for UUID values stored as bytes.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/column/DefaultNullValueVirtualColumnProviderTest.java Add UUID coverage for virtual column default-null dictionary/metadata.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/impl/dictionary/MutableDictionaryTest.java Add UUID canonical string lookup tests for mutable bytes dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/RawValueBitmapInvertedIndexReader.java Make bytes dictionary logical-type aware; add getDocIdsForBytes.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapBytesDictionary.java Add logical type to parse/format BYTES vs UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBytesDictionary.java Add logical type to parse/format BYTES vs UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BytesDictionary.java Add logical type to parse/format BYTES vs UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/InvertedIndexHandler.java Use stored type; enable raw inverted index creation for UUID (stored as BYTES).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/columnminmaxvalue/ColumnMinMaxValueGenerator.java Pass logical type into bytes dictionary for min/max generation.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/bloomfilter/BloomFilterHandler.java Use DataType-aware string formatting for bloom filter population (UUID vs BYTES).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java Plumb logical type into bytes dictionary and mutable dictionary creation.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/column/DefaultNullValueVirtualColumnProvider.java Build bytes dictionary with logical type to format UUID defaults correctly.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/inv/RawValueBitmapInvertedIndexCreator.java Fix raw inverted index dictionary temp-file handling; use ByteArray keys for BYTES/UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/BaseSegmentCreator.java Allow inverted index without dictionary for UUID via raw-value inverted index creator.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/MutableDictionaryFactory.java Create bytes dictionaries with logical type (UUID vs BYTES) based on stored type.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOnHeapMutableDictionary.java Add logical type parsing/formatting for UUID vs BYTES.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOffHeapMutableDictionary.java Add logical type parsing/formatting for UUID vs BYTES.
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/plan/server/ServerPlanRequestUtils.java Emit UUID IN operands as canonical strings instead of raw bytes literals.
pinot-query-planner/src/test/java/org/apache/pinot/query/type/TypeFactoryTest.java Add UUID type conversion tests and skip UUID array tests.
pinot-query-planner/src/test/java/org/apache/pinot/query/planner/serde/RexExpressionSerDeTest.java Add UUID literal SerDe test and supported type list.
pinot-query-planner/src/test/java/org/apache/pinot/query/planner/logical/RelToPlanNodeConverterTest.java Add UUID column type conversion tests and reject UUID arrays.
pinot-query-planner/src/main/java/org/apache/pinot/query/type/TypeFactory.java Map Pinot UUID to Calcite SqlTypeName.UUID.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/serde/RexExpressionToProtoExpression.java Map UUID column type to proto enum.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/serde/ProtoExpressionToRexExpression.java Map proto UUID enum back to planner column type.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/physical/v2/PRelToPlanNodeConverter.java Convert Calcite UUID to Pinot UUID; reject UUID arrays.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/logical/RexExpressionUtils.java Add UUID literal conversion to/from Rex values.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/logical/RelToPlanNodeConverter.java Convert Calcite UUID to Pinot UUID; reject UUID arrays.
pinot-query-planner/src/main/java/org/apache/pinot/query/parser/CalciteRexExpressionParser.java Serialize UUID literals as canonical strings for SQL/parsing paths.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroUtilsTest.java Test Avro<->Pinot schema mapping for UUID logical type.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroSchemaUtilTest.java Test Avro schema JSON object generation for UUID fields.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java Enable Avro UUID logical type conversion + UUID schema handling.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroSchemaUtil.java Map Avro logicalType: uuid to Pinot UUID + emit uuid logical type in Avro schema JSON.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroIngestionSchemaValidator.java Fix mismatch message to use extracted Pinot type name.
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/UuidTypeTest.java Offline integration coverage for select/filter/group/distinct/order/join with UUID.
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/UuidTypeRealtimeTest.java Realtime integration coverage via subclassed UUID test.
pinot-integration-test-base/src/test/java/org/apache/pinot/integration/tests/ClusterTest.java Treat UUID like STRING/BYTES when extracting JSON response values in tests.
pinot-core/src/test/java/org/apache/pinot/core/query/selection/SelectionOperatorUtilsTest.java Verify result formatting distinguishes UUID (canonical) vs BYTES (hex).
pinot-core/src/test/java/org/apache/pinot/core/query/pruner/BloomFilterSegmentPrunerTest.java Add UUID bloom filter pruning test; allow mocking with arbitrary DataType.
pinot-core/src/test/java/org/apache/pinot/core/query/distinct/table/BytesDistinctTableTest.java Test UUID vs BYTES formatting in bytes distinct table (with/without ORDER BY).
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/CastTransformFunctionTest.java Add UUID cast tests, invalid literal rejection, and MV-source rejection.
pinot-core/src/main/java/org/apache/pinot/core/query/reduce/GroupByDataTableReducer.java Treat UUID like BYTES in group key extraction (raw bytes).
pinot-core/src/main/java/org/apache/pinot/core/query/reduce/filter/PredicateRowMatcher.java Convert UUID row values to bytes before applying predicate evaluator.
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ValueBasedSegmentPruner.java Hash bloom filter values using DataType-aware string formatting (UUID vs BYTES).
pinot-core/src/main/java/org/apache/pinot/core/query/distinct/table/BytesDistinctTable.java Preserve internal ByteArray and format at the end via schema type (UUID vs BYTES).
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/InTransformFunction.java Parse IN-list literals as UUID bytes when main function type is UUID.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/IdentifierTransformFunction.java Provide UUID string rendering for UUID columns (from underlying bytes).
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/CastTransformFunction.java Add CAST(... AS UUID) support (STRING/BYTES -> UUID) and string rendering.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/BaseTransformFunction.java Add UUID metadata and UUID->STRING rendering; prevent generic UUID-as-bytes fallback.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/RawValueInvertedIndexFilterOperator.java Support raw inverted index filtering for BYTES and UUID literals.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateUtils.java Add UUID IN-predicate dictionary id set computation.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java Add UUID raw predicate evaluator support (bytes-set with UUID type).
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotEqualsPredicateEvaluatorFactory.java Add UUID equals/neq evaluator support for dict and raw paths.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/InPredicateEvaluatorFactory.java Add UUID IN evaluator support (bytes-set with UUID type).
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/EqualsPredicateEvaluatorFactory.java Add UUID equals evaluator support for dict and raw paths.
pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java Add UUID conversions and type inference tests.
pinot-common/src/test/java/org/apache/pinot/common/utils/DataSchemaTest.java Add UUID column type coverage (compat, formatting, conversion).
pinot-common/src/test/java/org/apache/pinot/common/response/encoder/JsonResponseEncoderTest.java Add UUID round-trip encoding/decoding test for result tables.
pinot-common/src/test/java/org/apache/pinot/common/request/context/RequestContextUtilsTest.java Test filter conversion for UUID cast literals on RHS.
pinot-common/src/test/java/org/apache/pinot/common/function/FunctionUtilsTest.java Test UUID Java type mappings to Pinot types and Calcite rel types.
pinot-common/src/main/proto/expressions.proto Add UUID to proto ColumnDataType enum.
pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java Add UUID PinotDataType and conversions/toInternal handling.
pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java Add UUID ColumnDataType, internal/external conversions, formatting and rel type mapping.
pinot-common/src/main/java/org/apache/pinot/common/response/encoder/JsonResponseEncoder.java Treat UUID like STRING/BYTES when extracting JSON-encoded row values.
pinot-common/src/main/java/org/apache/pinot/common/request/context/RequestContextUtils.java Add literal-only CAST evaluation on predicate RHS; support UUID cast literals.
pinot-common/src/main/java/org/apache/pinot/common/request/context/predicate/BaseInPredicate.java Add UUID value parsing/cache for IN predicates.
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java Reuse UuidUtils for UUID bytes conversions.
pinot-common/src/main/java/org/apache/pinot/common/function/FunctionUtils.java Add UUID Java type mappings and Calcite rel type mapping.

@xiangfu0 xiangfu0 force-pushed the codex/uuid-v1-support branch from 271dd9c to 64febd9 Compare April 10, 2026 10:48
@xiangfu0 xiangfu0 changed the title [codex] Add first-class UUID support Add first-class UUID support Apr 10, 2026
@xiangfu0 xiangfu0 marked this pull request as ready for review April 10, 2026 12:23
@xiangfu0 xiangfu0 changed the title Add first-class UUID support Add first-class logical UUID type support Apr 10, 2026
@xiangfu0 xiangfu0 added feature New functionality schema Related to table schema definitions or changes ingestion Related to data ingestion pipeline query Related to query processing multi-stage Related to the multi-stage query engine integration ready-for-review PR is ready for maintainer review labels Apr 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 85 out of 85 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 97 out of 97 changed files in this pull request and generated 4 comments.

@ankitsultana
Copy link
Copy Markdown
Contributor

@xiangfu0 could you break this down into smaller PRs? Graphite would be perfect for this.

On a design note: I think using the existing bytes type would add a meaningful performance penalty at the very lowest layers because most operations will require a lookup on the number of bytes.

@xiangfu0
Copy link
Copy Markdown
Contributor Author

@xiangfu0 could you break this down into smaller PRs? Graphite would be perfect for this.

On a design note: I think using the existing bytes type would add a meaningful performance penalty at the very lowest layers because most operations will require a lookup on the number of bytes.

This is good point, I will try to benchmark the perf impact.

I feel storage side 16 bytes is already good enough.

Do you have any suggest on the query side?

@ankitsultana
Copy link
Copy Markdown
Contributor

Storage

On Disk Size

I'd imagine that whenever we store bytes, we have to store a offsets header to mark when the i-th value begins. This would happen both in the dictionary as well as in the raw forward index, unless we are auto-switching to FixedByteReaderWriter after detecting that all values are the same size. IIRC we did use to do that, so we should be good here?

Scan Performance

I think even if we get rid of the storage overhead mentioned above, FixedByteReaderWriter still ends up using readUnpaddedBytes, that relies on SWAR Zero in Word bit-hack. While that's faster than a naive approach, iirc it was still at least 20% slower than a naive approach that just simply assumes that each value is of a given fixed width. I had filed this issue about it last year: #16618 (comment)

Query

I think the most important operations for UUIDs are in Group Bys and Hashtable lookups. In both of these, the perf difference of equals and hashCode could be significantly different between an approach that uses class { long msb; long lsb; } (or similar) and a ByteArray based approach byte[] bytes.

To that end, for benchmarking I think we can just test the two approaches using microbenchmarks that test the group id generators for both V1 and V2 engines.

@xiangfu0 : we can also sync up on slack to expedite this. I can help share a PR with some benchmarks too.

xiangfu0 and others added 14 commits April 13, 2026 12:15
Add long-pair UUID helpers and adopt them across UUID comparison, group-by, join, and segment-local paths.

Preserve composite primary-key UUID hashing after UUID values are normalized to ByteArray and update the benchmark to exercise the production UUID key implementations.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Jackie-Jiang requested that BIG_DECIMAL get the same single-value-only
restriction that was added for UUID in Schema.validate(FieldSpec).
BIG_DECIMAL is SV-only by implementation (no MV forward-index or
dictionary exists for it). Also updated Javadoc to document both
restrictions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the codex/uuid-v1-support branch from 8388860 to 590c1c4 Compare April 13, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New functionality ingestion Related to data ingestion pipeline integration multi-stage Related to the multi-stage query engine query Related to query processing ready-for-review PR is ready for maintainer review schema Related to table schema definitions or changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants