Add first-class logical UUID type support by xiangfu0 · Pull Request #18140 · apache/pinot

xiangfu0 · 2026-04-09T00:48:44Z

Summary

add a first-class logical UUID type to Pinot v1, backed by the existing fixed-width 16-byte BYTES representation
support UUID semantics across schema, ingest, planner/type resolution, casts, filters, grouping, distinct, ordering, joins, dictionaries, bloom filters, raw inverted indexes, and result rendering
keep v1 backward-compatible at the storage and wire level: no segment format bump and no data-table format bump
add unit coverage plus offline and realtime integration coverage extending CustomDataQueryClusterIntegrationTest
document usage, limitations, migration constraints, and performance/design rationale in the README and PR description

Why

Today Pinot users typically model UUIDs as either STRING or BYTES.

That works for storage, but it loses type semantics:

BYTES results render as hex instead of UUID text
schema and query planning cannot distinguish UUIDs from arbitrary bytes
ingest and query paths do not validate UUID values consistently
joins, grouping, and distinct work physically, but not as a native logical type

This PR adds a native logical UUID type while intentionally reusing Pinot's current BYTES storage and wire encodings in v1.

V1 design contract

Aspect	Behavior
Logical type	`UUID`
Stored type	`BYTES`
Internal width	fixed 16 bytes
External text form	canonical lowercase RFC 4122 (`8-4-4-4-12`)
Result rendering	canonical UUID string for `UUID` columns
Plain `BYTES` behavior	unchanged, still rendered as hex
Multi-value support	not supported in v1
Segment/data-table format	unchanged in v1

Benchmark results

Pinot storage and query benchmark

Local benchmark setup:

4 segments
250,000 rows per segment
1,000,000 total rows
100,000 distinct UUIDs
equality filter and IN (10) filter queries

Results from this setup:

Representation	Total size	Bytes/row	Equality filter	`IN (10)` filter
`STRING_DICT`	31,329,596 B	31.330	0.407 ms/op	0.355 ms/op
`UUID_DICT`	23,329,684 B	23.330	0.217 ms/op	0.301 ms/op
`STRING_RAW`	26,811,612 B	26.812	97.822 ms/op	85.702 ms/op
`UUID_RAW`	37,464,100 B	37.464	39.118 ms/op	46.679 ms/op

Takeaways from these measurements:

dictionary-encoded UUID performs better than dictionary-encoded UUID-as-string on both storage size and lookup/filter latency
raw UUID performs better than raw string for equality and IN filtering in this setup
raw UUID is larger on disk in this setup because it always stores a fixed 16-byte payload, while raw string can be smaller depending on the source representation and compression characteristics

Note: the Pinot query benchmark had to run with JMH -f 0 because the benchmark harness inherits an executor lifecycle issue from the query test base. The relative deltas are more trustworthy than the absolute latency numbers.

In-memory hot-path benchmark

Local JMH benchmark setup:

scanSize=262144
setSize=65536
probeCount=8192
representations compared: byte[16], java.util.UUID, and long[2]

Results:

Representation	Hash lookup	Equality scan
`byte[16]`	91.523 us/op	3067.267 us/op
`java.util.UUID`	128.960 us/op	265.074 us/op
`long[2]`	83.208 us/op	65.619 us/op

Takeaways:

long[2] is the best representation for hot-path compare/hash/ordering work
java.util.UUID is acceptable at API boundaries but is slower than long[2] for compute-heavy engine work
raw byte[16] is a good storage/wire format but is the weakest option for repeated in-memory compare/hash operations

Design choice: `long[2]` in engine internals

This PR keeps the external and persisted UUID contract unchanged:

segment storage stays as fixed-width 16-byte BYTES
data-table / wire format stays unchanged in v1
plain BYTES remains distinct from logical UUID

For engine internals and hot-path computation, the benchmark results above support using two longs conceptually as the working representation when the code path already knows it is dealing with a logical UUID.

That is the design choice behind the optimization work in this PR:

use 16-byte BYTES at storage and API boundaries
use long[2]-style compare/hash logic in UUID-aware hot paths
do not rewrite generic BYTES paths that do not carry logical UUID type information

Why not use java.util.UUID internally for hot paths:

it is object-based and introduces extra indirection/allocation pressure relative to primitive longs
it is slower than long[2] in the microbenchmarks above

Why not keep byte[16] as the hot-path compute representation:

repeated byte-wise compare/hash is slower than decoding to two longs and comparing those directly
the gap is especially visible for equality/ordering-heavy scans

So the v1 design is intentionally split:

persisted form: byte[16]
hot-path compute form: two longs where the logical type is known
user-facing form: canonical lowercase UUID string

User guide

1. Schema definition

Single-value UUID columns can be declared directly in the Pinot schema.

{
  "schemaName": "events",
  "dimensionFieldSpecs": [
    {
      "name": "eventId",
      "dataType": "UUID"
    },
    {
      "name": "traceId",
      "dataType": "UUID"
    }
  ],
  "dateTimeFieldSpecs": [
    {
      "name": "ts",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:MILLISECONDS"
    }
  ]
}

If a schema declares a multi-value UUID column, validation now fails clearly.

2. Table behavior

UUID columns work with the same table features as other single-value dimension columns in v1, including:

dictionary and no-dictionary columns
bloom filters
raw-value inverted indexes for no-dictionary UUID columns
offline and realtime tables
upsert primary-key use cases backed by UUID columns

Example table config fragment:

{
  "tableName": "events",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "ts",
    "schemaName": "events"
  },
  "fieldConfigList": [
    {
      "name": "eventId",
      "encodingType": "RAW",
      "indexes": {
        "bloom": {},
        "inverted": {}
      }
    }
  ]
}

3. Ingest behavior

Pinot normalizes UUID inputs to a 16-byte internal representation.

Accepted v1 inputs:

canonical lowercase UUID strings, for example 550e8400-e29b-41d4-a716-446655440000
java.util.UUID
16-byte byte[]

Supported ingest paths covered in this PR:

record-reader and segment-local normalization
offline ingest
realtime ingest
Avro interop, including Avro logicalType: "uuid"

Invalid UUID values fail with explicit validation/conversion errors instead of silently degrading.

4. Query examples

Selection

SELECT eventId, traceId
FROM events
ORDER BY eventId
LIMIT 10

UUID columns are returned as canonical UUID strings.

CAST to UUID

SELECT eventId
FROM events
WHERE eventId = CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID)

IN predicate

SELECT eventId
FROM events
WHERE eventId IN (
  CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID),
  CAST('550e8400-e29b-41d4-a716-446655440001' AS UUID)
)

GROUP BY

SELECT eventId, COUNT(*)
FROM events
GROUP BY eventId
ORDER BY eventId

DISTINCT

SELECT DISTINCT eventId
FROM events
ORDER BY eventId

ORDER BY

SELECT eventId, ts
FROM events
ORDER BY eventId, ts

Equality join

SELECT a.eventId, a.ts, b.ts
FROM events_offline a
JOIN events_realtime b
  ON a.eventId = b.eventId
WHERE a.eventId = CAST('550e8400-e29b-41d4-a716-446655440000' AS UUID)

These query patterns are covered for both SSE and MSE where applicable.

5. Result behavior

For columns declared as UUID:

SELECT uuidCol FROM t returns canonical lowercase UUID strings
GROUP BY, DISTINCT, and join outputs also render UUID strings
Arrow and JSON broker response encoders both preserve UUID string rendering

For columns declared as BYTES:

behavior stays unchanged
results still render as hex
no existing BYTES semantics are changed by this PR

6. Existing UUID byte/string helpers

This PR keeps the existing helper functions usable:

toUUIDBytes(...)
fromUUIDBytes(...)

The new logical UUID type is additive and does not replace plain BYTES workflows.

Migration notes

Schema evolution constraint

Pinot does not support changing the data type of an existing column in place. This PR adds a new logical type, but it does not change that schema-evolution rule.

From `STRING`

If a column currently stores canonical UUID strings and should behave as a typed UUID column, the practical path is to create a new UUID column or a new table/schema and reingest or backfill the data.

From `BYTES`

If a column already stores 16-byte UUID payloads, declaring a new column as UUID gives UUID-aware query behavior and rendering, but existing BYTES columns remain BYTES columns. Adopting the logical UUID type still requires a new column or table rebuild/reingest rather than an in-place type mutation.

Important distinction

A column must be declared as UUID to get UUID result rendering. Declaring the column as BYTES continues to render hex, even if the underlying bytes happen to represent UUID values.

Format compatibility

The UUID type itself does not require a segment or data-table wire format bump in v1. That compatibility statement is about representation, not about in-place schema migration.

Scope exclusions in v1

This PR intentionally does not include:

multi-value UUID columns
segment format changes
data-table wire format changes
in-place schema type mutation from existing STRING or BYTES columns to UUID
a broader reinterpretation of plain BYTES columns as UUIDs
UUID range predicate support as part of the v1 contract

Validation

Targeted local validation run for this branch includes:

./mvnw -pl pinot-spi -Dtest=FieldSpecTest,SchemaTest test -Dcheckstyle.skip=true
./mvnw -pl pinot-common,pinot-core -am -Dtest=JsonResponseEncoderTest,ArrowResponseEncoderTest,BytesDistinctTableTest,SelectionOperatorUtilsTest,CastTransformFunctionTest,ScalarTransformFunctionWrapperTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
./mvnw -pl pinot-segment-local,pinot-segment-spi -am -Dtest=MutableDictionaryTest,DefaultNullValueVirtualColumnProviderTest,RawValueBitmapInvertedIndexTest,BloomFilterCreatorTest,BloomFilterSegmentPrunerTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
./mvnw -pl pinot-plugins/pinot-input-format/pinot-avro-base -am -Dtest=AvroUtilsTest,AvroSchemaUtilTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
./mvnw -pl pinot-integration-tests -am -Dtest=UuidTypeTest,UuidTypeRealtimeTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
./mvnw -pl pinot-integration-tests -am -Ppinot-fastdev -Dtest=UuidTypeTest,UuidTypeRealtimeTest -Dsurefire.failIfNoSpecifiedTests=false -Dcheckstyle.skip=true test
./mvnw -pl pinot-common checkstyle:check license:check -DskipTests

GitHub Actions on this PR continue to run the full Pinot CI matrix.

codecov-commenter · 2026-04-09T01:44:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (31eac83) to head (590c1c4).

Additional details and impacted files

@@              Coverage Diff               @@
##             master    #18140       +/-   ##
==============================================
+ Coverage     63.29%   100.00%   +36.70%     
+ Complexity     1627         6     -1621     
==============================================
  Files          3226         3     -3223     
  Lines        196636         6   -196630     
  Branches      30401         0    -30401     
==============================================
- Hits         124466         6   -124460     
+ Misses        62192         0    -62192     
+ Partials       9978         0     -9978

Flag	Coverage Δ
custom-integration1	`?`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`0.00% <ø> (-63.27%)`	⬇️
java-21	`100.00% <ø> (+36.74%)`	⬆️
temurin	`100.00% <ø> (+36.70%)`	⬆️
unittests	`?`
unittests1	`?`
unittests2	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds a first-class logical UUID type to Pinot (backed by 16-byte BYTES in v1), and wires UUID semantics through schema/type handling, indexing, query planning/execution, and result formatting, with unit + integration coverage and user-facing documentation.

Changes:

Introduce UUID as a logical type (FieldSpec.DataType.UUID / DataSchema.ColumnDataType.UUID) with canonical RFC 4122 lowercase string rendering via UuidUtils.
Propagate UUID-aware behavior through dictionaries, bloom filters, raw-value inverted indexes, casts/literals, predicates, distinct/grouping, and query planner/runtime type mapping + (de)serialization.
Add targeted unit tests plus offline/realtime integration tests and README documentation.

Reviewed changes

Copilot reviewed 76 out of 76 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
README.md	Document UUID logical type usage, casting, and migration notes.
pinot-spi/src/test/java/org/apache/pinot/spi/data/SchemaTest.java	Add schema validation tests for UUID SV-only and default handling.
pinot-spi/src/test/java/org/apache/pinot/spi/data/FieldSpecTest.java	Add UUID DataType storedType/size and conversion/default-null tests.
pinot-spi/src/main/java/org/apache/pinot/spi/utils/UuidUtils.java	New UUID conversion utilities (string/UUID/bytes/ByteArray).
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java	Add UUID null placeholders.
pinot-spi/src/main/java/org/apache/pinot/spi/data/Schema.java	Allow UUID in schema validation; enforce UUID SV-only.
pinot-spi/src/main/java/org/apache/pinot/spi/data/FieldSpec.java	Add UUID DataType and UUID-aware conversions/formatting/default handling.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/creator/BloomFilterCreator.java	Add UUID string rendering when inserting into bloom filters.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/inv/RawValueBitmapInvertedIndexTest.java	Extend raw inverted index tests to UUID + generic API path.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/BloomFilterCreatorTest.java	Add bloom filter creator test for UUID values stored as bytes.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/column/DefaultNullValueVirtualColumnProviderTest.java	Add UUID coverage for virtual column default-null dictionary/metadata.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/impl/dictionary/MutableDictionaryTest.java	Add UUID canonical string lookup tests for mutable bytes dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/RawValueBitmapInvertedIndexReader.java	Make bytes dictionary logical-type aware; add `getDocIdsForBytes`.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapBytesDictionary.java	Add logical type to parse/format BYTES vs UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBytesDictionary.java	Add logical type to parse/format BYTES vs UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BytesDictionary.java	Add logical type to parse/format BYTES vs UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/invertedindex/InvertedIndexHandler.java	Use stored type; enable raw inverted index creation for UUID (stored as BYTES).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/columnminmaxvalue/ColumnMinMaxValueGenerator.java	Pass logical type into bytes dictionary for min/max generation.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/bloomfilter/BloomFilterHandler.java	Use DataType-aware string formatting for bloom filter population (UUID vs BYTES).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java	Plumb logical type into bytes dictionary and mutable dictionary creation.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/column/DefaultNullValueVirtualColumnProvider.java	Build bytes dictionary with logical type to format UUID defaults correctly.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/inv/RawValueBitmapInvertedIndexCreator.java	Fix raw inverted index dictionary temp-file handling; use ByteArray keys for BYTES/UUID.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/BaseSegmentCreator.java	Allow inverted index without dictionary for UUID via raw-value inverted index creator.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/MutableDictionaryFactory.java	Create bytes dictionaries with logical type (UUID vs BYTES) based on stored type.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOnHeapMutableDictionary.java	Add logical type parsing/formatting for UUID vs BYTES.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOffHeapMutableDictionary.java	Add logical type parsing/formatting for UUID vs BYTES.
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/plan/server/ServerPlanRequestUtils.java	Emit UUID IN operands as canonical strings instead of raw bytes literals.
pinot-query-planner/src/test/java/org/apache/pinot/query/type/TypeFactoryTest.java	Add UUID type conversion tests and skip UUID array tests.
pinot-query-planner/src/test/java/org/apache/pinot/query/planner/serde/RexExpressionSerDeTest.java	Add UUID literal SerDe test and supported type list.
pinot-query-planner/src/test/java/org/apache/pinot/query/planner/logical/RelToPlanNodeConverterTest.java	Add UUID column type conversion tests and reject UUID arrays.
pinot-query-planner/src/main/java/org/apache/pinot/query/type/TypeFactory.java	Map Pinot UUID to Calcite `SqlTypeName.UUID`.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/serde/RexExpressionToProtoExpression.java	Map UUID column type to proto enum.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/serde/ProtoExpressionToRexExpression.java	Map proto UUID enum back to planner column type.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/physical/v2/PRelToPlanNodeConverter.java	Convert Calcite UUID to Pinot UUID; reject UUID arrays.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/logical/RexExpressionUtils.java	Add UUID literal conversion to/from Rex values.
pinot-query-planner/src/main/java/org/apache/pinot/query/planner/logical/RelToPlanNodeConverter.java	Convert Calcite UUID to Pinot UUID; reject UUID arrays.
pinot-query-planner/src/main/java/org/apache/pinot/query/parser/CalciteRexExpressionParser.java	Serialize UUID literals as canonical strings for SQL/parsing paths.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroUtilsTest.java	Test Avro<->Pinot schema mapping for UUID logical type.
pinot-plugins/pinot-input-format/pinot-avro-base/src/test/java/org/apache/pinot/plugin/inputformat/avro/AvroSchemaUtilTest.java	Test Avro schema JSON object generation for UUID fields.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java	Enable Avro UUID logical type conversion + UUID schema handling.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroSchemaUtil.java	Map Avro `logicalType: uuid` to Pinot UUID + emit uuid logical type in Avro schema JSON.
pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroIngestionSchemaValidator.java	Fix mismatch message to use extracted Pinot type name.
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/UuidTypeTest.java	Offline integration coverage for select/filter/group/distinct/order/join with UUID.
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/UuidTypeRealtimeTest.java	Realtime integration coverage via subclassed UUID test.
pinot-integration-test-base/src/test/java/org/apache/pinot/integration/tests/ClusterTest.java	Treat UUID like STRING/BYTES when extracting JSON response values in tests.
pinot-core/src/test/java/org/apache/pinot/core/query/selection/SelectionOperatorUtilsTest.java	Verify result formatting distinguishes UUID (canonical) vs BYTES (hex).
pinot-core/src/test/java/org/apache/pinot/core/query/pruner/BloomFilterSegmentPrunerTest.java	Add UUID bloom filter pruning test; allow mocking with arbitrary DataType.
pinot-core/src/test/java/org/apache/pinot/core/query/distinct/table/BytesDistinctTableTest.java	Test UUID vs BYTES formatting in bytes distinct table (with/without ORDER BY).
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/CastTransformFunctionTest.java	Add UUID cast tests, invalid literal rejection, and MV-source rejection.
pinot-core/src/main/java/org/apache/pinot/core/query/reduce/GroupByDataTableReducer.java	Treat UUID like BYTES in group key extraction (raw bytes).
pinot-core/src/main/java/org/apache/pinot/core/query/reduce/filter/PredicateRowMatcher.java	Convert UUID row values to bytes before applying predicate evaluator.
pinot-core/src/main/java/org/apache/pinot/core/query/pruner/ValueBasedSegmentPruner.java	Hash bloom filter values using DataType-aware string formatting (UUID vs BYTES).
pinot-core/src/main/java/org/apache/pinot/core/query/distinct/table/BytesDistinctTable.java	Preserve internal ByteArray and format at the end via schema type (UUID vs BYTES).
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/InTransformFunction.java	Parse IN-list literals as UUID bytes when main function type is UUID.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/IdentifierTransformFunction.java	Provide UUID string rendering for UUID columns (from underlying bytes).
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/CastTransformFunction.java	Add `CAST(... AS UUID)` support (STRING/BYTES -> UUID) and string rendering.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/BaseTransformFunction.java	Add UUID metadata and UUID->STRING rendering; prevent generic UUID-as-bytes fallback.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/RawValueInvertedIndexFilterOperator.java	Support raw inverted index filtering for BYTES and UUID literals.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateUtils.java	Add UUID IN-predicate dictionary id set computation.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotInPredicateEvaluatorFactory.java	Add UUID raw predicate evaluator support (bytes-set with UUID type).
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/NotEqualsPredicateEvaluatorFactory.java	Add UUID equals/neq evaluator support for dict and raw paths.
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/InPredicateEvaluatorFactory.java	Add UUID IN evaluator support (bytes-set with UUID type).
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/EqualsPredicateEvaluatorFactory.java	Add UUID equals evaluator support for dict and raw paths.
pinot-common/src/test/java/org/apache/pinot/common/utils/PinotDataTypeTest.java	Add UUID conversions and type inference tests.
pinot-common/src/test/java/org/apache/pinot/common/utils/DataSchemaTest.java	Add UUID column type coverage (compat, formatting, conversion).
pinot-common/src/test/java/org/apache/pinot/common/response/encoder/JsonResponseEncoderTest.java	Add UUID round-trip encoding/decoding test for result tables.
pinot-common/src/test/java/org/apache/pinot/common/request/context/RequestContextUtilsTest.java	Test filter conversion for UUID cast literals on RHS.
pinot-common/src/test/java/org/apache/pinot/common/function/FunctionUtilsTest.java	Test UUID Java type mappings to Pinot types and Calcite rel types.
pinot-common/src/main/proto/expressions.proto	Add UUID to proto `ColumnDataType` enum.
pinot-common/src/main/java/org/apache/pinot/common/utils/PinotDataType.java	Add UUID PinotDataType and conversions/toInternal handling.
pinot-common/src/main/java/org/apache/pinot/common/utils/DataSchema.java	Add UUID ColumnDataType, internal/external conversions, formatting and rel type mapping.
pinot-common/src/main/java/org/apache/pinot/common/response/encoder/JsonResponseEncoder.java	Treat UUID like STRING/BYTES when extracting JSON-encoded row values.
pinot-common/src/main/java/org/apache/pinot/common/request/context/RequestContextUtils.java	Add literal-only CAST evaluation on predicate RHS; support UUID cast literals.
pinot-common/src/main/java/org/apache/pinot/common/request/context/predicate/BaseInPredicate.java	Add UUID value parsing/cache for IN predicates.
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java	Reuse `UuidUtils` for UUID bytes conversions.
pinot-common/src/main/java/org/apache/pinot/common/function/FunctionUtils.java	Add UUID Java type mappings and Calcite rel type mapping.

...format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroUtils.java

pinot-spi/src/main/java/org/apache/pinot/spi/utils/UuidUtils.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/Schema.java

pinot-spi/src/main/java/org/apache/pinot/spi/utils/UuidUtils.java

...ocal/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BytesDictionary.java

.../org/apache/pinot/segment/local/segment/index/loader/invertedindex/InvertedIndexHandler.java

...java/org/apache/pinot/segment/local/segment/index/loader/bloomfilter/BloomFilterHandler.java

...n/java/org/apache/pinot/segment/local/realtime/impl/dictionary/MutableDictionaryFactory.java

Copilot

Pull request overview

Copilot reviewed 85 out of 85 changed files in this pull request and generated 3 comments.

pinot-spi/src/main/java/org/apache/pinot/spi/utils/UuidUtils.java

pinot-common/src/main/java/org/apache/pinot/common/request/context/RequestContextUtils.java

Copilot

Pull request overview

Copilot reviewed 97 out of 97 changed files in this pull request and generated 4 comments.

pinot-spi/src/main/java/org/apache/pinot/spi/utils/UuidUtils.java

pinot-spi/src/main/java/org/apache/pinot/spi/data/FieldSpec.java

...rg/apache/pinot/segment/local/segment/index/creator/inv/RawValueBitmapInvertedIndexTest.java

ankitsultana · 2026-04-12T06:46:10Z

@xiangfu0 could you break this down into smaller PRs? Graphite would be perfect for this.

On a design note: I think using the existing bytes type would add a meaningful performance penalty at the very lowest layers because most operations will require a lookup on the number of bytes.

xiangfu0 · 2026-04-12T09:19:47Z

@xiangfu0 could you break this down into smaller PRs? Graphite would be perfect for this.

On a design note: I think using the existing bytes type would add a meaningful performance penalty at the very lowest layers because most operations will require a lookup on the number of bytes.

This is good point, I will try to benchmark the perf impact.

I feel storage side 16 bytes is already good enough.

Do you have any suggest on the query side?

ankitsultana · 2026-04-12T19:53:19Z

Storage

On Disk Size

I'd imagine that whenever we store bytes, we have to store a offsets header to mark when the i-th value begins. This would happen both in the dictionary as well as in the raw forward index, unless we are auto-switching to FixedByteReaderWriter after detecting that all values are the same size. IIRC we did use to do that, so we should be good here?

Scan Performance

I think even if we get rid of the storage overhead mentioned above, FixedByteReaderWriter still ends up using readUnpaddedBytes, that relies on SWAR Zero in Word bit-hack. While that's faster than a naive approach, iirc it was still at least 20% slower than a naive approach that just simply assumes that each value is of a given fixed width. I had filed this issue about it last year: #16618 (comment)

Query

I think the most important operations for UUIDs are in Group Bys and Hashtable lookups. In both of these, the perf difference of equals and hashCode could be significantly different between an approach that uses class { long msb; long lsb; } (or similar) and a ByteArray based approach byte[] bytes.

To that end, for benchmarking I think we can just test the two approaches using microbenchmarks that test the group id generators for both V1 and V2 engines.

@xiangfu0 : we can also sync up on slack to expedite this. I can help share a PR with some benchmarks too.

This reverts commit dca827a.

Add long-pair UUID helpers and adopt them across UUID comparison, group-by, join, and segment-local paths. Preserve composite primary-key UUID hashing after UUID values are normalized to ByteArray and update the benchmark to exercise the production UUID key implementations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Jackie-Jiang requested that BIG_DECIMAL get the same single-value-only restriction that was added for UUID in Schema.validate(FieldSpec). BIG_DECIMAL is SV-only by implementation (no MV forward-index or dictionary exists for it). Also updated Javadoc to document both restrictions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

xiangfu0 force-pushed the codex/uuid-v1-support branch from 94e9c8b to be53c71 Compare April 9, 2026 03:23

xiangfu0 requested a review from Copilot April 9, 2026 04:12

Copilot started reviewing on behalf of xiangfu0 April 9, 2026 04:13 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

xiangfu0 force-pushed the codex/uuid-v1-support branch from 271dd9c to 64febd9 Compare April 10, 2026 10:48

xiangfu0 changed the title ~~[codex] Add first-class UUID support~~ Add first-class UUID support Apr 10, 2026

xiangfu0 marked this pull request as ready for review April 10, 2026 12:23

xiangfu0 changed the title ~~Add first-class UUID support~~ Add first-class logical UUID type support Apr 10, 2026

Jackie-Jiang reviewed Apr 11, 2026

View reviewed changes

xiangfu0 requested a review from Copilot April 11, 2026 04:47

Copilot started reviewing on behalf of xiangfu0 April 11, 2026 04:47 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

xiangfu0 requested review from Jackie-Jiang and Copilot April 11, 2026 05:57

Copilot started reviewing on behalf of xiangfu0 April 11, 2026 05:57 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

xiangfu0 force-pushed the codex/uuid-v1-support branch from 68cfdd9 to 92be21a Compare April 11, 2026 07:33

ankitsultana mentioned this pull request Apr 12, 2026

[feature-request] Add a Native UUID Type #16619

Open

Add UUID logical type support

3eb65f9

xiangfu0 and others added 14 commits April 13, 2026 12:15

Complete UUID v1 coverage and fix CI regressions

2ffa74f

Fix planner spotless import order

bb649d1

Fix UUID test scaffolding for CI

22dee90

Address UUID review feedback

bba6a2c

Fix UUID Arrow encoding and legacy UDF behavior

e9c69c2

Clarify UUID migration constraints

ceebd8a

Add UUID conversion support and harden review feedback

0ee6e88

Address UUID review feedback and stabilize CI

1a38e08

Stabilize UUID upsert realtime test readiness

47aacca

Use HTTP broker path for UUID upsert test

e8d16d5

Revert "Use HTTP broker path for UUID upsert test"

9d29d60

This reverts commit dca827a.

Fix indentation of throw in RequestContextUtils default case

3cec515

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

xiangfu0 force-pushed the codex/uuid-v1-support branch from 8388860 to 590c1c4 Compare April 13, 2026 19:35

Conversation

xiangfu0 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

V1 design contract

Benchmark results

Pinot storage and query benchmark

In-memory hot-path benchmark

Design choice: long[2] in engine internals

User guide

1. Schema definition

2. Table behavior

3. Ingest behavior

4. Query examples

Selection

CAST to UUID

IN predicate

GROUP BY

DISTINCT

ORDER BY

Equality join

5. Result behavior

6. Existing UUID byte/string helpers

Migration notes

Schema evolution constraint

From STRING

From BYTES

Important distinction

Format compatibility

Scope exclusions in v1

Validation

Uh oh!

codecov-commenter commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ankitsultana commented Apr 12, 2026

Uh oh!

xiangfu0 commented Apr 12, 2026

Uh oh!

ankitsultana commented Apr 12, 2026

Storage

On Disk Size

Scan Performance

Query

Uh oh!

Reviewers

Assignees

Labels

Projects

xiangfu0 commented Apr 9, 2026 •

edited

Loading

Design choice: `long[2]` in engine internals

From `STRING`

From `BYTES`

codecov-commenter commented Apr 9, 2026 •

edited

Loading