feat: arrow_upper/array_lower UDFs for datafusion 54#362
Open
sunng87 wants to merge 5 commits into
Open
Conversation
DataFusion ships `array_length` but not `array_upper`/`array_lower`, so
Postgres client queries using them only "worked" via hardcoded blacklist
token substitution -- any non-byte-matching variant failed at planning
with `Invalid function 'array_upper'`.
Implement both as scalar UDFs (`array_bounds_udf.rs`) with Postgres
semantics:
* arrow lists are always 1-based, so array_lower(arr,1) -> 1 and
array_upper(arr,1) -> array_length(arr,1)
* NULL for null array, out-of-range dim, or empty array
* accepts List/LargeList/FixedSizeList + Int32/Int64 dims, returns Int32
Register them in `setup_pg_catalog`.
The dbeaver startup test went from FAILED -> PASS (18/19 queries now run
for real). The two dbeaver queries that still can't run are blacklisted:
one needs dynamic `generate_series` bounds (DF only takes literal Int64),
the other needs Postgres regnamespace/oid-alias string resolution.
Neither is fixable by adding the UDF; both get type-compatible NULL stubs.
Includes unit + end-to-end integration tests.
These are all Postgres `oid` (int4) alias types: casting a *name* to one
yields the matching catalog oid. DataFusion has no such types, so a bare
`'trigger'::regtype` was stripped to the string `'trigger'` and then
crashed an oid comparison with "Cannot cast string 'trigger' to Int32".
Generalize the existing regclass-only rewrite rule into
`RewriteRegCastToSubquery`, with a registry of name->oid lookup queries:
* regclass -> pg_class.oid (schema-qualified relation name)
* regnamespace -> pg_namespace.oid (schema name)
* regtype -> pg_type.oid (schema-qualified type name)
* regproc -> pg_proc.oid (schema-qualified function name)
Direction guard: only the *forward* direction (name -> oid) is rewritten,
detected by a string-literal or `$1` placeholder operand. The *reverse*
direction (`<oid-column>::regtype`, e.g. `prorettype::regtype::text` for
display, or the left side of `prorettype::regtype != 'trigger'::regtype`)
has a column operand and is left untouched; it is stripped to the bare
oid column by RemoveUnsupportedTypes, which is correct since the column
already is an oid. `regclass` is re-added to the unsupported-types set so
its reverse direction is stripped too (it was removed earlier for the
forward case, now handled by the rewrite).
Results -- client startup tests, base branch -> this branch:
* pgcli : FAILED -> PASS (the `'trigger'::regtype` crash is fixed)
* dbeaver: PASS (already green from the array_upper work)
* psql : 3 failures -> 1 failure; the two regclass/regtype-related
queries now pass. The remaining failure is a DataFusion
internal "ScalarSubqueryExpr evaluated before the subquery was
executed" bug in `IN (subquery UNION ALL VALUES(...))`, which
is blacklisted here as it can't be fixed at the SQL-rewrite layer.
The full workspace test suite is green.
The previous commit added 3 BLACKLIST_SQL_MAPPING entries to keep client
startup tests green. Two of them can be replaced with general, correct
rewrite rules; only the one that genuinely needs schema awareness remains.
1. dbeaver type lookup (REMOVED):
`generate_series(1, array_upper(current_schemas(false), 1))` failed
because our array_upper UDF returns Int32 (correct Postgres int4) but
DataFusion's generate_series wants Int64. Postgres implicitly coerces
int4->int8 here; DF does not. New rule `CastArrayBoundsForGenerateSeries`
wraps array_upper/array_lower args of generate_series in ::bigint, the
way Postgres would.
2. psql foreign-key \d lookup (REMOVED):
`VALUES ('16417'::pg_catalog.regclass)` broke DF's subquery
decorrelator because the regclass rewrite embedded a (SELECT ...) inside
VALUES. Fix it at the source: a *numeric* string operand like '16417'
should resolve directly to that oid (Postgres behavior), not do a name
lookup. RewriteRegCastToSubquery now emits a literal oid for numeric
operands, which both is more correct and avoids the nested subquery.
3. dbeaver relation-size lookup (KEPT):
`WHERE c.relnamespace='public'` compares an oid column to a bare string,
needing Postgres' implicit string->oid coercion. The explicit
`'public'::regnamespace` form is already handled, but detecting that a
bare string needs coercing requires knowing the column is an oid column
-- schema awareness the AST-rewrite layer lacks. Hardcoding pg_catalog
column names would risk regressing user tables with same-named columns,
so this single entry stays, with an accurate comment.
Net: BLACKLIST_SQL_MAPPING goes from +3 (previous commit) to +1 vs the
base branch, and that one is the genuinely hard case. Full test suite +
clippy are green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.