Skip to content

feat: arrow_upper/array_lower UDFs for datafusion 54#362

Open
sunng87 wants to merge 5 commits into
dependabot/cargo/datafusion-54.0.0from
feat/array-upper-lower-udf
Open

feat: arrow_upper/array_lower UDFs for datafusion 54#362
sunng87 wants to merge 5 commits into
dependabot/cargo/datafusion-54.0.0from
feat/array-upper-lower-udf

Conversation

@sunng87

@sunng87 sunng87 commented Jun 22, 2026

Copy link
Copy Markdown
Member

No description provided.

sunng87 added 4 commits June 20, 2026 11:31
DataFusion ships `array_length` but not `array_upper`/`array_lower`, so
Postgres client queries using them only "worked" via hardcoded blacklist
token substitution -- any non-byte-matching variant failed at planning
with `Invalid function 'array_upper'`.

Implement both as scalar UDFs (`array_bounds_udf.rs`) with Postgres
semantics:
  * arrow lists are always 1-based, so array_lower(arr,1) -> 1 and
    array_upper(arr,1) -> array_length(arr,1)
  * NULL for null array, out-of-range dim, or empty array
  * accepts List/LargeList/FixedSizeList + Int32/Int64 dims, returns Int32

Register them in `setup_pg_catalog`.

The dbeaver startup test went from FAILED -> PASS (18/19 queries now run
for real). The two dbeaver queries that still can't run are blacklisted:
one needs dynamic `generate_series` bounds (DF only takes literal Int64),
the other needs Postgres regnamespace/oid-alias string resolution.
Neither is fixable by adding the UDF; both get type-compatible NULL stubs.

Includes unit + end-to-end integration tests.
These are all Postgres `oid` (int4) alias types: casting a *name* to one
yields the matching catalog oid. DataFusion has no such types, so a bare
`'trigger'::regtype` was stripped to the string `'trigger'` and then
crashed an oid comparison with "Cannot cast string 'trigger' to Int32".

Generalize the existing regclass-only rewrite rule into
`RewriteRegCastToSubquery`, with a registry of name->oid lookup queries:
  * regclass      -> pg_class.oid      (schema-qualified relation name)
  * regnamespace  -> pg_namespace.oid  (schema name)
  * regtype       -> pg_type.oid       (schema-qualified type name)
  * regproc       -> pg_proc.oid       (schema-qualified function name)

Direction guard: only the *forward* direction (name -> oid) is rewritten,
detected by a string-literal or `$1` placeholder operand. The *reverse*
direction (`<oid-column>::regtype`, e.g. `prorettype::regtype::text` for
display, or the left side of `prorettype::regtype != 'trigger'::regtype`)
has a column operand and is left untouched; it is stripped to the bare
oid column by RemoveUnsupportedTypes, which is correct since the column
already is an oid. `regclass` is re-added to the unsupported-types set so
its reverse direction is stripped too (it was removed earlier for the
forward case, now handled by the rewrite).

Results -- client startup tests, base branch -> this branch:
  * pgcli : FAILED -> PASS (the `'trigger'::regtype` crash is fixed)
  * dbeaver: PASS (already green from the array_upper work)
  * psql  : 3 failures -> 1 failure; the two regclass/regtype-related
            queries now pass. The remaining failure is a DataFusion
            internal "ScalarSubqueryExpr evaluated before the subquery was
            executed" bug in `IN (subquery UNION ALL VALUES(...))`, which
            is blacklisted here as it can't be fixed at the SQL-rewrite layer.

The full workspace test suite is green.
The previous commit added 3 BLACKLIST_SQL_MAPPING entries to keep client
startup tests green. Two of them can be replaced with general, correct
rewrite rules; only the one that genuinely needs schema awareness remains.

1. dbeaver type lookup (REMOVED):
   `generate_series(1, array_upper(current_schemas(false), 1))` failed
   because our array_upper UDF returns Int32 (correct Postgres int4) but
   DataFusion's generate_series wants Int64. Postgres implicitly coerces
   int4->int8 here; DF does not. New rule `CastArrayBoundsForGenerateSeries`
   wraps array_upper/array_lower args of generate_series in ::bigint, the
   way Postgres would.

2. psql foreign-key \d lookup (REMOVED):
   `VALUES ('16417'::pg_catalog.regclass)` broke DF's subquery
   decorrelator because the regclass rewrite embedded a (SELECT ...) inside
   VALUES. Fix it at the source: a *numeric* string operand like '16417'
   should resolve directly to that oid (Postgres behavior), not do a name
   lookup. RewriteRegCastToSubquery now emits a literal oid for numeric
   operands, which both is more correct and avoids the nested subquery.

3. dbeaver relation-size lookup (KEPT):
   `WHERE c.relnamespace='public'` compares an oid column to a bare string,
   needing Postgres' implicit string->oid coercion. The explicit
   `'public'::regnamespace` form is already handled, but detecting that a
   bare string needs coercing requires knowing the column is an oid column
   -- schema awareness the AST-rewrite layer lacks. Hardcoding pg_catalog
   column names would risk regressing user tables with same-named columns,
   so this single entry stays, with an accurate comment.

Net: BLACKLIST_SQL_MAPPING goes from +3 (previous commit) to +1 vs the
base branch, and that one is the genuinely hard case. Full test suite +
clippy are green.
@sunng87 sunng87 changed the title feat: arrow_upper/array_lower UDFs feat: arrow_upper/array_lower UDFs for datafusion 54 Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant