Skip to content

Unexpected NULL-Left Rows with RIGHT OUTER JOIN using Cluster Table Functions with global join #1664

@alsugiliazova

Description

@alsugiliazova

Summary

A RIGHT OUTER JOIN between s3Cluster and icebergS3Cluster table functions with object_storage_cluster_join_mode='global' returns 6 rows instead of the expected 2. The 2 matched rows appear correctly, but 4 additional spurious rows with NULL on the left side are emitted — as if those right-side rows had no match on the left, even though every right-side row does match a left-side row.


Test Parameters

Parameter Value
Left table s3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data3/data/**.parquet', ...)
Right table icebergS3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data1', ...)
Join clause RIGHT OUTER JOIN
Join condition t1.boolean_col = t2.boolean_col AND t1.string_col = t2.string_col
object_storage_cluster_join_mode global
object_storage_cluster replicated_cluster
Format Values
Order by tuple(*)

All tables have same two rows:

SELECT *
FROM merge_tree_table_d435968e_3cab_11f1_a333_de7b9eea348f
Query id: 4ed08192-c8f0-4221-82f3-2058f558586c

   ┌─boolean_col─┬─long_col─┬─double_col─┬─string_col─┬──────────────timestamp_col─┬───date_col─┬────time_col─┬────────────timestamptz_col─┬─integer_col─┬─float_col─┬─decimal_col─┐
1. │ true        │     1000 │     456.78 │ Alice      │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │        1000 │    456.78 │      456.78 │
2. │ false       │     2000 │     456.78 │ Bob        │ 2024-01-01 13:00:00.000000 │ 2024-01-01 │ 43200000000 │ 2024-01-01 12:00:00.000000 │        2000 │    456.78 │      456.78 │
   └─────────────┴──────────┴────────────┴────────────┴────────────────────────────┴────────────┴─────────────┴────────────────────────────┴─────────────┴───────────┴─────────────┘

2 rows in set. Elapsed: 0.010 sec. 

What Happened

Step 1 — Baseline Query (MergeTree tables, passed)

The test first materialized both data sources into local MergeTree tables via
CREATE TABLE ... ENGINE = MergeTree ORDER BY tuple() AS SELECT * FROM <cluster_function>(...)
and ran the same RIGHT OUTER JOIN to establish the expected result. This produced 2 rows — one matched row for Bob and one matched row for Alice, with no NULL-left rows (every right-side row has a match on the left).

Expected output (2 rows):

(false,2000,456.78,'Bob','2024-01-01 12:00:00.000000','2024-01-01','1970-01-01 12:00:00.000000','2024-01-01 12:00:00.000000',2000,456.78,456.78,false,2000,456.78,'Bob','2024-01-01 13:00:00.000000','2024-01-01',43200000000,'2024-01-01 12:00:00.000000',2000,456.78,456.78),
(true,1000,456.78,'Alice','2024-01-01 12:00:00.000000','2024-01-01','1970-01-01 12:00:00.000000','2024-01-01 12:00:00.000000',1000,456.78,456.78,true,1000,456.78,'Alice','2024-01-01 13:00:00.000000','2024-01-01',43200000000,'2024-01-01 12:00:00.000000',1000,456.78,456.78)

Step 2 — Actual Query (cluster table functions, failed)

The same join was then executed using the s3Cluster and icebergS3Cluster table functions directly, with settings:

  • object_storage_cluster_join_mode='global'
  • object_storage_cluster='replicated_cluster'

This produced 6 rows — the 2 correct matched rows, plus 4 extra rows with NULL values in all left-side columns (2 spurious (NULL, ..., Bob) rows and 2 spurious (NULL, ..., Alice) rows).

Actual output (6 rows):

(false,2000,...,'Bob',...,false,2000,...,'Bob',...),
(true,1000,...,'Alice',...,true,1000,...,'Alice',...),
(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,false,2000,...,'Bob',...),
(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,false,2000,...,'Bob',...),
(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,true,1000,...,'Alice',...),
(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,true,1000,...,'Alice',...)

Assertion Failure

assert result.output == expected_result.output

The diff shows the first two rows match, followed by 4 extra NULL-left rows that should not exist.


Root Cause Analysis

The RIGHT OUTER JOIN contract guarantees that every right-side row appears exactly once in the result — matched against its corresponding left-side row(s) where matches exist, or padded with NULLs on the left where no match exists globally. The MergeTree baseline confirms this behavior: every right-side row matches a left-side row, so there are no NULL-left rows.

However, when the query is distributed across the cluster via s3Cluster/icebergS3Cluster with object_storage_cluster_join_mode='global', the "outer" side of the join is not evaluated globally:

  • Each node processes a partition of the data.
  • When a given right-side row does not find a match in the local left-side partition, the node emits a NULL-left row, treating it as "unmatched".
  • Those right-side rows do have matches on other nodes, but the distributed join executor does not deduplicate or cross-check matches across nodes before emitting the NULL-padded rows.
  • The final result therefore contains both the correctly matched rows and spurious NULL-left rows produced by nodes where the match did not happen locally.

This is a ClickHouse bug in the global join mode for cluster table functions: outer join semantics are applied per-node rather than globally, resulting in spurious NULL-padded rows for right-side rows that are globally matched but locally unmatched.

Why the Test's Non-Stable Handling Did Not Catch This

The test has logic (lines 335–468 in joins.py) that relaxes the output assertion for certain "non-stable" table-type combinations. However, all those exemptions check for object_storage_cluster_join_mode == "allow". This failure occurs with object_storage_cluster_join_mode == "global", which is not covered by the exemptions. The global mode exhibits the same non-stable behavior as allow for this table-type combination, but the test was not updated to account for it.


Relationship to Existing Issue

This bug is closely related to Altinity/ClickHouse#1661, which reports the same category of problem (distributed cluster joins producing incorrect row counts with object_storage_cluster_join_mode='global'), but for INNER ANY JOIN — where matched rows are duplicated. This failure is a different manifestation of the same underlying issue: the distributed join executor does not enforce global join semantics across cluster nodes.

Issue Join Type Symptom
#1661 INNER ANY JOIN Matched rows duplicated (6 rows instead of 2)
This failure RIGHT OUTER JOIN Matched rows correct, but 4 spurious NULL-left rows added (6 rows instead of 2)

Affected Table-Type Combination

Left Table Type Right Table Type Join Mode Join Clause Status
s3Cluster_table_function icebergS3Cluster_table_function allow any non-stable Handled (relaxed assertion)
s3Cluster_table_function icebergS3Cluster_table_function global RIGHT OUTER JOIN NOT handled — this failure

Possible Resolutions

Option A — ClickHouse Fix (Preferred)

File a bug (or extend #1661) against ClickHouse for RIGHT OUTER JOIN emitting spurious NULL-left rows when using cluster table functions with object_storage_cluster_join_mode='global'. The outer-join semantics must be enforced globally, not per-node — a right-side row should only produce a NULL-left row if it has no match on any node.

Option B — Test Workaround

Extend the non-stable result handling in check_join() to also cover object_storage_cluster_join_mode == "global" for the s3Cluster + icebergS3Cluster combination (and likely other cluster function pairs), relaxing the assertion to only check exitcode == 0.

Option C — Combined

File the ClickHouse bug and add the test relaxation with an xfail or similar marker referencing the upstream issue, so that the test suite is not blocked while waiting for a fix.


Reproduction

Input data (both s3Cluster and icebergS3Cluster return 8 rows each; every right-side row has a matching left-side row on (boolean_col, string_col)).

SELECT *
FROM s3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data3/data/**.parquet', '<user>', '<password>') AS t1
RIGHT OUTER JOIN icebergS3Cluster(replicated_cluster, 'http://minio:9000/warehouse/data1', '<user>', '<password>') AS t2
  ON t1.boolean_col = t2.boolean_col AND t1.string_col = t2.string_col
ORDER BY tuple(*)
SETTINGS object_storage_cluster_join_mode='global', object_storage_cluster='replicated_cluster'
FORMAT Values

Compare against the same query against MergeTree copies of the same data — the MergeTree version returns 2 rows (all matched), while the cluster version returns 6 rows (2 matched + 4 spurious NULL-left rows).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions