Skip to content

feat: Add RecordBatchLogReader for bounded log reading#446

Open
charlesdong1991 wants to merge 4 commits intoapache:mainfrom
charlesdong1991:arrow-batch-reader
Open

feat: Add RecordBatchLogReader for bounded log reading#446
charlesdong1991 wants to merge 4 commits intoapache:mainfrom
charlesdong1991:arrow-batch-reader

Conversation

@charlesdong1991
Copy link
Copy Markdown
Contributor

Purpose

Move query_latest_offsets and poll-until-offsets logic from Python binding into Rust core as RecordBatchLogReader.

This enables both Python and C++ bindings to share the same bounded-read implementation.

Linked issue: close #406

Tests

Tests are passed locally

API and Format

Documentation

arrow_schema: SchemaRef,
/// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`.
///
/// TODO: Consider an API that consumes
Copy link
Copy Markdown
Contributor Author

@charlesdong1991 charlesdong1991 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is cheap to clone for this record batch log scanner, but all clones will share one Arc , so two overlapping poll is not supported under current usage model, i add a client-side guard with poll_session so overlapping calls can fail fast.

Not sure what you think, i am happy to create a new issue and do a follow-up on that, or if you prefer i can have a stricter API in this PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it properly in this PR. The reader should take ownership of the scanner (move, not clone). That way the compiler prevents concurrent polls - no mutex needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Contributor

@fresh-borzoni fresh-borzoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 Ty for the PR. Left comments, PTAL

Comment thread crates/fluss/src/client/table/reader.rs Outdated
/// The projected row type to use for record-based scanning
projected_row_type: fcore::metadata::RowType,
/// Cache for partition_id -> partition_name mapping (avoids repeated list_partition_infos calls)
partition_name_cache: std::sync::RwLock<Option<HashMap<i64, String>>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have we removed this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it had no remaining caller after offset/poll loop moved to rust core, wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it had no remaining caller after offset/poll loop moved to rust core, wdyt?

Comment thread bindings/python/src/lib.rs Outdated
Comment thread bindings/python/src/table.rs Outdated
Comment thread crates/fluss/src/client/table/reader.rs
Comment thread crates/fluss/src/client/table/reader.rs
arrow_schema: SchemaRef,
/// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`.
///
/// TODO: Consider an API that consumes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it properly in this PR. The reader should take ownership of the scanner (move, not clone). That way the compiler prevents concurrent polls - no mutex needed.

@charlesdong1991
Copy link
Copy Markdown
Contributor Author

Hi @fresh-borzoni Sorry for late response, thanks for reviews. As i have been travelling without my laptop, i will come back to this in 2 weeks.
In the meantime, i will convert this to draft to avoid confusion. 🙏

@charlesdong1991 charlesdong1991 marked this pull request as draft March 30, 2026 18:10
@charlesdong1991 charlesdong1991 marked this pull request as ready for review April 18, 2026 15:17
@charlesdong1991
Copy link
Copy Markdown
Contributor Author

Thanks for your reviews, did some refactoring, PTAL @fresh-borzoni 🙏

@fresh-borzoni
Copy link
Copy Markdown
Contributor

fresh-borzoni commented Apr 19, 2026

@charlesdong1991 Ty for the PR, I looked briefly, looks good now, but let's wait until we decide if we want to move to fully async api for python polls, and then if it's the case - merge it first, rebase/resolve conflicts herr and I'll review one more time.

WDYT?

@charlesdong1991
Copy link
Copy Markdown
Contributor Author

let's wait until we decide if we want to move to fully async api for python polls, and then if it's the case - merge it first

oh, that's good to hear, let me take a look too to get some understanding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NO to_arrow_batch_reader support in python binding

2 participants