feat: Add RecordBatchLogReader for bounded log reading#446
feat: Add RecordBatchLogReader for bounded log reading#446charlesdong1991 wants to merge 4 commits intoapache:mainfrom
Conversation
| arrow_schema: SchemaRef, | ||
| /// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`. | ||
| /// | ||
| /// TODO: Consider an API that consumes |
There was a problem hiding this comment.
it is cheap to clone for this record batch log scanner, but all clones will share one Arc , so two overlapping poll is not supported under current usage model, i add a client-side guard with poll_session so overlapping calls can fail fast.
Not sure what you think, i am happy to create a new issue and do a follow-up on that, or if you prefer i can have a stricter API in this PR?
There was a problem hiding this comment.
Let's do it properly in this PR. The reader should take ownership of the scanner (move, not clone). That way the compiler prevents concurrent polls - no mutex needed.
fresh-borzoni
left a comment
There was a problem hiding this comment.
@charlesdong1991 Ty for the PR. Left comments, PTAL
| /// The projected row type to use for record-based scanning | ||
| projected_row_type: fcore::metadata::RowType, | ||
| /// Cache for partition_id -> partition_name mapping (avoids repeated list_partition_infos calls) | ||
| partition_name_cache: std::sync::RwLock<Option<HashMap<i64, String>>>, |
There was a problem hiding this comment.
Why have we removed this?
There was a problem hiding this comment.
since it had no remaining caller after offset/poll loop moved to rust core, wdyt?
There was a problem hiding this comment.
since it had no remaining caller after offset/poll loop moved to rust core, wdyt?
| arrow_schema: SchemaRef, | ||
| /// Serializes overlapping `poll` / `poll_batches` across clones sharing this `Arc`. | ||
| /// | ||
| /// TODO: Consider an API that consumes |
There was a problem hiding this comment.
Let's do it properly in this PR. The reader should take ownership of the scanner (move, not clone). That way the compiler prevents concurrent polls - no mutex needed.
|
Hi @fresh-borzoni Sorry for late response, thanks for reviews. As i have been travelling without my laptop, i will come back to this in 2 weeks. |
1502702 to
23f666d
Compare
|
Thanks for your reviews, did some refactoring, PTAL @fresh-borzoni 🙏 |
|
@charlesdong1991 Ty for the PR, I looked briefly, looks good now, but let's wait until we decide if we want to move to fully async api for python polls, and then if it's the case - merge it first, rebase/resolve conflicts herr and I'll review one more time. WDYT? |
oh, that's good to hear, let me take a look too to get some understanding! |
Purpose
Move query_latest_offsets and poll-until-offsets logic from Python binding into Rust core as RecordBatchLogReader.
This enables both Python and C++ bindings to share the same bounded-read implementation.
Linked issue: close #406
Tests
Tests are passed locally
API and Format
Documentation