[fix](csv reader) fix silent data corruption when enclose char is a prefix of multi-char column separator#62580
Open
sollhui wants to merge 1 commit intoapache:masterfrom
Open
[fix](csv reader) fix silent data corruption when enclose char is a prefix of multi-char column separator#62580sollhui wants to merge 1 commit intoapache:masterfrom
sollhui wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
utafrali
approved these changes
Apr 17, 2026
utafrali
left a comment
There was a problem hiding this comment.
The fix is correct: by peeking ahead _column_sep_len bytes when _quote_escape=true and the current char equals _enclose, the parser correctly distinguishes a closing-enclose-then-separator sequence from a double-quote escape, addressing the silent data corruption. The logic is sound, the inline comments are clear, and the regression test covers the exact scenario from the bug report.
Contributor
|
PR approved by anyone and no changes requested. |
…r column separator
e996641 to
c9ae863
Compare
Contributor
Author
|
run buildall |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
encloseis a single character that is also the first character of a multi-charcolumn_separator(e.g.enclose=':',column_sep='::'), the CSV parser silently produces wrong data: it finds fewer column separators than actually exist and fills trailing columns withNULL.Reproduction
Properties:
column_separator = '::',enclose = ':'File content:
Expected:
id | name | remark -- | -- | -- 1 | alpha::beta | ok 2 | plain | tailRoot Cause
EncloseCsvLineReaderCtx::_on_pre_match_encloseuses a double-quote-escape heuristic: two consecutive enclose chars (::) inside an enclosed field are treated as an escaped enclose and scanning continues. This is correct in isolation, but breaks when the separator itself starts with the enclose char.For the line
1:::alpha::beta:::ok, the closing sequence at positions 15–17 is:The state machine consumed positions 15–16 as a double-quote escape, then set
_quote_escape=trueat 17, and finally triggered MATCH_ENCLOSE at 18 ('o') — producing a bogus field boundary. Only one separator was recorded instead of two, so column 3 becameNULL.Fix
In
_on_pre_match_enclose, when_quote_escape=trueand the current character is again the enclose char, peek aheadcolumn_sep_lenbytes before deciding:MATCH_ENCLOSEso the separator is recorded correctly.The check is guarded by
_column_sep_len > 1 && _column_sep[0] == _enclose, so it only activates for this specific combination. All other formats (standard CSVsep=','/enclose='"', multi-char separators that don't start with the enclose char, etc.) are completely unaffected.Limitation
When
encloseis a prefix ofsep, fields whose content ends with the separator string itself (e.g. storingalpha::withenclose=':',sep='::') are inherently ambiguous at the byte level — no algorithm can distinguish the two interpretations from the same byte sequence. This edge case was already broken before this fix and remains best-effort.Tests
Added
EncloseIsPrefixOfSeparatorinbe/test/format/file_reader/new_plain_text_line_reader_test.cpp:1:::alpha::beta:::okwithsep='::',enclose=':'→ sep positions[1, 16]✓2:::plain:::tail→ sep positions[1, 10]✓