Skip to content

[fix](csv reader) fix silent data corruption when enclose char is a prefix of multi-char column separator#62580

Open
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_csv_reader_parse_error
Open

[fix](csv reader) fix silent data corruption when enclose char is a prefix of multi-char column separator#62580
sollhui wants to merge 1 commit intoapache:masterfrom
sollhui:fix_csv_reader_parse_error

Conversation

@sollhui
Copy link
Copy Markdown
Contributor

@sollhui sollhui commented Apr 17, 2026

Problem

When enclose is a single character that is also the first character of a multi-char column_separator (e.g. enclose=':', column_sep='::'), the CSV parser silently produces wrong data: it finds fewer column separators than actually exist and fills trailing columns with NULL.

Reproduction

Properties: column_separator = '::', enclose = ':'

File content:

id::name::remark
1:::alpha::beta:::ok
2:::plain:::tail

Expected:

id | name | remark -- | -- | -- 1 | alpha::beta | ok 2 | plain | tail

Root Cause

EncloseCsvLineReaderCtx::_on_pre_match_enclose uses a double-quote-escape heuristic: two consecutive enclose chars (::) inside an enclosed field are treated as an escaped enclose and scanning continues. This is correct in isolation, but breaks when the separator itself starts with the enclose char.

For the line 1:::alpha::beta:::ok, the closing sequence at positions 15–17 is:

pos:  15  16  17  18
char:  :   :   :   o
       ↑   ↑───┘
       │   separator '::'
       closing enclose

The state machine consumed positions 15–16 as a double-quote escape, then set _quote_escape=true at 17, and finally triggered MATCH_ENCLOSE at 18 ('o') — producing a bogus field boundary. Only one separator was recorded instead of two, so column 3 became NULL.

Fix

In _on_pre_match_enclose, when _quote_escape=true and the current character is again the enclose char, peek ahead column_sep_len bytes before deciding:

  • If the bytes at the current position equal the full column separator → the previous enclose char was the closing enclose; transition to MATCH_ENCLOSE so the separator is recorded correctly.
  • Otherwise → fall through to the existing double-quote escape path.

The check is guarded by _column_sep_len > 1 && _column_sep[0] == _enclose, so it only activates for this specific combination. All other formats (standard CSV sep=',' / enclose='"', multi-char separators that don't start with the enclose char, etc.) are completely unaffected.

Limitation

When enclose is a prefix of sep, fields whose content ends with the separator string itself (e.g. storing alpha:: with enclose=':', sep='::') are inherently ambiguous at the byte level — no algorithm can distinguish the two interpretations from the same byte sequence. This edge case was already broken before this fix and remains best-effort.

Tests

Added EncloseIsPrefixOfSeparator in be/test/format/file_reader/new_plain_text_line_reader_test.cpp:

  • 1:::alpha::beta:::ok with sep='::', enclose=':' → sep positions [1, 16] 
  • 2:::plain:::tail → sep positions [1, 10] 
  • Both rows together ✓

@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 17, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Apr 17, 2026

run buildall

Copy link
Copy Markdown

@utafrali utafrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is correct: by peeking ahead _column_sep_len bytes when _quote_escape=true and the current char equals _enclose, the parser correctly distinguishes a closing-enclose-then-separator sequence from a double-quote escape, addressing the silent data corruption. The logic is sound, the inline comments are clear, and the regression test covers the exact scenario from the bug report.

Comment thread be/src/format/file_reader/new_plain_text_line_reader.cpp Outdated
Comment thread be/test/format/file_reader/new_plain_text_line_reader_test.cpp
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@sollhui sollhui force-pushed the fix_csv_reader_parse_error branch from e996641 to c9ae863 Compare April 17, 2026 07:35
@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented Apr 17, 2026

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants