Skip to content

Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling#2499

Merged
jiafatom merged 7 commits into
mainfrom
jiafa/align-vision-eval-1based
Jun 5, 2026
Merged

Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling#2499
jiafatom merged 7 commits into
mainfrom
jiafa/align-vision-eval-1based

Conversation

@jiafatom
Copy link
Copy Markdown
Contributor

@jiafatom jiafatom commented Jun 4, 2026

Description

Aligns the Olive JSON-based vision evaluation (vision_vqa_pre_process) with the standalone eval.py scripts in olive-recipes, and adds robustness improvements.

Problem

The JSON eval was reporting ~66% accuracy on AI2D while eval.py reported ~76% on the same CUDA model. The gap was caused by:

  1. 0-based vs 1-based option numbering — options were presented as 0. opt, 1. opt... but VLMs prefer 1-based (1. opt, 2. opt...)
  2. Overly strict output parsingre.match(r"^(\\d+)", pred) only matched a leading digit, missing valid responses like "The answer is 2"

Changes

  • olive/data/component/pre_process_data.py:
    • Use 1-based numbering for options and convert 0-based ground truth index to 1-based
    • Pass num_choices (actual count of options) instead of a boolean flag
    • Add configurable max_length parameter (default 4096), passable from JSON data config
  • olive/evaluator/olive_evaluator.py:
    • Build regex dynamically as r"\\b([1-{num_choices}])\\b" instead of hardcoded [1-4]
    • Only enable digit extraction when num_choices is 1-9 (single-digit range)
    • Read max_length from data config, falling back to 4096 default
    • Wrap entire per-image block in try/except so corrupt images log a warning with sample index instead of aborting the run

Testing

Validated on AI2D (3088 samples) with Qwen2.5-VL-3B-Instruct CUDA model — results now align with eval.py (~76% accuracy).

Copilot AI review requested due to automatic review settings June 4, 2026 18:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread olive/data/component/pre_process_data.py Fixed
@jiafatom jiafatom force-pushed the jiafa/align-vision-eval-1based branch from 1719469 to 96fc4c3 Compare June 4, 2026 18:16
Comment thread olive/data/component/pre_process_data.py Fixed
@jiafatom jiafatom changed the title Align vision VQA eval with 1-based option numbering and lenient parsing Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling Jun 5, 2026
jiafatom and others added 6 commits June 5, 2026 16:31
- Change option numbering from 0-based to 1-based (1, 2, 3, 4) in
  vision_vqa_pre_process to match how VLMs are typically prompted
- Convert ground-truth answer index from 0-based to 1-based accordingly
- Update extract_number regex from leading-only (^\d+) to search-anywhere
  (\b[1-4]\b) with fallback, matching eval.py's lenient parsing behavior

This aligns the Olive JSON-based evaluation with the standalone eval.py
script, fixing a ~10pp accuracy gap caused by 0-based numbering confusion
and overly strict output parsing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Pass num_choices (len of options) from pre_process instead of boolean extract_number
- Evaluator builds regex pattern dynamically: r'\b([1-{num_choices}])\b'
- Only enables digit extraction when num_choices is 1-9 (single-digit range)
- Add explanatory comment to empty except clause (CodeQL fix)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add max_length parameter to vision_vqa_pre_process (default 4096)
- Pass max_length through VisionVQADataset to evaluator via input_dict
- Evaluator reads per-sample max_length, falling back to 4096 default

Users can now override max_length in their JSON data config:
  "pre_process_args": {"max_length": 8192, ...}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Corrupt or broken images could crash during Image.open, pil_image.save,
og.Images.open, or processor() — not just during generation. Widen the
try/except to catch any Exception so a single bad sample logs a warning
and continues evaluation instead of aborting the entire run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute num_choices directly where options are validated, removing the
separate has_options flag and the potentially-uninitialized 'options'
reference that CodeQL flagged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom force-pushed the jiafa/align-vision-eval-1based branch from a4d4cf3 to 3bd8d05 Compare June 5, 2026 16:31
Comment thread olive/evaluator/olive_evaluator.py Fixed
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom merged commit c10e5bc into main Jun 5, 2026
13 checks passed
@jiafatom jiafatom deleted the jiafa/align-vision-eval-1based branch June 5, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants