Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling by jiafatom · Pull Request #2499 · microsoft/Olive

jiafatom · 2026-06-04T18:10:29Z

Description

Aligns the Olive JSON-based vision evaluation (vision_vqa_pre_process) with the standalone eval.py scripts in olive-recipes, and adds robustness improvements.

Problem

The JSON eval was reporting ~66% accuracy on AI2D while eval.py reported ~76% on the same CUDA model. The gap was caused by:

0-based vs 1-based option numbering — options were presented as 0. opt, 1. opt... but VLMs prefer 1-based (1. opt, 2. opt...)
Overly strict output parsing — re.match(r"^(\\d+)", pred) only matched a leading digit, missing valid responses like "The answer is 2"

Changes

olive/data/component/pre_process_data.py:
- Use 1-based numbering for options and convert 0-based ground truth index to 1-based
- Pass num_choices (actual count of options) instead of a boolean flag
- Add configurable max_length parameter (default 4096), passable from JSON data config
olive/evaluator/olive_evaluator.py:
- Build regex dynamically as r"\\b([1-{num_choices}])\\b" instead of hardcoded [1-4]
- Only enable digit extraction when num_choices is 1-9 (single-digit range)
- Read max_length from data config, falling back to 4096 default
- Wrap entire per-image block in try/except so corrupt images log a warning with sample index instead of aborting the run

Testing

Validated on AI2D (3088 samples) with Qwen2.5-VL-3B-Instruct CUDA model — results now align with eval.py (~76% accuracy).

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Change option numbering from 0-based to 1-based (1, 2, 3, 4) in vision_vqa_pre_process to match how VLMs are typically prompted - Convert ground-truth answer index from 0-based to 1-based accordingly - Update extract_number regex from leading-only (^\d+) to search-anywhere (\b[1-4]\b) with fallback, matching eval.py's lenient parsing behavior This aligns the Olive JSON-based evaluation with the standalone eval.py script, fixing a ~10pp accuracy gap caused by 0-based numbering confusion and overly strict output parsing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Pass num_choices (len of options) from pre_process instead of boolean extract_number - Evaluator builds regex pattern dynamically: r'\b([1-{num_choices}])\b' - Only enables digit extraction when num_choices is 1-9 (single-digit range) - Add explanatory comment to empty except clause (CodeQL fix) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add max_length parameter to vision_vqa_pre_process (default 4096) - Pass max_length through VisionVQADataset to evaluator via input_dict - Evaluator reads per-sample max_length, falling back to 4096 default Users can now override max_length in their JSON data config: "pre_process_args": {"max_length": 8192, ...} Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Corrupt or broken images could crash during Image.open, pil_image.save, og.Images.open, or processor() — not just during generation. Widen the try/except to catch any Exception so a single bad sample logs a warning and continues evaluation instead of aborting the entire run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Compute num_choices directly where options are validated, removing the separate has_options flag and the potentially-uninitialized 'options' reference that CodeQL flagged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 4, 2026 18:10

Copilot started reviewing on behalf of jiafatom June 4, 2026 18:10 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

github-advanced-security AI found potential problems Jun 4, 2026

View reviewed changes

Comment thread olive/data/component/pre_process_data.py Fixed

jiafatom force-pushed the jiafa/align-vision-eval-1based branch from 1719469 to 96fc4c3 Compare June 4, 2026 18:16

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

Comment thread olive/data/component/pre_process_data.py Fixed

jiafatom changed the title ~~Align vision VQA eval with 1-based option numbering and lenient parsing~~ Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling Jun 5, 2026

jiafatom and others added 6 commits June 5, 2026 16:31

Log sample index when image processing fails

6e3aae1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom force-pushed the jiafa/align-vision-eval-1based branch from a4d4cf3 to 3bd8d05 Compare June 5, 2026 16:31

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

Comment thread olive/evaluator/olive_evaluator.py Fixed

Fix lint: format __init__ args and use set comprehension

d42c77c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xiaoyu-work approved these changes Jun 5, 2026

View reviewed changes

jiafatom merged commit c10e5bc into main Jun 5, 2026
13 checks passed

jiafatom deleted the jiafa/align-vision-eval-1based branch June 5, 2026 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling#2499

Align vision VQA eval: dynamic choice detection, configurable max_length, robust error handling#2499
jiafatom merged 7 commits into
mainfrom
jiafa/align-vision-eval-1based

jiafatom commented Jun 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiafatom commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiafatom commented Jun 4, 2026 •

edited

Loading