Skip to content

test: Accumulate CPU half-precision sums in float32#3488

Open
sofinvalery wants to merge 1 commit intoml-explore:mainfrom
sofinvalery:sum-float-accum
Open

test: Accumulate CPU half-precision sums in float32#3488
sofinvalery wants to merge 1 commit intoml-explore:mainfrom
sofinvalery:sum-float-accum

Conversation

@sofinvalery
Copy link
Copy Markdown
Contributor

Proposed changes

Accumulate CPU float16 and bfloat16 sum reductions in float32, while preserving the output dtype. This fixes precision loss in ops that use sum().

Fixes #3326.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@sofinvalery
Copy link
Copy Markdown
Contributor Author

Made it inline. Feels cleaner.

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a new array would have heavy performance penalty, the correct way would be refactoring strided_reduce/contiguous_reduce to accumulate in float32 rather than the output type.

@sofinvalery
Copy link
Copy Markdown
Contributor Author

Refactored strided_reduce/contiguous_reduce to support accumulation in a separate type.
float16/bfloat16 sums now accumulate in float32 without allocating a temp array.
Also added tests for reductions across full, contiguous, and strided axes.

Comment thread mlx/backend/cpu/reduce.cpp Outdated
Comment thread mlx/backend/cpu/reduce.cpp
Comment thread mlx/backend/cpu/reduce.cpp Outdated
@angeloskath
Copy link
Copy Markdown
Member

I don't think we should merge this tbh. It's kind of expected for floats that you get the accumulation precision of the float you are in.

I am not against choosing high precision temporaries internally if possible but given the current implementation it is actually faster to cast to float32 and then sum (didn't try it but almost certainly it will be).

For the test by the way, the correct fix is to simply upcast the random numbers before computing their average. We want to test random after all not sum.

@sofinvalery
Copy link
Copy Markdown
Contributor Author

Updated the test only. Thanks for the feedback! @zcbenz @angeloskath

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the test! Can you also update .github/actions/test-windows/action.yml to enable the test in CI?

@zcbenz zcbenz changed the title Accumulate CPU half-precision sums in float32 test: Accumulate CPU half-precision sums in float32 May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"test random uniform" fails for CPU backend

3 participants