Skip to content

tphakala/simd

Repository files navigation

simd

Go Reference Go Report Card License: MIT

A high-performance SIMD (Single Instruction, Multiple Data) library for Go providing vectorized operations on float64, float32, float16, int32, complex128, and complex64 slices.

Features

  • Pure Go assembly - Native Go assembler, simple cross-compilation
  • Runtime CPU detection - Automatically selects optimal implementation (AVX-512, AVX+FMA, AVX without FMA, SSE4.1, NEON, NEON+FP16, or pure Go)
  • Zero allocations - All operations work on pre-allocated slices
  • 80+ operations - Arithmetic, reduction, statistical, vector, signal processing, activation functions, and complex number operations
  • Multi-architecture - AMD64 (AVX-512/AVX+FMA/SSE4.1) and ARM64 (NEON/NEON+FP16) with pure Go fallback
  • Half-precision support - Native FP16 SIMD on ARM64 with FP16 extension (Apple Silicon, Cortex-A55+)
  • Thread-safe - All functions are safe for concurrent use

Installation

go get github.com/tphakala/simd

Requires Go 1.25+

Quick Start

package main

import (
    "fmt"
    "github.com/tphakala/simd/cpu"
    "github.com/tphakala/simd/f64"
)

func main() {
    fmt.Println("SIMD:", cpu.Info())

    // Vector operations
    a := []float64{1, 2, 3, 4, 5, 6, 7, 8}
    b := []float64{8, 7, 6, 5, 4, 3, 2, 1}

    // Dot product
    dot := f64.DotProduct(a, b)
    fmt.Println("Dot product:", dot) // 120

    // Element-wise operations
    dst := make([]float64, len(a))
    f64.Add(dst, a, b)
    fmt.Println("Sum:", dst) // [9, 9, 9, 9, 9, 9, 9, 9]

    // Statistical operations
    mean := f64.Mean(a)
    stddev := f64.StdDev(a)
    fmt.Printf("Mean: %.2f, StdDev: %.2f\n", mean, stddev)

    // Vector operations
    f64.Normalize(dst, a)
    fmt.Println("Normalized:", dst)

    // Distance calculation
    dist := f64.EuclideanDistance(a, b)
    fmt.Println("Distance:", dist)
}

Packages

cpu - CPU Feature Detection

import "github.com/tphakala/simd/cpu"

fmt.Println(cpu.Info())        // "AMD64 AVX-512", "AMD64 AVX+FMA", "AMD64 AVX", "AMD64 SSE2", "AMD64 (scalar)", "ARM64 NEON+FP16", or "ARM64 NEON"
fmt.Println(cpu.HasAVX())      // true/false
fmt.Println(cpu.HasAVX2())     // true/false
fmt.Println(cpu.HasFMA())      // true/false
fmt.Println(cpu.HasAVX512VL()) // true/false (AVX-512 F+VL)
fmt.Println(cpu.HasNEON())     // true/false
fmt.Println(cpu.HasFP16())     // true/false (ARM64 half-precision SIMD)

f64 - float64 Operations

Category Function Description SIMD Width
Arithmetic Add(dst, a, b) Element-wise addition 8x (AVX-512) / 4x (AVX) / 2x (NEON)
Sub(dst, a, b) Element-wise subtraction 8x / 4x / 2x
Mul(dst, a, b) Element-wise multiplication 8x / 4x / 2x
Div(dst, a, b) Element-wise division 8x / 4x / 2x
Scale(dst, a, s) Multiply by scalar 8x / 4x / 2x
AddScalar(dst, a, s) Add scalar 8x / 4x / 2x
SubFromScalar(dst, a, s) Scalar minus vector 8x / 4x / 2x (composed SIMD)
FMA(dst, a, b, c) Fused multiply-add: a*b+c 8x / 4x / 2x
AddScaled(dst, alpha, s) dst += alpha*s (axpy) 8x / 4x / 2x
Unary Abs(dst, a) Absolute value 8x / 4x / 2x
Neg(dst, a) Negation 8x / 4x / 2x
Sqrt(dst, a) Square root 8x / 4x / 2x
Reciprocal(dst, a) Reciprocal (1/x) 8x / 4x / 2x
Round(dst, src) Round half away from zero 4x (AVX) / 2x (NEON) / Go fallback
Reduction DotProduct(a, b) Dot product 8x / 4x / 2x
WeightedSum(w, src) Weighted sum Σ(wᵢ·srcᵢ) 8x / 4x / 2x
SumOfSquares(src) Sum of squares Σ(srcᵢ²) 8x / 4x / 2x
Sum(a) Sum of elements 8x / 4x / 2x
Min(a) Minimum value 8x / 4x / 2x
Max(a) Maximum value 8x / 4x / 2x
MinIdx(a) Index of minimum value Pure Go
MaxIdx(a) Index of maximum value Pure Go
Statistical Mean(a) Arithmetic mean 8x / 4x / 2x
Variance(a) Population variance 8x / 4x / 2x
StdDev(a) Standard deviation 8x / 4x / 2x
Vector EuclideanDistance(a, b) L2 distance 8x / 4x / 2x
Normalize(dst, a) Unit vector normalization 8x / 4x / 2x
CumulativeSum(dst, a) Running sum Sequential
Range Clamp(dst, a, min, max) Clamp to range 8x / 4x / 2x
Activation Sigmoid(dst, src) Sigmoid: 1/(1+e^-x) 4x (AVX2) / 2x (NEON)
ReLU(dst, src) Rectified Linear Unit 4x (AVX) / 2x (NEON)
Tanh(dst, src) Hyperbolic tangent 4x (AVX2) / 2x (NEON)
Exp(dst, src) Exponential e^x 4x (AVX2) / 2x (NEON)
ClampScale(dst, src, min, max, s) Fused clamp and scale 4x (AVX) / 2x (NEON)
Batch DotProductBatch(r, rows, v) Multiple dot products 8x / 4x / 2x
Signal ConvolveValid(dst, sig, k) FIR filter / convolution 8x / 4x / 2x
ConvolveValidMulti(dsts, sig, ks) Multi-kernel convolution 8x / 4x / 2x
ConvolveDecimate(dst,sig,k,f,p) Strided FIR downsample (decimate) 8x / 4x / 2x
AccumulateAdd(dst, src, off) Overlap-add: dst[off:] += src 8x / 4x / 2x
Audio Interleave2(dst, a, b) Pack stereo: [L,R,L,R,...] 4x / 2x
Deinterleave2(a, b, src) Unpack stereo to channels 4x / 2x
InterleaveN(dst, srcs) Pack N planar streams (any N; N-stream Interleave2) N=2,4,8 AVX, N=3,6 AVX2 / N=2,3,4 NEON; else Go
DeinterleaveN(dsts, src) Unpack N interleaved streams (any N) N=2,4,8 AVX, N=3,6 AVX2 / N=2,3,4 NEON; else Go
CubicInterpDot(hist,a,b,c,d,x) Fused cubic interp dot product 4x / 2x
Int32ToFloat32Scale(dst,src,s) PCM int32 to normalized float 8x / 4x
Int16ToFloat32Scale(dst,src,s) PCM int16 to normalized float 8x (AVX2) / 4x (NEON)
Float32ToInt16Scale(dst,src,s) Normalized float to PCM int16 8x (AVX2) / 4x (NEON)

DotProductBatch scores its [][]float64 rows in groups of four, keeping the query vector resident in registers across each group via a fused 4-row kernel on AMD64 (AVX-512 and AVX+FMA) instead of re-loading it per row. Short, ragged, or sub-SIMD-width rows fall back to the per-row dot product, with identical results.

f32 - float32 Operations

Same API as f64 but for float32 with wider SIMD:

Architecture SIMD Width
AMD64 (AVX-512) 16x float32
AMD64 (AVX+FMA) 8x float32
AMD64 (SSE2) 4x float32
ARM64 (NEON) 4x float32

InterleaveN/DeinterleaveN add an 8-stream AVX path (8x8 register transpose) and a 3-stream AVX2 path (per-stream VPERMPS gathers merged with VPBLENDD, since 3 streams do not map onto a clean register transpose) on top of the shared N=2/4 (AVX) and N=2/3/4 (NEON) kernels; all other stream counts use the allocation-free generic path. The N=3 case is the 16k -> 48k upsample hot path: the AVX2 gather/blend kernel runs roughly 2.8x (interleave) and 3.2x (deinterleave) over the generic loop on AVX2. The 6-stream AVX2 path (the 8k -> 48k upsample) zips stream pairs into three double-wide pair streams, then reuses the f64 N=3 interleave on those pairs, so it needs no index tables; it runs roughly 2x (interleave and deinterleave) on AVX2. f64 adds N=3 and N=6 (AVX2) plus N=8 (AVX), processing 4 frames per block (a YMM holds 4 doubles): N=3 uses immediate VPERMPD gathers merged with VBLENDPD, N=6 zips pairs at 128-bit-lane granularity with VPERM2F128 (roughly 4x interleave, 1.5x deinterleave), and N=8 runs two stacked 4x4 transposes (streams 0-3 fill each frame's low YMM, streams 4-7 the high YMM).

Row-major batch dot products (for flat vector stores):

Function Description
DotProductIndexed(dst, base, query, rowIDs, dims) bool Scores selected row-major rows by uint32 row ID without building [][]float32; returns whether an optimized SIMD batch kernel handled at least one batch.
DotProductStrided(dst, base, query, rowCount, dims, stride) bool Scores contiguous or fixed-stride row-major rows; returns whether an optimized SIMD batch kernel handled at least one batch.

Both APIs are allocation-free. The batched SIMD kernel covers AMD64 (AVX-512 / AVX+FMA) and ARM64 (NEON); unsupported CPUs, tiny shapes, tails, and ragged inputs use the per-row fallback.

DotProductBatch scores its [][]float32 rows in groups of four, keeping the query vector resident in registers across each group instead of re-loading it for every row. The fused 4-row kernel runs on AVX-512, AVX+FMA, and ARM64 NEON; short, ragged, or sub-SIMD-width rows fall back to the per-row dot product. Results are identical to the per-row path either way.

Additional split-format complex operations (for FFT pipelines with separate real/imag arrays):

Category Function Description SIMD Width
Complex MulComplex(dstRe,dstIm,aRe,aIm,bRe,bIm) Split-format complex multiply 8x (AVX) / 4x (NEON)
MulConjComplex(dstRe,dstIm,aRe,aIm,bRe,bIm) Multiply by conjugate 8x / 4x
AbsSqComplex(dst,aRe,aIm) Magnitude squared 8x / 4x
ButterflyComplex(uRe,uIm,lRe,lIm,twRe,twIm) FFT butterfly with twiddle 8x / 4x
RealFFTUnpack(outRe,outIm,zRe,zIm,twRe,twIm) Real FFT unpack step 8x / 4x
Utility Reverse(dst, src) Reverse slice order 8x / 4x
AddSub(sum, diff, a, b) Fused sum and difference 8x / 4x

f16 - float16 (Half-Precision) Operations

IEEE 754 half-precision floating-point operations, optimized for ML inference, audio DSP, and memory-bandwidth-bound workloads.

import "github.com/tphakala/simd/f16"

// Convert between float32 and float16
h := f16.FromFloat32(3.14)
f := f16.ToFloat32(h)

// Vector operations (same API as f32/f64)
a := make([]f16.Float16, 1024)
b := make([]f16.Float16, 1024)
dst := make([]f16.Float16, 1024)

f16.Add(dst, a, b)           // Element-wise addition
dot := f16.DotProduct(a, b)  // Dot product (returns float32)
f16.ReLU(dst, a)             // Activation functions
Category Function Description SIMD Width
Conversion ToFloat32(h) FP16 → float32 Scalar
FromFloat32(f) float32 → FP16 Scalar
ToFloat32Slice(dst, src) Batch FP16 → float32 8x (NEON+FP16)
FromFloat32Slice(dst, src) Batch float32 → FP16 8x (NEON+FP16)
Arithmetic Add(dst, a, b) Element-wise addition 8x (NEON+FP16)
Sub(dst, a, b) Element-wise subtraction 8x (NEON+FP16)
Mul(dst, a, b) Element-wise multiplication 8x (NEON+FP16)
Div(dst, a, b) Element-wise division 8x (NEON+FP16)
Scale(dst, a, s) Multiply by scalar 8x (NEON+FP16)
AddScalar(dst, a, s) Add scalar 8x (NEON+FP16)
FMA(dst, a, b, c) Fused multiply-add: a*b+c 8x (NEON+FP16)
AddScaled(dst, alpha, s) dst += alpha*s (AXPY) 8x (NEON+FP16)
Unary Abs(dst, a) Absolute value 8x (NEON+FP16)
Neg(dst, a) Negation 8x (NEON+FP16)
Sqrt(dst, a) Square root 8x (NEON+FP16)
Reciprocal(dst, a) Reciprocal (1/x) 8x (NEON+FP16)
Reduction DotProduct(a, b) → float32 Dot product 8x (NEON+FP16)
DotProductF32(a, b) → float32 Dot product (FP32 widen) 8x (NEON)
Sum(a) → float32 Sum of elements 8x (NEON+FP16)
Min(a) Minimum value 8x (NEON+FP16)
Max(a) Maximum value 8x (NEON+FP16)
MinIdx(a) Index of minimum Pure Go
MaxIdx(a) Index of maximum Pure Go
Statistical Mean(a) → float32 Arithmetic mean 8x (NEON+FP16)
Variance(a) → float32 Population variance 8x (NEON)
StdDev(a) → float32 Standard deviation 8x (NEON)
Vector EuclideanDistance(a, b) → float32 L2 distance 8x (NEON)
Normalize(dst, a) Unit vector normalization 8x (NEON+FP16)
CumulativeSum(dst, a) Running sum Sequential
Range Clamp(dst, a, min, max) Clamp to range 8x (NEON+FP16)
ClampScale(dst, src, min, max, s) Fused clamp and scale 8x (NEON)
Activation ReLU(dst, src) Rectified Linear Unit 8x (NEON+FP16)
Sigmoid(dst, src) Sigmoid: 1/(1+e^-x) Pure Go
Tanh(dst, src) Hyperbolic tangent Pure Go
Exp(dst, src) Exponential e^x Pure Go
Batch DotProductBatch(r, rows, v) Multiple dot products 8x (NEON+FP16)
Signal ConvolveValid(dst, sig, k) FIR filter / convolution Pure Go
AccumulateAdd(dst, src, off) Overlap-add: dst[off:] += src 8x (NEON+FP16)
Audio Interleave2(dst, a, b) Pack stereo: [L,R,L,R,...] 8x (NEON)
Deinterleave2(a, b, src) Unpack stereo to channels 8x (NEON)

Key characteristics:

  • Storage: IEEE 754 half-precision (1 sign, 5 exponent, 10 mantissa bits)
  • Precision: ~3.3 decimal digits, range ~6×10⁻⁸ to 65504
  • Reductions: Accumulate in float32 for numerical stability
  • Memory efficiency: 2x bandwidth vs float32 (8 elements per 128-bit NEON vector)
  • DotProduct saturation: On ARM64 with FP16 SIMD, DotProduct computes per-element products in FP16 and saturates to ±Inf when |a[i] * b[i]| > 65504. Use DotProductF32 (FP32 widening before multiply, ~1.5-2x slower) for audio DSP or raw-signal inputs that can produce out-of-range products.
  • FP32-widened ops: DotProductF32, EuclideanDistance, Variance, StdDev, and ClampScale widen each FP16 lane to FP32 before arithmetic, so they match the pure-Go reference and never saturate. They use only base-NEON instructions (the FCVTL/FCVTN conversions are ARMv8.0-A, not the FEAT_FP16 extension), so they run on any ARM64 NEON core, including non-FP16 parts (Cortex-A72/A53). Interleave2/Deinterleave2 are likewise bit-exact 16-bit lane permutes (ZIP/UZP) that run on any ARM64 NEON core.

Benchmark (1024 elements, Raspberry Pi 5 / Cortex-A76, zero allocations):

Operation SIMD Pure Go Speedup
EuclideanDistance 481 ns 5995 ns 12.5x
Variance 506 ns 8901 ns 17.6x
Interleave2 178 ns 2163 ns 12.2x
Deinterleave2 178 ns 2167 ns 12.2x
ClampScale 531 ns 12211 ns 23.0x

Hardware requirements:

  • Native FP16 SIMD: ARM64 with FEAT_FP16 (ARMv8.2-A+)
    • Apple Silicon (M1/M2/M3/M4) ✅
    • Cortex-A55, A75, A76, A77, A78, X1, X2, X3 ✅
    • Raspberry Pi 5 (Cortex-A76) ✅
  • Pure Go fallback: All other platforms
    • Raspberry Pi 3/4 (Cortex-A53/A72 - ARMv8.0) - works but no SIMD acceleration
    • AMD64 - works but no SIMD acceleration

c128 - complex128 Operations

SIMD-accelerated complex number operations for FFT-based signal processing:

Category Function Description SIMD Width
Arithmetic Mul(dst, a, b) Complex multiplication 4x (AVX-512) / 2x (AVX)
MulConj(dst, a, b) Multiply by conjugate: a × conj(b) 4x / 2x
Scale(dst, a, s) Scale by complex scalar 4x / 2x
Add(dst, a, b) Complex addition 4x / 2x
Sub(dst, a, b) Complex subtraction 4x / 2x
Unary Abs(dst, a) Complex magnitude |a + bi| 4x (AVX-512) / 2x (AVX)
AbsSq(dst, a) Magnitude squared |a + bi|² 4x / 2x
Conj(dst, a) Complex conjugate: a - bi 4x / 2x

These operations are designed for FFT-based signal processing pipelines:

import "github.com/tphakala/simd/c128"

// Frequency-domain multiplication (FFT convolution)
signalFFT := make([]complex128, n)
kernelFFT := make([]complex128, n)
result := make([]complex128, n)
magnitude := make([]float64, n)

// Frequency-domain filtering
c128.Mul(result, signalFFT, kernelFFT)          // Complex multiply
c128.MulConj(result, signalFFT, kernelFFT)      // Cross-correlation

// Spectrogram and magnitude analysis
c128.Abs(magnitude, signalFFT)                  // Extract magnitude for display

Use Cases:

  • Abs/AbsSq: Spectrograms, power spectral density, frequency analysis
  • Conj: Cross-correlation, frequency-domain filtering
  • Mul/MulConj: FFT-based convolution, filtering, correlation

Benchmark (1024 elements, Intel Xeon Platinum 8362 AVX-512):

Operation SIMD Pure Go Speedup
Mul 239 ns 1021 ns 4.3x
MulConj 314 ns 1434 ns 4.6x
Scale 180 ns 959 ns 5.3x
Add 255 ns 733 ns 2.9x
Abs 918 ns 3453 ns 3.8x
AbsSq 237 ns 594 ns 2.5x
Conj 163 ns 594 ns 3.7x

c64 - complex64 Operations

SIMD-accelerated single-precision complex number operations:

Category Function Description SIMD Width
Arithmetic Mul(dst, a, b) Complex multiplication 8x (AVX-512) / 4x (AVX) / 2x (NEON)
MulConj(dst, a, b) Multiply by conjugate: a × conj(b) 8x / 4x / 2x
Scale(dst, a, s) Scale by complex scalar 8x / 4x / 2x
Add(dst, a, b) Complex addition 8x / 4x / 2x
Sub(dst, a, b) Complex subtraction 8x / 4x / 2x
Unary Abs(dst, a) Complex magnitude |a + bi| 8x / 4x / 2x
AbsSq(dst, a) Magnitude squared |a + bi|² 8x / 4x / 2x
Conj(dst, a) Complex conjugate: a - bi 8x / 4x / 2x
Conversion FromReal(dst, src) Real to complex: src → src+0i 8x / 4x / 2x

Same API as c128 but for complex64 with 2x wider SIMD (8 bytes vs 16 bytes per element):

import "github.com/tphakala/simd/c64"

// Single-precision FFT processing
signalFFT := make([]complex64, n)
kernelFFT := make([]complex64, n)
result := make([]complex64, n)
magnitude := make([]float32, n)

c64.Mul(result, signalFFT, kernelFFT)     // Complex multiply
c64.Abs(magnitude, signalFFT)              // Extract magnitude

i32 - int32 Operations

SIMD-accelerated integer-domain operations for codec and integer-DSP hot loops (for example a pure-Go FLAC encoder/decoder), where the per-sample work is integer arithmetic and channel (de)interleaving rather than floating-point math:

Category Function Description SIMD Width
Interleave Interleave2(dst, a, b) Pack two channels into interleaved stereo 8x (AVX) / 4x (NEON)
Deinterleave2(a, b, src) Split interleaved stereo into two channels 8x (AVX) / 4x (NEON)
Decorrelation Add(dst, a, b) Element-wise add (RIGHT_SIDE decode) 8x (AVX2) / 4x (NEON)
Sub(dst, a, b) Element-wise subtract (side channel / LEFT_SIDE decode) 8x (AVX2) / 4x (NEON)
MidSideEncode(mid, side, left, right) Mid/side forward decorrelation 8x (AVX2) / 4x (NEON)
MidSideDecode(left, right, mid, side) Mid/side inverse (parity-bit reconstruction) 8x (AVX2) / 4x (NEON)
Fixed predictor Diff1..Diff4(dst, src) Order 1-4 fixed-predictor encode residual (forward diff) 8x (AVX2) / 4x (NEON)
Restore1..Restore4(dst, src) Order 1-4 fixed-predictor decode (inverse; prefix sum) 8x (AVX2) / 4x (NEON)
import "github.com/tphakala/simd/i32"

left := make([]int32, n)
right := make([]int32, n)
stereo := make([]int32, n*2)

i32.Interleave2(stereo, left, right)   // [l0, r0, l1, r1, ...]
i32.Deinterleave2(left, right, stereo) // inverse: split back to channels

res := make([]int32, n)
i32.Diff2(res, left)     // order-2 fixed-predictor encode residual
i32.Restore2(left, res)  // exact inverse: reconstruct the samples

Interleaving is pure 32-bit-lane movement, so those kernels reuse the proven f32 shuffle/permute encodings (AVX VUNPCKLPS/VPERM2F128, NEON ZIP/UZP on .4S); the bit pattern of each lane is irrelevant, so negative values and the type extremes round-trip exactly. The decorrelation and fixed-predictor kernels do integer-ALU work on 256-bit (AVX2) / 128-bit (NEON) lanes. RestoreK is the exact inverse of DiffK: since the order-K fixed predictor is the K-th forward difference, restoration is K cumulative-sum passes built on one SIMD prefix-sum kernel (11x at order 1 down to ~5x at order 4 on AVX2; 6x to 3x on NEON, all zero-allocation). The remaining integer surface (LPC FIR encode/decode, Rice cost search) is tracked in the project issues.

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

Category Operation SIMD (ns) Go (ns) Speedup
Arithmetic Add 84 446 5.3x
Sub 84 335 4.0x
Mul 86 436 5.1x
Div 441 941 2.1x
Scale 68 272 4.0x
AddScalar 68 286 4.2x
FMA 110 557 5.0x
Unary Abs 66 365 5.6x
Neg 66 306 4.6x
Sqrt 658 1323 2.0x
Reciprocal 447 920 2.1x
Reduction DotProduct 162 859 5.3x
Sum 82 184 2.3x
Min 157 340 2.2x
Max 154 352 2.3x
Statistical Mean 82 184 2.3x
Variance* 820 902 1.1x
StdDev* 825 905 1.1x
Vector EuclideanDistance 216 1071 5.0x
Normalize 220 1080 4.9x
CumulativeSum 428 425 1.0x
Range Clamp 81 640 7.9x

*Variance/StdDev benchmarked at 4096 elements (SIMD benefits at larger sizes)

float32 Operations - SIMD vs Pure Go (1024 elements)

Category Operation SIMD (ns) Go (ns) Speedup
Arithmetic Add 47 441 9.4x
Sub 49 339 6.9x
Mul 49 436 8.9x
Div 138 655 4.8x
Scale 40 299 7.4x
AddScalar 39 272 7.0x
FMA 64 444 6.9x
Unary Abs 37 656 17.6x
Neg 40 273 6.9x
Reduction DotProduct 71 424 5.9x
Sum 41 123 3.0x
Min 65 340 5.2x
Max 66 352 5.3x
Range Clamp 47 701 14.8x

Activation Functions - SIMD vs Pure Go

float32 (1024 elements):

Function SIMD (ns) Go (ns) Speedup SIMD Throughput
Sigmoid 138 5906 43x 59.3 GB/s
ReLU 39 662 17x 211 GB/s
Tanh 138 28116 204x 59.5 GB/s
Exp 312 5555 18x 26.3 GB/s

float64 (1024 elements):

Function SIMD (ns) Go (ns) Speedup SIMD Throughput
Sigmoid 745 5640 7.6x 22.0 GB/s
ReLU 68 646 9.5x 240 GB/s
Tanh 836 6529 7.8x 19.6 GB/s
Exp 606 4698 7.8x 27.0 GB/s

Key Characteristics:

  • Tanh: 200x+ speedup for f32 - fast approximation with saturation vs math.Tanh
  • ReLU: Highest throughput (211-240 GB/s) - simple max(0, x) operation
  • Sigmoid: 43x speedup for f32 - fast approximation with exponential
  • Exp: 18x speedup for f32 (12x on ARM64 NEON) via range reduction plus a degree-5 polynomial; max relative error ~7e-6 (f32), ~3e-6 (f64)

Batch & Signal Processing (varied sizes)

Operation Config SIMD Go Speedup
DotProductBatch (f64) 256 vec × 100 rows 3.2 µs 20.5 µs 6.4x
DotProductBatch (f32) 256 vec × 100 rows 1.5 µs 9.8 µs 6.7x
ConvolveValid (f64) 4096 sig × 64 ker 26.6 µs 169 µs 6.3x
ConvolveValid (f32) 4096 sig × 64 ker 17.9 µs 80 µs 4.5x
ConvolveValidMulti (f64) 1000 sig × 64 ker × 2 13.4 µs - -
CubicInterpDot (f64) 241 taps 47 ns 88 ns 1.9x
CubicInterpDot (f32) 241 taps 21 ns 66 ns 3.1x
Int32ToFloat32Scale 1024 elements 50 ns 405 ns 8.1x
Int32ToFloat32Scale 4096 elements 157 ns 1586 ns 10.1x
Int16ToFloat32Scale 1024 elements 58 ns 483 ns 8.3x
Int16ToFloat32Scale 4096 elements 200 ns 1914 ns 9.6x
Float32ToInt16Scale 1024 elements 92 ns 1360 ns 14.8x
Float32ToInt16Scale 4096 elements 365 ns 5420 ns 14.8x
Interleave2 (f64) 1000 pairs 216 ns - -
Deinterleave2 (f64) 1000 pairs 216 ns - -
Interleave2 (f32) 1000 pairs 109 ns - -
Deinterleave2 (f32) 1000 pairs 216 ns - -

ConvolveDecimate (fused strided convolution)

ConvolveDecimate fuses an FIR downsample loop into one call. The relevant baseline is what a consumer writes today: a Go loop calling DotProductUnsafe at each strided window (the inner dot is already SIMD). Both compute identical results; the fused kernel removes the per-output call, dispatch and slice-header overhead and keeps the kernel pointer resident, so the win is largest for short kernels. Signal length 4096, allocation-free. Measured (AVX2 on x86-64, NEON on a Raspberry Pi 5):

Config f32 x86 f64 x86 f32 NEON f64 NEON
20 taps, 2x decimate 2.0x 2.3x 1.7x 1.9x
32 taps, 2x decimate 2.3x 1.7x 1.9x 1.7x
64 taps, 2x decimate 2.0x 1.8x 1.7x 1.3x
241 taps, 2x decimate 1.5x 1.3x 1.2x 1.1x
241 taps, 4x decimate 1.4x 1.2x 1.2x 1.1x

Performance Summary

Package Average Speedup Best Operations
f32 6.5x 21.8x (Abs) 35 functions
f64 3.2x 7.9x (Clamp) 32 functions
c128 2.7x 3.4x (Abs) 8 functions
c64 ~2x ~3x (Mul) 9 functions

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

Operation Size Time Throughput
DotProduct 277 151 ns 29 GB/s
DotProduct 1000 513 ns 31 GB/s
Add 1000 775 ns 31 GB/s
Mul 1000 727 ns 33 GB/s
FMA 1000 890 ns 36 GB/s
Sum 1000 635 ns 13 GB/s
Mean 1000 677 ns 12 GB/s

float32 Operations

Operation Size Time Throughput
DotProduct 100 37 ns 21 GB/s
DotProduct 1000 263 ns 30 GB/s
DotProduct 10000 2.78 µs 29 GB/s
Add 1000 389 ns 31 GB/s
Mul 1000 390 ns 31 GB/s
FMA 1000 479 ns 33 GB/s

Comparison vs Pure Go

Operation Size SIMD Pure Go Speedup
DotProduct (f32) 100 37 ns 137 ns 3.7x
DotProduct (f32) 1000 262 ns 1350 ns 5.2x
DotProduct (f64) 100 62 ns 138 ns 2.2x
DotProduct (f64) 1000 513 ns 1353 ns 2.6x
Add (f32) 1000 389 ns 2015 ns 5.2x
Sum (f32) 1000 343 ns 1327 ns 3.9x

int32 (i32) - SIMD vs Pure Go (1000 elements)

Operation AMD64 (AVX/AVX2) ARM64 (NEON, Pi 5)
Interleave2 121 ns vs 487 ns (4.0x) 322 ns vs 1679 ns (5.2x)
Deinterleave2 228 ns vs 482 ns (2.1x) 322 ns vs 1682 ns (5.2x)
Restore1 181 ns vs 2019 ns (11.2x) 605 ns vs 3763 ns (6.2x)
Restore2 320 ns vs 2296 ns (7.2x) 1101 ns vs 4592 ns (4.2x)
Restore3 462 ns vs 2493 ns (5.4x) 1608 ns vs 5421 ns (3.4x)
Restore4 575 ns vs 2778 ns (4.8x) 2093 ns vs 6245 ns (3.0x)

All int32 kernels are zero-allocation and bit-exact against the pure-Go reference (verified across the sign and high bits with negative values and the type extremes). The Restore baseline is the pure-Go decode recurrence; RestoreK reconstructs samples as K SIMD prefix-sum passes (the inverse of the order-K forward difference).

Performance Notes

  • AMD64: Explicit SIMD provides 5x speedups for most operations compared to pure Go, with consistent high throughput across all vector sizes.

  • ARM64: NEON SIMD provides substantial speedups over pure Go across all operations:

    • float32: 3.7x - 5.2x faster (4 elements per 128-bit vector)
    • float64: 2.2x - 2.6x faster (2 elements per 128-bit vector)
  • CumulativeSum is inherently sequential (each element depends on the previous) and uses pure Go on all platforms.

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

On AMD64, the Min and Max functions fall back to pure Go for small slices:

  • float64: slices with fewer than 4 elements
  • float32: slices with fewer than 8 elements

This is because AVX assembly loads multiple elements at once (4 float64s or 8 float32s), which would cause out-of-bounds memory access on smaller slices.

The Go fallback for small slices is intentional and likely optimal - SIMD setup overhead (register loading, masking, horizontal reduction) would exceed the cost of a simple 2-3 element comparison loop.

Architecture Support

Architecture Instruction Set f64/f32/c128/c64 f16
AMD64 AVX-512 Full SIMD support Pure Go fallback
AMD64 AVX + FMA Full SIMD support Pure Go fallback
AMD64 SSE4.1 Full SIMD support Pure Go fallback
ARM64 NEON + FP16 Full SIMD support Full SIMD support
ARM64 NEON only Full SIMD support Pure Go fallback
Other - Pure Go fallback Pure Go fallback

ARM64 FP16 support by device:

Device / SoC Core(s) Architecture FP16 SIMD
Apple Silicon (M1-M4) Firestorm+ ARMv8.4-A ✅ Yes
Raspberry Pi 5 Cortex-A76 ARMv8.2-A ✅ Yes
Raspberry Pi 4 Cortex-A72 ARMv8.0-A ❌ No
Raspberry Pi 3 Cortex-A53 ARMv8.0-A ❌ No
AWS Graviton 2/3 Neoverse N1/V1 ARMv8.2-A+ ✅ Yes
Ampere Altra Neoverse N1 ARMv8.2-A ✅ Yes

Design Principles

  1. Pure Go assembly - Native Go assembler for maximum portability and easy cross-compilation
  2. Runtime dispatch - CPU features detected once at init time, zero runtime overhead
  3. Zero allocations - No heap allocations in hot paths
  4. Safe defaults - Gracefully falls back to pure Go on unsupported CPUs
  5. Boundary safe - Handles any slice length, not just SIMD-aligned sizes

Testing

The library includes comprehensive tests with pure Go reference implementations for validation:

# Run all tests
go test ./...

# Run tests with verbose output
task test

# Run benchmarks
task bench

# Compare SIMD vs pure Go performance
task bench:compare

# Show CPU SIMD capabilities
task cpu

See Taskfile.yml for all available tasks.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

High-performance SIMD library for Go with AVX/NEON support

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors