Skip to main content

WASM SIMD & Performance Optimization

The RSP is a SIMD processor. WASM has SIMD. Nobody connects them. We do.


The N64 RSP Vector Unit

The Reality Signal Processor is the N64's secret weapon. It handles:

  • Vertex transforms (3D math)
  • Lighting calculations
  • Audio mixing and resampling
  • Texture coordinate generation

It's a SIMD processor with 32 vector registers, each containing 8 x 16-bit elements:

Register layout (from mupen64plus-rsp-cxd4/vu/vu.h):

VR[0..31] = 8 x int16 = 128 bits per register

┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ e[0] │ e[1] │ e[2] │ e[3] │ e[4] │ e[5] │ e[6] │ e[7] │
│ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
|__________________ 128 bits ____________________________|

WASM SIMD v128

WASM SIMD provides a v128 type — exactly 128 bits:

v128 (i16x8 interpretation):

┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│lane 0│lane 1│lane 2│lane 3│lane 4│lane 5│lane 6│lane 7│
│ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
|__________________ 128 bits ____________________________|

The match is perfect. One WASM SIMD instruction processes an entire RSP vector register.


The SSE2 → WASM SIMD Pipeline

The RSP code in N64Wasm already has SSE2 implementations. Emscripten translates them:

Actual Code (from the N64Wasm source):

// mupen64plus-rsp-cxd4/vu/multiply.h
#ifdef ARCH_MIN_SSE2

static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
__m128i vs = _mm_load_si128((__m128i*)VS);
__m128i vt = _mm_load_si128((__m128i*)VT);
__m128i lo = _mm_mullo_epi16(vs, vt);
__m128i hi = _mm_mulhi_epi16(vs, vt);
__m128i sign = _mm_srai_epi16(lo, 15);
__m128i prod = _mm_add_epi16(hi, hi);
prod = _mm_sub_epi16(prod, sign);
// ... accumulator update ...
_mm_store_si128((__m128i*)VD, result);
}

#else
// SCALAR FALLBACK (what N64Wasm currently uses!)
static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
for (int i = 0; i < 8; i++) {
int32_t product = (int32_t)VS[i] * (int32_t)VT[i];
// ... 8 iterations of scalar math ...
VD[i] = result;
}
}
#endif

N64Wasm builds with -DNOSSE, forcing the scalar fallback. Our build removes that flag and adds -msimd128.


Build Flag Changes

# N64Wasm Makefile diff
- EMCC_FLAGS += -DNOSSE
- EMCC_FLAGS += -DNO_ASM
+ EMCC_FLAGS += -msimd128
+ EMCC_FLAGS += -mrelaxed-simd
+ EMCC_FLAGS += -DARCH_MIN_SSE2

That's it. Three flag changes unlock 2-4x RSP performance.


RSP Operations Affected

Every single RSP vector instruction benefits from SIMD:

InstructionOperationScalar OpsSIMD OpsSpeedup
VMULFVector multiply (frac)8 muls + 8 shifts + 8 adds3 SIMD ops~8x
VADDVector add8 adds + 8 clamps2 SIMD ops~4x
VSUBVector subtract8 subs + 8 clamps2 SIMD ops~4x
VMACFMultiply-accumulate8 muls + 16 adds4 SIMD ops~6x
VAND/VOR/VXORLogical ops8 ops1 SIMD op~8x
VCH/VCL/VCRCompare/clip8 compares1-2 SIMD ops~4x
VMRGMerge8 selects1 blend op~8x

Relaxed SIMD

The -mrelaxed-simd flag enables additional optimizations:

// Standard SIMD: must handle NaN deterministically
// Relaxed SIMD: can use native hardware behavior for NaN

// This matters for RSP floating-point-like operations
// where exact NaN semantics don't affect game behavior

Relaxed SIMD allows the browser's WASM engine to use the fastest native instruction without worrying about edge-case determinism. For emulation where we don't need IEEE-754 strictness, this is free performance.


Threading + SIMD Combined

The real power is combining SIMD with the threading model:


Expected Performance Budget

For a typical N64 game frame (16.67ms budget at 60fps):

ComponentCurrent (scalar, 1 thread)Ours (SIMD, multi-thread)
CPU emulation6ms6ms (no change)
RSP vector5ms1.5ms (SIMD)
RSP scalar1ms1ms (no change)
RDP/rendering3ms3ms (separate thread)
Audio1ms0ms (separate thread)
Input/UI0.5ms0ms (main thread)
Total per frame16.5ms (barely 60fps)~8ms (headroom!)

With headroom, we can add post-processing shaders, recording, and other features without dropping frames.


Browser Support

WASM SIMD is supported by 94.21% of global browsers:

BrowserVersionSince
Chrome91+May 2021
Firefox89+June 2021
Safari16.4+March 2023
Edge91+May 2021
Samsung Internet16.0+2022

The remaining ~6% are IE, Opera Mini, and pre-2021 mobile browsers. We provide a scalar fallback for these (same as current N64Wasm — just slower).