WASM SIMD & Performance Optimization
The RSP is a SIMD processor. WASM has SIMD. Nobody connects them. We do.
The N64 RSP Vector Unit
The Reality Signal Processor is the N64's secret weapon. It handles:
- Vertex transforms (3D math)
- Lighting calculations
- Audio mixing and resampling
- Texture coordinate generation
It's a SIMD processor with 32 vector registers, each containing 8 x 16-bit elements:
Register layout (from mupen64plus-rsp-cxd4/vu/vu.h):
VR[0..31] = 8 x int16 = 128 bits per register
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ e[0] │ e[1] │ e[2] │ e[3] │ e[4] │ e[5] │ e[6] │ e[7] │
│ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
|__________________ 128 bits ____________________________|
WASM SIMD v128
WASM SIMD provides a v128 type — exactly 128 bits:
v128 (i16x8 interpretation):
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│lane 0│lane 1│lane 2│lane 3│lane 4│lane 5│lane 6│lane 7│
│ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │ i16 │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
|__________________ 128 bits ____________________________|
The match is perfect. One WASM SIMD instruction processes an entire RSP vector register.
The SSE2 → WASM SIMD Pipeline
The RSP code in N64Wasm already has SSE2 implementations. Emscripten translates them:
Actual Code (from the N64Wasm source):
// mupen64plus-rsp-cxd4/vu/multiply.h
#ifdef ARCH_MIN_SSE2
static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
__m128i vs = _mm_load_si128((__m128i*)VS);
__m128i vt = _mm_load_si128((__m128i*)VT);
__m128i lo = _mm_mullo_epi16(vs, vt);
__m128i hi = _mm_mulhi_epi16(vs, vt);
__m128i sign = _mm_srai_epi16(lo, 15);
__m128i prod = _mm_add_epi16(hi, hi);
prod = _mm_sub_epi16(prod, sign);
// ... accumulator update ...
_mm_store_si128((__m128i*)VD, result);
}
#else
// SCALAR FALLBACK (what N64Wasm currently uses!)
static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
for (int i = 0; i < 8; i++) {
int32_t product = (int32_t)VS[i] * (int32_t)VT[i];
// ... 8 iterations of scalar math ...
VD[i] = result;
}
}
#endif
N64Wasm builds with -DNOSSE, forcing the scalar fallback. Our build removes that flag and adds -msimd128.
Build Flag Changes
# N64Wasm Makefile diff
- EMCC_FLAGS += -DNOSSE
- EMCC_FLAGS += -DNO_ASM
+ EMCC_FLAGS += -msimd128
+ EMCC_FLAGS += -mrelaxed-simd
+ EMCC_FLAGS += -DARCH_MIN_SSE2
That's it. Three flag changes unlock 2-4x RSP performance.
RSP Operations Affected
Every single RSP vector instruction benefits from SIMD:
| Instruction | Operation | Scalar Ops | SIMD Ops | Speedup |
|---|---|---|---|---|
| VMULF | Vector multiply (frac) | 8 muls + 8 shifts + 8 adds | 3 SIMD ops | ~8x |
| VADD | Vector add | 8 adds + 8 clamps | 2 SIMD ops | ~4x |
| VSUB | Vector subtract | 8 subs + 8 clamps | 2 SIMD ops | ~4x |
| VMACF | Multiply-accumulate | 8 muls + 16 adds | 4 SIMD ops | ~6x |
| VAND/VOR/VXOR | Logical ops | 8 ops | 1 SIMD op | ~8x |
| VCH/VCL/VCR | Compare/clip | 8 compares | 1-2 SIMD ops | ~4x |
| VMRG | Merge | 8 selects | 1 blend op | ~8x |
Relaxed SIMD
The -mrelaxed-simd flag enables additional optimizations:
// Standard SIMD: must handle NaN deterministically
// Relaxed SIMD: can use native hardware behavior for NaN
// This matters for RSP floating-point-like operations
// where exact NaN semantics don't affect game behavior
Relaxed SIMD allows the browser's WASM engine to use the fastest native instruction without worrying about edge-case determinism. For emulation where we don't need IEEE-754 strictness, this is free performance.
Threading + SIMD Combined
The real power is combining SIMD with the threading model:
Expected Performance Budget
For a typical N64 game frame (16.67ms budget at 60fps):
| Component | Current (scalar, 1 thread) | Ours (SIMD, multi-thread) |
|---|---|---|
| CPU emulation | 6ms | 6ms (no change) |
| RSP vector | 5ms | 1.5ms (SIMD) |
| RSP scalar | 1ms | 1ms (no change) |
| RDP/rendering | 3ms | 3ms (separate thread) |
| Audio | 1ms | 0ms (separate thread) |
| Input/UI | 0.5ms | 0ms (main thread) |
| Total per frame | 16.5ms (barely 60fps) | ~8ms (headroom!) |
With headroom, we can add post-processing shaders, recording, and other features without dropping frames.
Browser Support
WASM SIMD is supported by 94.21% of global browsers:
| Browser | Version | Since |
|---|---|---|
| Chrome | 91+ | May 2021 |
| Firefox | 89+ | June 2021 |
| Safari | 16.4+ | March 2023 |
| Edge | 91+ | May 2021 |
| Samsung Internet | 16.0+ | 2022 |
The remaining ~6% are IE, Opera Mini, and pre-2021 mobile browsers. We provide a scalar fallback for these (same as current N64Wasm — just slower).