WASM SIMD & Performance Optimization

The RSP is a SIMD processor. WASM has SIMD. Nobody connects them. We do.

The N64 RSP Vector Unit

The Reality Signal Processor is the N64's secret weapon. It handles:

Vertex transforms (3D math)
Lighting calculations
Audio mixing and resampling
Texture coordinate generation

It's a SIMD processor with 32 vector registers, each containing 8 x 16-bit elements:

Register layout (from mupen64plus-rsp-cxd4/vu/vu.h):

VR[0..31] = 8 x int16 = 128 bits per register

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │ e[0] │ e[1] │ e[2] │ e[3] │ e[4] │ e[5] │ e[6] │ e[7] │
   │ i16  │ i16  │ i16  │ i16  │ i16  │ i16  │ i16  │ i16  │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
   |__________________ 128 bits ____________________________|

WASM SIMD v128

WASM SIMD provides a v128 type — exactly 128 bits:

v128 (i16x8 interpretation):

   ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
   │lane 0│lane 1│lane 2│lane 3│lane 4│lane 5│lane 6│lane 7│
   │ i16  │ i16  │ i16  │ i16  │ i16  │ i16  │ i16  │ i16  │
   └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
   |__________________ 128 bits ____________________________|

The match is perfect. One WASM SIMD instruction processes an entire RSP vector register.

The SSE2 → WASM SIMD Pipeline

The RSP code in N64Wasm already has SSE2 implementations. Emscripten translates them:

Actual Code (from the N64Wasm source):

// mupen64plus-rsp-cxd4/vu/multiply.h
#ifdef ARCH_MIN_SSE2

static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
    __m128i vs    = _mm_load_si128((__m128i*)VS);
    __m128i vt    = _mm_load_si128((__m128i*)VT);
    __m128i lo    = _mm_mullo_epi16(vs, vt);
    __m128i hi    = _mm_mulhi_epi16(vs, vt);
    __m128i sign  = _mm_srai_epi16(lo, 15);
    __m128i prod  = _mm_add_epi16(hi, hi);
    prod = _mm_sub_epi16(prod, sign);
    // ... accumulator update ...
    _mm_store_si128((__m128i*)VD, result);
}

#else
// SCALAR FALLBACK (what N64Wasm currently uses!)
static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
    for (int i = 0; i < 8; i++) {
        int32_t product = (int32_t)VS[i] * (int32_t)VT[i];
        // ... 8 iterations of scalar math ...
        VD[i] = result;
    }
}
#endif

N64Wasm builds with -DNOSSE, forcing the scalar fallback. Our build removes that flag and adds -msimd128.

Build Flag Changes

# N64Wasm Makefile diff
- EMCC_FLAGS += -DNOSSE
- EMCC_FLAGS += -DNO_ASM
+ EMCC_FLAGS += -msimd128
+ EMCC_FLAGS += -mrelaxed-simd
+ EMCC_FLAGS += -DARCH_MIN_SSE2

That's it. Three flag changes unlock 2-4x RSP performance.

RSP Operations Affected

Every single RSP vector instruction benefits from SIMD:

Instruction	Operation	Scalar Ops	SIMD Ops	Speedup
VMULF	Vector multiply (frac)	8 muls + 8 shifts + 8 adds	3 SIMD ops	~8x
VADD	Vector add	8 adds + 8 clamps	2 SIMD ops	~4x
VSUB	Vector subtract	8 subs + 8 clamps	2 SIMD ops	~4x
VMACF	Multiply-accumulate	8 muls + 16 adds	4 SIMD ops	~6x
VAND/VOR/VXOR	Logical ops	8 ops	1 SIMD op	~8x
VCH/VCL/VCR	Compare/clip	8 compares	1-2 SIMD ops	~4x
VMRG	Merge	8 selects	1 blend op	~8x

Relaxed SIMD

The -mrelaxed-simd flag enables additional optimizations:

// Standard SIMD: must handle NaN deterministically
// Relaxed SIMD: can use native hardware behavior for NaN

// This matters for RSP floating-point-like operations
// where exact NaN semantics don't affect game behavior

Relaxed SIMD allows the browser's WASM engine to use the fastest native instruction without worrying about edge-case determinism. For emulation where we don't need IEEE-754 strictness, this is free performance.

Threading + SIMD Combined

The real power is combining SIMD with the threading model:

Expected Performance Budget

For a typical N64 game frame (16.67ms budget at 60fps):

Component	Current (scalar, 1 thread)	Ours (SIMD, multi-thread)
CPU emulation	6ms	6ms (no change)
RSP vector	5ms	1.5ms (SIMD)
RSP scalar	1ms	1ms (no change)
RDP/rendering	3ms	3ms (separate thread)
Audio	1ms	0ms (separate thread)
Input/UI	0.5ms	0ms (main thread)
Total per frame	16.5ms (barely 60fps)	~8ms (headroom!)

With headroom, we can add post-processing shaders, recording, and other features without dropping frames.

Browser Support

WASM SIMD is supported by 94.21% of global browsers:

Browser	Version	Since
Chrome	91+	May 2021
Firefox	89+	June 2021
Safari	16.4+	March 2023
Edge	91+	May 2021
Samsung Internet	16.0+	2022

The remaining ~6% are IE, Opera Mini, and pre-2021 mobile browsers. We provide a scalar fallback for these (same as current N64Wasm — just slower).

The N64 RSP Vector Unit​

WASM SIMD v128​

The SSE2 → WASM SIMD Pipeline​

Actual Code (from the N64Wasm source):​

Build Flag Changes​

RSP Operations Affected​

Relaxed SIMD​

Threading + SIMD Combined​

Expected Performance Budget​

Browser Support​