WASM Core Engine
The engine under the hood — N64Wasm rebuilt with every modern optimization enabled.
The Baseline Problem
N64Wasm (our fork base) compiles with these flags:
# Current N64Wasm Makefile (2021)
EMCC_FLAGS = -O3 -flto \
-DNOSSE \ # <-- DISABLES ALL SIMD!
-DNO_ASM \ # <-- DISABLES ASM OPTIMIZATIONS!
-s TOTAL_MEMORY=536870912 \
-s ASSERTIONS=0
This means the entire RSP vector unit — the N64's SIMD coprocessor that handles vertex transforms, lighting, and audio mixing — is being emulated with scalar C code. The RSP processes 8 x 16-bit values in parallel. WASM SIMD provides exactly 128-bit vectors (8 x i16). The match is nearly 1:1.
Our Build Configuration
# N64.wasm optimized build
EMCC_FLAGS = -O3 -flto \
-msimd128 \ # Enable WASM SIMD (128-bit)
-mrelaxed-simd \ # Relaxed SIMD for extra perf
-pthread \ # Enable Web Worker threading
-sPTHREAD_POOL_SIZE=4 \ # Pre-create worker pool
-sPROXY_TO_PTHREAD \ # Move main() to worker (free main thread)
-sMALLOC=mimalloc \ # Thread-safe allocator
-sALLOW_MEMORY_GROWTH=0 \# Fixed memory (faster than growable)
-sTOTAL_MEMORY=536870912 \
-sASSERTIONS=0 \
-sENVIRONMENT=web,worker \
-fno-exceptions \
--pre-js=pre.js # AudioWorklet + OffscreenCanvas setup
SIMD: The Biggest Win
How the N64 RSP Works
The Reality Signal Processor has 32 vector registers, each containing 8 x 16-bit elements:
VR[0] = [ elem0 | elem1 | elem2 | elem3 | elem4 | elem5 | elem6 | elem7 ]
16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit
|_________________________128 bits___________________________|
How WASM SIMD Maps To It
WASM v128 = [ i16x8 lane0 | lane1 | lane2 | lane3 | lane4 | lane5 | lane6 | lane7 ]
|_________________________128 bits___________________________|
It's the same width. Every RSP vector operation maps directly to a WASM SIMD instruction.
The Code Already Exists
The mupen64plus-rsp-cxd4 source (inside N64Wasm) already has SSE2 implementations of every RSP vector operation:
// vu/multiply.h (already in N64Wasm source)
#ifdef ARCH_MIN_SSE2
#include <emmintrin.h>
static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
__m128i vs = _mm_load_si128((__m128i*)VS);
__m128i vt = _mm_load_si128((__m128i*)VT);
__m128i lo = _mm_mullo_epi16(vs, vt);
__m128i hi = _mm_mulhi_epi16(vs, vt);
// ... accumulator logic with SIMD
_mm_store_si128((__m128i*)VD, result);
}
#endif
Emscripten compiles SSE2 intrinsics directly to WASM SIMD — we just remove -DNOSSE and add -msimd128. The RSP code compiles unchanged.
Expected Speedup
| Component | Scalar (current) | SIMD (ours) | Speedup |
|---|---|---|---|
| Vector multiply (VMULF) | 8 ops | 1 SIMD op | ~8x theoretical |
| Vector add (VADD) | 8 ops | 1 SIMD op | ~8x theoretical |
| Accumulator update | 24 ops | 3 SIMD ops | ~8x theoretical |
| Practical RSP speedup | — | — | 2-4x |
| Overall game speedup | — | — | 30-60% |
The practical speedup is lower than theoretical because not all RSP time is vector math — there's also scalar operations, memory access, and pipeline management.
Interpreter vs JIT
The JIT Problem in WASM
Traditional N64 emulators (Project64, mupen64plus native) use dynamic recompilation — translating MIPS machine code to native x86/ARM at runtime. This provides 5-20x speedup over interpretation.
WASM cannot do this. Code and data are separated (Harvard architecture). You cannot generate executable code at runtime from within WASM.
Our Approach: Fast Cached Interpreter + SIMD Compensation
The key insight: SIMD compensates for the JIT loss on vector-heavy workloads. Most N64 games spend 30-50% of their time in RSP vector operations. Getting 2-4x speedup on that component significantly closes the gap with native JIT emulators.
Future: Late-Linking JIT
There's a technique (documented by Andy Wingo at wingolog.org) that enables a form of JIT in WASM:
- Generate a new WASM module at runtime containing translated MIPS blocks
- Instantiate it via
WebAssembly.instantiate(), importing shared memory - New functions become callable via
call_indirect
This has much higher overhead than native JIT (async compilation, module instantiation cost), but could provide 2-5x speedup on CPU-heavy games. This is a Phase 4+ optimization.
Memory Layout
SharedArrayBuffer (512MB total)
├── [0x00000000 - 0x007FFFFF] RDRAM (8MB)
├── [0x00800000 - 0x03FFFFFF] ROM Space (up to 56MB)
├── [0x04000000 - 0x04000FFF] RSP DMEM (4KB)
├── [0x04001000 - 0x04001FFF] RSP IMEM (4KB)
├── [0x10000000 - 0x100FFFFF] Frame Buffer (1MB, double-buffered)
├── [0x10100000 - 0x10103FFF] Audio Ring Buffer (16KB)
├── [0x10104000 - 0x1010403F] Input State (64 bytes)
├── [0x10104040 - 0x1012403F] Save RAM (128KB)
├── [0x10124040 - 0x1FFFFFFF] WASM Heap (emulator internal state)
└── [0x20000000 - ...] Stack + Globals
Emscripten Version Upgrade Impact
Upgrading from Emscripten 2.0.7 (2021) to 3.x (2026):
| Improvement | Impact |
|---|---|
| LLVM backend upgrades | Better codegen, 10-15% faster |
| SIMD support matured | More optimizations available |
| pthread improvements | Lower sync overhead |
| Binaryen wasm-opt | Better dead code elimination |
| Source maps | Debuggable WASM |
| Smaller JS glue | Faster load |
| WASM Exception Handling | Faster TLB miss paths |
Browser Compatibility
| Feature | Chrome | Firefox | Safari | Edge |
|---|---|---|---|---|
| WASM SIMD | 91+ | 89+ | 16.4+ | 91+ |
| SharedArrayBuffer | 68+ | 79+ | 15.2+ | 79+ |
| OffscreenCanvas | 69+ | 105+ | 16.4+ | 79+ |
| AudioWorklet | 66+ | 76+ | 14.1+ | 79+ |
| WebGL2 | 56+ | 51+ | 15+ | 79+ |
| All combined | 105+ | 105+ | 16.4+ | 79+ |
:::info GLOBAL SUPPORT ~94% of all browsers support every feature we need. The remaining 6% are IE, Opera Mini, and very old mobile browsers. :::