Skip to main content

WASM Core Engine

The engine under the hood — N64Wasm rebuilt with every modern optimization enabled.


The Baseline Problem

N64Wasm (our fork base) compiles with these flags:

# Current N64Wasm Makefile (2021)
EMCC_FLAGS = -O3 -flto \
-DNOSSE \ # <-- DISABLES ALL SIMD!
-DNO_ASM \ # <-- DISABLES ASM OPTIMIZATIONS!
-s TOTAL_MEMORY=536870912 \
-s ASSERTIONS=0

This means the entire RSP vector unit — the N64's SIMD coprocessor that handles vertex transforms, lighting, and audio mixing — is being emulated with scalar C code. The RSP processes 8 x 16-bit values in parallel. WASM SIMD provides exactly 128-bit vectors (8 x i16). The match is nearly 1:1.


Our Build Configuration

# N64.wasm optimized build
EMCC_FLAGS = -O3 -flto \
-msimd128 \ # Enable WASM SIMD (128-bit)
-mrelaxed-simd \ # Relaxed SIMD for extra perf
-pthread \ # Enable Web Worker threading
-sPTHREAD_POOL_SIZE=4 \ # Pre-create worker pool
-sPROXY_TO_PTHREAD \ # Move main() to worker (free main thread)
-sMALLOC=mimalloc \ # Thread-safe allocator
-sALLOW_MEMORY_GROWTH=0 \# Fixed memory (faster than growable)
-sTOTAL_MEMORY=536870912 \
-sASSERTIONS=0 \
-sENVIRONMENT=web,worker \
-fno-exceptions \
--pre-js=pre.js # AudioWorklet + OffscreenCanvas setup

SIMD: The Biggest Win

How the N64 RSP Works

The Reality Signal Processor has 32 vector registers, each containing 8 x 16-bit elements:

VR[0] = [ elem0 | elem1 | elem2 | elem3 | elem4 | elem5 | elem6 | elem7 ]
16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit
|_________________________128 bits___________________________|

How WASM SIMD Maps To It

WASM v128 = [ i16x8 lane0 | lane1 | lane2 | lane3 | lane4 | lane5 | lane6 | lane7 ]
|_________________________128 bits___________________________|

It's the same width. Every RSP vector operation maps directly to a WASM SIMD instruction.

The Code Already Exists

The mupen64plus-rsp-cxd4 source (inside N64Wasm) already has SSE2 implementations of every RSP vector operation:

// vu/multiply.h (already in N64Wasm source)
#ifdef ARCH_MIN_SSE2
#include <emmintrin.h>

static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
__m128i vs = _mm_load_si128((__m128i*)VS);
__m128i vt = _mm_load_si128((__m128i*)VT);
__m128i lo = _mm_mullo_epi16(vs, vt);
__m128i hi = _mm_mulhi_epi16(vs, vt);
// ... accumulator logic with SIMD
_mm_store_si128((__m128i*)VD, result);
}
#endif

Emscripten compiles SSE2 intrinsics directly to WASM SIMD — we just remove -DNOSSE and add -msimd128. The RSP code compiles unchanged.

Expected Speedup

ComponentScalar (current)SIMD (ours)Speedup
Vector multiply (VMULF)8 ops1 SIMD op~8x theoretical
Vector add (VADD)8 ops1 SIMD op~8x theoretical
Accumulator update24 ops3 SIMD ops~8x theoretical
Practical RSP speedup2-4x
Overall game speedup30-60%

The practical speedup is lower than theoretical because not all RSP time is vector math — there's also scalar operations, memory access, and pipeline management.


Interpreter vs JIT

The JIT Problem in WASM

Traditional N64 emulators (Project64, mupen64plus native) use dynamic recompilation — translating MIPS machine code to native x86/ARM at runtime. This provides 5-20x speedup over interpretation.

WASM cannot do this. Code and data are separated (Harvard architecture). You cannot generate executable code at runtime from within WASM.

Our Approach: Fast Cached Interpreter + SIMD Compensation

The key insight: SIMD compensates for the JIT loss on vector-heavy workloads. Most N64 games spend 30-50% of their time in RSP vector operations. Getting 2-4x speedup on that component significantly closes the gap with native JIT emulators.

Future: Late-Linking JIT

There's a technique (documented by Andy Wingo at wingolog.org) that enables a form of JIT in WASM:

  1. Generate a new WASM module at runtime containing translated MIPS blocks
  2. Instantiate it via WebAssembly.instantiate(), importing shared memory
  3. New functions become callable via call_indirect

This has much higher overhead than native JIT (async compilation, module instantiation cost), but could provide 2-5x speedup on CPU-heavy games. This is a Phase 4+ optimization.


Memory Layout

SharedArrayBuffer (512MB total)
├── [0x00000000 - 0x007FFFFF] RDRAM (8MB)
├── [0x00800000 - 0x03FFFFFF] ROM Space (up to 56MB)
├── [0x04000000 - 0x04000FFF] RSP DMEM (4KB)
├── [0x04001000 - 0x04001FFF] RSP IMEM (4KB)
├── [0x10000000 - 0x100FFFFF] Frame Buffer (1MB, double-buffered)
├── [0x10100000 - 0x10103FFF] Audio Ring Buffer (16KB)
├── [0x10104000 - 0x1010403F] Input State (64 bytes)
├── [0x10104040 - 0x1012403F] Save RAM (128KB)
├── [0x10124040 - 0x1FFFFFFF] WASM Heap (emulator internal state)
└── [0x20000000 - ...] Stack + Globals

Emscripten Version Upgrade Impact

Upgrading from Emscripten 2.0.7 (2021) to 3.x (2026):

ImprovementImpact
LLVM backend upgradesBetter codegen, 10-15% faster
SIMD support maturedMore optimizations available
pthread improvementsLower sync overhead
Binaryen wasm-optBetter dead code elimination
Source mapsDebuggable WASM
Smaller JS glueFaster load
WASM Exception HandlingFaster TLB miss paths

Browser Compatibility

FeatureChromeFirefoxSafariEdge
WASM SIMD91+89+16.4+91+
SharedArrayBuffer68+79+15.2+79+
OffscreenCanvas69+105+16.4+79+
AudioWorklet66+76+14.1+79+
WebGL256+51+15+79+
All combined105+105+16.4+79+

:::info GLOBAL SUPPORT ~94% of all browsers support every feature we need. The remaining 6% are IE, Opera Mini, and very old mobile browsers. :::