WASM Core Engine

The engine under the hood — N64Wasm rebuilt with every modern optimization enabled.

The Baseline Problem

N64Wasm (our fork base) compiles with these flags:

# Current N64Wasm Makefile (2021)
EMCC_FLAGS = -O3 -flto \
  -DNOSSE \          # <-- DISABLES ALL SIMD!
  -DNO_ASM \         # <-- DISABLES ASM OPTIMIZATIONS!
  -s TOTAL_MEMORY=536870912 \
  -s ASSERTIONS=0

This means the entire RSP vector unit — the N64's SIMD coprocessor that handles vertex transforms, lighting, and audio mixing — is being emulated with scalar C code. The RSP processes 8 x 16-bit values in parallel. WASM SIMD provides exactly 128-bit vectors (8 x i16). The match is nearly 1:1.

Our Build Configuration

# N64.wasm optimized build
EMCC_FLAGS = -O3 -flto \
  -msimd128 \              # Enable WASM SIMD (128-bit)
  -mrelaxed-simd \         # Relaxed SIMD for extra perf
  -pthread \               # Enable Web Worker threading
  -sPTHREAD_POOL_SIZE=4 \  # Pre-create worker pool
  -sPROXY_TO_PTHREAD \     # Move main() to worker (free main thread)
  -sMALLOC=mimalloc \      # Thread-safe allocator
  -sALLOW_MEMORY_GROWTH=0 \# Fixed memory (faster than growable)
  -sTOTAL_MEMORY=536870912 \
  -sASSERTIONS=0 \
  -sENVIRONMENT=web,worker \
  -fno-exceptions \
  --pre-js=pre.js          # AudioWorklet + OffscreenCanvas setup

SIMD: The Biggest Win

How the N64 RSP Works

The Reality Signal Processor has 32 vector registers, each containing 8 x 16-bit elements:

VR[0]  = [ elem0 | elem1 | elem2 | elem3 | elem4 | elem5 | elem6 | elem7 ]
           16-bit  16-bit  16-bit  16-bit  16-bit  16-bit  16-bit  16-bit
           |_________________________128 bits___________________________|

How WASM SIMD Maps To It

WASM v128 = [ i16x8 lane0 | lane1 | lane2 | lane3 | lane4 | lane5 | lane6 | lane7 ]
             |_________________________128 bits___________________________|

It's the same width. Every RSP vector operation maps directly to a WASM SIMD instruction.

The Code Already Exists

The mupen64plus-rsp-cxd4 source (inside N64Wasm) already has SSE2 implementations of every RSP vector operation:

// vu/multiply.h (already in N64Wasm source)
#ifdef ARCH_MIN_SSE2
#include <emmintrin.h>

static INLINE void do_vmulf(short* VD, short* VS, short* VT) {
    __m128i vs = _mm_load_si128((__m128i*)VS);
    __m128i vt = _mm_load_si128((__m128i*)VT);
    __m128i lo = _mm_mullo_epi16(vs, vt);
    __m128i hi = _mm_mulhi_epi16(vs, vt);
    // ... accumulator logic with SIMD
    _mm_store_si128((__m128i*)VD, result);
}
#endif

Emscripten compiles SSE2 intrinsics directly to WASM SIMD — we just remove -DNOSSE and add -msimd128. The RSP code compiles unchanged.

Expected Speedup

Component	Scalar (current)	SIMD (ours)	Speedup
Vector multiply (VMULF)	8 ops	1 SIMD op	~8x theoretical
Vector add (VADD)	8 ops	1 SIMD op	~8x theoretical
Accumulator update	24 ops	3 SIMD ops	~8x theoretical
Practical RSP speedup	—	—	2-4x
Overall game speedup	—	—	30-60%

The practical speedup is lower than theoretical because not all RSP time is vector math — there's also scalar operations, memory access, and pipeline management.

Interpreter vs JIT

The JIT Problem in WASM

Traditional N64 emulators (Project64, mupen64plus native) use dynamic recompilation — translating MIPS machine code to native x86/ARM at runtime. This provides 5-20x speedup over interpretation.

WASM cannot do this. Code and data are separated (Harvard architecture). You cannot generate executable code at runtime from within WASM.

Our Approach: Fast Cached Interpreter + SIMD Compensation

The key insight: SIMD compensates for the JIT loss on vector-heavy workloads. Most N64 games spend 30-50% of their time in RSP vector operations. Getting 2-4x speedup on that component significantly closes the gap with native JIT emulators.

Future: Late-Linking JIT

There's a technique (documented by Andy Wingo at wingolog.org) that enables a form of JIT in WASM:

Generate a new WASM module at runtime containing translated MIPS blocks
Instantiate it via WebAssembly.instantiate(), importing shared memory
New functions become callable via call_indirect

This has much higher overhead than native JIT (async compilation, module instantiation cost), but could provide 2-5x speedup on CPU-heavy games. This is a Phase 4+ optimization.

Memory Layout

SharedArrayBuffer (512MB total)
├── [0x00000000 - 0x007FFFFF]  RDRAM (8MB)
├── [0x00800000 - 0x03FFFFFF]  ROM Space (up to 56MB)
├── [0x04000000 - 0x04000FFF]  RSP DMEM (4KB)
├── [0x04001000 - 0x04001FFF]  RSP IMEM (4KB)
├── [0x10000000 - 0x100FFFFF]  Frame Buffer (1MB, double-buffered)
├── [0x10100000 - 0x10103FFF]  Audio Ring Buffer (16KB)
├── [0x10104000 - 0x1010403F]  Input State (64 bytes)
├── [0x10104040 - 0x1012403F]  Save RAM (128KB)
├── [0x10124040 - 0x1FFFFFFF]  WASM Heap (emulator internal state)
└── [0x20000000 - ...]          Stack + Globals

Emscripten Version Upgrade Impact

Upgrading from Emscripten 2.0.7 (2021) to 3.x (2026):

Improvement	Impact
LLVM backend upgrades	Better codegen, 10-15% faster
SIMD support matured	More optimizations available
pthread improvements	Lower sync overhead
Binaryen wasm-opt	Better dead code elimination
Source maps	Debuggable WASM
Smaller JS glue	Faster load
WASM Exception Handling	Faster TLB miss paths

Browser Compatibility

Feature	Chrome	Firefox	Safari	Edge
WASM SIMD	91+	89+	16.4+	91+
SharedArrayBuffer	68+	79+	15.2+	79+
OffscreenCanvas	69+	105+	16.4+	79+
AudioWorklet	66+	76+	14.1+	79+
WebGL2	56+	51+	15+	79+
All combined	105+	105+	16.4+	79+

:::info GLOBAL SUPPORT ~94% of all browsers support every feature we need. The remaining 6% are IE, Opera Mini, and very old mobile browsers. :::

The Baseline Problem​

Our Build Configuration​

SIMD: The Biggest Win​

How the N64 RSP Works​

How WASM SIMD Maps To It​

The Code Already Exists​

Expected Speedup​

Interpreter vs JIT​

The JIT Problem in WASM​

Our Approach: Fast Cached Interpreter + SIMD Compensation​

Future: Late-Linking JIT​

Memory Layout​

Emscripten Version Upgrade Impact​

Browser Compatibility​