Benchmarking with WasmNow

Copy-and-patch Compilation is a fascinating way of constructing a baseline JIT by copying pre-compiled stencils of code fragments and patching the stencils to change embedded constants or addresses as needed. It offers a way of engineering native code generation that is only slightly more difficult to write and maintain than an interpreter, but offers potentially significant speedups by removing the interpretation overhead.

The percent improvement that one will see thus depends greatly upon the interpretation overhead, and thus the design of the virtual machine and how complex its instructions are. The more work packed into one VM instruction, the less time there is spent in the interpreter itself as overhead. Python 3.13 includes a JIT based on copy-and-patch, where each of the interpreter’s opcode handlers are compiled into a stencil. They see a 2%-9% speedup from the initial implementation, as a language with relatively complex instructions. Oppositely, alongside the Copy-and-Patch paper, WasmNow was published as a Copy-and-Patch based WASM engine. WASM has very simple instructions, and thus a large portion of execution time would be spent in an interpreter as overhead, and thus any form of native compilation can be highly effective.

The Copy-and-Patch paper included a comparison against V8’s Liftoff, and handwritten baseline JIT, and saw it being 4.9x-6.5x faster in compiling, and its code executing 39%-63% faster than Liftoff’s as well. This seemed surprising, as WASM instructions are very simple, so there’s not much an ahead-of-time compiler can do on optimization, and Copy-and-Patch is much more limited on what code it can generate per instruction. I’ve re-run this comparison (recreating Fig. 21 from the paper):

And thus we can clearly see: Liftoff isn’t slower in execution. It’s uniformly better. To be incredibly clear, this is not an accusation that the Copy-and-Patch paper was deceitful, it’s that we’re comparing two different things! I’m benchmarking Liftoff in 2025, and the paper looked at Liftoff in 2020. The V8 team has made great progress!

But these new numbers change some of how one should interpret results from the Copy-and-Patch paper. Copy-and-Patch forms a new point on the Pareto frontier of JIT design in compilation time versus execution time, not a direct improvement over handwritten baseline JITs. I’ve experimented with a number of WASM implementations, and using the following engines as representing their class of compilers:

Interpreter: Wizard Copy-and-Patch JIT: WasmNow Baseline JIT: Liftoff Optimizing JIT: TurboFan LLVM JIT: WAVM Native Compilation: wasm2c

Compilation Differences

We’re going to be pulling apart the generated code for one webassembly program. That is:

* Does a call
* Does a divide
* Does a branch

WasmNow’s Stack Safety

The prologue of every function Liftoff process includes a stack check. The prologue for WasmNow doesn’t have one. We can illustrate this with a simple recursive program:

recursive.wat
(module
  (func $main
    (call $main)
  )
  (export "_start" (func $main))
)

Liftoff’s Tier-Up Checks

Liftoff’s Inlining Stats

For Liftoff’s translation of (call $func), we see:

0x27dedc25d859    19  488b45e8             REX.W movq rax,[rbp-0x18]
0x27dedc25d85d    1d  83400702             addl [rax+0x7],0x2
0x27dedc25d861    21  e89af7ffff           call 0x27dedc25d000  (jump table)

What’s being loaded from rbp-0x18? Why is it being incremented? These instructions are generated in liftoff-compiler.cc as part of LiftoffCompiler::CallDirect where the code for a call is being generated:

// Update call counts for inlining.
if (v8_flags.wasm_inlining) {
  CODE_COMMENT("call wasm inlining");
  LiftoffRegister vector = asm_.GetUnusedRegister(kGpReg, {});
  asm_.Fill(vector, WasmLiftoffFrameConstants::kFeedbackVectorOffset,
          kIntPtrKind);
  asm_.IncrementSmi(vector,
                  wasm::ObjectAccess::ElementOffsetInTaggedFixedArray(
                      static_cast<int>(vector_slot)));
  // Warning: {vector} may be clobbered by {IncrementSmi}!
}

So, it’s loading and incrementing a counter from a feedback vector as part of gathering information from execution to feed into Turbofan’s inlining decisions. Do we need this? No! We’re not using Turbofan! Thus, we must pass --no-wasm-inlining --no-wasm-inlining-call-indirect to d8. Doing so removes these two instructions from every call yields the following changes in benchmark performance:

WasmNow’s Div Safety

WasmNow and Liftoff differ in terms of what code they generate for divide operations. We can see this by just having a very simple program implementing quotient = divident / divisor.

safediv.wat
(module
  (func $main
    (local $dividend i32)
    (local $divisor i32)
    (local $quotient i32)

    (local.set $dividend (i32.const 10))
    (local.set $divisor (i32.const 0))

    (local.get $dividend)
    (local.get $divisor)
    (i32.div_u)
    (local.set $quotient)
  )
  (export "_start" (func $main))
)

Running this program under WasmNow and Liftoff produces two notably different effects:

WasmNow
535409 Floating point exception(core dumped)
Liftoff
wasm-function[0]:0x31: RuntimeError: divide by zero
RuntimeError: divide by zero
    at wasm://wasm/0b909e92:wasm-function[0]:0x31
    at safediv.js:25:16

The reasoning for this is apparent from the assembly: Liftoff inserts a check that the divisor isn’t zero, and bails if it is. WasmNow divides unconditionally, and allows the program to crash.

WasmNow Liftoff

Prologue

0x40  xorl r12,r12
0x43  call 0x3aa76c01b170  (jump table)
0x48  REX.W subq rsp,0x8
0x4f  REX.W cmpq rsp,[r13-0x60]
0x53  jna 0x7a  <+0x3a>
0x59  xorl rax,rax
(local.set $dividend
  (i32.const 10))
0x00  mov ebp,0xa
0x05  mov [r13+0x8],ebp
0x5b  movl rcx,0xa
(local.set $divisor
  (i32.const 0))
0x0C  mov ebp,0x0
0x11  mov [r13+0x10],ebp
0x60  movl r10,rax
0x63  testl r10,r10
0x66  jz 0x85  <+0x45>
(local.get $dividend)
0x18  mov ebp,[r13+0x8]
(local.get $divisor)
0x1F  mov r12d,[r13+0x10]
(i32.div_u)
0x26  mov rax,rbp
0x29  xor edx,edx
0x2B  div r12d
0x6c  movl rax,rcx
0x6e  xorl rdx,rdx
0x70  divl r10
(local.set $quotient)
0x2E  mov rbp,rax
0x31  mov [r13+0x18],ebp
0x73  movl rcx,rax

Epilogue

0x38  ret
0x75  REX.W movq rsp,rbp
0x78  pop rbp
0x79  retl

Trailing

    OOL: WasmStackGuard
0x7a  call 0x3aa76c01b310  (jump table)
0x7f  REX.W movq rsi,[rbp-0x10]
0x83  jmp 0x59  <+0x19>
    OOL: ThrowWasmTrapDivByZero
0x85  call 0x3aa76c01b070  (jump table)
0x8a  nop

Who is correct? Unsurprisingly, not the program crash one.

Where the underlying operators are partial, the corresponding instruction will trap when the result is not defined.

idiv_u (i1, i2)

  • If i2 is 0, then the result is undefined.

  • Else, return the result of dividing i1 by i2, truncated toward zero

So WasmNow is benefitting in execution speed slightly by not emitting a test and branch on every divide. Let’s fix that.

--- a/fastinterp/wasm_int_binary_ops.cpp
+++ b/fastinterp/wasm_int_binary_ops.cpp
@@ -106,6 +106,11 @@ struct FIIntBinaryOpsImpl
         else if constexpr(operatorType == WasmIntBinaryOps::Div)
         {
             // TODO: signed overflow?
+            if (rhs == 0) {
+              // No need to save registers around this call.
+              typedef void(*ClobberRegsFunc)(void) [[clang::preserve_all]];
+              reinterpret_cast<ClobberRegsFunc>(1)();
+            }
             result = lhs / rhs;
         }
         else if constexpr(operatorType == WasmIntBinaryOps::Rem)

Now there’s a proper check and a call to some "trap" routine, which still crashes, but we’re looking at performance and not safety here. With this change, WasmNow instead generates the code:

0x00  mov ebp,0xa
0x05  mov [r13+0x8],ebp
0x0C  mov ebp,0x0
0x11  mov [r13+0x10],ebp
0x18  mov ebp,[r13+0x8]
0x1F  mov r12d,[r13+0x10]
0x26  mov rax,rbp
 ;; if divisor==0
0x29  test r12d,r12d
0x2C  jz 0x3b
 ;; else
0x2E  xor edx,edx
0x30  div r12d
0x33  mov rbp,rax
0x36  jmp 0x4f
 ;; then
0x3B  push rax
0x3C  mov ecx,0x1
0x41  call rcx
0x43  add rsp,byte +0x8
0x47  xor edx,edx
0x49  div r12d
0x4C  mov rbp,rax
 ;; endif
0x4F  mov [r13+0x18],ebp
0x56  ret

For some reason, clang wishes to duplicate the div instruction into both the then and else branches, but I’m not clear on if there’s any actual benefit to doing so here. We’re reliant on clang for codegen

This has the following impact on the PolyBenchC tests:

Graph go here

WasmNow Quirks

Being a research project, it got to the point of "I can run the specific benchmarks I’m targeting", and then stopped, so just to document the oddities if anyone else tries to play around with it in the future…​

Incorrect Ifs

It turns out that WasmNow doesn’t codegen the wasm (if …​) correctly.

(module
  (func $main
    (local $a i32)
    (i32.const 1)
    (if (then (local.set $a (i32.const 1)))
        (else (local.set $a (i32.const 2))))
  )
  (export "main" (func $main))
)
00000000  BD01000000        mov ebp,0x1
00000005  85ED              test ebp,ebp
00000007  0F840C000000      jz near 0x19 ;; if zero jump to else body
0000000D  BD01000000        mov ebp,0x1
00000012  4189AD08000000    mov [r13+0x8],ebp
                            ;; there should be a jmp over the else body here
00000019  BD02000000        mov ebp,0x2
0000001E  4189AD08000000    mov [r13+0x8],ebp
00000025  C3                ret

Why was this not noticed? Because PolyBenchC and Coremark only use br_if. So, we must thus also only use br_if.

Comparison

(module
  (func $collatz
    (local $n i32)
    (local $count i32)

    ;; Compute the number of steps required for 100 to converge.
    (local.set $n (i32.const 100))

    (loop $loop
      ;; If n is 1, return count
      (i32.eq (local.get $n) (i32.const 1))
      (if (then
        (return)
      ))

      ;; WasmNow appears to have a miscompilation where it omits the
      ;; jmp from the bottom of the then clause to skip the else, and
      ;; therefore infinite loops. Repeating the if twice works around it.
      ;; If n is even
      (i32.and (local.get $n) (i32.const 1))
      (if ;; n = n / 2
        (then (local.set $n (i32.div_u (local.get $n) (i32.const 2)))))
        ;; If n is odd, n = 3n + 1
      (i32.xor (i32.and (local.get $n) (i32.const 1)) (i32.const 1))
      (if (then (local.set $n (i32.add
                             (i32.mul (local.get $n) (i32.const 3))
                             (i32.const 1)))))

      ;; Increment count
      (local.set $count (i32.add (local.get $count) (i32.const 1)))

      ;; Repeat loop
      (br $loop)
    )
    (return)
  )
  (export "main" (func $collatz))
)

V8 Liftoff

d8 --print-code --liftoff --no-tier-up testcase.js
0x37c0dd747840  xorl r12,r12
0x37c0dd747843  call 0x37c0dd747170  (jump table)
0x37c0dd747848  REX.W subq rsp,0x10
0x37c0dd74784f  REX.W cmpq rsp,[r13-0x60]
0x37c0dd747853  jna 0x37c0dd7478f4  <+0xb4>
0x37c0dd747859  movl [rbp-0x24],0x64
0x37c0dd747860  movl [rbp-0x28],0x0
0x37c0dd747867  movl rax,[rbp-0x24]
0x37c0dd74786a  cmpl rax,0x1
0x37c0dd74786d  jnz 0x37c0dd747886  <+0x46>
0x37c0dd747873  REX.W movq r10,[rsi+0x57]
0x37c0dd747877  subl [r10],0x6f
0x37c0dd74787b  js 0x37c0dd747902  <+0xc2>
0x37c0dd747881  REX.W movq rsp,rbp
0x37c0dd747884  pop rbp
0x37c0dd747885  retl
0x37c0dd747886  movl rax,[rbp-0x24]
0x37c0dd747889  andl rax,0x1
0x37c0dd74788c  testl rax,rax
0x37c0dd74788e  jz 0x37c0dd7478ad  <+0x6d>
0x37c0dd747894  movl rax,[rbp-0x24]
0x37c0dd747897  movl rcx,0x2
0x37c0dd74789c  testl rcx,rcx
0x37c0dd74789e  jz 0x37c0dd747910  <+0xd0>
0x37c0dd7478a4  xorl rdx,rdx
0x37c0dd7478a6  divl rcx
0x37c0dd7478a8  jmp 0x37c0dd7478b0  <+0x70>
0x37c0dd7478ad  movl rax,[rbp-0x24]
0x37c0dd7478b0  movl rcx,rax
0x37c0dd7478b2  andl rcx,0x1
0x37c0dd7478b5  xorl rcx,0x1
0x37c0dd7478b8  testl rcx,rcx
0x37c0dd7478ba  jz 0x37c0dd7478d0  <+0x90>
0x37c0dd7478c0  movl rcx,0x3
0x37c0dd7478c5  imull rcx,rax
0x37c0dd7478c8  addl rcx,0x1
0x37c0dd7478cb  jmp 0x37c0dd7478d2  <+0x92>
0x37c0dd7478d0  movl rcx,rax
0x37c0dd7478d2  movl rax,[rbp-0x28]
0x37c0dd7478d5  addl rax,0x1
0x37c0dd7478d8  REX.W movq r10,[rsi+0x57]
0x37c0dd7478dc  subl [r10],0x85
0x37c0dd7478e3  js 0x37c0dd747915  <+0xd5>
0x37c0dd7478e9  movl [rbp-0x24],rcx
0x37c0dd7478ec  movl [rbp-0x28],rax
0x37c0dd7478ef  jmp 0x37c0dd747867  <+0x27>
0x37c0dd7478f4  call 0x37c0dd747310  (jump table)
0x37c0dd7478f9  REX.W movq rsi,[rbp-0x10]
0x37c0dd7478fd  jmp 0x37c0dd747859  <+0x19>
0x37c0dd747902  call 0x37c0dd747160  (jump table)
0x37c0dd747907  REX.W movq rsi,[rbp-0x10]
0x37c0dd74790b  jmp 0x37c0dd747881  <+0x41>
0x37c0dd747910  call 0x37c0dd747070  (jump table)
0x37c0dd747915  push rax
0x37c0dd747916  push rcx
0x37c0dd747917  call 0x37c0dd747160  (jump table)
0x37c0dd74791c  pop rcx
0x37c0dd74791d  pop rax
0x37c0dd74791e  REX.W movq rsi,[rbp-0x10]
0x37c0dd747922  jmp 0x37c0dd7478e9  <+0xa9>

Copy-and-Patch

ndisasm -b64 0.bin

00000000  mov ebp,0x64
00000005  mov [r13+0x8],ebp
0000000C  nop dword [rax+0x0]
00000013  mov ebp,[r13+0x8]
0000001A  mov r12d,0x1
00000020  xor eax,eax
00000022  cmp ebp,r12d
00000025  setz al
00000028  mov rbp,rax
0000002B  test ebp,ebp
0000002D  jz near 0x34
00000033  ret
00000034  mov ebp,[r13+0x8]
0000003B  mov r12d,0x1
00000041  and ebp,r12d
00000044  test ebp,ebp
00000046  jz near 0x6b
0000004C  mov ebp,[r13+0x8]
00000053  mov r12d,0x2
00000059  mov rax,rbp
0000005C  xor edx,edx
0000005E  div r12d
00000061  mov rbp,rax
00000064  mov [r13+0x8],ebp
0000006B  mov ebp,[r13+0x8]
00000072  mov r12d,0x1
00000078  and ebp,r12d
0000007B  mov r12d,0x1
00000081  xor ebp,r12d
00000084  test ebp,ebp
00000086  jz near 0xad
0000008C  mov ebp,[r13+0x8]
00000093  mov r12d,0x3
00000099  imul ebp,r12d
0000009D  mov r12d,0x1
000000A3  add ebp,r12d
000000A6  mov [r13+0x8],ebp
000000AD  mov ebp,[r13+0x10]
000000B4  mov r12d,0x1
000000BA  add ebp,r12d
000000BD  mov [r13+0x10],ebp
000000C4  jmp 0x13
000000C9  ret

Side-by-Side

Liftoff Copy-and-Patch

Setup

Appendix

V8 Setup

git clone depot_tools
export PATH=$(pwd)/depot_tools:$PATH
fetch --nohistory v8

Go install bazelisk if you don’t hae it already.

And now within v8/, I needed to turn off -Werror:

diff --git a/bazel/defs.bzl b/bazel/defs.bzl
index fbd942ba..0eb339bd 100644
--- a/bazel/defs.bzl
+++ b/bazel/defs.bzl
@@ -106,7 +106,6 @@ def _default_args():
             "@v8//bazel/config:is_posix": [
                 "-fPIC",
                 "-fno-strict-aliasing",
-                "-Werror",
                 "-Wextra",
                 "-Wno-unneeded-internal-declaration",
                 "-Wno-unknown-warning-option", # b/330781959

And then build d8 with the disassembler enabled:

bazel build //:noicu/d8 --//:v8_enable_disassembler=true --//:v8_enable_object_print=true --//:v8_code_comments=true

Now wait like 3-4 hours. V8 Team, please publish precompiled d8 binaries.

bazel-bin/noicu/d8 will now be your d8 binary. It has no dynamically linked dependencies on any of the V8 build, so you can copy it elsewhere (I dropped it in /usr/local/bin to get it on $PATH easily).

WasmNow Setup

Clone the repo:

git clone https://github.com/sillycross/WasmNow.git

The build script pochivm-build expects to be able to copy /lib/x86_64-linux-gnu/libtinfo.so.5 from your host system. Fedora dropped that in version 37, so we have to just hack it out of the script:

diff --git a/pochivm-build b/pochivm-build
index f0aec5f..f4c207c 100755
--- a/pochivm-build
+++ b/pochivm-build
@@ -63,7 +63,7 @@ def BuildOrUpdateDockerImage():

     CreateDirIfNotExist(os.path.join(base_dir, 'shared_libs'))
     all_shared_libs = [
-        '/lib/x86_64-linux-gnu/libtinfo.so.5'
+        #'/lib/x86_64-linux-gnu/libtinfo.so.5'
     ]
     for shared_lib in all_shared_libs:
         cmd = 'docker run -v%s:/home/u/PochiVM pochivm-build:latest cp %s /home/u/PochiVM/shared_libs' % (base_dir, shared_lib)

It also turns out that making files shared across a host system and a container is hard, especially when you’re running on fedora using rootless podman by default, so I also had to patch in:

diff --git a/pochivm-build b/pochivm-build
index f0aec5f..f4c207c 100755
--- a/pochivm-build
+++ b/pochivm-build
@@ -111,7 +111,7 @@ if (op == 'cmake'):
     CreateDirIfNotExist(GetGeneratedDirFlavor(target))
     CreateDirIfNotExist(os.path.join(GetGeneratedDirFlavor(target), 'generated'))

-    cmd = "docker run -v %s:/home/u/PochiVM pochivm-build:latest bash -c 'cd PochiVM/build/%s && cmake ../../ -DBUILD_FLAVOR=%s -GNinja'" % (base_dir, target, target.upper())
+    cmd = "docker run --user root -v %s:/home/u/PochiVM:z pochivm-build:latest bash -c 'cd PochiVM/build/%s && cmake ../../ -DBUILD_FLAVOR=%s -GNinja'" % (base_dir, target, target.upper())
     r = os.system(cmd)
     sys.exit(r)

@@ -146,7 +146,7 @@ if (op == 'make'):
             if (num_cpus > 4):
                 parallelism = num_cpus - 2
         option = ("-j%s" % str(parallelism))
-    cmd = "docker run -v %s:/home/u/PochiVM pochivm-build:latest bash -c 'cd PochiVM/build/%s && ninja %s'" % (base_dir, target, option)
+    cmd = "docker run --user root -v %s:/home/u/PochiVM:z pochivm-build:latest bash -c 'cd PochiVM/build/%s && ninja %s'" % (base_dir, target, option)
     r = os.system(cmd)
     if (r != 0):
         sys.exit(r)

And now you should be able to

DevEx Setup

run_d8_wasm
#!/bin/bash
WATFILE=$1
WASMFILE="${1%.wat}.wasm"
JSFILE="${1%.wat}.js"

wat2wasm $WATFILE -o $WASMFILE

cat >$JSFILE <<END
const bytes = new Uint8Array(
END
cat $WASMFILE | node -e "process.stdin.on('data', (data) => console.log([...data]));" >> $JSFILE
cat >>$JSFILE <<END
);

const module = new WebAssembly.Module(bytes);
const instance = new WebAssembly.Instance(module);
console.log(instance.exports.main());
END
d8 --liftoff --no-wasm-tier-up --print-code --code-comments $JSFILE
#!/bin/bash
WATFILE=$1
WASMFILE="${1%.wat}.wasm"
wat2wasm $WATFILE -o $WASMFILE
WASM_TEST_FILE=$WASMFILE ./main --gtest_filter=WasmExecution.from_env
ndisasm -b64 0.bin

Spec Test Complaints

I’d like to take a moment here to just rant about how difficult it is to apply the official wasm spec’s test suite to a new wasm engine.

The spec’s tests are defined in a wast syntax, which is wat that allows multiple modules per file plus some extra scripting support to allow defining tests:

Supporting these functions directly in an engine just isn’t overly tractable. assert_return is probably JIT-able, but the "runtime" function lookup from (invoke) is annoying, except it’s always a string constant so it’s technically doable at compile time. assert_trap means you must implement trapping gracefully, as a valid return value. assert_invalid is applying validation at runtime, which isn’t how the rest of a wasm engine works. assert_malformed is just right out.

And thus, this ends up turning into that implementations have to write their own spectest parser. Sorry, I mean their own spectest test generator. No, wait, I mean . If every consumer of your test suite has to devote hours to transforming your testsuite into something they can actually use, there’s probably something to fix.

You might think that wast2json would be a great help by exploding the wast tests out into a collection of wasm files and a json file telling you what all the tests are. However, that leaves you to emit the assertions as WASM, and link in the corresponding module definitions that were emitted. Linking with What happens when you try to use wast2json --relocatable on the spectests to be able to do that?

/usr/include/c++/14/string_view:256: constexpr const std::basic_string_view<_CharT, _Traits>::value_type& std::basic_string_view<_CharT, _Traits>::operator[](size_type) const [with _CharT = char; _Traits = std::char_traits<char>; const_reference = const char&; size_type = long unsigned int]: Assertion '__pos < this->_M_len' failed.

And before you say "that’s just a bug, report it." I did. And the overall sentiment was that (1) it’s maybe better to just remove --relocatable, and (2) it’s not really a good solution to this problem anyway. (Ladybird’s LibWasm uses it well though from a context where modules are loadable from javascript, and a python file emits javascript loading the module and checking it according to the test the json says should be applied.)

I’m very happy to see that there’s an official suite of tests to check compliance with the specification, so please don’t take this as me not being appreciative of the labor that went into building out the test suite, but I’m very disappointed in how difficult they are to actually apply to a minimal WASM implementation. Hence it’s not surprising that spectests weren’t applied to WasmNow during its development. But that’s literally what they exist for!

What would be useful is if each assert was its own file. I started writing a script to try and track the module and definition dependencies and programmatically generate one minimal file for each assertion test from the .wast definition, but such work became quickly apparent that it’d take more than the couple days I was willing to spend on it. Each assertion being its own test means that it’s fine if the WASM runtime exits with a return code indicating the type of error that it detected, and then there doesn’t need to be any first class support for parsing and validating new code at runtime or having first class traps. One can just point the WASM engine at the entirely self-contained file, and the harness can assert that a malformed wasm file yields a malformed wasm file error return code, and the same for traps, invalid, or successful execution.