← Back to home

Is WebAssembly magic performance pixie dust?

Add WebAssembly, get performance. Is that how it really works?

The incredibly unsatisfying answer is: It depends. It depends on oh-so-many factors, and I’ll be touching on some of them here.

Why am I doing this? (You can skip this)

I really like AssemblyScript (full disclosure: I am one of their backers). It’s a very young language with a small but passionate team that built a custom compiler for a TypeScript-like language targeting WebAssembly. The reason I like AssemblyScript (or ASC for short) is because it allows the average web developer to make use of WebAssembly without having to learn a potentially new language like C++ or Rust. It’s important to note that the language is TypeScript-like. Don’t expect your existing TypeScript code to just compile out of the box. That being said, the language is intentionally mirroring the behaviors and semantics of TypeScript (and therefore JavaScript), which means that the act of “porting” TypeScript to AssemblyScript are often mostly cosmetic, usually just adding type annotations.

I always wondered if there is anything to gain from taking a piece of JavaScript, turning it into AssemblyScript and compiling it to WebAssembly. When my colleague Ingvar sent me a piece of JavaScript code to blur images, I thought that this would be a perfect case study. I ran a quick experiment to see if it’s worth doing a deeper exploration into porting JavaScript to AssemblyScript. And oh boy was it worth it. This article is that deeper exploration.

If you want to know more about AssemblyScript, go to the website, join the Discord or, if you fancy, watch the AssemblyScript intro video I made with my podcast husband Jake.

Advantages of WebAssembly

I think it is fair to say that most mature use-case for WebAssembly is tapping into the ecosystem of other languages. In Squoosh, for example, we use libraries from the C/C++ and Rust ecosystem to process images. These libraries were not written with the web in mind, but through WebAssembly, they can run there anyway.

WebAssembly, in my perception, is also strongly associated with performance by a lot of people. It was designed to be fast, and it’s compiled, so it’s gotta be fast, right? Well, for the longest time I have been vocal that WebAssembly and JavaScript have the same peak performance, and I still stand behind that. Given ideal conditions, they both compile to machine code and end up being equally fast. But there’s obviously more nuance here, and when have conditions ever been ideal on the web‽ Instead, I think it would be better if we thought about WebAssembly as a way to get more reliable performance.

However, it’s also important to realize that WebAssembly has recently been getting access to performance primitives (like SIMD or shared-memory threads) that JavaScript cannot utilize, giving WebAssembly an increased chance to out-perform JavaScript. There are also some other qualities of WebAssembly that might make it better suited in specific situations than JavaScript:

No warmup

For V8 to execute JavaScript, it first gives the code to the interpreter “Ignition”. Ignition is optimized to make code run as soon as possible. Afterwards, “Sparkplug” takes Ignition’s output (the infamous “bytecode”) and turns it into non-optimized machine code, yielding better performance at the cost of increased memory footprint. While your code is executing, it is closely observed by V8 to gather data on object shapes (think of them like types). Once sufficient data has been collected, V8’s optimizing compiler “TurboFan” kicks in and generates low-level machine code that is optimized for those types. This will give another significant speed boost.

It’s tradeoffs all the way down: If you want to learn more about the exact tradeoffs JavaScripts engines have to make, I can recommend this article by Benedikt and Mathias.

WebAssembly, on the other hand, is strongly typed. It can be turned into machine code straight away. V8 has a streaming Wasm compiler called “Liftoff“ which, like Ignition, is geared to get your code running quickly, at the cost of generating potentially suboptimal execution speed. The second Liftoff is done, TurboFan kicks in and generates optimized machine code that will run faster than what Liftoff produced, but will take longer to generate. The big difference to JavaScript is that TurboFan can do its work without having to observe your Wasm first.

No tierdown

The machine code that TurboFan generates for JavaScript is only usable for as long as the assumptions about types hold. If TurboFan generated machine code for a function f with a number as a parameter, and now all of a sudden that function f gets called with an object, the engine has to fall back to Ignition or Sparkplug. That’s called a “deoptimization” (or “deopt” for short). Again, because WebAssembly is strongly typed, the types can’t change. Not only that, but the types that WebAssembly supports are designed to map well to machine code. Deopts can’t happen with WebAssembly.

Binary size

Now this one is a bit elusive. According to webassembly.org, “the wasm stack machine is designed to be encoded in a size- and load-time-efficient binary format.” And yet, WebAssembly is currently somewhat notorious for generating big binary blobs, at least by what is considered “big” on the web. WebAssembly compresses very well (via gzip or brotli), which can undo a lot of the bloat.

It is easy to forget that JavaScript comes with a lot of batteries included (despite the claim that it doesn’t have a standard library). For example: You can handle arrays, objects, iterate over keys and values, split strings, filter, map, have prototypical inheritance and so on and so forth. All that is built into the JavaScript engine. WebAssembly comes with nothing, except arithmetic. Whenever you use any of these higher-level concepts in a language that compiles to WebAssembly, the underpinning code will have to get bundled into your binary, which is one of the causes for big WebAssembly binaries. Of course those functions will only have to be included once, so bigger projects will benefit more from Wasm’s small binary representation than small modules.

Not all of these advantages are equally available or important in any given scenario. However, AssemblyScript is known to generate rather small WebAssembly binaries and I was curious how it can hold up in terms of speed and size with directly comparable JavaScript.

Porting to AssemblyScript

As mentioned, AssemblyScript mimics TypeScript’s semantics and Web Platform APIs as much as possible, which means porting a piece of JS to ASC is mostly a matter of adding type annotations to the code. As a first example, I took glur, a JavaScript library that blurs images.

Adding types

ASC’s built-in types mirror the types of the WebAssembly VM. While numeric values in TypeScript are just number (a 64-bit IEEE754 float according to the spec), AssemblyScript has u8, u16, u32, i8, i16, i32, f32 and f64 as its primitive types. The small-but-sufficiently-powerful standard library of ASC adds higher-level data structures like string, Array<T>, ArrayBuffer, Uint8Array etc. The only ASC-specific data structure, that is neither in JavaScript nor the Web Platform, is StaticArray, which I will talk about a bit later.

As an example, here is a function from the glur library and its AssemblyScript’ified counterpart:

function gaussCoef(sigma) {
  if (sigma < 0.5)
    sigma = 0.5;

  var a = Math.exp(0.726 * 0.726) / sigma;
  /* ... more math ... */

  return new Float32Array([
    a0, a1, a2, a3,
    b1, b2,
    left_corner, right_corner
  ]);
}
function gaussCoef(sigma: f32): Float32Array {
  if (sigma < 0.5)
    sigma = 0.5;

  let a: f32 = Mathf.exp(0.726 * 0.726) / sigma;
  /* ... more math ... */

  const r = new Float32Array(8);
  const v = [
    a0, a1, a2, a3,
    b1, b2,
    left_corner, right_corner
  ];
  for (let i = 0; i < v.length; i++) {
    r[i] = v[i];
  }
  return r;
}

The explicit loop at the end to populate the array is there because of a current short-coming in AssemblyScript: Function overloading isn’t supported yet. There is only exactly one constructor for Float32Array in ASC, which takes an i32 parameter for the length of the TypedArray. Callbacks are supported in ASC, but closures also are not, so I can’t use .forEach() to fill in the values. This is certainly inconvenient, but not prohibitively so.

Mathf: You might have noticed Mathf instead of Math. Mathf is specifically for 32-bit floats, while Math is for 64-bit floats. I could have used Math and done a cast, but they are ever-so-slightly slower due to the increased precision required. Either way, the gaussCoef function is not part of the hot path, so it really doesn’t make a difference.

Side note: Mind the signs

Something that took me an embarrassingly long time to figure out is that, uh, types matter. Blurring an image involves convolution, and that means a whole bunch of for-loops iterating over all the pixels. Naïvely I thought that because all pixel indices are positive, the loop counters would be as well and decided to choose u32 for those loop variables. That’ll bite you with a lovely infinite loop if any of those loops happen to iterate backwards, like the following one:

let j: u32;
// ... many many lines of code ...
for (j = width - 1; j >= 0; j--) {
  // ...
}

Apart from that, the act of porting JS to ASC was a pretty mechanical task.

Benchmarking using d8

Now that we have a JS file and an ASC file, we can compile the ASC to WebAssembly and run a little benchmark to compare the runtime performance.

d-What?: d8 is a minimal CLI wrapper around V8, exposing fine-grained control over all kinds of engine features for both Wasm and JS. You can think of it like Node, but with no standard library whatsoever. Just vanilla ECMAScript. Unless you have compiled V8 locally (which you can do by following the guide on v8.dev), you probably won’t have d8 available. jsvu is a tool that can install pre-compiled binaries for many JavaScript engines, including V8.

However, since this section has the word “Benchmarking” in the title, I think it’s important to put a disclaimer here: The numbers I am listing here are specific to the code that I wrote in a language I chose, ran on my machine (a 2020 M1 MacBook Air) using a benchmark script that I made. The results are coarse indicators at best and it would be ill-advised to derive quantitative conclusions about the general performance of AssemblyScript, WebAssembly or JavaScript from this.

Some might be wondering why I’m using d8 instead of running this in the browser or even Node. Both Node and the browser have,... other stuff that may or may not screw with the results. d8 is the most sterile environment I can get and as a cherry on top it allows me to control the tier-up behavior. I can limit execution to use Ignition, Sparkplug or Liftoff only, ensuring that performance characteristics don’t change in the middle of a benchmark.

Methodology

As described above, it is important to “warm-up” JavaScript when benchmarking, giving V8 a chance to optimize it. If you don’t do that, you may very well end up measuring a mixture of the performance characteristics of interpreted JS and optimized machine code. To that end, I’m running the blur program 5 times before I start measuring, then I do 50 timed runs and ignore the 5 fastest and slowest runs to remove potential outliers. Here’s what I got:

Language Engine Average vs JS
JavaScript Ignition 4434.15ms 42.5x
JavaScript Sparkplug 2476.60ms 23.7x
AssemblyScript Liftoff 380.52ms 3.6x
AssemblyScript TurboFan 298.13ms 2.9x
JavaScript TurboFan 104.35ms 1.0x

On the one hand, I was happy to see that Liftoff’s output was faster than what Ignition or Sparkplug could squeeze out of JavaScript. At the same time, it didn’t sit well with me that the optimized WebAssembly module takes about 3 times as long as JavaScript.

To be fair, this is a David vs Goliath scenario: V8 is a long-standing JavaScript engine with a huge team of engineers implementing optimizations and other clever stuff, while AssemblyScript is a relatively young project with a small team around it. ASC’s compiler is single-pass and defers all optimization efforts to Binaryen (see also: wasm-opt). This means that optimization is done at the Wasm VM bytecode level, after most of the high-level semantics have been compiled away. V8 has a clear edge here. However, the blur code is so simple — just doing arithmetic with values from memory — that I was really expecting it to be closer. What’s going on here?

Digging in

After quickly consulting with some folks from the V8 team and some folks from the AssemblyScript team (thanks Daniel and Max!), it turns out that one big difference here are “bounds checks” — or the lack thereof.

V8 has the luxury of having access to your original JavaScript code and knowledge about the semantics of the language. It can use that information to apply additional optimizations. For example: It can tell you are not just randomly reading values from memory, but you are iterating over an ArrayBuffer using a for ... of loop. What’s the difference? Well with a for ... of loop, the language semantics guarantee that you will never try to read values outside of the ArrayBuffer. You will never end up accidentally reading byte 11 when the buffer is only 10 bytes long, or: You never go out of bounds. This means TurboFan does not need to emit bounds checks, which you can think of as if statements making sure you are not accessing memory you are not supposed to. This kind of information is lost once compiled to WebAssembly, and since ASC’s optimization only happens at WebAssembly VM level, it can’t necessarily apply the same optimization.

Luckily, AssemblyScript provides a magic unchecked() annotation to indicate that we are taking responsibility for staying in-bounds.

- prev_prev_out_r = prev_src_r * coeff[6];
- line[line_index] = prev_out_r;
+ prev_prev_out_r = prev_src_r * unchecked(coeff[6]);
+ unchecked(line[line_index] = prev_out_r);

But there’s more: The Typed Arrays in AssemblyScript (Uint8Array, Float32Array, ...) offer the same API as they do on the platform, meaning they are merely a view onto an underlying ArrayBuffer. This is good in that the API design is familiar and battle-tested, but due to the lack of high-level optimizations, this means that every access to a field on the Typed Array (like myFloatArray[23]) needs to access memory twice: Once to load the pointer to the underlying ArrayBuffer of this specific array, and another to load the value at the right offset. V8, as it can tell that you are accessing the Typed Array but never the underlying buffer, is most likely able to optimize the entire data structure so that you can read values with a single memory access.

For that reason, AssemblyScript provides StaticArray<T>, which is mostly equivalent to an Array<T> except that it can’t grow. With a fixed length, there is no need to keep the Array entity separate from the memory the values are stored in, removing that indirection.

I applied both these optimizations to my “naïve port” and measured again:

Language Variant Average vs JS
JavaScript 104.35ms 1.0x
AssemblyScript optimized 162.63ms 1.6x
AssemblyScript naive 298.13ms 2.9x

A lot better! While the AssemblyScript is still slower than the JavaScript, we got significantly closer. Is this the best we can do?

Sneaky defaults

Another thing that the AssemblyScript folks pointed out to me is that the --optimize flag is equivalent to -O3s which aggressively optimizes for speed, but makes tradeoffs to reduce binary size. -O3 optimizes for speed and speed only. Having -O3s as a default is good in spirit — binary size matters on the web — but is it worth it? At least in this specific example the answer is no: -O3s ends up trading the laughable amount of ~30 bytes for a huge performance penalty:

Language Variant Optimizer Average vs JS
AssemblyScript optimized O3 89.60ms 0.9x
JavaScript 104.35ms 1.0x
AssemblyScript optimized O3s 162.63ms 1.6x
AssemblyScript naive O3s 298.13ms 2.9x

One single optimizer flag makes a night-and-day difference, letting AssemblyScript overtake JavaScript (on this specific test case!). We made AssemblyScript faster than JavaScript!

O3: From here on forward, I will only be using -O3 in this article.

Bubblesort

To gain some confidence that the image blur example is not just a fluke, I thought I should try this again with a second program. Rather uncreatively, I took a bubblesort implementation off of StackOverflow and ran through the same process: Add types. Run benchmark. Optimize. Run benchmark. The creation and population of the array that’s to be bubble-sorted is not part of the benchmarked code path.

Language Engine Variant Average vs JS
AssemblyScript TurboFan optimized 65.42ms 0.6x
JavaScript TurboFan 103.85ms 1.0x
AssemblyScript Liftoff optimized 256.98ms 2.5x
AssemblyScript TurboFan naive 434.05ms 4.2x
AssemblyScript Liftoff naive 714.70ms 6.9x
JavaScript Sparkplug 1616.00ms 15.6x
JavaScript Ignition 2340.07ms 22.5x

We did it again! This time with an even bigger discrepancy: The optimized AssemblyScript is almost twice as fast as JavaScript. But do me a favor: Don’t stop reading now.

Allocations

Some of you may have noticed that both these examples have very few or no allocations. V8 takes care of all memory management (and garbage collection) in JavaScript for you and I won’t pretend that I know much about it. In WebAssembly, on the other hand, you get a chunk of linear memory and you have to decide how to use it (or rather: the language does). How much do these rankings change if we make heavy use of dynamic memory?

To measure this, I chose to benchmark an implementation of a binary heap. The benchmark fills the binary heap with 1 million random numbers (courtesy of Math.random()) and pop()s them all back out, checking that the numbers are in increasing order. The process remained the same as above: Make a naïve port of the JS code to ASC, run benchmark, optimize, benchmark again:

Language Engine Variant Average vs JS
JavaScript TurboFan 233.68ms 1.0x
JavaScript Sparkplug 1712.58ms 7.3x
JavaScript Ignition 3157.97ms 13.5x
AssemblyScript TurboFan optimized 18758.50ms 80.3x
AssemblyScript Liftoff optimized 18867.08ms 80.7x
AssemblyScript TurboFan naive 19031.60ms 81.4x
AssemblyScript Liftoff naive 19409.10ms 83.1x

80x slower than JavaScript?! Even slower than Ignition? Surely, there is something else going wrong here.

Runtimes

All data that we create in AssemblyScript needs to be stored in memory. To make sure we don’t overwrite anything else that is already in memory, there is memory management. As AssemblyScript aims to provide a familiar environment, mirroring the behavior of JavaScript, it adds a fully managed garbage collector to your WebAssembly module so that you don’t have to worry about when to allocate and when to free up memory.

By default, AssemblyScript ships with a Two-Level Segregated Fit memory allocator and an Incremental Tri-Color Mark & Sweep (ITCMS) garbage collector. It’s not actually relevant for this article what kind of allocator and garbage collector they use, I just found it interesting that you can go look at them.

This default runtime, called incremental, is surprisingly small, adding only about 2KB of gzip’d WebAssembly to your module. AssemblyScript also offers alternative runtimes, namely minimal and stub that can be chosen using the --runtime flag. minimal uses the same allocator, but a more lightweight GC that does not run automatically but must be manually invoked. This could be interesting for high-performance use-cases like games where you want to be in control when the GC will pause your program. stub is extremely small (~400B gzip’d) and fast, as it’s just a bump allocator.

My lovely memory bump: Bump allocators are extremely fast, but lack the ability to free up memory. While that sounds stupid, it can be extremely useful for single-purpose modules, where instead of freeing memory, you delete the entire WebAssembly instance and rather create a new one. If you are curious, I actually wrote a bump allocator in my article Compiling C to WebAssembly without Emscripten.

How much faster does that make our binary heap experiment? Quite significantly!

Language Variant Runtime Average vs JS
JavaScript 233.68ms 1.0x
AssemblyScript optimized stub 345.35ms 1.5x
AssemblyScript optimized minimal 354.60ms 1.5x
AssemblyScript optimized incremental 18758.50ms 80.3x

Both minimal and stub get us significantly closer to JavaScripts performance. But why are these two so much faster? As mentioned above, minimal and incremental share the same allocator, so that can’t be it. Both also have a garbage collector, but minimal doesn’t run it unless explicitly invoked (and we ain’t invoking it). That means the differentiating quality is that incremental runs garbage collection, while minimal and stub do not. I don’t see why the garbage collector should make this big of a difference, considering it has to keep track of one array.

Growth

After doing some profiling with d8 on a build with debugging symbols (--debug), it turns out that a lot of time is spent in a system library libsystem_platform.dylib, which contains OS-level primitives for threading and memory management. Calls into this library are made from __new and __renew, which in turn are called from Array<f32>#push:

[Bottom up (heavy) profile]:
  ticks parent  name
  18670   96.1%  /usr/lib/system/libsystem_platform.dylib
  13530   72.5%    Function: *~lib/rt/itcms/__renew
  13530  100.0%      Function: *~lib/array/ensureSize
  13530  100.0%        Function: *~lib/array/Array<f32>#push
  13530  100.0%          Function: *binaryheap_optimized/BinaryHeap<f32>#push
  13530  100.0%            Function: *binaryheap_optimized/push
   5119   27.4%    Function: *~lib/rt/itcms/__new
   5119  100.0%      Function: *~lib/rt/itcms/__renew
   5119  100.0%        Function: *~lib/array/ensureSize
   5119  100.0%          Function: *~lib/array/Array<f32>#push
   5119  100.0%            Function: *binaryheap_optimized/BinaryHeap<f32>#push

Clearly, we have a problem with allocations here. But JavaScript somehow manages to make an ever-growing array fast, so why can’t AssemblyScript? Luckily, the standard library of AssemblyScript is rather small and approachable, so let’s go and take a look at this ominous push() function of the Array<T> class:

export class Array<T> {
  // ...
  push(value: T): i32 {
    var length = this.length_;
    var newLength = length + 1;
    ensureSize(changetype<usize>(this), newLength, alignof<T>());
    // ...
    return newLength;
  }
  // ...
}

The push() function correctly determines that the new length of the array is the current length plus 1 and then calls ensureSize(), to make sure that the underlying buffer has enough room (“capacity”) to grow to this length.

function ensureSize(array: usize, minSize: usize, alignLog2: u32): void {
  // ...
  if (minSize > <usize>oldCapacity >>> alignLog2) {
    // ...
    let newCapacity = minSize << alignLog2;
    let newData = __renew(oldData, newCapacity);
    // ...
  }
}

ensureSize(), in turn, checks if the capacity is smaller than the new minSize, and if so, allocates a new buffer of size minSize using __renew, which entails copying all the data from the old buffer to the new buffer. For that reason our benchmark, where we push one million values one-by-one into the array, ends up causing a lot of allocation work and create a lot of garbage.

In other languages, like Rust’s std::vec or Go’s slices, the new buffer has double the old buffer’s capacity, which amortizes the allocation work over time. I am working to fix this in ASC, but in the meantime we can create our own CustomArray<T> that has the desired behavior. Lo and behold, we made things faster!

Fixed: The ASC team has since fixed this regression v0.18.31

Language Variant Runtime Average vs JS
JavaScript 233.68ms 1.0x
AssemblyScript customarray minimal 329.23ms 1.4x
AssemblyScript customarray stub 329.43ms 1.4x
AssemblyScript customarray incremental 335.35ms 1.4x
AssemblyScript optimized stub 345.35ms 1.5x
AssemblyScript optimized minimal 354.60ms 1.5x
AssemblyScript optimized incremental 18758.50ms 80.3x

With this change incremental is as fast as stub and minimal, but none of them are as fast as JavaScript in this test case. There are probably more optimizations I could do, but we are already pretty deep in the weeds here, and this is not supposed to be an article about how to optimize AssemblyScript.

There are also a lot of simple optimizations I wish AssemblyScript’s compiler would do for me. To that end, they are working on an IR called “AIR”. Will that make things faster out-of-the-box without having to hand-optimize every array access? Very likely. Will it be faster than JavaScript? Hard to say. But I did wonder what the more “mature” languages with “very smart” compiler toolchains can achieve.

Rust & C++

I re-rewrote the code in Rust, being as idiomatic as possible and compiled it to WebAssembly. While it was faster than a naive port to AssemblyScript, it was slower than our optimized AssemblyScript with CustomArray<T>. So I had to do the same as I did in AssemblyScript: Avoid bound checks by sprinkling some unsafe here and there. With that optimization in place, Rust’s WebAssembly module is faster than our optimized AssemblyScript, but still not faster than JavaScript.

I took the same approach with C++, using Emscripten to compile it to WebAssembly. To my surprise, my first attempt came out performing just as well as JavaScript.

Language Variant Average vs JS
C++ idiomatic 226.18ms 1.0x
JavaScript 233.68ms 1.0x
Rust optimized 284.60ms 1.2x
AssemblyScript customarray 335.35ms 1.4x
Rust idiomatic 442.88ms 1.9x

Lrn2code: The versions labelled “idiomatic” are still very closely inspired by the original JS code. I tried to make use of my knowledge of the idioms of the target language, but in the end it’s still a port. I am sure an implementation from scratch by someone with more experience in those languages would look different.

I’m fairly certain that Rust and C++ could be made even faster, but I don’t have sufficiently deep knowledge of either language to squeeze out those last couple optimizations.

Gzip’d file sizes

It is worth noting that file size is a strength of AssemblyScript. Comparing the gzip’d file sizes, we get:

Language Runtime .wasm .js Total
JavaScript 0 B 449 B 449 B
AssemblyScript stub 959 B 0 B 959 B
AssemblyScript incremental 2.6KB 0 B 2.6KB
Rust 7.2KB 0 B 7.2KB
C++ 6.6KB 6.2KB 12.8KB

Conclusion

I want to be very clear: Any generalized, quantitative take-away from this article would be ill-advised. For example, Rust is not 1.2x slower than JavaScript. These numbers are very much specific to the code that I wrote, the optimizations that I applied and the machine I used. However, I think there are some general guidelines we can extract to help you make more informed decisions in the future:

If you don’t trust me (and you shouldn’t!) and want to dig into the benchmark code yourself, take a look at the gist.