1- # WebAssembly String Processing Plan
1+ # WebAssembly UTF-8 String Processing
22
33## Background
44
@@ -14,321 +14,44 @@ The main issues were:
1414
1515### What Changed in 2025
1616
17- ** js-string-builtins** (WebAssembly 3.0, September 2025 ) fundamentally changes the equation:
17+ ** js-string-builtins** (WebAssembly 3.0) fundamentally changes the equation:
1818
19- - Direct import of JS string operations ( ` length ` , ` charCodeAt ` , ` substring ` , etc.) from ` wasm:js-string `
19+ - Direct import of JS string operations from ` wasm:js-string `
2020- No glue code overhead - operations can be inlined by the engine
21- - No memory copying at boundaries when consuming JS strings
22- - Strings stay in JS representation (UTF-16) - no UTF-8/UTF-16 conversion
21+ - Uses WASM GC arrays with ` intoCharCodeArray ` /` fromCharCodeArray ` for bulk operations
2322
24- Browser/runtime support:
25- - Chrome 131+ (enabled by default)
26- - Firefox 134+
27- - Safari: TBD (expressed openness)
28- - Node.js 24+ (V8 13.6+, enabled by default)
29- - Node.js 22-23: ` --experimental-wasm-imported-strings ` flag required
23+ ## Building
3024
31- ## Proposal: Hand-written WAT
32-
33- ### Why WAT over Rust/wasm-bindgen
34-
35- | Aspect | Hand-written WAT | Rust + wasm-bindgen |
36- | --------| ------------------| ---------------------|
37- | Overhead | Zero - direct builtins | Glue code overhead |
38- | Binary size | Minimal (~ 1-2KB) | Larger (~ 10KB+) |
39- | Dependencies | None (just wat2wasm) | Rust toolchain, wasm-pack |
40- | Complexity | Simple for small scope | Overkill for 3 functions |
41- | js-string-builtins | Direct imports | Indirect, still evolving |
42- | Contributor barrier | Low (WAT is simple) | Higher (Rust knowledge) |
43-
44- For our limited scope (UTF-8 encode/decode), hand-written WAT is ideal.
45-
46- ### What to Implement in Wasm
47-
48- 1 . ** UTF-8 byte length counting** (` utf8Count ` )
49- - Iterate string via ` charCodeAt ` , calculate byte length
50-
51- 2 . ** UTF-8 encoding** (` utf8Encode ` )
52- - Read chars via ` charCodeAt ` , write UTF-8 bytes to linear memory
53-
54- 3 . ** UTF-8 decoding** (` utf8Decode ` )
55- - Read UTF-8 bytes from memory, build string via ` fromCharCode ` /` fromCodePoint `
56-
57- ### Available js-string-builtins
58-
59- From ` wasm:js-string ` :
60- - ` length ` - get string length
61- - ` charCodeAt ` - get UTF-16 code unit at index
62- - ` codePointAt ` - get Unicode code point at index (handles surrogates)
63- - ` fromCharCode ` - create single-char string from code unit
64- - ` fromCodePoint ` - create single-char string from code point
65- - ` concat ` - concatenate strings
66- - ` substring ` - extract substring
67- - ` equals ` - compare strings
68-
69- ## Implementation Plan
70-
71- ### Phase 1: Project Setup
72-
73- ```
74- msgpack-javascript/
75- ├── wasm/
76- │ ├── utf8.wat # hand-written WAT source
77- │ └── build.sh # wat2wasm + base64 generation
78- ├── src/
79- │ └── utils/
80- │ ├── utf8.ts # existing pure JS
81- │ ├── utf8-wasm.ts # wasm loader + integration
82- │ └── utf8-wasm-binary.ts # auto-generated base64 wasm
83- ```
84-
85- ### Phase 2: WAT Implementation
86-
87- ``` wat
88- ;; wasm/utf8.wat
89- (module
90- ;; Import js-string builtins
91- ;; Note: string parameters use externref, string returns use (ref extern)
92- (import "wasm:js-string" "length"
93- (func $str_length (param externref) (result i32)))
94- (import "wasm:js-string" "charCodeAt"
95- (func $str_charCodeAt (param externref i32) (result i32)))
96- (import "wasm:js-string" "fromCharCode"
97- (func $str_fromCharCode (param i32) (result (ref extern))))
98- (import "wasm:js-string" "concat"
99- (func $str_concat (param externref externref) (result (ref extern))))
100-
101- ;; Linear memory for UTF-8 bytes (exported for JS access)
102- (memory (export "memory") 1)
103-
104- ;; Count UTF-8 byte length of a JS string
105- (func (export "utf8Count") (param $str externref) (result i32)
106- (local $i i32)
107- (local $len i32)
108- (local $byteLen i32)
109- (local $code i32)
110-
111- (local.set $len (call $str_length (local.get $str)))
112-
113- (block $break
114- (loop $continue
115- (br_if $break (i32.ge_u (local.get $i) (local.get $len)))
116-
117- (local.set $code
118- (call $str_charCodeAt (local.get $str) (local.get $i)))
119-
120- ;; Count bytes based on code point range
121- (if (i32.lt_u (local.get $code) (i32.const 0x80))
122- (then
123- (local.set $byteLen (i32.add (local.get $byteLen) (i32.const 1))))
124- (else (if (i32.lt_u (local.get $code) (i32.const 0x800))
125- (then
126- (local.set $byteLen (i32.add (local.get $byteLen) (i32.const 2))))
127- (else (if (i32.and
128- (i32.ge_u (local.get $code) (i32.const 0xD800))
129- (i32.le_u (local.get $code) (i32.const 0xDBFF)))
130- ;; High surrogate - 4 bytes total, skip low surrogate
131- (then
132- (local.set $byteLen (i32.add (local.get $byteLen) (i32.const 4)))
133- (local.set $i (i32.add (local.get $i) (i32.const 1))))
134- (else
135- (local.set $byteLen (i32.add (local.get $byteLen) (i32.const 3)))))))))
136-
137- (local.set $i (i32.add (local.get $i) (i32.const 1)))
138- (br $continue)))
139-
140- (local.get $byteLen))
141-
142- ;; Encode JS string to UTF-8 bytes at offset, returns bytes written
143- (func (export "utf8Encode") (param $str externref) (param $offset i32) (result i32)
144- ;; Similar loop: charCodeAt -> encode -> store to memory
145- (local $i i32)
146- (local $len i32)
147- (local $pos i32)
148- (local $code i32)
149-
150- (local.set $len (call $str_length (local.get $str)))
151- (local.set $pos (local.get $offset))
152-
153- (block $break
154- (loop $continue
155- (br_if $break (i32.ge_u (local.get $i) (local.get $len)))
156-
157- (local.set $code (call $str_charCodeAt (local.get $str) (local.get $i)))
158-
159- ;; 1-byte (ASCII)
160- (if (i32.lt_u (local.get $code) (i32.const 0x80))
161- (then
162- (i32.store8 (local.get $pos) (local.get $code))
163- (local.set $pos (i32.add (local.get $pos) (i32.const 1))))
164- (else (if (i32.lt_u (local.get $code) (i32.const 0x800))
165- ;; 2-byte
166- (then
167- (i32.store8 (local.get $pos)
168- (i32.or (i32.shr_u (local.get $code) (i32.const 6)) (i32.const 0xC0)))
169- (i32.store8 (i32.add (local.get $pos) (i32.const 1))
170- (i32.or (i32.and (local.get $code) (i32.const 0x3F)) (i32.const 0x80)))
171- (local.set $pos (i32.add (local.get $pos) (i32.const 2))))
172- ;; 3-byte or 4-byte (surrogate pair)
173- (else
174- ;; TODO: handle surrogates for 4-byte
175- (i32.store8 (local.get $pos)
176- (i32.or (i32.shr_u (local.get $code) (i32.const 12)) (i32.const 0xE0)))
177- (i32.store8 (i32.add (local.get $pos) (i32.const 1))
178- (i32.or (i32.and (i32.shr_u (local.get $code) (i32.const 6)) (i32.const 0x3F)) (i32.const 0x80)))
179- (i32.store8 (i32.add (local.get $pos) (i32.const 2))
180- (i32.or (i32.and (local.get $code) (i32.const 0x3F)) (i32.const 0x80)))
181- (local.set $pos (i32.add (local.get $pos) (i32.const 3)))))))
182-
183- (local.set $i (i32.add (local.get $i) (i32.const 1)))
184- (br $continue)))
185-
186- (i32.sub (local.get $pos) (local.get $offset)))
187-
188- ;; Decode UTF-8 bytes from memory to JS string
189- (func (export "utf8Decode") (param $offset i32) (param $length i32) (result externref)
190- ;; Build string by reading bytes, decoding, calling fromCharCode + concat
191- ;; ... implementation
192- (call $str_fromCharCode (i32.const 0))) ;; placeholder
193- )
194- ```
195-
196- ### Phase 3: Build Script
25+ Requires [ Binaryen] ( https://github.com/WebAssembly/binaryen ) (` brew install binaryen ` ):
19726
19827``` bash
199- #! /bin/bash
200- # wasm/build.sh
201- # Requires: binaryen (brew install binaryen)
202-
203- wasm-as utf8.wat -o utf8.wasm --enable-reference-types --enable-gc
204-
205- # Generate base64-encoded TypeScript module
206- echo " // Auto-generated - do not edit" > ../src/utils/utf8-wasm-binary.ts
207- echo " export const wasmBinary = \" $( base64 -i utf8.wasm) \" ;" >> ../src/utils/utf8-wasm-binary.ts
28+ ./build.sh
20829```
20930
210- ### Phase 4: TypeScript Integration
211-
212- ``` typescript
213- // src/utils/utf8-wasm-binary.ts (auto-generated)
214- export const wasmBinary = " AGFzbQEAAAA..." ; // base64-encoded wasm
215-
216- // src/utils/utf8-wasm.ts
217- import { utf8Count as utf8CountJs } from " ./utf8.js" ;
218- import { wasmBinary } from " ./utf8-wasm-binary.js" ;
219-
220- interface WasmExports {
221- memory: WebAssembly .Memory ;
222- utf8Count(str : string ): number ;
223- utf8Encode(str : string , offset : number ): number ;
224- utf8Decode(offset : number , length : number ): string ;
225- }
226-
227- let wasm: WasmExports | null = null ;
228-
229- function base64ToBytes(base64 : string ): Uint8Array {
230- const binary = atob (base64 );
231- const bytes = new Uint8Array (binary .length );
232- for (let i = 0 ; i < binary .length ; i ++ ) {
233- bytes [i ] = binary .charCodeAt (i );
234- }
235- return bytes ;
236- }
237-
238- // Polyfill for js-string-builtins (used when native builtins unavailable)
239- const jsStringPolyfill = {
240- " wasm:js-string" : {
241- length : (s : string ) => s .length ,
242- charCodeAt : (s : string , i : number ) => s .charCodeAt (i ),
243- codePointAt : (s : string , i : number ) => s .codePointAt (i ),
244- fromCharCode : (code : number ) => String .fromCharCode (code ),
245- fromCodePoint : (code : number ) => String .fromCodePoint (code ),
246- concat : (a : string , b : string ) => a + b ,
247- substring : (s : string , start : number , end : number ) => s .substring (start , end ),
248- equals : (a : string , b : string ) => a === b ,
249- },
250- };
251-
252- // Synchronous initialization
253- function initWasm(): boolean {
254- if (wasm ) return true ;
255-
256- try {
257- const bytes = base64ToBytes (wasmBinary );
258- // Try with builtins first (native support)
259- // If builtins not supported, option is ignored and polyfill is used
260- const module = new WebAssembly .Module (bytes , { builtins: [" js-string" ] });
261- const instance = new WebAssembly .Instance (module , jsStringPolyfill );
262- wasm = instance .exports as WasmExports ;
263- return true ;
264- } catch {
265- return false ; // Fallback to pure JS (utf8CountJs, etc.)
266- }
267- }
268-
269- // Try init at module load
270- const wasmAvailable = initWasm ();
271-
272- export function utf8Count(str : string ): number {
273- return wasm ? wasm .utf8Count (str ) : utf8CountJs (str );
274- }
275- ```
276-
277- ** Progressive enhancement:**
278- - Native builtins → engine ignores import object, uses optimized builtins
279- - No native builtins → engine uses polyfill from import object
280- - Wasm fails entirely → falls back to pure JS implementation
281-
282- ** Benefits of base64 inline:**
283- - No async initialization needed - sync ` new WebAssembly.Module() `
284- - No fetch/network request - works in all environments
285- - Single file distribution - no separate .wasm asset
286- - Bundle size: ~ 1.3x wasm size (base64 overhead), but gzip compresses well
287-
288- ## Compatibility Matrix
289-
290- | Environment | Native builtins | Wasm + polyfill | Pure JS fallback |
291- | -------------| -----------------| -----------------| ------------------|
292- | Chrome 131+ | Yes | - | - |
293- | Firefox 134+ | Yes | - | - |
294- | Safari 18+ | TBD | Yes | - |
295- | Node.js 24+ | Yes (V8 13.6+) | - | - |
296- | Node.js 22-23 | Flag required | Yes | - |
297- | Deno | TBD | Yes | - |
298- | Older browsers | No | Yes | - |
299- | No Wasm support | - | - | Yes |
300-
301- Three-tier fallback:
302- 1 . ** Native builtins** - best performance (engine-optimized)
303- 2 . ** Wasm + polyfill** - good performance (wasm logic, JS string ops)
304- 3 . ** Pure JS** - baseline (current implementation)
305-
306- ## Benchmarking Strategy
307-
308- 1 . Reuse existing benchmarks:
309- - ` benchmark/encode-string.ts `
310- - ` benchmark/decode-string.ts `
31+ This compiles ` utf8.wat ` and generates ` src/utils/utf8-wasm-binary.ts ` with the base64-encoded binary.
31132
312- 2 . Add Wasm variants and compare across string sizes:
313- - Short strings (< 50 bytes): likely JS faster due to call overhead
314- - Medium strings (50-1000 bytes): Wasm should win
315- - Large strings (> 1000 bytes): TextEncoder/TextDecoder still optimal
33+ ## Runtime Requirements
31634
317- ## Success Criteria
35+ | Environment | Support |
36+ | -------------| ---------|
37+ | Node.js 24+ | Native (V8 13.6+) |
38+ | Node.js 22-23 | ` --experimental-wasm-imported-strings ` flag |
39+ | Chrome 131+ | Native |
40+ | Firefox 134+ | Native |
41+ | Safari | TBD |
42+ | Older/unsupported | Falls back to pure JS |
31843
319- 1 . ** Performance** : >= 1.5x speedup for medium strings (50-1000 bytes)
320- 2 . ** Bundle size** : Wasm binary < 2KB (~ 2.7KB as base64, compresses well with gzip)
321- 3 . ** Compatibility** : Zero breakage with fallback to pure JS
322- 4 . ** Maintainability** : Simple WAT, easy to understand
44+ ## Architecture
32345
324- ## Decisions
46+ Three-tier dispatch based on string/byte length:
32547
326- - ** Node.js** : js-string-builtins enabled by default in Node.js 24+ (V8 13.6+). For Node.js 22-23, use ` --experimental-wasm-imported-strings ` flag.
48+ | Length | Method | Reason |
49+ | --------| --------| --------|
50+ | ≤ 50 | Pure JS | Lowest call overhead |
51+ | 51-1000 | WASM | Optimal for medium strings |
52+ | > 1000 | TextEncoder/TextDecoder | SIMD-optimized for bulk |
32753
32854## References
32955
33056- [ js-string-builtins proposal] ( https://github.com/WebAssembly/js-string-builtins )
33157- [ MDN: WebAssembly JavaScript builtins] ( https://developer.mozilla.org/en-US/docs/WebAssembly/Guides/JavaScript_builtins )
332- - [ WebAssembly 3.0 announcement] ( https://webassembly.org/news/2025-09-17-wasm-3.0/ )
333- - [ Previous PR #26 ] ( https://github.com/msgpack/msgpack-javascript/pull/26 )
334- - [ Removal PR #95 ] ( https://github.com/msgpack/msgpack-javascript/pull/95 )
0 commit comments