Skip to content

Commit 5ff999d

Browse files
committed
update doc
1 parent feba4e2 commit 5ff999d

File tree

1 file changed

+24
-301
lines changed

1 file changed

+24
-301
lines changed

wasm/README.md

Lines changed: 24 additions & 301 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# WebAssembly String Processing Plan
1+
# WebAssembly UTF-8 String Processing
22

33
## Background
44

@@ -14,321 +14,44 @@ The main issues were:
1414

1515
### What Changed in 2025
1616

17-
**js-string-builtins** (WebAssembly 3.0, September 2025) fundamentally changes the equation:
17+
**js-string-builtins** (WebAssembly 3.0) fundamentally changes the equation:
1818

19-
- Direct import of JS string operations (`length`, `charCodeAt`, `substring`, etc.) from `wasm:js-string`
19+
- Direct import of JS string operations from `wasm:js-string`
2020
- No glue code overhead - operations can be inlined by the engine
21-
- No memory copying at boundaries when consuming JS strings
22-
- Strings stay in JS representation (UTF-16) - no UTF-8/UTF-16 conversion
21+
- Uses WASM GC arrays with `intoCharCodeArray`/`fromCharCodeArray` for bulk operations
2322

24-
Browser/runtime support:
25-
- Chrome 131+ (enabled by default)
26-
- Firefox 134+
27-
- Safari: TBD (expressed openness)
28-
- Node.js 24+ (V8 13.6+, enabled by default)
29-
- Node.js 22-23: `--experimental-wasm-imported-strings` flag required
23+
## Building
3024

31-
## Proposal: Hand-written WAT
32-
33-
### Why WAT over Rust/wasm-bindgen
34-
35-
| Aspect | Hand-written WAT | Rust + wasm-bindgen |
36-
|--------|------------------|---------------------|
37-
| Overhead | Zero - direct builtins | Glue code overhead |
38-
| Binary size | Minimal (~1-2KB) | Larger (~10KB+) |
39-
| Dependencies | None (just wat2wasm) | Rust toolchain, wasm-pack |
40-
| Complexity | Simple for small scope | Overkill for 3 functions |
41-
| js-string-builtins | Direct imports | Indirect, still evolving |
42-
| Contributor barrier | Low (WAT is simple) | Higher (Rust knowledge) |
43-
44-
For our limited scope (UTF-8 encode/decode), hand-written WAT is ideal.
45-
46-
### What to Implement in Wasm
47-
48-
1. **UTF-8 byte length counting** (`utf8Count`)
49-
- Iterate string via `charCodeAt`, calculate byte length
50-
51-
2. **UTF-8 encoding** (`utf8Encode`)
52-
- Read chars via `charCodeAt`, write UTF-8 bytes to linear memory
53-
54-
3. **UTF-8 decoding** (`utf8Decode`)
55-
- Read UTF-8 bytes from memory, build string via `fromCharCode`/`fromCodePoint`
56-
57-
### Available js-string-builtins
58-
59-
From `wasm:js-string`:
60-
- `length` - get string length
61-
- `charCodeAt` - get UTF-16 code unit at index
62-
- `codePointAt` - get Unicode code point at index (handles surrogates)
63-
- `fromCharCode` - create single-char string from code unit
64-
- `fromCodePoint` - create single-char string from code point
65-
- `concat` - concatenate strings
66-
- `substring` - extract substring
67-
- `equals` - compare strings
68-
69-
## Implementation Plan
70-
71-
### Phase 1: Project Setup
72-
73-
```
74-
msgpack-javascript/
75-
├── wasm/
76-
│ ├── utf8.wat # hand-written WAT source
77-
│ └── build.sh # wat2wasm + base64 generation
78-
├── src/
79-
│ └── utils/
80-
│ ├── utf8.ts # existing pure JS
81-
│ ├── utf8-wasm.ts # wasm loader + integration
82-
│ └── utf8-wasm-binary.ts # auto-generated base64 wasm
83-
```
84-
85-
### Phase 2: WAT Implementation
86-
87-
```wat
88-
;; wasm/utf8.wat
89-
(module
90-
;; Import js-string builtins
91-
;; Note: string parameters use externref, string returns use (ref extern)
92-
(import "wasm:js-string" "length"
93-
(func $str_length (param externref) (result i32)))
94-
(import "wasm:js-string" "charCodeAt"
95-
(func $str_charCodeAt (param externref i32) (result i32)))
96-
(import "wasm:js-string" "fromCharCode"
97-
(func $str_fromCharCode (param i32) (result (ref extern))))
98-
(import "wasm:js-string" "concat"
99-
(func $str_concat (param externref externref) (result (ref extern))))
100-
101-
;; Linear memory for UTF-8 bytes (exported for JS access)
102-
(memory (export "memory") 1)
103-
104-
;; Count UTF-8 byte length of a JS string
105-
(func (export "utf8Count") (param $str externref) (result i32)
106-
(local $i i32)
107-
(local $len i32)
108-
(local $byteLen i32)
109-
(local $code i32)
110-
111-
(local.set $len (call $str_length (local.get $str)))
112-
113-
(block $break
114-
(loop $continue
115-
(br_if $break (i32.ge_u (local.get $i) (local.get $len)))
116-
117-
(local.set $code
118-
(call $str_charCodeAt (local.get $str) (local.get $i)))
119-
120-
;; Count bytes based on code point range
121-
(if (i32.lt_u (local.get $code) (i32.const 0x80))
122-
(then
123-
(local.set $byteLen (i32.add (local.get $byteLen) (i32.const 1))))
124-
(else (if (i32.lt_u (local.get $code) (i32.const 0x800))
125-
(then
126-
(local.set $byteLen (i32.add (local.get $byteLen) (i32.const 2))))
127-
(else (if (i32.and
128-
(i32.ge_u (local.get $code) (i32.const 0xD800))
129-
(i32.le_u (local.get $code) (i32.const 0xDBFF)))
130-
;; High surrogate - 4 bytes total, skip low surrogate
131-
(then
132-
(local.set $byteLen (i32.add (local.get $byteLen) (i32.const 4)))
133-
(local.set $i (i32.add (local.get $i) (i32.const 1))))
134-
(else
135-
(local.set $byteLen (i32.add (local.get $byteLen) (i32.const 3)))))))))
136-
137-
(local.set $i (i32.add (local.get $i) (i32.const 1)))
138-
(br $continue)))
139-
140-
(local.get $byteLen))
141-
142-
;; Encode JS string to UTF-8 bytes at offset, returns bytes written
143-
(func (export "utf8Encode") (param $str externref) (param $offset i32) (result i32)
144-
;; Similar loop: charCodeAt -> encode -> store to memory
145-
(local $i i32)
146-
(local $len i32)
147-
(local $pos i32)
148-
(local $code i32)
149-
150-
(local.set $len (call $str_length (local.get $str)))
151-
(local.set $pos (local.get $offset))
152-
153-
(block $break
154-
(loop $continue
155-
(br_if $break (i32.ge_u (local.get $i) (local.get $len)))
156-
157-
(local.set $code (call $str_charCodeAt (local.get $str) (local.get $i)))
158-
159-
;; 1-byte (ASCII)
160-
(if (i32.lt_u (local.get $code) (i32.const 0x80))
161-
(then
162-
(i32.store8 (local.get $pos) (local.get $code))
163-
(local.set $pos (i32.add (local.get $pos) (i32.const 1))))
164-
(else (if (i32.lt_u (local.get $code) (i32.const 0x800))
165-
;; 2-byte
166-
(then
167-
(i32.store8 (local.get $pos)
168-
(i32.or (i32.shr_u (local.get $code) (i32.const 6)) (i32.const 0xC0)))
169-
(i32.store8 (i32.add (local.get $pos) (i32.const 1))
170-
(i32.or (i32.and (local.get $code) (i32.const 0x3F)) (i32.const 0x80)))
171-
(local.set $pos (i32.add (local.get $pos) (i32.const 2))))
172-
;; 3-byte or 4-byte (surrogate pair)
173-
(else
174-
;; TODO: handle surrogates for 4-byte
175-
(i32.store8 (local.get $pos)
176-
(i32.or (i32.shr_u (local.get $code) (i32.const 12)) (i32.const 0xE0)))
177-
(i32.store8 (i32.add (local.get $pos) (i32.const 1))
178-
(i32.or (i32.and (i32.shr_u (local.get $code) (i32.const 6)) (i32.const 0x3F)) (i32.const 0x80)))
179-
(i32.store8 (i32.add (local.get $pos) (i32.const 2))
180-
(i32.or (i32.and (local.get $code) (i32.const 0x3F)) (i32.const 0x80)))
181-
(local.set $pos (i32.add (local.get $pos) (i32.const 3)))))))
182-
183-
(local.set $i (i32.add (local.get $i) (i32.const 1)))
184-
(br $continue)))
185-
186-
(i32.sub (local.get $pos) (local.get $offset)))
187-
188-
;; Decode UTF-8 bytes from memory to JS string
189-
(func (export "utf8Decode") (param $offset i32) (param $length i32) (result externref)
190-
;; Build string by reading bytes, decoding, calling fromCharCode + concat
191-
;; ... implementation
192-
(call $str_fromCharCode (i32.const 0))) ;; placeholder
193-
)
194-
```
195-
196-
### Phase 3: Build Script
25+
Requires [Binaryen](https://github.com/WebAssembly/binaryen) (`brew install binaryen`):
19726

19827
```bash
199-
#!/bin/bash
200-
# wasm/build.sh
201-
# Requires: binaryen (brew install binaryen)
202-
203-
wasm-as utf8.wat -o utf8.wasm --enable-reference-types --enable-gc
204-
205-
# Generate base64-encoded TypeScript module
206-
echo "// Auto-generated - do not edit" > ../src/utils/utf8-wasm-binary.ts
207-
echo "export const wasmBinary = \"$(base64 -i utf8.wasm)\";" >> ../src/utils/utf8-wasm-binary.ts
28+
./build.sh
20829
```
20930

210-
### Phase 4: TypeScript Integration
211-
212-
```typescript
213-
// src/utils/utf8-wasm-binary.ts (auto-generated)
214-
export const wasmBinary = "AGFzbQEAAAA..."; // base64-encoded wasm
215-
216-
// src/utils/utf8-wasm.ts
217-
import { utf8Count as utf8CountJs } from "./utf8.js";
218-
import { wasmBinary } from "./utf8-wasm-binary.js";
219-
220-
interface WasmExports {
221-
memory: WebAssembly.Memory;
222-
utf8Count(str: string): number;
223-
utf8Encode(str: string, offset: number): number;
224-
utf8Decode(offset: number, length: number): string;
225-
}
226-
227-
let wasm: WasmExports | null = null;
228-
229-
function base64ToBytes(base64: string): Uint8Array {
230-
const binary = atob(base64);
231-
const bytes = new Uint8Array(binary.length);
232-
for (let i = 0; i < binary.length; i++) {
233-
bytes[i] = binary.charCodeAt(i);
234-
}
235-
return bytes;
236-
}
237-
238-
// Polyfill for js-string-builtins (used when native builtins unavailable)
239-
const jsStringPolyfill = {
240-
"wasm:js-string": {
241-
length: (s: string) => s.length,
242-
charCodeAt: (s: string, i: number) => s.charCodeAt(i),
243-
codePointAt: (s: string, i: number) => s.codePointAt(i),
244-
fromCharCode: (code: number) => String.fromCharCode(code),
245-
fromCodePoint: (code: number) => String.fromCodePoint(code),
246-
concat: (a: string, b: string) => a + b,
247-
substring: (s: string, start: number, end: number) => s.substring(start, end),
248-
equals: (a: string, b: string) => a === b,
249-
},
250-
};
251-
252-
// Synchronous initialization
253-
function initWasm(): boolean {
254-
if (wasm) return true;
255-
256-
try {
257-
const bytes = base64ToBytes(wasmBinary);
258-
// Try with builtins first (native support)
259-
// If builtins not supported, option is ignored and polyfill is used
260-
const module = new WebAssembly.Module(bytes, { builtins: ["js-string"] });
261-
const instance = new WebAssembly.Instance(module, jsStringPolyfill);
262-
wasm = instance.exports as WasmExports;
263-
return true;
264-
} catch {
265-
return false; // Fallback to pure JS (utf8CountJs, etc.)
266-
}
267-
}
268-
269-
// Try init at module load
270-
const wasmAvailable = initWasm();
271-
272-
export function utf8Count(str: string): number {
273-
return wasm ? wasm.utf8Count(str) : utf8CountJs(str);
274-
}
275-
```
276-
277-
**Progressive enhancement:**
278-
- Native builtins → engine ignores import object, uses optimized builtins
279-
- No native builtins → engine uses polyfill from import object
280-
- Wasm fails entirely → falls back to pure JS implementation
281-
282-
**Benefits of base64 inline:**
283-
- No async initialization needed - sync `new WebAssembly.Module()`
284-
- No fetch/network request - works in all environments
285-
- Single file distribution - no separate .wasm asset
286-
- Bundle size: ~1.3x wasm size (base64 overhead), but gzip compresses well
287-
288-
## Compatibility Matrix
289-
290-
| Environment | Native builtins | Wasm + polyfill | Pure JS fallback |
291-
|-------------|-----------------|-----------------|------------------|
292-
| Chrome 131+ | Yes | - | - |
293-
| Firefox 134+ | Yes | - | - |
294-
| Safari 18+ | TBD | Yes | - |
295-
| Node.js 24+ | Yes (V8 13.6+) | - | - |
296-
| Node.js 22-23 | Flag required | Yes | - |
297-
| Deno | TBD | Yes | - |
298-
| Older browsers | No | Yes | - |
299-
| No Wasm support | - | - | Yes |
300-
301-
Three-tier fallback:
302-
1. **Native builtins** - best performance (engine-optimized)
303-
2. **Wasm + polyfill** - good performance (wasm logic, JS string ops)
304-
3. **Pure JS** - baseline (current implementation)
305-
306-
## Benchmarking Strategy
307-
308-
1. Reuse existing benchmarks:
309-
- `benchmark/encode-string.ts`
310-
- `benchmark/decode-string.ts`
31+
This compiles `utf8.wat` and generates `src/utils/utf8-wasm-binary.ts` with the base64-encoded binary.
31132

312-
2. Add Wasm variants and compare across string sizes:
313-
- Short strings (< 50 bytes): likely JS faster due to call overhead
314-
- Medium strings (50-1000 bytes): Wasm should win
315-
- Large strings (> 1000 bytes): TextEncoder/TextDecoder still optimal
33+
## Runtime Requirements
31634

317-
## Success Criteria
35+
| Environment | Support |
36+
|-------------|---------|
37+
| Node.js 24+ | Native (V8 13.6+) |
38+
| Node.js 22-23 | `--experimental-wasm-imported-strings` flag |
39+
| Chrome 131+ | Native |
40+
| Firefox 134+ | Native |
41+
| Safari | TBD |
42+
| Older/unsupported | Falls back to pure JS |
31843

319-
1. **Performance**: >= 1.5x speedup for medium strings (50-1000 bytes)
320-
2. **Bundle size**: Wasm binary < 2KB (~2.7KB as base64, compresses well with gzip)
321-
3. **Compatibility**: Zero breakage with fallback to pure JS
322-
4. **Maintainability**: Simple WAT, easy to understand
44+
## Architecture
32345

324-
## Decisions
46+
Three-tier dispatch based on string/byte length:
32547

326-
- **Node.js**: js-string-builtins enabled by default in Node.js 24+ (V8 13.6+). For Node.js 22-23, use `--experimental-wasm-imported-strings` flag.
48+
| Length | Method | Reason |
49+
|--------|--------|--------|
50+
| ≤ 50 | Pure JS | Lowest call overhead |
51+
| 51-1000 | WASM | Optimal for medium strings |
52+
| > 1000 | TextEncoder/TextDecoder | SIMD-optimized for bulk |
32753

32854
## References
32955

33056
- [js-string-builtins proposal](https://github.com/WebAssembly/js-string-builtins)
33157
- [MDN: WebAssembly JavaScript builtins](https://developer.mozilla.org/en-US/docs/WebAssembly/Guides/JavaScript_builtins)
332-
- [WebAssembly 3.0 announcement](https://webassembly.org/news/2025-09-17-wasm-3.0/)
333-
- [Previous PR #26](https://github.com/msgpack/msgpack-javascript/pull/26)
334-
- [Removal PR #95](https://github.com/msgpack/msgpack-javascript/pull/95)

0 commit comments

Comments
 (0)