spark-wasm browser demo

A real, runnable single-page app that runs Spark-dialect SQL entirely in the browser via the spark-wasm crate -- no server, no backend, data never leaves the tab. This is the spark-rust analogue of the DuckDB-WASM SQL console.

wasm sandbox, Arrow-backed, 391 registered scalar functions.

What it shows

Type Spark SQL, hit Run, see results -- the parse/plan/execute happens in a .wasm module compiled from the same Rust engine the server uses.
A live Spark-dialect to ANSI preview (preprocess_sql).
The registered scalar-function count (udf_count, 391 = DataFusion built-ins + datafusion_spark + our spark_udfs) as an engine smoke-check.
Four sample workloads: filter/project, aggregate, Spark date functions, and a table-less scalar query showcasing Spark-native semantics (element_at(array(10,20,30), 2) -> 20 via Spark's 1-based array indexing, concat_ws, regexp_replace).

The demo uses the friendly JSON path run_sql_json(sql, rowsJson): you hand it rows as a JSON array of objects (registered as table data) and get a JSON array back -- no apache-arrow JS library needed. (The zero-copy Arrow-IPC export run_sql(sql, Uint8Array) -> Uint8Array is also available for callers that already hold Arrow buffers.)

Run it

# one-time prereqs
rustup target add wasm32-unknown-unknown
cargo install wasm-bindgen-cli --version 0.2.122   # must match Cargo.lock

# build the wasm + JS bindings into ./pkg, then serve
bash build.sh
bash serve.sh        # -> http://localhost:8000/

Open http://localhost:8000/ and query away. Edit the SQL or the JSON rows and press Run (or Ctrl/Cmd+Enter).

build.sh sets PROTOC and the getrandom_backend="wasm_js" RUSTFLAG for you. The optional wasm-opt -Oz size pass runs only if wasm-opt is on PATH; without it the demo still runs, the binary is just larger. Measured artifact sizes (this engine, element_at fix included): the debug wasm is ~1.2 GB, the release wasm is 85 MB, and wasm-bindgen --target web emits pkg/spark_wasm_bg.wasm at 80 MB + pkg/spark_wasm.js at ~18 KB. A production bundle should always run wasm-opt -Oz on top (DuckDB-WASM's shipped artifact is ~30 MB for comparison). The size reflects the full 391-function Spark surface compiled in -- tree-shaking to the functions a given app uses is the documented follow-up.

How it fits together

  index.html --loads--> app.js --import--> pkg/spark_wasm.js  (wasm-bindgen glue)
                                              |
                                              v
                                     pkg/spark_wasm_bg.wasm   (the engine)
                                              |
   run_sql_json(sql, rowsJson) ---------------+
     = preprocess_spark_sql            (Spark dialect -> ANSI)
     + register rows as MemTable `data`
     + DataFusion parse / plan / execute
     + datafusion_spark + spark_udfs::register_all   (391 scalar fns)
     -> JSON rows out

The network is touched once, to fetch spark_wasm_bg.wasm from this origin. Everything after that -- every byte of your data -- stays inside the wasm sandbox in the tab. That sandbox is the privacy guarantee: the module has no filesystem and no socket capability handed to it.

Files

File	Role
`index.html`	UI shell + styles.
`app.js`	Loads the wasm module and calls `init` / `udf_count` / `preprocess_sql` / `run_sql_json`.
`build.sh`	`cargo build --target wasm32` + `wasm-bindgen --target web` to `pkg/`.
`serve.sh`	Static file server (`python3 -m http.server`).
`pkg/`	Generated bindings (git-ignored; produced by `build.sh`).

Notes / limits

MIME type: wasm must be served as application/wasm for streaming compilation. python3 -m http.server (3.11+) does this; if you use another server, configure it.
file:// won't work: ES-module + wasm fetch requires http(s)://.
Engine completeness: the UDF surface is datafusion_spark + the locally-shipped spark_udfs. Functions outside that surface (some higher-order / sketch / geo registries) error cleanly at plan time rather than mis-answering -- extracting those to wasm is the documented follow-up (see the crate README).
Threads: single-threaded by default. Multi-core DuckDB-style threading needs SharedArrayBuffer, which needs COOP+COEP cross-origin isolation headers on whatever serves the page in production.

← back to the demo