spark-wasm browser demo
A real, runnable single-page app that runs Spark-dialect SQL entirely in the browser via the spark-wasm crate -- no server, no backend, data never leaves the tab. This is the spark-rust analogue of the DuckDB-WASM SQL console.
wasm sandbox, Arrow-backed, 391 registered scalar functions.
What it shows
- Type Spark SQL, hit Run, see results -- the parse/plan/execute happens in a
.wasmmodule compiled from the same Rust engine the server uses. - A live Spark-dialect to ANSI preview (
preprocess_sql). - The registered scalar-function count (
udf_count, 391 = DataFusion built-ins +datafusion_spark+ ourspark_udfs) as an engine smoke-check. - Four sample workloads: filter/project, aggregate, Spark date functions, and a table-less scalar query showcasing Spark-native semantics (
element_at(array(10,20,30), 2)->20via Spark's 1-based array indexing,concat_ws,regexp_replace).
The demo uses the friendly JSON path run_sql_json(sql, rowsJson): you hand it rows as a JSON array of objects (registered as table data) and get a JSON array back -- no apache-arrow JS library needed. (The zero-copy Arrow-IPC export run_sql(sql, Uint8Array) -> Uint8Array is also available for callers that already hold Arrow buffers.)
Run it
# one-time prereqs
rustup target add wasm32-unknown-unknown
cargo install wasm-bindgen-cli --version 0.2.122 # must match Cargo.lock
# build the wasm + JS bindings into ./pkg, then serve
bash build.sh
bash serve.sh # -> http://localhost:8000/
Open http://localhost:8000/ and query away. Edit the SQL or the JSON rows and press Run (or Ctrl/Cmd+Enter).
build.shsetsPROTOCand thegetrandom_backend="wasm_js"RUSTFLAG for you. The optionalwasm-opt -Ozsize pass runs only ifwasm-optis onPATH; without it the demo still runs, the binary is just larger. Measured artifact sizes (this engine,element_atfix included): the debug wasm is ~1.2 GB, the release wasm is 85 MB, andwasm-bindgen --target webemitspkg/spark_wasm_bg.wasmat 80 MB +pkg/spark_wasm.jsat ~18 KB. A production bundle should always runwasm-opt -Ozon top (DuckDB-WASM's shipped artifact is ~30 MB for comparison). The size reflects the full 391-function Spark surface compiled in -- tree-shaking to the functions a given app uses is the documented follow-up.
How it fits together
index.html --loads--> app.js --import--> pkg/spark_wasm.js (wasm-bindgen glue)
|
v
pkg/spark_wasm_bg.wasm (the engine)
|
run_sql_json(sql, rowsJson) ---------------+
= preprocess_spark_sql (Spark dialect -> ANSI)
+ register rows as MemTable `data`
+ DataFusion parse / plan / execute
+ datafusion_spark + spark_udfs::register_all (391 scalar fns)
-> JSON rows out
The network is touched once, to fetch spark_wasm_bg.wasm from this origin. Everything after that -- every byte of your data -- stays inside the wasm sandbox in the tab. That sandbox is the privacy guarantee: the module has no filesystem and no socket capability handed to it.
Files
| File | Role |
|---|---|
index.html | UI shell + styles. |
app.js | Loads the wasm module and calls init / udf_count / preprocess_sql / run_sql_json. |
build.sh | cargo build --target wasm32 + wasm-bindgen --target web to pkg/. |
serve.sh | Static file server (python3 -m http.server). |
pkg/ | Generated bindings (git-ignored; produced by build.sh). |
Notes / limits
- MIME type: wasm must be served as
application/wasmfor streaming compilation.python3 -m http.server(3.11+) does this; if you use another server, configure it. file://won't work: ES-module + wasm fetch requireshttp(s)://.- Engine completeness: the UDF surface is
datafusion_spark+ the locally-shippedspark_udfs. Functions outside that surface (some higher-order / sketch / geo registries) error cleanly at plan time rather than mis-answering -- extracting those to wasm is the documented follow-up (see the crate README). - Threads: single-threaded by default. Multi-core DuckDB-style threading needs
SharedArrayBuffer, which needs COOP+COEP cross-origin isolation headers on whatever serves the page in production.