Unlock the power of in-browser LLM inference with our WebAssembly binding for llama.cpp. No backend or GPU needed, simply integrate with Typescript and enjoy high-performance inference through a user-friendly API. Whether you’re managing embeddings or completions, our library ensures smooth operation without blocking your UI, paving the way for advanced web applications.
wllama is an innovative WebAssembly binding for llama.cpp that facilitates in-browser large language model (LLM) inference, ensuring a seamless experience for developers and users alike.
Key Features
- Typescript Support: Enjoy enhanced developer experience with TypeScript.
- In-Browser Inference: Leverage WebAssembly SIMD to run model inference directly in the browser without needing a backend or GPU.
- No Runtime Dependencies: A lightweight solution that keeps your project dependencies minimal (see package.json).
- High-Level and Low-Level APIs: Access both high-level functionality (completions and embeddings) and low-level operations, including (de)tokenization and KV cache control.
- Parallel Model Loading: Split your model into smaller files for faster, more efficient loading.
- Dynamic Thread Switching: Automatically adapts to single-thread or multi-thread builds based on browser capabilities.
- Background Processing: Model inference occurs in a worker thread to prevent UI blocking.
- Pre-Built npm Package: Easily integrate with your projects via the pre-built npm package @wllama/wllama.
Limitations
- Multi-thread Requirements: To utilize multi-threading, proper headers such as
Cross-Origin-Embedder-Policy
andCross-Origin-Opener-Policy
must be configured. Find out more in this discussion. - WebGPU Support: Currently not available but may be introduced in future updates.
- 2GB File Size Limit: Due to the ArrayBuffer size restriction. For models exceeding this, refer to the Split Model section below.
Documentation and Demos
Explore our comprehensive Documentation for in-depth guidance and code examples:
- Basic usage with completions and embeddings: Basic Example
- Advanced example utilizing low-level API: Advanced Example
- Embedding and cosine distance demo: Embedding Example
Get Started with wllama
To integrate wllama into your React TypeScript projects, simply install the package:
npm i @wllama/wllama
For detailed implementation, check out the full code in examples/reactjs. This basic example focuses on completions, while you can explore embeddings in examples/embeddings/index.html.
Model Preparation
To optimize performance, we recommend splitting models into chunks of maximum 512MB for faster downloads and reduced memory issues. Use quantized Q4, Q5, or Q6 formats for the best performance-to-quality balance.
Splitting Your Model
For models exceeding 2GB or for optimizing download processes, split your model using llama-gguf-split
:
./llama-gguf-split --split-max-size 512M ./my_model.gguf ./my_model
Load the first chunk URL in loadModelFromUrl
, and the remaining parts will load automatically:
await wllama.loadModelFromUrl(
'https://huggingface.co/ngxson/tinyllama_split_test/resolve/main/stories15M-q8_0-00001-of-00003.gguf',
{
parallelDownloads: 5, // Optional: controls max parallel downloads (default: 3)
},
);
Custom Logging
wllama allows custom logging configurations to tailor debug outputs to your preference. You can suppress debug messages or add emoji prefixes to log entries for a more engaging development experience.
Explore our project to unleash the potential of in-browser LLM inference with wllama, and push the boundaries of what's possible in web applications!