PitchHut
Log in / Sign up
wllama
7 views
Run LLM inference directly in your browser with seamless efficiency.
Pitch

Unlock the power of in-browser LLM inference with our WebAssembly binding for llama.cpp. No backend or GPU needed, simply integrate with Typescript and enjoy high-performance inference through a user-friendly API. Whether you’re managing embeddings or completions, our library ensures smooth operation without blocking your UI, paving the way for advanced web applications.

Description

wllama is an innovative WebAssembly binding for llama.cpp that facilitates in-browser large language model (LLM) inference, ensuring a seamless experience for developers and users alike.

Key Features

  • Typescript Support: Enjoy enhanced developer experience with TypeScript.
  • In-Browser Inference: Leverage WebAssembly SIMD to run model inference directly in the browser without needing a backend or GPU.
  • No Runtime Dependencies: A lightweight solution that keeps your project dependencies minimal (see package.json).
  • High-Level and Low-Level APIs: Access both high-level functionality (completions and embeddings) and low-level operations, including (de)tokenization and KV cache control.
  • Parallel Model Loading: Split your model into smaller files for faster, more efficient loading.
  • Dynamic Thread Switching: Automatically adapts to single-thread or multi-thread builds based on browser capabilities.
  • Background Processing: Model inference occurs in a worker thread to prevent UI blocking.
  • Pre-Built npm Package: Easily integrate with your projects via the pre-built npm package @wllama/wllama.

Limitations

  • Multi-thread Requirements: To utilize multi-threading, proper headers such as Cross-Origin-Embedder-Policy and Cross-Origin-Opener-Policy must be configured. Find out more in this discussion.
  • WebGPU Support: Currently not available but may be introduced in future updates.
  • 2GB File Size Limit: Due to the ArrayBuffer size restriction. For models exceeding this, refer to the Split Model section below.

Documentation and Demos

Explore our comprehensive Documentation for in-depth guidance and code examples:

Get Started with wllama

To integrate wllama into your React TypeScript projects, simply install the package:

npm i @wllama/wllama

For detailed implementation, check out the full code in examples/reactjs. This basic example focuses on completions, while you can explore embeddings in examples/embeddings/index.html.

Model Preparation

To optimize performance, we recommend splitting models into chunks of maximum 512MB for faster downloads and reduced memory issues. Use quantized Q4, Q5, or Q6 formats for the best performance-to-quality balance.

Splitting Your Model

For models exceeding 2GB or for optimizing download processes, split your model using llama-gguf-split:

./llama-gguf-split --split-max-size 512M ./my_model.gguf ./my_model

Load the first chunk URL in loadModelFromUrl, and the remaining parts will load automatically:

await wllama.loadModelFromUrl(
  'https://huggingface.co/ngxson/tinyllama_split_test/resolve/main/stories15M-q8_0-00001-of-00003.gguf',
  {
    parallelDownloads: 5, // Optional: controls max parallel downloads (default: 3)
  },
);

Custom Logging

wllama allows custom logging configurations to tailor debug outputs to your preference. You can suppress debug messages or add emoji prefixes to log entries for a more engaging development experience.

Explore our project to unleash the potential of in-browser LLM inference with wllama, and push the boundaries of what's possible in web applications!