@privane/engine
The @privane/engine is a headless TypeScript SDK for running LLMs directly in the browser using WebGPU, or in Node.js environments.
Installation
npm install @privane/engineBrowser Usage (WebGPU)
The true power of Privane is executing local AI completely within the user’s browser, enabling zero-latency UI experiences.
import { Engine } from '@privane/engine';
const engine = new Engine({
backend: 'webgpu'
});
await engine.load('gemma-2b');
const stream = await engine.generate({
prompt: "Write a short poem about coding:",
maxTokens: 100
});
for await (const chunk of stream) {
console.log(chunk);
}Supported Models
The WebGPU backend currently supports highly optimized, quantized variations of:
- Google Gemma (2B)
- Llama 3 (8B)
- Mistral (7B)
Optimized for Local Inference
The @privane/engine runtime is designed from the ground up to achieve maximum throughput and minimum resource overhead during local execution:
- WebGPU Acceleration: Native integration with standard web GPU pipelines, bypassing slow CPU and WASM threads to run models directly on local graphics hardware inside any modern browser.
- Quantized GGUF Pipelines: Optimized loading of highly compacted 2-bit, 4-bit, and 8-bit model weights, enabling high-quality reasoning without exhausting local memory footprint.
- Streaming Token Generation: Native asynchronous event loops stream tokens instantly as they are computed, drastically reducing time-to-first-token (TTFT) and enhancing perceived speed.
- KV-Cache Optimization: Dynamic context recycling and state management prevent memory bloat, keeping your browser tabs and native runtimes running smooth and crash-free.