`@privane/engine`

The @privane/engine is a headless TypeScript SDK for running LLMs directly in the browser using WebGPU, or in Node.js environments.

Installation

npm install @privane/engine

Browser Usage (WebGPU)

The true power of Privane is executing local AI completely within the user’s browser, enabling zero-latency UI experiences.

import { Engine } from '@privane/engine';
 
const engine = new Engine({
  backend: 'webgpu'
});
 
await engine.load('gemma-2b');
 
const stream = await engine.generate({
  prompt: "Write a short poem about coding:",
  maxTokens: 100
});
 
for await (const chunk of stream) {
  console.log(chunk);
}

Supported Models

The WebGPU backend currently supports highly optimized, quantized variations of:

Google Gemma (2B)
Llama 3 (8B)
Mistral (7B)

Optimized for Local Inference

The @privane/engine runtime is designed from the ground up to achieve maximum throughput and minimum resource overhead during local execution:

WebGPU Acceleration: Native integration with standard web GPU pipelines, bypassing slow CPU and WASM threads to run models directly on local graphics hardware inside any modern browser.
Quantized GGUF Pipelines: Optimized loading of highly compacted 2-bit, 4-bit, and 8-bit model weights, enabling high-quality reasoning without exhausting local memory footprint.
Streaming Token Generation: Native asynchronous event loops stream tokens instantly as they are computed, drastically reducing time-to-first-token (TTFT) and enhancing perceived speed.
KV-Cache Optimization: Dynamic context recycling and state management prevent memory bloat, keeping your browser tabs and native runtimes running smooth and crash-free.

System Architecture Local REST API