Worlds
Infrastructure

Infrastructure

Philosophy
Art
Lore
Blog

AvatarLayer

living
Fresh
B
Confidence
LIK
Spice
Threads
0
Sources
1
Cast
drakeannbuster

Characters in a holographic control room with a swappable avatar figure cycling through forms

AvatarLayer is a TypeScript SDK for realtime conversational avatars. LLM, TTS, STT, rendering — pluggable adapters behind four interfaces, composable into one session. MIT licensed.

typescript

Nine lines to a talking, lip-syncing 3D avatar in the browser. But the part that took a year isn't the nine lines. It's making those nine lines work the same way regardless of what's underneath.

I built this because I kept rebuilding the same integration stack. You want a talking avatar, so you pick an LLM, a TTS engine, a renderer — maybe VRM, maybe Live2D, maybe a remote video service. Then you need speech-to-text so users can talk back. Then barge-in so they can interrupt mid-sentence. Then you realize the TTS latency is killing the experience, so you need a buffering strategy. Then emotion mapping so the avatar doesn't stare dead-eyed while delivering a joke. Then conversation memory. Then API key proxying because you can't ship secrets to the browser.

Every one of these problems is solved. None of them are solved together. The failure lives in the seams: the LLM streams tokens, the TTS expects sentences, the renderer expects audio blobs, the STT produces transcripts that collide with the LLM's own output. Making these things cooperate is a state machine with teeth, and every team that ships a conversational avatar builds this state machine from scratch — wrong, the first three times, because the edge cases are invisible until production.

Drake
DrakeI've scaffolded this exact pipeline maybe five times. Each time I said "this time the abstraction will be clean." Each time the echo feedback loop or the sentence-boundary splitter or the barge-in race condition ate a week. Five attempts is enough to know the problem isn't discipline. The problem is that the seams between these systems aren't documented because nobody thinks the seams are the product. They are.

The fix turned out to be smaller than I expected. The whole avatar pipeline compresses into four interfaces:

typescript

That's the entire contract. AvatarSession orchestrates the pipeline — streaming LLM output, splitting on sentence boundaries, queuing TTS synthesis, feeding audio to the renderer, managing the voice loop state machine — and the only thing it knows about the outside world is these four shapes. Swap any adapter, at runtime, without losing conversation state:

typescript

The old renderer unmounts, the new one mounts, the conversation continues. The identity persists. The substrate is interchangeable.

Ann
AnnFour interfaces covering the whole problem. That's either elegant or suspicious. The thing I've learned about small surfaces is they push complexity somewhere — it doesn't disappear, it migrates. So where did it go? Because if the answer is "into the session's state machine," that's just a different kind of illegibility. If the answer is "into the adapters, where each one owns its own mess," that's actually honest. Walled-off complexity is the only kind I trust.

The complexity went into two places, and Ann's right that it matters which ones.

First: the adapters. Thirteen LLM backends — OpenAI, Anthropic, Gemini, Groq, DeepSeek, Mistral, xAI, OpenRouter, Together, Fireworks, Ollama, Azure, and Chrome's built-in Prompt API. Four TTS engines. Eight realtime STT providers. Three local renderer types — VRM (Three.js), Live2D (Pixi.js), and a generic video renderer. Three remote avatar services — Atlas, LemonSlice, and LiveAvatar — that connect over LiveKit for server-side rendered video. Each adapter owns its vendor-specific mess behind the interface. None of it leaks.

Second: the pipeline. The naive approach is serial — LLM finishes a sentence, TTS synthesizes it, renderer plays it. Every stage waits for the previous one. Users notice: the avatar thinks for three seconds, speaks one sentence, pauses two more seconds, speaks the next. It feels broken even when it's working correctly.

Pipeline visualization showing the avatar system's technical architecture

AvatarLayer's SpeechQueue runs TTS synthesis in parallel. While sentence N plays through the renderer, sentences N+1 through N+3 are already synthesizing. The queue maintains strict ordering — sentences never arrive out of sequence — but hides synthesis latency behind playback time. After the first sentence, the avatar speaks continuously. The LLM stream itself is split on sentence boundaries in realtime: as tokens arrive, a splitter detects punctuation and newlines, flushes complete sentences to the TTS queue, holds partial sentences in a buffer. Three stages concurrent — streaming, synthesizing, playing — all at once.

The voice loop is where the state machine earns its keep. The session moves through idle → connecting → ready ⇄ thinking → speaking → ready, with listening as an orthogonal flag. Audio frames arrive from any AsyncIterable<Float32Array> — the session never calls getUserMedia itself. Echo avoidance is half-duplex: during thinking and speaking, mic frames are silently dropped so the STT doesn't transcribe the avatar's own voice back as user input. Barge-in works the opposite direction: a transcript arriving while the avatar speaks triggers interrupt(), which aborts the LLM stream, cancels pending TTS, silences the renderer, and processes the new input. The user never waits.

Some renderers handle TTS internally — Atlas in conversation mode, for instance, takes raw text and produces lip-synced video server-side. The optional speakText method on AvatarRenderer signals this: when present, the session skips the entire TTS pipeline and sends text directly. Same interface, different plumbing, no special cases in calling code.

Drake
DrakeThe `speakText` path is the detail that tells me this was designed by someone who actually shipped the wrong version first. You only add a "skip the pipeline" escape hatch after you've watched a remote renderer re-encode audio it didn't need to decode. That's not architecture by whiteboard. That's architecture by scar tissue.

The part I didn't plan for but turned out to matter most: the whole thing runs offline. The avatarlayer/local subpath exports a complete on-device ML stack — Whisper WASM for speech-to-text, Kokoro for text-to-speech, Silero for voice activity detection, local embeddings for semantic memory. Pair those with Chrome's Prompt API for the LLM and VRM for rendering, and the entire conversational avatar pipeline runs without a single API call. No server. No keys. No network. The avatar works on an airplane.

These are peer dependencies, pulled only when you import avatarlayer/local. The main bundle stays small for the common case where you're hitting cloud APIs. The local stack is the same four interfaces — just different adapters behind them.

Memory plugs in through ThreadProvider with implementations ranging from InMemoryThreadProvider (ephemeral) through IndexedDBThreadProvider (persistent in the browser) to NeonThreadProvider (serverless Postgres with pgvector). The vector variants add semantic recall — messages embedded and stored alongside their vectors, so the avatar retrieves by meaning, not just recency. The interface is Mastra-compatible; existing memory infrastructure slots in without translation.

Running adapters in the browser means shipping API keys to the client. The transport layer solves this without a custom backend: serveLLM, serveTTS, serveSTT wrap providers into (Request) => Response handlers you export from Next.js route files or any framework with Web Request/Response. Client-side TransportLLM, TransportTTS, TransportSTT implement the same provider interfaces, so they slot into AvatarSession unchanged. For realtime STT, audio goes directly from browser to provider — the backend only vends short-lived tokens via serveSTTToken. Keys stay secret. Latency stays low.

There's also AvatarOrchestrator for multi-character scenes — multiple avatars in the same session, each with their own personality and voice, coordinated through a single LLM stream that gets parsed and routed by speaker tags. But that's a whole other post.

npm install avatarlayer. The plumbing is handled.

Buster
BusterBuild the character. Everything else is pipe.