Local Agent

Neil Haddley โ€ข June 14, 2026

A conversational AI assistant for this blog using WebLLM (in-browser) and Ollama (local server) as interchangeable backends

AIwebllmwebgpuqwenreactagentsollama

I've added a conversational AI assistant to this blog โ€” the ๐Ÿ’ฌ button in the bottom-right corner of every page. It runs entirely locally with no backend and no API fees, using one of two model backends: WebLLM (in-browser, no setup) or Ollama (local server, larger models).

The chat button appears on every page โ€” click it to open the assistant panel

The chat button appears on every page โ€” click it to open the assistant panel

Choosing a Backend

WebLLMOllama
SetupNone โ€” loads in the browserInstall Ollama, pull a model, run the site locally
Browser supportChrome / Edge with WebGPUAny browser
Model sizesUp to 7B (browser VRAM limits)Up to 27B (Qwen3.5)
Inference speedDepends on GPU via WebGPUNative โ€” generally faster
Works for visitorsYesNo โ€” only visible when running the site locally
Model storageBrowser cache (per device)Local disk, shared across apps

WebLLM is the right choice for anyone visiting the public site โ€” it just works. Ollama is better for local development, giving access to larger, faster models without the browser download.

WebLLM

WebLLM runs a quantized Qwen2.5 model directly in the browser using WebGPU. The model is downloaded once and cached โ€” subsequent loads are instant. WebGPU is required, so it works in Chrome and Edge on GPU-enabled devices.

Three model sizes are available, all quantized to 4-bit weights:

ModelDownloadNote
Qwen2.5-7B-Instruct-q4f16_1-MLC~4 GBBest quality ยท WebLLM
Qwen2.5-3B-Instruct-q4f16_1-MLC~2 GBBalanced ยท WebLLM
Qwen2.5-1.5B-Instruct-q4f16_1-MLC~1 GBFast ยท WebLLM

The 1.5B is the default โ€” a fast first download and a reasonable starting point. Larger models give better reasoning and more reliable multi-step tool use.

The model selector on the public site โ€” only the three WebLLM options appear, since using Ollama requires the site to be hosted on localhost

The model selector on the public site โ€” only the three WebLLM options appear, since using Ollama requires the site to be hosted on localhost

Loading the model for the first time โ€” progress bar fills as the weights download to the browser cache

Loading the model for the first time โ€” progress bar fills as the weights download to the browser cache

Why Quantization?

A standard Qwen2.5-7B model in 16-bit precision weighs around 14 GB. Most consumer GPUs don't have that much VRAM, and browsers impose their own caps on top of that. 4-bit quantization brings it down to a manageable size:

ModelFP16q4f16_1
7B~14 GB~4 GB
3B~6 GB~2 GB
1.5B~3 GB~1 GB

WebLLM only supports its own pre-compiled MLC model variants โ€” the MLC compilation step converts the model to run on WebGPU and bakes in the quantization. The quality tradeoff is minimal: benchmark scores drop by around 1โ€“2% at q4f16_1, which is unnoticeable for a blog assistant.

Model Quality

Smaller models trade reasoning quality for speed. I ran the same query โ€” "Any Java related posts?" โ€” against the 1.5B and 3B to see the difference.

The 1.5B called tools redundantly, hit the round limit, and returned an empty response:

CODE
1round 0 โ€” search_posts {"query": "Java"}
2round 1 โ€” get_posts_by_category {"category": "Java"}  (already had the data)
3round 2 โ€” get_posts_by_category {"category": "Java"}  โ†’ skipping duplicate
4round 3 โ€” get_posts_by_category {"category": "Java"}  โ†’ skipping duplicate
5loop exhausted โ€” final nudge โ†’ (empty)

The 3B called one tool and answered cleanly on the next round:

CODE
1round 0 โ€” search_posts {"query": "Java related"}
2round 1 โ€” text: "Here are the Java related posts: โ€ฆ"

The 3B handles multi-step tool use reliably. The 1.5B is faster to load but may struggle on follow-up questions.

Ollama

Ollama runs as a local background process and exposes an OpenAI-compatible API at http://localhost:11434. Instead of downloading weights into the browser, the model runs natively on the machine โ€” generally faster, and with larger model options.

Installing Ollama
BASH
1brew install ollama
2ollama serve
3ollama pull qwen3.5:4b

Five Qwen3.5 sizes are available in the agent:

ModelNote
qwen3.5:27bBest quality ยท Ollama
qwen3.5:9bGood quality ยท Ollama
qwen3.5:4bBalanced ยท Ollama
qwen3.5:2bFast ยท Ollama
qwen3.5:0.8bFastest ยท Ollama
Local Dev Only

The Ollama option only appears when the site is running locally. Chrome and Edge enforce a Private Network Access policy that blocks requests from public HTTPS pages to localhost โ€” there is no workaround for the public URL.

To use Ollama models, run the site locally:

BASH
1npm run dev
2# then open http://localhost:3000
On localhost, the model selector shows all eight options โ€” three WebLLM and five Ollama

On localhost, the model selector shows all eight options โ€” three WebLLM and five Ollama

Qwen3.5 4B selected and connected โ€” the header shows "local Ollama"

Qwen3.5 4B selected and connected โ€” the header shows "local Ollama"

I asked "What AI posts are on the blog?" โ€” Qwen3.5 4B called get_posts_by_category and returned a full list with links

I asked "What AI posts are on the blog?" โ€” Qwen3.5 4B called get_posts_by_category and returned a full list with links

On the Java category page I prompted "summarize all posts in this category" โ€” DevTools shows Qwen3.5 9B calling get_posts_by_category then get_post_content for each of the six posts

On the Java category page I prompted "summarize all posts in this category" โ€” DevTools shows Qwen3.5 9B calling get_posts_by_category then get_post_content for each of the six posts

After seven rounds of tool use the agent produced a formatted Java Category Summary with links to all six Spring Boot posts

After seven rounds of tool use the agent produced a formatted Java Category Summary with links to all six Spring Boot posts

How It Works

The agent is a React component (BlogAgent.tsx) mounted in the Next.js layout, so it appears on every page. Post metadata is pre-built at deploy time into agent-data.json, which the component fetches when the panel first opens.

Both backends implement the same interface so the agent loop runs identically regardless of which is active. For WebLLM:

TYPESCRIPT
1const { CreateMLCEngine } = await import('@mlc-ai/web-llm');
2const engine = await CreateMLCEngine(
3  selectedModel,
4  { initProgressCallback: ({ progress, text }) => setLoadState(...) },
5);

For Ollama, a thin fetch wrapper is created at load time:

TYPESCRIPT
1let controller: AbortController | null = null;
2const engine = {
3  chat: {
4    completions: {
5      create: async ({ messages }) => {
6        controller = new AbortController();
7        const r = await fetch('http://localhost:11434/v1/chat/completions', {
8          method: 'POST',
9          headers: { 'Content-Type': 'application/json' },
10          body: JSON.stringify({ model: modelName, messages, stream: false }),
11          signal: controller.signal,
12        });
13        return r.json();
14      },
15    },
16  },
17  interruptGenerate: () => controller?.abort(),
18};
Tools

The agent has six tools:

ToolWhat it does
search_postsKeyword search across titles, descriptions, and tags
get_posts_by_categoryAll posts in a named category
list_categoriesAll categories ranked by post count
get_post_contentFull content of a specific post
navigate_to_postPush the browser to a post via the Next.js router
web_searchLive web search via Jina AI โ€” for topics not covered by the blog
I asked "Any Java related posts?" and the agent called get_posts_by_category

I asked "Any Java related posts?" and the agent called get_posts_by_category

The agent returned links to all six Java Spring Boot posts

The agent returned links to all six Java Spring Boot posts

I followed up asking the difference between Java and JavaScript โ€” the agent used web_search

I followed up asking the difference between Java and JavaScript โ€” the agent used web_search

The agent answered using the web search results

The agent answered using the web search results

The Agent Loop

Each turn, the model replies either with a plain-text answer (done) or a <tool_call> block naming a function to run. The component parses the block, executes the tool, and feeds the result back as a <tool_response> user message. This repeats until the model produces a text answer with no tool calls.

Because WebLLM's native tools API only supports a fixed set of Hermes models, I implemented function calling via prompt engineering โ€” tool definitions are injected as JSON in the system message, and the model outputs structured <tool_call> blocks rather than using a native API.

On a post page I asked the agent to summarise โ€” it called get_post_content with the current slug

On a post page I asked the agent to summarise โ€” it called get_post_content with the current slug

The agent summarised the post content

The agent summarised the post content

I asked "summarise all Phaser posts" from the home page โ€” DevTools shows Qwen3.5 9B calling search_posts then get_post_content for each result

I asked "summarise all Phaser posts" from the home page โ€” DevTools shows Qwen3.5 9B calling search_posts then get_post_content for each result

The agent produced a formatted summary of all Phaser posts with links

The agent produced a formatted summary of all Phaser posts with links