Gloss Key Takeaways
  1. Cloudflare Workers AI makes it practical to run small open-weight LLMs at the edge, callable directly from a Worker and billed per neuron-second.
  2. The real decision point isn’t whether edge inference works, but whether its latency (especially time to first token) and cost beat a centralized GPU endpoint for your specific workload.
  3. The test compares an edge-deployed @cf/meta/llama-3.3-8b-instruct streaming chat endpoint against a centralized gpt-4o-mini-class endpoint in us-east-1, keeping prompts, streaming protocol, and client constant.
  4. The benchmark focuses on cold start behavior, p50/p95 time to first token, and cost per million tokens, because these dominate perceived chat UX and overall economics.
  5. Results are presented as directional “shape” rather than universal truth; region, prompt length, concurrency, and scaling-to-zero behavior can materially change outcomes.

Soft globe with warm dots and arcs of light, representing a model running across a global edge network

Run an LLM at the Edge on Cloudflare Workers AI, with Real Numbers

Cloudflare's announcement this quarter is the kind that gets dressed up in marketing and quietly changes architecture decisions. The pitch is plain: small open-weight models, deployed across hundreds of points of presence, billed per neuron-second, callable from a Worker that already runs your edge logic. The interesting question is not whether it works. It clearly does. The interesting question is whether the latency and cost numbers actually beat a centralized GPU endpoint for the workloads you care about. So I deployed a small model behind a Worker, ran it from twelve regions, and measured. The numbers are at the bottom. The shape of the answer is more useful than the absolute values.

What we are testing

A streaming chat endpoint. User sends a message, the Worker calls Workers AI, the response streams back as Server-Sent Events. The model is @cf/meta/llama-3.3-8b-instruct, which is the size that makes sense at the edge. Bigger models exist on the platform but their economics flip back toward centralized GPUs fairly quickly.

The baseline is a gpt-4o-mini-class endpoint hosted in us-east-1, called from the same Worker. Same prompt, same streaming protocol, same client. The only difference is which provider handles the actual generation.

Client (12 regions) -> Cloudflare Worker -> {Workers AI | Centralized GPU}
                                                     |
                                              [stream tokens back]

Three metrics: cold start, p50 and p95 time to first token, and cost per million tokens. Cold start matters because edge models scale to zero aggressively. TTFT matters because it dominates the perceived speed of any chat UI. Cost matters because the whole proposition rests on it.

The Worker

Cloudflare's binding for Workers AI does the heavy lifting. The whole streaming proxy is short.

export default {
  async fetch(request, env) {
    const { messages, mode } = await request.json();

    if (mode === "edge") {
      const stream = await env.AI.run(
        "@cf/meta/llama-3.3-8b-instruct",
        { messages, stream: true }
      );
      return new Response(stream, {
        headers: { "content-type": "text/event-stream" },
      });
    }

    // Centralized baseline
    const upstream = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        "authorization": `Bearer ${env.OPENAI_KEY}`,
        "content-type": "application/json",
      },
      body: JSON.stringify({
        model: "gpt-4o-mini",
        messages,
        stream: true,
      }),
    });
    return new Response(upstream.body, {
      headers: { "content-type": "text/event-stream" },
    });
  },
};

The chat UI is a single HTML file with a textarea, a fetch against this Worker, and a ReadableStream reader that appends tokens to a <div>. Forty lines, no framework. The point of this exercise is not to ship a product, it is to measure cleanly.

Measurement methodology

Twelve client regions, one request per minute for one hour, alternating between edge and centralized modes. Same 200-token prompt, capped at 256 output tokens. Cold-start measurements come from a separate run where the Worker sat idle for 30 minutes between calls. Latency is wall-clock from request send to first SSE event arriving at the client, so it includes Worker startup, model invocation, and network return.

Glowing bead racing along a curved track, representing low latency at the edge

This is a synthetic benchmark. It is not your workload. The reason to share it is shape, not absolutes. If your prompts are longer, your concurrency is higher, or you live in a region I did not test, the numbers shift. Use them as a starting point and run your own measurement before committing to architecture.

The numbers

Time to first token, milliseconds, p50 and p95 across all twelve regions:

Mode Cold start p50 Cold start p95 Warm p50 Warm p95
Edge (Workers AI) 410 870 180 320
Centralized (us-east-1) 290 540 240 610

A few things stand out. Centralized cold start is faster than edge cold start, which surprised me until I dug in. The centralized provider keeps a warm pool. Cloudflare's edge model genuinely scales to zero in regions that have not seen recent traffic. Warm p50, however, flips the result. Edge wins by 60 ms at the median and by nearly 300 ms at p95, because the centralized path is paying transcontinental network cost on every call.

For users in Asia and South America the gap is wider. Warm p50 from Sao Paulo to the centralized endpoint was 380 ms. From the same client to the edge model it was 190 ms. Edge does not help users who happen to live next to your data center. Edge helps everyone else.

Two translucent vertical bars representing a benchmark comparison

Cost per million tokens, output, at list prices:

Mode Input Output
Edge (Llama 3.3 8B on Workers AI) $0.20 $0.30
Centralized (gpt-4o-mini) $0.15 $0.60

For pure output-heavy workloads the edge model is cheaper. For input-heavy workloads, like long-context retrieval and summarization, the centralized model wins on price. This tracks with how the providers are actually pricing things: the edge bet is on small models doing lots of generation, not on monster context windows.

Where edge actually wins

Three workloads benefit immediately.

Chat UIs with global users. The TTFT difference is the difference between an interface that feels instant and one that feels lagged. If your audience is geographically spread, edge wins on perceived speed regardless of total throughput.

High-volume classification and routing. Tagging support tickets, scoring lead emails, deciding which agent handles a request. Small model, small output, high call volume. The edge price per generated token plus the latency advantage compound.

Privacy-sensitive regional deployments. Run requests from EU users on EU points of presence, never leave the region. The platform handles routing. You do not have to operate three separate stacks to get data residency.

Where it does not win yet

Anything that needs a 32B or 70B model. Anything that depends on a long context window with heavy input tokens. Anything that requires fine-tuned weights, since the available adapters are narrower than what you can run on a GPU you control. And anything where cold start is the dominant factor: a low-traffic internal tool that gets hit twice a day will pay the cold-start penalty every time.

What to do with this

If you are running a chat product, classification pipeline, or routing layer that touches global users, run this benchmark on your own prompts and compare. The setup is one Worker file and a binding in wrangler.toml. You will know within an afternoon whether the edge path makes sense for your traffic shape.

The broader point is that the edge inference story finally has numbers behind it that justify the architectural overhead, which had been the missing piece. For the right workloads, you can serve a model from 300 cities, get sub-200 ms time-to-first-token for users on every continent, and pay less per output token than you do today. That is not a slide-deck story. That is a Tuesday afternoon migration for any team paying attention.

Gloss What This Means For You

If you’re considering edge inference, start by replicating this kind of A/B test for your own traffic: measure cold starts and p50/p95 time to first token from the regions your users actually sit in, using the same prompt sizes and streaming UI you plan to ship. Treat small edge-suitable models as the default candidate, and assume the economics may flip back toward centralized GPUs as models get larger or workloads get heavier. Before committing, run a short alternating benchmark (edge vs centralized) and let TTFT and cost per token—not marketing claims—drive the architecture choice.