Gloss Key Takeaways

Most LLM FastAPI examples ignore the production-critical “boring” infrastructure (retries, quotas, logging, middleware, replayable tests) that keeps endpoints stable under real traffic.
A small, opinionated template can cover the essentials in under ~800 lines with minimal dependencies, adding heavier components (queues, vector stores, celery) only when needed.
Commit to async end-to-end: keep routes thin, make all I/O awaitable, and push provider SDK calls, retries, and logging into a dedicated LLM wrapper.
Use structured retries with exponential backoff plus jitter to handle transient errors, rate limits, and provider 5xx storms without creating thundering-herd retry spikes.
Cap retry attempts (e.g., four) so users get a fast, clear failure during real provider outages instead of long hangs.

Stack of overlapping translucent tiles representing a layered production-ready toolkit

A FastAPI Starter Kit for Shipping LLM Features in Production

FastAPI keeps showing up as the default backend for AI products and it is easy to see why. Async by default, type hints that double as documentation, fast enough that the network and the model are always your bottleneck, simple enough that a single developer can hold the whole thing in their head. Every tutorial uses it. Almost none of them cover the parts that actually matter once your endpoint is in front of real users. This article fills that gap with an opinionated reference template you can fork today.

Most LLM tutorials stop at "here is a route that calls the API and returns the response." That route will fall over the first time your provider has a bad five minutes, the first time a user spams a streaming endpoint, the first time you need to debug why a specific prompt produced a specific response three days ago. The boring infrastructure around the model call is the entire job. Here is what that infrastructure looks like.

The shape of the template

Eight files do most of the work.

app/
  main.py            # FastAPI app, routers, startup
  llm.py             # client wrapper, retries, streaming
  schemas.py         # pydantic in/out models
  quotas.py          # per-user request quotas
  logging.py         # prompt/response logging
  middleware.py      # request id, timing, error trap
  fixtures.py        # frozen replay harness for tests
  config.py          # settings via pydantic-settings
tests/
  test_replay.py

Less than 800 lines of Python, dependencies kept tight: fastapi, httpx, tenacity, pydantic-settings, structlog, and your provider SDK of choice. No queue. No vector store. No celery. Add those when you actually need them, not because the template forces them on you.

Async everywhere or async nowhere

FastAPI lets you mix sync and async route handlers. Do not. Pick async, commit to async, and call sync libraries from a thread pool with asyncio.to_thread only when you have to. Every LLM call, every database call, every external HTTP call should be awaitable. The moment one slow sync call sneaks into a hot path it pins a worker for the duration and your concurrency drops to zero on that pod.

async def chat(request: ChatRequest, user: User = Depends(current_user)) -> StreamingResponse:
    await quotas.check(user.id, "chat")
    async def stream():
        async for chunk in llm.stream(request.messages, request_id=request.id):
            yield f"data: {chunk.json()}\n\n"
    return StreamingResponse(stream(), media_type="text/event-stream")

Notice what this route does and does not do. It checks the quota. It streams. It returns. The actual SDK calls, retries, and logging are inside llm.stream. Routes stay thin so they remain readable when you come back six months from now and the provider has changed twice.

Retries with backoff

Every production LLM call needs to handle three failures: transient network errors, provider rate limits, and provider 5xx storms. tenacity covers all three with a decorator that takes about 90 seconds to configure correctly.

from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type

@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=0.5, max=8),
    retry=retry_if_exception_type((httpx.HTTPStatusError, httpx.TransportError)),
    reraise=True,
)
async def _call(payload: dict) -> dict:
    async with httpx.AsyncClient(timeout=60) as client:
        r = await client.post(PROVIDER_URL, json=payload, headers=AUTH)
        r.raise_for_status()
        return r.json()

Two non-obvious choices. Cap retries at four, not ten. If four attempts with exponential backoff fail, the provider is having a real outage and your user is better served by a fast clear error than a 90-second hang. Use jitter, not pure exponential, to avoid thundering-herd retries from a hundred pods that all hit the same rate-limit window.

Coral arrow looping back into itself with falling specks, representing retries with backoff

For streaming endpoints, retries get tricky. You cannot retry midstream without restarting the entire response. The pattern that works: retry only on the initial connection, fail open after the first byte. Most streaming SDKs implement this for you, but check, because the default in some clients is to silently swallow stream errors.

Structured outputs

Give up on regex parsing. Use the structured-output mode your provider exposes, whether that is JSON mode, tool calls, or a Pydantic-aware response format. The pattern is identical across providers: define a schema, pass it to the model, get back a validated object.

class ExtractedFields(BaseModel):
    name: str
    email: EmailStr
    company: str | None = None

async def extract(text: str) -> ExtractedFields:
    raw = await llm.respond(
        prompt=text,
        response_format=ExtractedFields,
    )
    return ExtractedFields.model_validate(raw)

The interesting part is what to do when validation fails, because it will. The provider returned something close to your schema but not quite. Two strategies. Either retry once with the validation error appended to the prompt, which works for small drift, or return a structured error to the client and log the offending response for later analysis. Do not silently coerce. Coercion hides bugs that bite you in production at 3 a.m.

Streaming that does not lie

A streaming endpoint that returns 200 and then errors midstream is worse than one that returns 500 immediately. Browsers and SSE clients handle the latter cleanly. They handle the former by displaying half a response and going silent.

async def safe_stream(generator):
    try:
        async for chunk in generator:
            yield chunk
    except Exception as e:
        yield {"error": str(e), "type": e.__class__.__name__}

Always send a terminal event, success or failure. The client should never have to time out to learn the stream is over. Most teams discover this only after their first user complains about a frozen UI.

Continuous teal ribbon unfurling into coral droplets, representing streaming responses

Quotas

Per-user quotas are the cheapest insurance you will ever buy. Without them, one curious user with a script can drain your monthly budget on a Saturday. Redis is overkill for this if you are running a single region. A Postgres table with a user_id, bucket, and count_window works fine for tens of thousands of users.

async def check(user_id: str, bucket: str, limit: int = 100, window_sec: int = 3600):
    cutoff = datetime.utcnow() - timedelta(seconds=window_sec)
    async with db.transaction():
        count = await db.fetchval("""
            SELECT count(*) FROM quota_events
            WHERE user_id = $1 AND bucket = $2 AND ts > $3
        """, user_id, bucket, cutoff)
        if count >= limit:
            raise QuotaExceeded(bucket=bucket, retry_after=window_sec)
        await db.execute("""
            INSERT INTO quota_events (user_id, bucket, ts) VALUES ($1, $2, now())
        """, user_id, bucket)

Two buckets per user. A short-window quota for abuse prevention, like 100 calls per hour. A long-window quota for cost control, like 10000 calls per month. Different buckets, different responses. The short-window quota returns a 429 with retry-after. The long-window quota returns a friendlier 402-style "you have hit your plan limit" response with an upgrade link.

Prompt and response logging

Log every prompt and every response, redacted, with a request ID. The logs are how you debug, how you fine-tune, how you answer the support ticket that says "the bot lied to me yesterday." Use structlog for JSON output, ship to whatever log platform you already pay for.

log.info(
    "llm.completion",
    request_id=request_id,
    user_id=user.id,
    model=model,
    prompt_tokens=resp.usage.prompt_tokens,
    completion_tokens=resp.usage.completion_tokens,
    latency_ms=round((time.monotonic() - t0) * 1000),
    prompt_hash=sha256(prompt),
    response_hash=sha256(response),
)

Hash the prompt and response. Store the raw text only if you have a clear retention policy and the user has consented. The hash gets you grouping and dedup without the regulatory headache.

The replay test harness

The single highest-leverage piece of the template, and the part most LLM apps skip. Capture a few hundred real prompts and responses from staging, freeze them as fixtures, and test against the frozen set. Your tests do not call the model. They call your code with a fake llm client that returns the recorded response.

@pytest.fixture
def replay_llm(fixtures_dir):
    responses = load_fixtures(fixtures_dir)
    async def _stub(messages, **kwargs):
        key = hash_messages(messages)
        return responses[key]
    return _stub

async def test_extract_handles_partial_response(replay_llm, monkeypatch):
    monkeypatch.setattr("app.llm.respond", replay_llm)
    result = await extract("Acme Corp, contact: jane@acme.com")
    assert result.email == "jane@acme.com"

The point is not to test the model. The model is not your code. The point is to test the code that sits around the model, the parsing, the retries, the validation, the error paths, and to do it without burning $50 on every CI run. Refresh the fixtures monthly, or whenever the prompt changes meaningfully.

What this gets you

A FastAPI service that handles outages, throttles abusers, fails loudly when it should, fails quietly when it should, logs enough to debug a year-old issue, and ships with tests that run in two seconds and never call the model. None of it is exciting. All of it is the difference between a demo and a product. Fork the template, swap in your provider, and start with the boring parts already in place. The fun parts are easier when the foundation is solid.

Gloss What This Means For You

If you’re shipping LLM features with FastAPI, start from a template that bakes in quotas, request IDs/timing middleware, prompt/response logging, and a replay harness so you can debug and test behavior days later. Keep your route handlers minimal and async, and centralize provider calls in a wrapper that handles streaming and tenacity-based retries with jitter. Set conservative retry limits so outages fail quickly and predictably, and only introduce heavier infrastructure like queues or background workers once you’ve proven you need them.