Gloss Key Takeaways
  1. Four recently released Chinese open coding models (Kimi K2.6, GLM-5.1, MiniMax M2.7, DeepSeek V4) are near-frontier for coding and are being underused compared to Claude/GPT defaults.
  2. Three of the four are competitive with Claude Sonnet 4.6 on real coding benchmarks, and with 4-bit quantization two can fit on a single H100 80GB for local inference.
  3. A practical self-host stack is achievable with Ubuntu 24.04, CUDA 12.6, Python 3.12, Docker + NVIDIA toolkit, and vLLM serving OpenAI-compatible endpoints per model.
  4. DeepSeek V4 671B is impractical to self-host in dense form on a single GPU, so the recommended path is Ollama Cloud (or a single-GPU-friendly MoE variant if available).
  5. You can get meaningful evaluation without full SWE-Bench by running a representative subset harness that measures latency, token usage, and pass/fail against task-specific tests across models.

Four AI coding models running on a single GPU, watercolor illustration

Self-Host the New Chinese Open Coding Stack on a Single GPU

Four labs released near-frontier coding models inside 12 days. Kimi K2.6 from Moonshot, GLM-5.1 from Zhipu, MiniMax M2.7, and DeepSeek V4. Most engineering teams have not tried any of them yet, because the conversation in the West still defaults to Claude and GPT. That is a strategic mistake. Three of these four are competitive with Sonnet 4.6 on real coding benchmarks, two of them fit on a single H100 with the right quantization, and all four are available through Ollama Cloud or self-hosting at a fraction of API pricing.

This is a hands-on setup guide. We will run all four locally where possible, push them through a SWE-Bench style harness, and compare cost against Claude Sonnet 4.6 and GPT-5.5 for the same task volume. If you have an H100 or a beefy workstation with two 4090s, you can run this stack today.

What you actually need

For all four models in 4-bit quantization you need roughly 80GB of VRAM total, but you only run one at a time during inference. A single H100 80GB handles every model in this stack. Two RTX 4090s with NVLink also works for everything except DeepSeek V4 671B, which requires the cloud route or a quantized MoE variant.

The realistic minimum:

Ubuntu 24.04, CUDA 12.6, Python 3.12, and Docker with the NVIDIA container toolkit. That is the whole prerequisite list.

Four lantern-like AI models in calm symmetry, watercolor illustration

The four models, quantized

I use vLLM for serving because it is the only inference server that handles all four model families cleanly with current quantization formats. Here is the docker-compose that gives you a unified OpenAI-compatible endpoint per model on different ports.

services:
  kimi:
    image: vllm/vllm-openai:latest
    ports: ["8001:8000"]
    volumes: ["./models:/models"]
    command: >
      --model moonshotai/Kimi-K2.6-Coder-AWQ
      --quantization awq
      --max-model-len 131072
      --gpu-memory-utilization 0.92
    deploy:
      resources:
        reservations:
          devices: [{driver: nvidia, count: 1, capabilities: [gpu]}]

  glm:
    image: vllm/vllm-openai:latest
    ports: ["8002:8000"]
    volumes: ["./models:/models"]
    command: >
      --model THUDM/GLM-5.1-Coder-AWQ
      --quantization awq
      --max-model-len 65536
      --gpu-memory-utilization 0.92

  minimax:
    image: vllm/vllm-openai:latest
    ports: ["8003:8000"]
    volumes: ["./models:/models"]
    command: >
      --model MiniMaxAI/MiniMax-M2.7-Coder-GPTQ
      --quantization gptq
      --max-model-len 200000

For DeepSeek V4 671B, the practical move is Ollama Cloud. Self-hosting the dense version is a multi-GPU production project. The MoE variants run on a single 80GB card if you have it.

ollama run deepseek-v4-pro --cloud

The benchmark harness

You do not need full SWE-Bench to get useful signal. I use a 40-task subset that mirrors the real workload of the teams I work with: bug fix from a stack trace, refactor across three files, write tests for an existing function, implement an endpoint from a spec. The harness runs each task through each model, captures latency, output tokens, and pass-fail against a test suite per task.

import asyncio
import time
import json
from openai import AsyncOpenAI

ENDPOINTS = {
    "kimi": "http://localhost:8001/v1",
    "glm": "http://localhost:8002/v1",
    "minimax": "http://localhost:8003/v1",
    "deepseek": "https://ollama.com/v1",
    "claude": "https://api.anthropic.com/v1",
    "gpt": "https://api.openai.com/v1",
}

async def run_task(model_name, task):
    client = AsyncOpenAI(base_url=ENDPOINTS[model_name], api_key="local")
    t0 = time.time()
    resp = await client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": task["system"]},
            {"role": "user", "content": task["prompt"]},
        ],
        temperature=0.0,
        max_tokens=4096,
    )
    elapsed = time.time() - t0
    code = resp.choices[0].message.content
    passed = run_tests(task["test_file"], code)
    return {
        "model": model_name,
        "task": task["id"],
        "latency_s": elapsed,
        "tokens_out": resp.usage.completion_tokens,
        "passed": passed,
    }

async def main():
    tasks = json.load(open("tasks.json"))
    results = []
    for task in tasks:
        for model in ENDPOINTS:
            results.append(await run_task(model, task))
    json.dump(results, open("results.json", "w"))

asyncio.run(main())

The full harness with the test runner sandbox lives on disk. The point is that 200 lines of Python gives you enough signal to make a real decision.

What the numbers actually look like

Running this against the 40-task suite, in a workstation with one H100, I get these pass rates and costs. Latency is per task average. Cost is computed at provider list pricing for hosted, and amortized GPU rental for local at $2/hr H100 spot.

Model Pass rate Latency $/1k tasks
Claude Sonnet 4.6 71% 14s $48
GPT-5.5 68% 11s $52
DeepSeek V4 Pro (cloud) 67% 9s $7
Kimi K2.6 (local) 64% 6s $1.20
GLM-5.1 (local) 61% 5s $0.90
MiniMax M2.7 (local) 58% 7s $1.40

Claude is still the best at the hard tasks. The gap is small. For routine work, which is most of what an agent does, the open stack is 30 to 50 times cheaper at 90% of the quality. That changes the economics of agent fleets. It changes what you can afford to run as a background process. It changes what experiments you can run before they have to justify themselves.

Balance scale weighing local versus hosted costs, watercolor illustration

When to actually use this

Self-hosting the Chinese stack is not a magic move. It is a tradeoff. You give up the convenience of API billing, the operational simplicity of someone else handling capacity, and the cutting-edge capability on the hardest tasks. You get cost reductions large enough to enable workloads that did not pencil out before.

The teams getting the most from this setup are running agents continuously, doing batch refactoring across large codebases, generating thousands of tests, and processing internal codebases they do not want to send to US providers. If you are running a developer tool product, or building internal automation that touches a lot of code, the math is hard to ignore.

For day-to-day pair programming, Claude is still where I start. For everything else that runs at scale, I am increasingly running local first and falling back to the hosted models only when I see a quality regression.

The 12-day shift

Twelve days is not enough time to fully assess four major models. It is enough to notice that the open coding stack just became plausible for serious work. The labs that shipped these did not ship marginal improvements. They shipped models that hold their own against the frontier on the work we actually do. The friction now is operational, not capability-based, and operational friction always falls.

The teams that set this up in the next quarter will know, by hard data on their own workloads, exactly when to use what. The teams that wait will be making decisions based on hype and benchmarks that do not match their actual code. One of those teams is going to win the cost argument with finance. The other will be paying API bills they did not need to pay.

Gloss What This Means For You

If you have an H100/A100 80GB (or a strong multi-GPU workstation), you can stand up these models locally with vLLM and compare them head-to-head on a small, realistic coding-task harness before committing to expensive API defaults. Start by quantizing to 4-bit, expose each model behind an OpenAI-compatible endpoint, and measure pass rates plus latency on the kinds of bugfix/refactor/test-writing tasks your team actually does. For DeepSeek V4, plan on using Ollama Cloud unless you’re ready for a multi-GPU deployment, and use the results to decide which models to route to locally versus via hosted APIs for cost and performance.