Current Local LLM Favorites

June 05, 2026

An AI-generated engineering sketch of a gaming PC running a local AI agent.

On June 1st, GitHub moved to usage based billing for its Copilot services.

This comes on the heels of Anthropic banning third party clients, Microsoft cancelling internal Claude subscriptions, and providers migrating to more opaque pricing schemes:

*The usage limits for local messages and cloud tasks share a five-hour window. Additional weekly limits may apply. For Enterprise/Edu users, there are no fixed rate limits—usage scales with credits⁠. Enterprise and Edu plans without flexible pricing have the same per-seat usage limits as Plus for most features

The Time for the Local Model is Now

I prefer to use local models for most of my workloads and build up to remote models where necessary. While I encourage technology democratization and sovereignty where possible already, the FinOps benefits of local AI tooling are rapidly growing against the high end SOTA models.

Based on my current electric billing at $0.19 per kWh, I can get as many tokens as I can process for $1.48 a day, assuming an uninterrupted daily workload at 100% utilization. (I can guarantee this doesn't happen.) This conveniently ignores the price of hardware today, but if you have some capable hardware on hand, what are you waiting for?

At my highest monthly utilization, I managed to push ~$0.74/day. Not bad, and far cheaper than Claude Opus.

Tempering Expectations

Any model worth running on the edge comes with compromises. Estimates place ChatGPT 5.5 at ~1.5 trillion parameters and most likely at bf16 or greater precision. You're not going to run that at home. (Or if you are, contact me so I can be gifted your hand-me-down parts.)

The models below are distilled to ~30 billion parameters and below at Q4 (4-bit) quantization, which is typical for the local model scene.

One-shot solutions will be fewer and further between, you may have to prompt the model to continue working more often, you may have to correct the model more often, and context management is a must - but you can still be productive with local models.

Current Favorites

Let's meet the contenders.

Model	Type	Intelligence Index	Release Date
GLM 4.7 Flash 30b a3b	MoE	30	2026-01-19
Qwen3.6 27b	Dense	46	2026-04-22
Qwen3.6 35b a3b	MoE	43	2026-04-22
Gemma 4 26b a4b	MoE	31	2026-04-16
Gemma 4 31b	Dense	39	2026-04-16
Gemma 4 12b	Dense	N/A	2026-06-03
Gemma 4 12b QAT	Dense	N/A	2026-06-05

(Gemma 4 12b is currently being evaluated at the time of writing.)

The intelligence index above is the score from the Artificial Analysis Intelligence Index, which is a composite score of multiple evaluations. Claude Opus 4.8 with adaptive reasoning and max effort currently holds the highest score at 61.

We can reach within half to three-quarters of its score with local models, which is surprisingly capable still - and far more economic.

Most large language models in use are the standard dense models, where each token must pass through the full model before being output. MoE (mixture of experts) models are different - each token only activates a subset of the full network of parameters. This is useful for systems with limited VRAM where inferencing can fallback to the CPU, which can slow performance significantly. The fewer active parameters (denoted by the a*b nomenclature above), the faster the model will perform.

Notes on Context

Before running these models locally, please understand how context windows work and how to set your maximum context window tuned to your hardware.

The higher the window, the more memory will be consumed.

Ollama: The default context window is 4096 tokens. This can be altered by setting OLLAMA_CONTEXT_LENGTH.
Llama.cpp: The default context window is 512 tokens. This can be altered by setting the -c/--ctx-size parameter at the command line.

I found the following context windows suitable for these use cases:

Context Size	Use Case
4096	Editing or creating multiple functions or short classes.
16384	Editing or creating large classes. Basic agentic flows are feasible.
32768	Refactoring tasks become attainable. More useful agentic flows feasible.
65536	Larger refactoring tasks become attainable. Complex agentic flows are feasible.

Subjective Observations

Subjective observations can be quite...well, subjective. There are as many use cases as there are people, and my observations may not align with everyone else's. In other words, your mileage may vary.

GLM 4.7 Flash

GLM 4.7 Flash is a model that has become my daily workhorse. Its training for agentic workflows allows it to chain up to 50+ tool calls without interruption. It can handle large-context tasks, and I used it to successfully refactor a monolithic architectural service into several smaller single-responsibility services with little guidance - and refactored the service container to use them.

Unfortunately, it does tend to struggle with complex data structures, such as ones requiring recursion. Fortunately most of the business software I work on day-to-day doesn't require that level of complexity.

Qwen3.6 Series

The Qwen3.6 series is considered about as good as it gets for local software development today - the 27b parameter model especially. MoE models vastly outperform their dense siblings in tokens per second on equivalent hardware, but they do lose some points in intelligence as seen above.

Qwen3.6 has succeeded in working with more complex data structures and object-oriented patterns where GLM 4.7 Flash has fallen short and is surprisingly intuitive. It tends to make fewer tool calls sequentially, however.

This model series also contains vision processing, which can be useful in additional contexts.

Gemma 4 Series

I was excited when Google finally released the Gemma 4 series. Gemma 3 seemed like a promising series but wasn't trained with tool-calling capabilities. The Gemma 4 series rectifies this.

I had mixed results with this series. While these models perform similarly to Qwen3.6 in most tasks, occasionally they end up in an infinite reasoning loop. Supposedly this can be fixed, but even after I applied the recommended parameter adjustments, I was still able to reliably get it looping.

Newer distributions of the model have been released since I last used it heavily, so this could no longer be a concern.

The 12b variation in this series is newer and contains a very interesting unified architecture for audio, images, and text (which all models in this series can process). I enjoyed giving it screenshots of web sites and having it recreate them with no additional context. This 12b variant is designed to run on a single consumer GPU (~12-16 GB of VRAM), and Google promises similar intelligence to Gemma 4 26b.

On June 5th, a QAT version of Gemma 4 12b was released, a process which further improves the quality of highly quantized models.

Objective Results

Let's try some objective benchmarks for these models.

Project Euler: Problem #1

This is a short, practical demonstration. I won't publish the correct value, but I will share whether the solution functions correctly or not.

Multiples of 3 or 5

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.

Prompt

Solve https://projecteuler.net/problem=1 in JavaScript and run it using Node.js.

Yes, this means each model must:

Access the web page using an agentic harness
Locate the problem and properly interpret it within the HTML returned
Create the script
Execute it using the installed node binary
And calculate the correct result

GLM 4.7 Flash a3b

Iterations Until Correct	Time (Cold Start)
1	26.7s

Output

function sumMultiplesOf3Or5(max) {
  let sum = 0;
  for (let i = 1; i < max; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  return sum;
}

const result = sumMultiplesOf3Or5(1000);
console.log(result);

Qwen3.6 27b

Iterations Until Correct	Time (Cold Start)
1	1m 27s

Output

let sum = 0;
for (let i = 1; i < 1000; i++) {
  if (i % 3 === 0 || i % 5 === 0) {
    sum += i;
  }
}
console.log(sum);

Qwen3.6 35b a3b

Iterations Until Correct	Time (Cold Start)
1	1m 00s

Output

// Project Euler Problem 1: Sum of all multiples of 3 or 5 below 1000
const LIMIT = 1000;
let sum = 0;

for (let i = 1; i < LIMIT; i++) {
 if (i % 3 === 0 || i % 5 === 0) {
  sum += i;
 }
}

console.log(sum);

Gemma 4 26b a4b

Iterations Until Correct	Time (Cold Start)
1	1m 04s

Output

const sumMultiples = (limit) => {
  let sum = 0;
  for (let i = 1; i < limit; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  return sum;
};

const limit = 1000;
console.log(sumMultiples(limit));

Gemma 4 31b

Iterations Until Correct	Time (Cold Start)
1	1m 39s

Output

const limit = 1000;
let sum = 0;

for (let i = 1; i < limit; i++) {
  if (i % 3 === 0 || i % 5 === 0) {
    sum += i;
  }
}

console.log(sum);

Gemma 4 12b

Iterations Until Correct	Time (Cold Start)
1	24.1s

Output

function solve() {
  let sum = 0;
  for (let i = 1; i < 1000; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  console.log(sum);
}

solve();

Gemma 4 12b QAT

Iterations Until Correct	Time (Cold Start)
1	20.8s

function solve() {
  let sum = 0;
  for (let i = 1; i < 1000; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  console.log(sum);
}

solve();