<X />C{}JS();

Current Local LLM Favorites

June 05, 2026

An AI-generated engineering sketch of a gaming PC running a local AI agent.

On June 1st, GitHub moved to usage based billing for its Copilot services.

This comes on the heels of Anthropic banning third party clients, Microsoft cancelling internal Claude subscriptions, and providers migrating to more opaque pricing schemes:

*The usage limits for local messages and cloud tasks share a five-hour window. Additional weekly limits may apply. For Enterprise/Edu users, there are no fixed rate limits—usage scales with credits⁠. Enterprise and Edu plans without flexible pricing have the same per-seat usage limits as Plus for most features

The Time for the Local Model is Now

I prefer to use local models for most of my workloads and build up to remote models where necessary. While I encourage technology democratization and sovereignty where possible already, the FinOps benefits of local AI tooling are rapidly growing against the high end SOTA models.

Based on my current electric billing at $0.19 per kWh, I can get as many tokens as I can process for $1.48 a day, assuming an uninterrupted daily workload at 100% utilization. (I can guarantee this doesn't happen.) This conveniently ignores the price of hardware today, but if you have some capable hardware on hand, what are you waiting for?

At my highest monthly utilization, I managed to push ~$0.74/day. Not bad, and far cheaper than Claude Opus.

Tempering Expectations

Any model worth running on the edge comes with compromises. Estimates place ChatGPT 5.5 at ~1.5 trillion parameters and most likely at bf16 or greater precision. You're not going to run that at home. (Or if you are, contact me so I can be gifted your hand-me-down parts.)

The models below are distilled to ~30 billion parameters and below at Q4 (4-bit) quantization, which is typical for the local model scene.

One-shot solutions will be fewer and further between, you may have to prompt the model to continue working more often, you may have to correct the model more often, and context management is a must - but you can still be productive with local models.

Current Favorites

Let's meet the contenders.

ModelTypeIntelligence IndexRelease Date
GLM 4.7 Flash 30b a3bMoE302026-01-19
Qwen3.6 27bDense462026-04-22
Qwen3.6 35b a3bMoE432026-04-22
Gemma 4 26b a4bMoE312026-04-16
Gemma 4 31bDense392026-04-16
Gemma 4 12bDenseN/A2026-06-03
Gemma 4 12b QATDenseN/A2026-06-05

(Gemma 4 12b is currently being evaluated at the time of writing.)

The intelligence index above is the score from the Artificial Analysis Intelligence Index, which is a composite score of multiple evaluations. Claude Opus 4.8 with adaptive reasoning and max effort currently holds the highest score at 61.

We can reach within half to three-quarters of its score with local models, which is surprisingly capable still - and far more economic.

Most large language models in use are the standard dense models, where each token must pass through the full model before being output. MoE (mixture of experts) models are different - each token only activates a subset of the full network of parameters. This is useful for systems with limited VRAM where inferencing can fallback to the CPU, which can slow performance significantly. The fewer active parameters (denoted by the a*b nomenclature above), the faster the model will perform.

Notes on Context

Before running these models locally, please understand how context windows work and how to set your maximum context window tuned to your hardware.

The higher the window, the more memory will be consumed.

  • Ollama: The default context window is 4096 tokens. This can be altered by setting OLLAMA_CONTEXT_LENGTH.
  • Llama.cpp: The default context window is 512 tokens. This can be altered by setting the -c/--ctx-size parameter at the command line.

I found the following context windows suitable for these use cases:

Context SizeUse Case
4096Editing or creating multiple functions or short classes.
16384Editing or creating large classes. Basic agentic flows are feasible.
32768Refactoring tasks become attainable. More useful agentic flows feasible.
65536Larger refactoring tasks become attainable. Complex agentic flows are feasible.

Subjective Observations

Subjective observations can be quite...well, subjective. There are as many use cases as there are people, and my observations may not align with everyone else's. In other words, your mileage may vary.

GLM 4.7 Flash

GLM 4.7 Flash is a model that has become my daily workhorse. Its training for agentic workflows allows it to chain up to 50+ tool calls without interruption. It can handle large-context tasks, and I used it to successfully refactor a monolithic architectural service into several smaller single-responsibility services with little guidance - and refactored the service container to use them.

Unfortunately, it does tend to struggle with complex data structures, such as ones requiring recursion. Fortunately most of the business software I work on day-to-day doesn't require that level of complexity.

Qwen3.6 Series

The Qwen3.6 series is considered about as good as it gets for local software development today - the 27b parameter model especially. MoE models vastly outperform their dense siblings in tokens per second on equivalent hardware, but they do lose some points in intelligence as seen above.

Qwen3.6 has succeeded in working with more complex data structures and object-oriented patterns where GLM 4.7 Flash has fallen short and is surprisingly intuitive. It tends to make fewer tool calls sequentially, however.

This model series also contains vision processing, which can be useful in additional contexts.

Gemma 4 Series

I was excited when Google finally released the Gemma 4 series. Gemma 3 seemed like a promising series but wasn't trained with tool-calling capabilities. The Gemma 4 series rectifies this.

I had mixed results with this series. While these models perform similarly to Qwen3.6 in most tasks, occasionally they end up in an infinite reasoning loop. Supposedly this can be fixed, but even after I applied the recommended parameter adjustments, I was still able to reliably get it looping.

Newer distributions of the model have been released since I last used it heavily, so this could no longer be a concern.

The 12b variation in this series is newer and contains a very interesting unified architecture for audio, images, and text (which all models in this series can process). I enjoyed giving it screenshots of web sites and having it recreate them with no additional context. This 12b variant is designed to run on a single consumer GPU (~12-16 GB of VRAM), and Google promises similar intelligence to Gemma 4 26b.

On June 5th, a QAT version of Gemma 4 12b was released, a process which further improves the quality of highly quantized models.

Objective Results

Let's try some objective benchmarks for these models. Gemma 4 12b QAT isn't included at this time due to limitations in my inferencing environment.

Project Euler: Problem #1

This is a short, practical demonstration. I won't publish the correct value, but I will share whether the solution functions correctly or not.

Multiples of 3 or 5

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Prompt

Solve https://projecteuler.net/problem=1 in JavaScript and run it using Node.js.

Yes, this means each model must:

  1. Access the web page using an agentic harness
  2. Locate the problem and properly interpret it within the HTML returned
  3. Create the script
  4. Execute it using the installed node binary
  5. And calculate the correct result

GLM 4.7 Flash a3b

Iterations Until CorrectTime (Cold Start)
126.7s
Output
function sumMultiplesOf3Or5(max) {
  let sum = 0;
  for (let i = 1; i < max; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  return sum;
}

const result = sumMultiplesOf3Or5(1000);
console.log(result);

Qwen3.6 27b

Iterations Until CorrectTime (Cold Start)
11m 27s
Output
let sum = 0;
for (let i = 1; i < 1000; i++) {
  if (i % 3 === 0 || i % 5 === 0) {
    sum += i;
  }
}
console.log(sum);

Qwen3.6 35b a3b

Iterations Until CorrectTime (Cold Start)
11m 00s
Output
// Project Euler Problem 1: Sum of all multiples of 3 or 5 below 1000
const LIMIT = 1000;
let sum = 0;

for (let i = 1; i < LIMIT; i++) {
 if (i % 3 === 0 || i % 5 === 0) {
  sum += i;
 }
}

console.log(sum);

Gemma 4 26b a4b

Iterations Until CorrectTime (Cold Start)
11m 04s
Output
const sumMultiples = (limit) => {
  let sum = 0;
  for (let i = 1; i < limit; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  return sum;
};

const limit = 1000;
console.log(sumMultiples(limit));

Gemma 4 31b

Iterations Until CorrectTime (Cold Start)
11m 39s
Output
const limit = 1000;
let sum = 0;

for (let i = 1; i < limit; i++) {
  if (i % 3 === 0 || i % 5 === 0) {
    sum += i;
  }
}

console.log(sum);

Gemma 4 12b

Iterations Until CorrectTime (Cold Start)
124.1s
Output
function solve() {
  let sum = 0;
  for (let i = 1; i < 1000; i++) {
    if (i % 3 === 0 || i % 5 === 0) {
      sum += i;
    }
  }
  console.log(sum);
}

solve();

A Pelican Riding a Bicycle

Let's close out with a fun one.

Yes, this is a real benchmark.

Results are my subjective judgment of the best of three attempts per model.

Prompt

Draw three separate SVG files of a pelican riding a bicycle.

GLM 4.7 Flash a3b

Note: The best of the three was also the only one of the three that worked.

Qwen3.6 27b

Note: Qwen3.6 27b promised three images with distinct styles, but they were effectively recolors of each other.

Qwen3.6 35b a3b

Note: Qwen3.6 35b attempted to show the same pelican from different angles. Only the side angle was coherent.

Gemma 4 26b a4b

Note: Gemma 4 26b refused twice to actually create the files using the agentic harness.

Gemma 4 31b

Note: Debatably the other two options were more detailed yet distorted, but this one has a stylistic charm.

Gemma 4 12b

Note: On the first attempt, none of the SVG files were valid. The second attempt yielded three results, but this is as good as it got.

A Balanced Perspective

Today's hardware costs are brutal, but for the developer who already owns capable hardware, local models are trailing just behind SOTA models and already surpass former frontier models.

The price of capability with technology is always a little bit of knowledge and time, but if the alternative is a monthly $200 bill or more, can you blame developers for looking toward localhost?

Those without recent hardware or an extra GPU or two (for splitting LLM layers across) may have to wait just a little longer for the ROI to pay off.


This article was proofread by Qwen3.6 27b.