On June 1st, GitHub moved to usage based billing for its Copilot services.
This comes on the heels of Anthropic banning third party clients, Microsoft cancelling internal Claude subscriptions, and providers migrating to more opaque pricing schemes:
*The usage limits for local messages and cloud tasks share a five-hour window. Additional weekly limits may apply. For Enterprise/Edu users, there are no fixed rate limits—usage scales with credits. Enterprise and Edu plans without flexible pricing have the same per-seat usage limits as Plus for most features
The Time for the Local Model is Now
I prefer to use local models for most of my workloads and build up to remote models where necessary. While I encourage technology democratization and sovereignty where possible already, the FinOps benefits of local AI tooling are rapidly growing against the high end SOTA models.
Based on my current electric billing at $0.19 per kWh, I can get as many tokens as I can process for $1.48 a day, assuming an uninterrupted daily workload at 100% utilization. (I can guarantee this doesn't happen.) This conveniently ignores the price of hardware today, but if you have some capable hardware on hand, what are you waiting for?
At my highest monthly utilization, I managed to push ~$0.74/day. Not bad, and far cheaper than Claude Opus.
Tempering Expectations
Any model worth running on the edge comes with compromises. Estimates place ChatGPT 5.5 at ~1.5 trillion parameters and most likely at bf16 or greater precision. You're not going to run that at home. (Or if you are, contact me so I can be gifted your hand-me-down parts.)
The models below are distilled to ~30 billion parameters and below at Q4 (4-bit) quantization, which is typical for the local model scene.
One-shot solutions will be fewer and further between, you may have to prompt the model to continue working more often, you may have to correct the model more often, and context management is a must - but you can still be productive with local models.
Current Favorites
Let's meet the contenders.
| Model | Type | Intelligence Index | Release Date |
|---|---|---|---|
| GLM 4.7 Flash 30b a3b | MoE | 30 | 2026-01-19 |
| Qwen3.6 27b | Dense | 46 | 2026-04-22 |
| Qwen3.6 35b a3b | MoE | 43 | 2026-04-22 |
| Gemma 4 26b a4b | MoE | 31 | 2026-04-16 |
| Gemma 4 31b | Dense | 39 | 2026-04-16 |
| Gemma 4 12b | Dense | N/A | 2026-06-03 |
| Gemma 4 12b QAT | Dense | N/A | 2026-06-05 |
(Gemma 4 12b is currently being evaluated at the time of writing.)
The intelligence index above is the score from the Artificial Analysis Intelligence Index, which is a composite score of multiple evaluations. Claude Opus 4.8 with adaptive reasoning and max effort currently holds the highest score at 61.
We can reach within half to three-quarters of its score with local models, which is surprisingly capable still - and far more economic.
Most large language models in use are the standard dense models, where each token must pass through the full model before being output. MoE (mixture of experts) models are different - each token only activates a subset of the full network of parameters. This is useful for systems with limited VRAM where inferencing can fallback to the CPU, which can slow performance significantly. The fewer active parameters (denoted by the a*b nomenclature above), the faster the model will perform.
Notes on Context
Before running these models locally, please understand how context windows work and how to set your maximum context window tuned to your hardware.
The higher the window, the more memory will be consumed.
- Ollama: The default context window is 4096 tokens. This can be altered by
setting
OLLAMA_CONTEXT_LENGTH. - Llama.cpp: The default context window is 512 tokens. This can be altered
by setting the
-c/--ctx-sizeparameter at the command line.
I found the following context windows suitable for these use cases:
| Context Size | Use Case |
|---|---|
| 4096 | Editing or creating multiple functions or short classes. |
| 16384 | Editing or creating large classes. Basic agentic flows are feasible. |
| 32768 | Refactoring tasks become attainable. More useful agentic flows feasible. |
| 65536 | Larger refactoring tasks become attainable. Complex agentic flows are feasible. |
Subjective Observations
Subjective observations can be quite...well, subjective. There are as many use cases as there are people, and my observations may not align with everyone else's. In other words, your mileage may vary.
GLM 4.7 Flash
GLM 4.7 Flash is a model that has become my daily workhorse. Its training for agentic workflows allows it to chain up to 50+ tool calls without interruption. It can handle large-context tasks, and I used it to successfully refactor a monolithic architectural service into several smaller single-responsibility services with little guidance - and refactored the service container to use them.
Unfortunately, it does tend to struggle with complex data structures, such as ones requiring recursion. Fortunately most of the business software I work on day-to-day doesn't require that level of complexity.
Qwen3.6 Series
The Qwen3.6 series is considered about as good as it gets for local software
development today - the 27b parameter model especially. MoE models vastly
outperform their dense siblings in tokens per second on equivalent hardware,
but they do lose some points in intelligence as seen above.
Qwen3.6 has succeeded in working with more complex data structures and object-oriented patterns where GLM 4.7 Flash has fallen short and is surprisingly intuitive. It tends to make fewer tool calls sequentially, however.
This model series also contains vision processing, which can be useful in additional contexts.
Gemma 4 Series
I was excited when Google finally released the Gemma 4 series. Gemma 3 seemed like a promising series but wasn't trained with tool-calling capabilities. The Gemma 4 series rectifies this.
I had mixed results with this series. While these models perform similarly to Qwen3.6 in most tasks, occasionally they end up in an infinite reasoning loop. Supposedly this can be fixed, but even after I applied the recommended parameter adjustments, I was still able to reliably get it looping.
Newer distributions of the model have been released since I last used it heavily, so this could no longer be a concern.
The 12b variation in this series is newer and contains a very interesting
unified architecture
for audio, images, and text (which all models in this series can process). I
enjoyed giving it screenshots of web sites and having it recreate them with no
additional context. This 12b variant is designed to run on a single consumer
GPU (~12-16 GB of VRAM), and Google promises similar intelligence to Gemma 4
26b.
On June 5th, a QAT version of Gemma 4 12b was released, a process which further improves the quality of highly quantized models.
Objective Results
Let's try some objective benchmarks for these models. Gemma 4 12b QAT isn't included at this time due to limitations in my inferencing environment.
Project Euler: Problem #1
This is a short, practical demonstration. I won't publish the correct value, but I will share whether the solution functions correctly or not.
Multiples of 3 or 5
If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6, and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.
Prompt
Solve https://projecteuler.net/problem=1 in JavaScript and run it using Node.js.
Yes, this means each model must:
- Access the web page using an agentic harness
- Locate the problem and properly interpret it within the HTML returned
- Create the script
- Execute it using the installed
nodebinary - And calculate the correct result
GLM 4.7 Flash a3b
| Iterations Until Correct | Time (Cold Start) |
|---|---|
| 1 | 26.7s |
Output
function sumMultiplesOf3Or5(max) {
let sum = 0;
for (let i = 1; i < max; i++) {
if (i % 3 === 0 || i % 5 === 0) {
sum += i;
}
}
return sum;
}
const result = sumMultiplesOf3Or5(1000);
console.log(result);
Qwen3.6 27b
| Iterations Until Correct | Time (Cold Start) |
|---|---|
| 1 | 1m 27s |
Output
let sum = 0;
for (let i = 1; i < 1000; i++) {
if (i % 3 === 0 || i % 5 === 0) {
sum += i;
}
}
console.log(sum);
Qwen3.6 35b a3b
| Iterations Until Correct | Time (Cold Start) |
|---|---|
| 1 | 1m 00s |
Output
// Project Euler Problem 1: Sum of all multiples of 3 or 5 below 1000
const LIMIT = 1000;
let sum = 0;
for (let i = 1; i < LIMIT; i++) {
if (i % 3 === 0 || i % 5 === 0) {
sum += i;
}
}
console.log(sum);
Gemma 4 26b a4b
| Iterations Until Correct | Time (Cold Start) |
|---|---|
| 1 | 1m 04s |
Output
const sumMultiples = (limit) => {
let sum = 0;
for (let i = 1; i < limit; i++) {
if (i % 3 === 0 || i % 5 === 0) {
sum += i;
}
}
return sum;
};
const limit = 1000;
console.log(sumMultiples(limit));
Gemma 4 31b
| Iterations Until Correct | Time (Cold Start) |
|---|---|
| 1 | 1m 39s |
Output
const limit = 1000;
let sum = 0;
for (let i = 1; i < limit; i++) {
if (i % 3 === 0 || i % 5 === 0) {
sum += i;
}
}
console.log(sum);
Gemma 4 12b
| Iterations Until Correct | Time (Cold Start) |
|---|---|
| 1 | 24.1s |
Output
function solve() {
let sum = 0;
for (let i = 1; i < 1000; i++) {
if (i % 3 === 0 || i % 5 === 0) {
sum += i;
}
}
console.log(sum);
}
solve();
A Pelican Riding a Bicycle
Let's close out with a fun one.
Yes, this is a real benchmark.
Results are my subjective judgment of the best of three attempts per model.
Prompt
Draw three separate SVG files of a pelican riding a bicycle.
GLM 4.7 Flash a3b
Note: The best of the three was also the only one of the three that worked.
Qwen3.6 27b
Note: Qwen3.6 27b promised three images with distinct styles, but they were effectively recolors of each other.
Qwen3.6 35b a3b
Note: Qwen3.6 35b attempted to show the same pelican from different angles. Only the side angle was coherent.
Gemma 4 26b a4b
Note: Gemma 4 26b refused twice to actually create the files using the agentic harness.
Gemma 4 31b
Note: Debatably the other two options were more detailed yet distorted, but this one has a stylistic charm.
Gemma 4 12b
Note: On the first attempt, none of the SVG files were valid. The second attempt yielded three results, but this is as good as it got.
A Balanced Perspective
Today's hardware costs are brutal, but for the developer who already owns capable hardware, local models are trailing just behind SOTA models and already surpass former frontier models.
The price of capability with technology is always a little bit of knowledge and
time, but if the alternative is a monthly $200 bill or more, can you blame
developers for looking toward localhost?
Those without recent hardware or an extra GPU or two (for splitting LLM layers across) may have to wait just a little longer for the ROI to pay off.
This article was proofread by Qwen3.6 27b.
