NorthCode logo

Running your own AI agent for coding – reality check

2026-04-08

Written by

Niklas_Siltakorpi
Niklas Siltakorpi
Senior DevOps Specialist

Share

Together with our partners, we at NorthCode set out to understand what is possible with on-prem models. The results were surprisingly good!

Agentic AI coding has revolutionized software development in just a year. Previously, we used to ask ChatGPT for code, copy it into the editor, check if it even runs, and go back and forth multiple times until ended up with mediocre code for mediocre tasks.

Then came models that were able to use tools—and it has been a joy ever since. The missing pieces earlier were that LLMs were unable to check things themselves (via the internet or by reading code for better context) and unable to test the code they produced.

There are countless ways to solve a software development problem—therefore, there are also countless ways to do it wrong. But now that LLMs can approach problems by actually running their own code and testing it—either by running tests or doing ad-hoc exploratory testing—coding automation is now viable in software development.

All of this, of course, requires a lot of computing power. You want a smart model, not a dumb one. Smart models require many parameters, and these parameters among other things consume a lot of GPU memory. This GPU memory is, of course, very expensive at the moment. With cloud models like OpenAI Codex or Anthropic Claude, this is not a problem, as the model is provided as a service and then service is shared which reduces costs.

But what if you can't use cloud models? This becomes especially relevant if your working environment requires air-gapped hardware or the work is sensitive enough that cloud usage is not an option.

At NorthCode, together with our partners, we recognized this need for running AI agent models on-prem and tested multiple models with sufficient hardware. The test setup consisted of:

- vLLM, orchestrated by OpenShift AI as a runtime environment. - Kubernetes to orchestrate the actual workload. - 4× A100 80GB GPUs on top of an NVIDIA DGX platform

We tested the MiniMax 2.5 model with a ~200k context window. The client used was a modified OpenCode—we had to make some adjustments to the tool calling to get it working reliably.

Instead of running standardized tests like Swebench, we used a few real-life use cases of our own: - Must pass our own AI Ready Engineer training tasks, consisting of: - Creating a Twitter clone (called Yapster) with proper Robot Framework tests - Deploying the Yapster to Azure using provided credentials and GitHub Actions for CI/CD - Adding additional features to Yapster

- Must be able to handle complex code refactoring tasks as planned by our AI Colleague concept, a tool that automatically pays off technical debt. Our testing revealed that in some cases, MiniMax was even better than some older frontier models. - Must be able to perform complex UI changes in a React application while following the existing styling of the app. This was achieved without issues.

We found the SWEbench results to be in line with our own testing: MiniMax 2.5 can work as a coding agent both in legacy codebases and when creating new systems. The only notable issue we encountered—and mitigated—was that while Codex includes a fairly long instruction prompt that makes it cautious with destructive commands, MiniMax lacks this. We observed this when MiniMax cleaned my `~/Downloads` directory without asking while trying to relieve disk pressure. The issue was mitigated by writing better user instructions to MiniMax model.

Partners of our ecosystem

KipinäKipinäLuoto CompanyLuoto CompanyAsteroidAsteroidHeroeHeroeLakeviewLakeviewTrail OpenersTrail OpenersVuoluVuolu
Hello world_