I will start this by saying that I am not a software engineer. The things that I say here should be tempered by that fact. And there's a good deal of AI slop in here that I am too stupid to know whether it's actually correct or not. It is what it is. I'm more of a hacker, plinker, tinkerer. I do have experience building electronics, embedded systems, and industrial machines. At one time a pretty decent electrical engineer. Above all, though, I am a pragmatist. I like to build things, get under the hood, and see how things work. A reader might ask: "why not just buy a Mac Studio and be done with it?" Besides being wholly unsatisfying, and most of all expensive, you should already have your reason: I'm just not wired that way.
At LLM Garage, we are making a conscious choice to eschew the aesthetic, in favor of the utilitarian. From "it just works" to "why does it work?" This blog is a running monologue of my journey through building LLM inference systems using off-the-shelf components. It won't be pretty, but it's mine.
The LLM Journey
I started my LLM journey like most of you, I expect. I got ChatGPT 3.5, used the API to generate some things, and make a shitty RAG pipeline. Back then (all 24-36 months ago), text2sql didn't really work reliably. The models weren't good enough to just figure things out. Next, came copy-pasting output from Claude into my own scripts. Then came Cursor, which was thoroughly good. I also had set up an Ollama server, and built my own machine with an AMD5800X, 64GB of ram, and two RTX3090 for doing local inference. It was exciting to run 70b parameter model locally, but the atomic, transactional means of interacting with the models that I was familiar with left this system sitting idle, mostly as a novelty or some expensive toy.
The Claude Code Miracle
Then, of course. Claude Code happened. I discovered it like many of you, on Christmas break 2025. The connective tissue was finally there. The fact it could write directly to the file system, do web searches, and debug zillions of errors and recover meant that my lazy fingers were no longer on the critical path. Something like a miracle.
Discovering Opencode
With a new enthusiasm surrounding the possibilities, I looked at my humble rig and decided: there's got to be a way to put these to use. Long story short, an hour with Grok and Claude Code and I had discovered Opencode. I served opencode with Qwen3-coder via an Ollama server. It was slow as shit. Another hour burned and my eyes were open to the world of inference frameworks, tokens per second was solid, and I had my first taste of self-hosted Opencode. The rest is, as they say, "history."
The Engineering Journal
So this is my engineering journal. We're starting with the scrappy rig that I built a few years ago. But I have already bought new hardware, and this system will quickly evolve to larger and more capable models. Currently, we are context-limited, with Qwen3-32B (dense) only able to do about 42k context length. While I haven't experienced any significant benefits from KV cache quantization, I have high hopes for using larger MOE models, and a significant amount of RAM to offload the inactive weights, thereby freeing up VRAM for larger contexts.
Rather than theorizing from our armchairs about the way the would could be, we will run headlong into this space (s/o Leroy Jenkins) and just see what works. I know of no other way to learn.
What's Next
There's a lot to do, but I feel a tremendous fire to figure this stuff out. I'm more than happy to share it with you, and if nothing else, with myself, as a trail of breadcrumbs to explain what the hell I did here.
Thanks for coming with me on this journey, and cheers.