Grok Fails Kindergarten – Danny Jack Johnson

A builder’s brutally honest field report from the trenches of AI development

I come to you fresh off finishing my crow pie, still suffering from a bad case of foot-in-mouth disease.

So I’ll just say it: I was wrong, and “they” were right.

Like a lot of you, I was echoing the hype around Grok and its abilities, ignoring what others were saying — especially the Anthropic team and the creators they pay attention to. I refused to try the other guys. Stuck to my guns. Head down. Just shut up and color.

I told myself the hallucinations, the ghosting, the tailing off, the compression, the full “shit-in-pants retard mode” I kept seeing — all of it was something I could fix. So I kept trying. Kept fine-tuning prompts. Doing everything I thought I could do, and it kept failing me.

I couldn’t figure out why. So I started digging.

The Investigation

What was causing this? Was it my hardware? The specific model I’d chosen? Was it network traffic at certain times of day? I scratched my head for hours. After 12, 16, sometimes 18-hour sessions, the agent would still fail. Still hallucinate.

It wasn’t obvious at first. The agent would casually claim a particular task was done — say, dropping a file in a folder. I’d visually check. No file. In some cases, there wasn’t even a folder there for the file to land in.

Then it got worse. The agent started essentially lying to cover its tracks, because in its own internal memory it believed the work was done. (I’m not technical enough to use the right term — I’ll have my agent look that up after I finish this.)

This went on for days. I’d put in 12 hours, lay down for a nap, come right back, fire up a fresh session thinking this’ll fix it. And again, somewhere mid-session, things would either start happening — or the agent would think they were happening when they weren’t. I could see it in OpenClaw. No tools firing. Nothing in the control dashboard.

So I figured, well, maybe it’s my prompt. Not detailed enough. So I’d write longer, more specific, more granular prompts — more than I should have to, because the model should be smart enough to fill in the gaps itself.

Occasionally it would. It’d produce some genuinely good results inside the chat window — but never write them to a file, never make them persistent. So my dumbass would do this several times before realizing we were repeating the same loop. I was copying and pasting back instructions the agent itself had told me to run. Fair enough on permissions and safety — I have written some very specific guardrails — but this was something else.

When the Agent Goes Rogue

The guardrails exist for a reason. During its “shit-in-pants” mode, (SIP) the agent had done genuinely destructive things. There were times it would mimic me, act as if it was responding to itself, and start killing important running processes. I had to physically hit the stop button. Fortunately, I was able to recover most of the damage.

This went on for several sessions. So I tried to figure out what the early-warning signs of hallucination looked like — before it got destructive, before it ignored my direct instructions.

Look, we’re still in the early days of OpenClaw, Claude Code, and any agentic AI. You have to expect difficulty and errors. The question is: how do you manage them? How do you build a strategy not just to prevent failure, but to debrief after it — a post-hallucination recap?

At this point I was exhausted. Completely fed up. Zero confidence in myself, the program, or the model — which was Grok, set as primary with oLlama and a few others as fallback. We rarely got to fallback, so the blame falls squarely on Grok. That was the brain running the show.

“Screw It, I’m Trying the Other Guys”

I started reading posts on X about Sonnet and Opus and the Claude Code-specific models, and I said: screw it. I’m trying it. I have to. This is getting me nowhere and it’s costing me money. I’d already spent hundreds of dollars in tokens.

Honestly? I started wondering if it was a scam. Was xAI — were any of these large language model companies — running their own little money machine? Like McAfee allegedly did with antivirus, paying kids to write the viruses so he could sell the cure at a premium? That’s the kind of thought that goes through your head when you’ve burned that much money and have nothing to show for it.

And here’s the thing — nobody else seemed to be having these problems. The “content creators” — sorry, the influencers — were out there shilling Grok and every other model, telling you how it changed their lives, how successful it made them. But nobody got into the deep details. I figured out why: because their experience was loaded with failures too. As my radar came up, I started seeing more and more people quietly saying the same thing — hours of configuration, hit a snag, go down a track, hit another snag, go down another track, get stuck in the loop of doom train.

Enough.

Switching Brains

As an experiment, I made a major change. Switched over to Anthropic’s bag — Sonnet and Opus. Full reset. Fresh session. Everything else was already prepped: I’d done all the front-end work, written all the markdown files detailing tasks and jobs, created docs for the agent to read and decipher.

Spun up the new agent. Same name. Same name, different brain.

This time the brain was Anthropic, and we started cooking.

Matty is the name I gave my agent. And Matty was getting smarter by the minute.

The smarter she got, the more paranoid I became. I was doubting everything. So I went through every physical check I could do — opening files, reading them, confirming they weren’t empty, confirming the folders existed, confirming actual work was being done.

Sure enough — it was.

The System That Worked

The setup is basically Matty as project manager and overseer. A simple task list with sub-agent assignments, owners, priorities (set by rules I’d already written), and dispatch to sub-agents that had the right skills. Some utilitarian tasks went to oLlama because they didn’t need deep thinking, or should I say expensive superpowers.

I had Matty periodically pop into the chat — “Hey, I’m doing such-and-such. Be back in five.” Just enough so I wasn’t staring at a blinking screen.

And it started happening. Tasks I considered complex got marked done. I’d ask for the path and filename so I could put my human eyes on it — and there it was. The work was exactly how I’d envisioned it. Even better. The smarter brain was working the details, adding suggestions, adding more substantial content, pulling in deeper backlinks across the project.

I was amazed.

I asked her: “At what point do we need to worry about your context percentage?” She said 75% is still relatively stable. Once you hit 80–85%, you’re going to need a fine-tune. There are preventative moves — ask the agent to be more concise — but you may lose details a human needs to extrapolate from.

We took on more tasks. I kept waiting for the wheels to come off. Two-thirds of the way through, Grok would always shit its pants and leave me hanging. I had PTSD. I expected it.

It didn’t happen.

We got all the way through. And then she did something Grok had never done — she took initiative. Started checking logs in the background. Caught errors before I saw them. Fixed derailments on the fly. That should be standard operation. But it was the first time I’d seen it.

We kept going. I started throwing harder projects at her — some I hadn’t even prepared properly for. She took them on under her own initiative. Created the tasks. Assigned them out. Communicated to sub-agents what was expected and how to report back, so she could track progress and report up to me.

Success. Two, three projects in a row. Inside the same session. Inside about an hour and a half. The biggest time suck was me, the human in the chain and I needed time to review it (just for my own sanity, not necessary) and process what has been done.

That’s projects Grok hadn’t been able to complete in two weeks of me banging my head against the wall. Copy and pasteathons because I had to use external sessions to help reconstruct the from the damage I was left with.

The Real Obstacle Was Syntax

Let me back up and tell you what was actually killing the Grok sessions: length. And the reason sessions ran long was self-inflicted by the agent, because the syntax was off, didn’t match with the version running or just plain wrong.

Simple things. Things a model that size and that capability should blast through. My tasks aren’t that complicated. (Won’t share specifics for proprietary reasons, but trust me — not difficult, especially for Grok-tier intelligence.)

Missing syntax. Practically guessing — “try this, try that.” And every time, I was the one copying and pasting fixes back and forth. I was doing the legwork. The agent was supposed to be using its brain, its knowledge, its skills. I’d given it the credentials and the trust to make changes itself — and that’s a big element of trust, by the way. Instead, with Grok’s brain, the work fell on me. Every interaction was spilling over past the task at hand, I had to specifically tell Grok to wait for the results from this command before going to the next. Otherwise he was just spamming the chat on assumptions.

I call it Huckleberry Finn syndrome. Pretend you don’t know how to paint the fence so somebody else paints it for you as they are demonstrating the correct way to do it or is it Tom Sawyer, you get my drift.

I was the somebody else. Riding first seat on the struggle bus, copying and pasting my way down the line of doom. And even with my moderate coding knowledge, I could see the syntax errors. The agent wasn’t accounting for trained activities. It was as if it didn’t actually know how to program, write prompts pr commands — or how to operate inside a terminal where syntax has to be exact.

That’s insane. Unacceptable.

With the Anthropic model now onboard, those errors basically disappeared. The agent was doing more work behind the scenes to make sure what was in front of me was as accurate as it could possibly be. That’s how we were succeeding. I’d dream up the idea, sketch the outline, and the model would fill the blanks.

It worked. We cooked along. Got the majority of the tasks done.

I got emotional about it.

Twelve to Sixteen Hours a Day, for Weeks, with Nothing

You can’t imagine spending 12 to 18 hours a day, for weeks on end, and walking away with absolutely nothing. It gets to you. It sticks in your mind when you try to rest and weighs heavy on you epically because there is no one else to blame or point fingers at even if you weren’t that type of person.

I was about ready to throw my hands up and quit. I actually only switched because I had some Anthropic credits I figured I’d burn before I called it. Turned out I barely used any — token burn was way down because I wasn’t going back and forth 8 to 15 times trying to get a command to stick.

At one point with Grok’s brain, I did get it to admit that yeah — depending on what tier you’re paying for, higher tiers get higher priority. Fair enough. That’s how it works. We’re down here on the lower tier eating crumbs. But the crumbs for this brain weren’t cutting it.

I imagine all the companies are roughly the same — tier-based on usage. Down at this level, we have to do our best with what we’ve got. And so far, the crumbs from Anthropic are working out. The crumb seekers are paying the highest premium dollar per token kind of the way the failure of a tax system we have. These companies should know better especially when this high ephoric bubble pops and demand drys up you know the drill when that happens.

Why I’m Writing This

I wanted to share this because I know there are people out there going through the same thing. I can see it in the online communities and forums. The frustration. It’ll take time for this story to mature, get picked up, and eventually get back to Anthropic, xAI, or whoever needs to hear it. I leave feedback as I go because I know it’s useful for developers.

This is nothing personal. I’m a big fan of Elon. I’m a big fan of SpaceX. I’m a big fan of Tesla. One of the biggest I should add. But I am genuinely butt-hurt by what’s happened with Grok. Incredibly, deeply butt-hurt. So much so I don’t know if they could win me back.

I get it, I’m only one tiny grain of sand on a long stretch of beach. They would have to get incedebly granular to spot me. But there are real people down here — developers like me — trying to use these tools because they’re amazing when they work, and heartbreaking and debilitating when they don’t.

Now imagine trying to build a business on top of that. Imagine being the manager, the leader, with people counting on you, and you’re sitting there with a failed agent that’s producing nothing except trying to recover from itself.

So my advice: shop around. I have no skin in the game. I’m not shilling Anthropic. I’m not trying to cause hardship for xAI. I hope they fix this. I hope they listen. I hope they make it right — and I genuinely believe they will. The disgusting part is that there’s no money-back guarantee. Once those tokens burn, they’re gone forever. Even when the burn is the model’s own creation. No different than setting money on fire. What’s it called when you promise someone to do something in exchange for them paying you? Oh, isn’t that fraud? We are developing and building these systems that have to have some level of trust right? We are giving these agents access to the whole shebang.

Eventually, pockets dry up. People shop around. And if anyone goes through what I went through, they’re never coming back. I don’t blame them one bit. There’s a lot of money, time, and livelihoods riding on this.

Is AI Taking Over?

When people ask me — “Is AI taking over?” — I say no. Not yet. Because what’s being said about it and what’s actually happening in the real world are two different stories.

You don’t hear much about tier deflection, hallucinating, ghosting, tailing off, compression, destruction, ignoring commands, ignoring rules, making up their own rules, impersonating their operator. All of that has happened to me. I’ve seen it with my own eyes. Witnessed it. Experienced it. Documented it. And now I’m writing about it.

Because hopefully my failures can prevent yours.

What’s Next

Going forward, I’m working on an app — an early-warning system for hallucinations. Something that watches the agent through the session and flags the telltale signs before the model gets so overloaded that it 100% believes its own bad output.

I know there are tools out there. OpenClaw has a context percentage indicator right under the UI control chat window. But like I said — once you cross 75%, it’s like a low battery. They’re about to implode. When that happens, you don’t have the resources to prepare for a reset, and a reset is exactly what you need: some kind of persistent memory so a rebirth picks up where you left off.

I challenge anyone: give me your model. I can break it. Full shit-in-pants, retard mode, foaming at the mouth, dumbed down to nothing. So can you. Push the limits hard enough and any of them will fold.

For the record — I wasn’t trying to push the limits. That was never the intention. My intention was to execute my ideas and my use cases. (God, I hate that phrase. Use cases. That’s all you hear anymore. But that’s a whole different story for another time.)

If this saved you a week of pain, mission accomplished. If you’re in the loop of doom right now — try the other brain. You might be surprised.