More on Claude: A win and three losses

Published on August 03, 2025

ai experiments

Table of Contents ▼

25 minutes vs 1 hour 15 minutes. That’s how dramatically Claude outperformed me when debugging a stubborn server configuration issue - the kind of problem that makes you question your career choices.

I’ve continued experimenting with Claude since using it to revamp this website, pushing it beyond its obvious strengths in web tech. The pattern that’s emerging is clear: Claude shines in a narrow sweet spot but fails spectacularly outside it. Most experiments generated impressive-looking responses that crumbled under scrutiny, but occasionally I hit the nut-flush of AI use cases.

Three Experiments

Before I share the dramatic success story, let me set the stage with three ambitious experiments that showcase Claude’s limitations. I’d switched to Claude’s more powerful Opus 4 model for these tests, enabling “Extended Thinking” and “Research” for the non-coding tasks.

Experiment #1: Set up a new processor design monorepo

This one was quite an ambitious task: I effectively asked Claude to help me plan out the architecture for a new processor design monorepo. It generated a specification document along with a multi-week development plan. Then I asked it to do something absurd: “Please do all the week 1 tasks.”

The surprising thing: Claude completed all of week 1, and then weeks 2 through 8, in the space of an evening. It generated dozens of files and thousands of lines of code across a range of languages (Python, C++, YAML, Verilog, JSON, etc.). It even generated the MCP integration I’d requested.

That’s pretty much where the good news ended. Most of the code didn’t work - didn’t even compile or interpret, let alone run correctly. After extensive debugging (made difficult by not having any knowledge of the code), I got the Python code to “work”: I could query Claude for some basic information from the loaded YAML files via the MCP integration. The C++ wouldn’t compile, and the many other features it claimed to have implemented were non-functional.

Okay, so I asked it to do something super ambitious and it didn’t work. That’s hardly a surprise.

But here’s the concerning part: What it had generated was essentially RISC-V in disguise. The instruction examples were straight out of the RISC-V spec. The YAML design was a copy of the RISC-V database design. The entire structure and system was RISC-V.

No part of the code said “RISC-V,” but the code also completely ignored the preferred structure that I’d edited into the original specification document that Claude was working from.

At the end of the day, there’s only one major open source processor design left in the world - RISC-V. So it’s no wonder Claude reproduced one of the only examples in its training set. Unfortunately, that means Claude (as I expected) is a bit useless for this high-level task of working on novel processor design.

In future experiments, I’ll explore how it performs when there’s a lot more human-written code for it to work within.

Experiment #2: Generate a blog post about today’s data centres

Moving from coding to research and writing, I wanted to test Claude’s ability to synthesise information from across the web. This is the kind of thing that could save a huge amount of time, especially in gathering and digesting hundreds of sources from across the internet. Sadly, I won’t be trusting it - not even slightly. While the text it generated was highly convincing, sprinkled with links to sources, the problems were numerous and damning:

Poor source quality:
- Sources were poorly researched
- Often sources formed a loop, resulting in circular sourcing (clearly there are people editing their blog posts to insert sources some time after publication)
- Frequently sources were from not-particularly-trustworthy places
- In one case, the “source” had a note saying it had been written by Claude
  - so, not an independent source at all.
Misquoted evidence: Most “evidence” was misquoted - Claude had processed the sources, then generated similar-ish but inaccurate text.
Basic arithmetic failures: Everywhere it should’ve applied basic arithmetic to make stats coherent, it hadn’t. 2 times 3 is not 7 - it’s that simple.

Experiment #3: Generate a financial plan

I wanted Claude to help me generate a well-researched financial plan for a property development (asking for a friend). I didn’t expect this to work, and sure enough, it failed completely.

The core issues were threefold:

Claude can’t do maths reliably
It can’t generate proper spreadsheets or spreadsheet-like code
It misquoted sources of information, making any model inputs wildly wrong

When your financial model is based on fabricated data points, the entire exercise becomes worthless.

Maybe a different AI could help with this better - Microsoft Copilot or Google Gemini. I’ve experimented with Gemini without much success (it seems more geared towards analysing what you’ve got than generating a spreadsheet from scratch).

The Big Win

After those spectacular failures, here’s where Claude genuinely surprised me…

Background

A few weeks ago, along with two former colleagues, I bought out all the assets from my previous startup. That’s a topic for another time but right now we’re figuring out how to optimise what we’ve got - before we get stuck into the fray of startup life again.

As part of this, we’ve realised we can consolidate a lot of digital services onto our own servers using open source software. This could save us up to £200 per employee per month. £2,400 per employee per year isn’t a vast amount but in our early stages every pound saved is worthwhile. If you want to read more about why we went with on-premise desktop machines as our workhorses, rather than cloud instances, check out Pete Birch’s post from earlier this week.

One of the things we’ve been experimenting with is ditching Slack and Zoom. Both are quite challenging to consider alternatives for, given the level of integration they have not just with other tools but with the wider community. Anyway, for internal operations, we reckon Zulip and Jitsi, both of which are fully open-source and we can self-host, will be adequate replacements.

Jitsi is quite a large, multi-component piece of software. To my surprise, the installation process was remarkably easy initially. Getting a basic server up and running took about an hour. Yesterday I tried to take the next step: configuring Secure LDAP with our Google Workspace organisation.

Using Claude

After 45 minutes working my way through Jitsi’s and Google’s respective documentation, I tried a test command. It didn’t work. Sigh - back to the beginning to see what I’d done wrong. An hour and 15 minutes later, I was still going, still trying to figure out why I could manually authenticate at the command line (using ldapsearch) but couldn’t get the LDAP service (saslauthd) to work.

Exasperated, I turned to Claude:

Me: Help me debug why I can't get saslauthd to work with "ldap.google.com"

It came back with a long list of suggestions, most of which matched the documentation I’d already worked through. Configuration file snippets matched what I already had with one exception: it wanted me to put in ldap_bind_dn and ldap_bind_pw - essentially a username and password configuration. Google’s documentation said this might not be necessary as they use TLS authentication.

Me: I'm using TLS authentication rather than bind_dn

Claude: Ah, that changes things significantly! With TLS certificate
        authentication for Google LDAP, you're using client certificates
        rather than bind credentials. Here's how to configure saslauthd
        for TLS certificate authentication...

That was a surprise. It came back with new configuration snippets that were very different and included ldap_use_sasl: yes - this isn’t mentioned in the Google docs. As of this writing, you can search that page and there’s no mention of SASL (Simple Authentication and Security Layer).

To connect the LDAP client to the Secure LDAP service:

Configure your LDAP client with Cloud Directory as your LDAP server.

Upload the certificate to your LDAP client.

The Secure LDAP service uses TLS client certificates as the primary authentication mechanism. To begin the process of uploading the certificate to the LDAP client, open the LDAP client’s authentication or directory settings, and enter the details from the table below.

Note: For complete details about how and where to upload TLS certificates, please see your vendor documentation.

Google Documentation: 4. Connect LDAP clients to the Secure LDAP service

I tried it. Ta-da! saslauthd sprung into life. Jitsi still couldn’t log in, though. I’d come this far - maybe Claude could help more:

Me: I am trying to get Google Secure LDAP to work with Jitsi Meet. I have
    followed the configuration guide and managed to get a test auth command
    to work but only if I set "-s smtp" manually.

    When I try to log in via Jitsi conferencing page, I get the following
    error:

    "Oops! Something went wrong and we couldn't connect to the conference:
     connection.otherError no-auth-mech"

    please help

(I think the “please help” at the end was a sign of my increasing frustration - it wasn’t for Claude’s benefit!)

I went round the houses a couple of times with Claude on this with me saying what I’d done and it telling me to do the same thing again. Eventually I decided to give it a copy of one of the key configuration files which appeared to be stopping saslauthd from starting up in the background.

The file looked innocuous enough - here’s the opening section:

START=YES

#
# Settings for saslauthd daemon
# Please read /usr/share/doc/sasl2-bin/README.Debian for details.
#

Claude: The configuration looks correct. The issue might be that START=YES
        should be START=yes (lowercase). Let's troubleshoot this
        systematically.

Oof. Claude basically just linted the config file for me, something saslauthd hadn’t logged any errors about. Swapping YES to yes, and now everything worked.

Time saved: 25 minutes vs 1h 15min

Working with Claude took about 25 minutes, compared to the hour and 15 minutes I’d spent trying to debug it myself.

Okay, in my total 2.5 hours I learnt a lot about what I was trying to do, which made acting on Claude’s answers easier. I don’t know what Claude would’ve generated had I started with it from scratch, and whether that would’ve been even quicker overall.

I certainly wouldn’t have learnt as much or understood the end result.

Nevertheless, Claude helped me solve a problem in 25 minutes which I’d failed to solve myself in an hour and 15 minutes of debugging. That’s quite a win.

Lessons Learned

These experiments reveal Claude’s narrow but powerful sweet spot:

Where Claude excels:

Synthesising existing, well-documented knowledge
Acting as an intelligent linter for configuration issues
Solving bounded technical problems with clear parameters

Where Claude fails:

Creating truly novel work (hello, RISC-V in disguise)
Mathematical precision and factual accuracy
Research requiring source verification and arithmetic

The key insight: Never judge a book by its thickness - Claude’s initial impressiveness. The more verbose and confident it appears, the more scrutiny it needs. (That’s hardly a novel insight - countless people have made this same observation.)

Conclusion

I hit a sweet-spot for Claude when trying to solve a configuration problem. A highly documented standard but which Google search (and DuckDuckGo) both struggled to produce useful answers (I’m feeling the degradation of search result quality and usefulness a lot at the moment). Also a problem where the existing tooling (saslauthd) didn’t include sufficiently helpful error messages, so Claude acting as a linter filled a tricky gap.

Everything else I’ve tried has been a case of “Wow! Look at what it generated!” followed by steadily reading through and digging into the details, only to find the results are actually not that great.

The processor design experiment was particularly telling - thousands of lines of RISC-V masquerading as original work.

My expectations about generative AI remain largely unchanged, though this experimentation has helped me identify that sweet spot more precisely.

I’m continuing to test the boundaries, hoping to find more examples where Claude genuinely surprises me - but I’m not holding my breath.

Applying Claude to this blog post

The first draft of this post was written entirely by me. I then gave it to Claude to “tidy up and improve”. Initially, I rolled-back all of its changes, especially its rewritten conclusion. The biases built into Claude fought heavily against my criticisms of the tool and it generated bold claims I don’t agree with.

I gave Claude a tighter scope of changes to help improve the structure and flow of the post without changing the content. It really messed up the post when it swapped the Big Win and Three Experiments headers without moving the corresponding paragraphs: it continued to rewrite the paragraphs to make the “win” a negative example and the experiments sound like successes 🤦🏻‍♂️.

After making the proper structural change and allowing Claude to re-attempt its other changes, it had made some useful and acceptable edits. The post you read here is the end result - and I wrote this last section without applying Claude to it.

← Back to Blog