2026, May 27

Giving Claude Eyes

AI Software Engineering Testing

Claude kept saying "all done" on broken code. Screenshots from end-to-end tests fixed it.

I'm building Skaffa, a grocery-list app, mostly hands-off with Claude Code. It's a side project, so Claude has to get far without my supervision.

I started with Dioxus and a hexagonal architecture. Hexagonal is overkill for a todo list. I wanted to learn it anyway, and the strict layering gives Claude guardrails: the ports are traits, and a missing match arm on a domain event is a compile error before it's a runtime bug. Rust's type system already does more of that work than TypeScript or JavaScript would. The project had unit tests and integration tests against a real Postgres. I thought it was solid. And yet Claude kept declaring victory on features that didn't work, or had broken something else.

The problem: Claude couldn't verify its work end-to-end. As developers, we don't just write code, run the tests, and ship. We poke at the app, see if it works, and iterate. Claude couldn't, and I didn't want to be its QA.

I tried --chrome, but it was slow, buggy at the time, and burned through tokens. It didn't guard against regressions either, since I didn't trust Claude to remember to re-test every pre-existing flow.

End-to-end tests with playwright-rs

I added Playwright tests that exercise every flow and take screenshots of each view, so Claude could see. The false "all done"s stopped. The UI got better too, because Claude could iterate on what it saw.

I started in Python, but every data-model change meant updating mirrored Python types, and Claude kept forgetting. No guardrails. Meanwhile, someone was playwright-rs at agentic speed: new releases every few days, reaching full feature parity with a few weeks. It drives the real Playwright server through a Rust client. I migrated, and tests now look like this:

#[tokio::test]
async fn test_owner_sees_and_cancels_pending_invite() {
    let env = helpers::new_env().await;
    let seed = env.fresh_seed().await;
    let invitee = env.disposable_email("pending-cancel");

    let page = env.user1_page().await;
    page_objects::navigate_to_list(&page, &seed.groceries().id).await;
    page_objects::open_share_modal(&page).await;

    browser::fill(&page, "invite-email-input", &invitee).await;
    browser::click(&page, "invite-button").await;
    browser::wait_for(&page, "[data-testid='badge-pending']", DEFAULT_TIMEOUT).await;
    browser::screenshot(&page, "owner_sees_invite").await;

    browser::click(&page, "cancel-invite-button").await;
    browser::wait_for_text_gone(&page, &invitee, DEFAULT_TIMEOUT).await;
    browser::screenshot(&page, "owner_cancelled").await;
}

It exercises one slice of sharing: a pending invite shown to the owner, then cancelled.

It also had to be runnable in parallel, so Claude could iterate on one branch without tripping over a test running on another. Each worktree gets its own dev server, port, and Postgres database, with isolation coming from per-workspace scopes rather than a container per branch.

It took a few iterations to land, but now it's the single biggest productivity multiplier in the project.

Mailpit

One gap remained: anything touching email, like magic-link auth or the share-a-list invite, kept breaking under Claude. Mailpit fixed it. It's a fake SMTP server that runs in Docker, catches every outbound email, and exposes an HTTP API for reading them back. Tests fetch the invite from the inbox, pull the magic link out of the body, and click through, all from Playwright.

Zero false "all done"s on email flows.

Where the eyes stop working

You feel the leverage most where it's missing.

Making Skaffa installable as a PWA looked easy: manifest, service worker, Add to Home Screen. Claude wrote it. Playwright confirmed the manifest was served, the service worker registered, and the install criteria were met. "All done."

Then I installed it on my iPhone and the app was unusable. The notch clipped the navbar's back button. The home indicator sat on top of the add-item button. 100vh overran iOS Safari as the toolbar grew and shrank (the fix is 100dvh, but Claude couldn't tell that from the headless screenshots).

The eyes only see what they can render. Headless Chromium can't reproduce iOS standalone mode. The entire failure surface was invisible, and Claude declared victory anyway.

There's no real fix here. Claude still can't see iOS standalone. What I did was tighten the human loop: I added a --tunnel flag to the dev command that exposes the local server over the network and prints a QR code in the terminal. I point my phone at it, reload after each change, and tell Claude what's wrong. It doesn't replace the missing eyes, but it shortens the iteration enough to make PWA work bearable.

Takeaway

If you want more out of your agents, and you're tired of half-baked "all done"s, give them a way to verify their work. The system has to be runnable by the agent, so it can poke at it like you would.

An agent that can't verify its work isn't done, it's guessing. Give it eyes.