Visual QA as a CI Pipeline Stage

2026-02-06 · 6 min read

I merged a PR last month. The code review looked good. Tests passed. Then I opened the site on my phone and the sidebar was completely broken.

The fix was trivial—a missing media query. The bug was obvious once you actually looked at the mobile view. Nobody did.

So I added a pipeline stage that looks.

I open a GitHub issue that says:

Implement the Manual Entry option for adding clients. When clicked, opens a wide slide-over drawer with a form to create a new client.

One commit later, the PR lands with 30+ screenshots proving every state works at every viewport. Zero manual testing. The only effort was writing the feature description.

The Concept

On every push to a PR, a GitHub Actions workflow:

Boots a Docker Compose stack (Django + Postgres) from the PR branch
Runs migrations and seeds a test user
Hands the local URL to Claude Code with Playwright MCP (headless Chromium)
Exercises interactive elements based on the PR diff
Screenshots every state at three viewports
Posts the results as a PR comment

Claude Code controls the browser. Docker Compose provides the app. GitHub Actions ties it together. Reviewers see proof of functionality without running anything locally.

%%{init: {"flowchart": {"subGraphTitleMargin": {"top": 3, "bottom": 10}}} }%%
graph TD
    A["PR Push"] --> B

    subgraph GHA["GitHub Actions Runner"]
        B["Checkout + Docker Compose"] --> C["Migrations + Seed Data"]
        C --> Agent

        subgraph Agent["Claude Code + Playwright"]
            direction LR
            D["Read PR Diff"] --> E["Map Files → URLs"]
            E --> F["Login + Navigate"]
            F --> G["Screenshot ×3 Viewports"]
        end
    end

    Agent --> H["PR Comment"]

    style GHA fill:transparent,stroke:#888
    style Agent fill:transparent,stroke:#888

The workflow itself is straightforward:

name: UI Screenshots
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  screenshots:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      # ... Docker Compose setup, migrations, test user creation ...

      - name: Start services
        run: |
          docker compose -f docker-compose.local.yml up -d --wait

      - name: Install Playwright
        run: npx playwright install --with-deps chromium

      - name: Run Claude for Screenshots
        uses: anthropics/claude-code-action@v1
        with:
          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
          claude_args: >-
            --allowed-tools
            "mcp__playwright__*,
            Bash(gh pr diff:*),
            Bash(gh pr comment:*),
            Read,Glob,Grep"
          prompt: |
            You are a visual QA agent. Your job is to visually
            verify a UI deployment.

            PR: ${{ github.event.pull_request.number }}
            The Django app is running at http://localhost:8001

            1. Run `gh pr diff` to find changed templates/views
            2. Map changed files to URLs — only test pages
               affected by this push
            3. Login, then screenshot each page at three viewports:
               1440x900, 1024x768, 640x1136
            4. Save screenshots with descriptive filenames
            5. Post a PR comment with:
               - Each test performed with pass/fail
               - Any visual issues found
               - Total screenshots captured

The Viewport Matrix

Device	Size	Why
Desktop	1440×900	Standard laptop
Tablet	1024×768	iPad portrait
Mobile	640×1136	iPhone SE

Every interactive state gets captured at all three. A sidebar component generates 12+ screenshots:

Page load (closed) — 3 viewports
Trigger click — 3 viewports
Fully open — 3 viewports
Close animation — 3 viewports

Multiply by light/dark themes and you’re looking at 24 screenshots for one component. That’s the point. Exhaustive visual proof as a CI artifact.

What It Actually Outputs

Here’s real output from a PR that added a manual entry page. The agent posts a PR comment with a structured summary, screenshots at every viewport, and a test checklist:

GitHub issue created by the visual QA agent

Below that summary, the agent appends screenshots of every state it exercised. Here’s the desktop view—clients list with data, plus the Add Client drawer open:

Desktop view: clients list and Add Client drawer at 1440x900

The agent tested:

Empty form state
Form with validation errors (missing required name field)
Successful form submission
Client list rendering with data
Avatar fallbacks when no image
Modal open/close transitions
All three viewports

I skipped individual field focus states for this PR—diminishing returns on a simple form. The agent can do it, but 50 screenshots of text inputs getting focus isn’t useful signal.

The PR comment becomes a visual changelog. Every test run ends with a checklist:

Testing Performed checklist with all items passing

Six months from now, I can look at any PR and see exactly what the UI looked like when it shipped.

The Test Manifest

The prompt in the workflow above is generic — it works for any PR. The specificity comes from the diff itself. The agent reads gh pr diff, figures out which templates and views changed, maps those to URLs, and decides what to test.

For a form component, that means it will exercise:

component: ClientForm
states:
  - empty form
  - filled form (valid data)
  - validation error (submit without name)
  - submitting (loading state)
  - success (confirmation)
interactions:
  - submit empty form (trigger validation)
  - fill all fields, submit
  - verify client appears in list after submit

No manual test plan needed. The agent infers the test surface from the code changes.

Why Screenshots in PR Comments

Inline screenshots in the PR comment have a few advantages:

Visible — reviewers see them without clicking through to artifacts
Contextual — right next to the diff they’re reviewing
Comparable — each push updates the comment, so you see before/after
Auditable — the comment history shows what was tested at each commit

What This Catches

Real bugs caught in the first week:

Dropdown clipped by overflow: hidden parent (only visible on tablet)
Button text invisible in dark mode (wrong CSS variable)
Form labels misaligned on iPhone (flexbox gap issue)
Hover state stuck after touch (needed @media (hover: hover))
Modal backdrop not covering full viewport on landscape tablet

Every one of these passed code review. Every one was obvious in the screenshots.

The Cost

Running Claude with Playwright MCP in CI takes 2-4 minutes depending on how many states need testing. For a typical PR touching one component, it’s about 90 seconds.

Compare to: deploying to production and finding out from users that the mobile layout is broken. Priceless.

The Shift

Visual QA has always been the bottleneck. You can automate unit tests, integration tests, even end-to-end flows — but someone still has to look at the UI. That’s been true for decades.

It’s not true anymore. Agents with browsers don’t just run scripts. They interpret, navigate, interact, and judge. The test surface isn’t hardcoded — it’s inferred from the change. Every PR gets exhaustive visual coverage that no manual process could match.

This changes QE from a gate to a guarantee. Not “did someone check it” but “the pipeline checked it, here’s proof.” Every PR, every push, every viewport. The QE role doesn’t disappear. It moves upstream. Instead of executing test plans, you’re defining what the agent should care about. Instead of catching bugs, you’re designing the system that catches them.