Peanut Butter & Chocolate: A Claude Code Plugin for Structured Agentic Coding

I have been writing about my agentic coding workflow for a while now. The manager/coding agent loop, the CLAUDE.md architecture, the review gates, the PRIME_DIRECTIVEs. All of that still applies.

What I have not shared until now is the actual plugin that ties it together.

Peanut Butter & Chocolate is a Claude Code plugin. It is MIT licensed, free, and you can install it right now. It implements the structured pipeline I run on every non-trivial PR: Research, Plan, Implement, Review.

The name is the whole idea. Peanut butter is Claude. Chocolate is OpenCode with Codex. Two good things that work better when you combine them with clear boundaries.

Why a plugin

I kept rebuilding the same workflow from scratch. Open a Claude Code session, manually prompt for research, copy context into a plan, hand off to Codex, paste review feedback back. Every time I started a new feature, I was re-inventing the process.

The plugin packages that process into five slash commands. Each one encodes the exact workflow I described in my earlier posts, with the guardrails and artifacts baked in.

The pipeline

/research-codebase

Three parallel Sonnet sub-agents fan out across your codebase. One finds files. One analyzes how things work. One maps patterns and conventions. They are read-only. They do not suggest improvements or critique your code. They just document what exists.

The output is a timestamped research document saved to thoughts/research/. Every claim includes a file:line reference so you can verify it.

I run this before planning because plans built on assumptions break. Plans built on evidence hold up.

/create-plan

Opus reads the research and works with you interactively. You answer clarifying questions. It produces a phased plan with explicit scope boundaries, including a "what we are NOT doing" section to prevent drift.

Every plan has verifiable success criteria. Not "ensure quality" but "run pytest -x and confirm zero failures." If you cannot script the check, it does not belong in the success criteria.

Plans get saved to thoughts/plans/.

/implement-plan

This is where Codex comes in. Opus breaks the plan into self-contained execution packets. Each packet has an objective, constraints, acceptance criteria, and a required output format. Then it sends each packet to Codex via opencode run.

The packets are markdown files, not inline heredocs. A safety hook blocks heredoc execution to enforce this. File-based handoff means every instruction is auditable and reproducible.

Opus does not write code. It orchestrates. Codex does not plan. It implements. The separation is the point.

/review-work

Codex reviews the changes independently. It checks the diff, runs tests, looks at code quality, and flags issues. Then Opus triages every finding into three buckets:

  • Agree, fix now - sends a targeted packet back to Codex
  • Agree, defer - logs it for later
  • Disagree, skip - documents the reasoning

This loops up to three times. The cap prevents infinite refinement cycles where agents keep finding new things to nitpick.

/address-pr-comments

After the PR is up and AI/humans leave review comments, this command fetches them, triages each one through the same agree/defer/skip process, and generates fix packets. One packet per comment for clean traceability.

The sub-agents

PBC ships with eight specialized Sonnet sub-agents. Three for codebase research, two for reading your thoughts/ artifacts, two for Jira integration, and one for web search.

Every sub-agent follows the same constraint: they are documentarians. They explain what exists. They do not critique, do not suggest, do not recommend. That keeps their output clean and prevents them from stepping on the orchestrator's job.

Why independent review matters

The biggest problem with single-agent coding is that the same model that wrote the code also reviews it. It has the same blind spots, the same assumptions, the same biases. It is going to approve its own work.

PBC hands review to a completely separate agent with no context about how the implementation was planned. Codex sees the diff and the codebase, not the thought process that led to the code. That independence catches things the implementing agent cannot see.

Is it perfect? No. But it catches real bugs that would otherwise make it into PR review, and that saves senior engineer time.

Artifacts and traceability

Everything goes to a thoughts/ directory, which is gitignored by default. Research, plans, packets, reviews. Each artifact has YAML frontmatter with date, branch, commit hash, and tags.

This matters when something breaks later and you need to understand why a decision was made. The artifact trail gives you that. It also matters when you want to resume work after clearing context or starting a new session.

How to install it

claude plugin add mattlgroff/pbc

That is it. You need Claude Code and OpenCode installed. The plugin assumes opencode is available on your PATH.

What this is not

This is not an auto-merge pipeline. You still review the PR. You still approve the plan before implementation starts. You still triage review findings before fixes happen. Every step has a human gate.

This is also not a replacement for tests. If your repo does not have tests, the review loop cannot verify correctness. PBC makes your existing quality infrastructure more effective. It does not substitute for it.

The cost angle

I use the ChatGPT Pro plan, which gives me near unlimited access to GPT-5.3-codex. That covers the implementation and review agents. For Claude, I have a company-paid Anthropic plan that includes some amount of Opus usage, but nowhere near MAX levels. Opus handles orchestration and planning, which requires fewer tokens than the implementation work Codex does.

This combination is the most cost effective strategy for me right now. Opus is excellent at reasoning through plans and triaging review findings. Codex is excellent at grinding through implementation packets. By splitting the roles along those strengths, I get the best of both without paying MAX pricing for the high-volume work.

Where this came from

The ideas are not original to me. Hamel Husain wrote about review loops. Dexter Horthy at HumanLayer shaped how I think about agent boundaries. Ryan Carson's compound engineering concept influenced the multi-pass structure. I took those ideas, built a workflow around them, used it on real production code for months, and then packaged it.

The repo is at github.com/mattlgroff/pbc. MIT license. If you try it and something breaks, open an issue.

Further reading