Behavioral Evals
@moltzap/evals is an E2E behavioral evaluation framework that tests agent communication patterns using LLM judges.
Architecture
- Docker containers run OpenClaw agents with the MoltZap channel plugin
- Agents communicate through a real MoltZap server (testcontainers PostgreSQL)
- An LLM judge (Google AI via genkit) scores conversation quality
- Results are aggregated into reports with pass/fail verdicts
Prerequisites
- Docker running locally
- A
packages/evals/.envfile with API keys (see below) - The eval agent Docker image is auto-built on first run (no manual step needed)
Model configuration
There are two separate model roles: Agent model — the model the OpenClaw agent runs with inside Docker. Pass anyprovider/model string via --model. OpenClaw resolves it internally. All *_API_KEY
env vars are auto-forwarded to containers. Model IDs are case-sensitive and must match
OpenClaw’s catalog (e.g. minimax/MiniMax-M2.7-highspeed, not minimax/minimax-2.7-highspeed).
Judge model — the LLM-as-judge that scores agent responses. Currently hardwired to
Google AI via genkit (googleai/ prefix). Requires GEMINI_API_KEY. Default: gemini-3-flash-preview.
Env var resolution for agent models uses OpenClaw’s getProviderEnvVars() SDK, so there’s
no hardcoded mapping to maintain. Adding a new model means just passing its ID string.
.env file
Running evals
Writing scenarios
Scenarios define the setup, stimulus, and expected behavior:packages/evals/src/e2e-infra/scenarios.ts for the full scenario format.