ykachala/evals
| Install | |
|---|---|
composer require ykachala/evals |
|
| PHP: | >=8.3 |
| License: | MIT |
| Last Updated: | Jun 2, 2026 |
| Links: | GitHub · Packagist |
Ykachala Evals
Regression testing for prompts and LLM output, built for PHP CI. Define golden datasets, score model responses with deterministic and LLM-as-judge assertions, and fail the build when quality drops below a threshold.
Why this exists (the 2026 gap)
A prompt change is a code change with no test suite. Swap a model, tweak a system prompt, or upgrade your RAG retriever, and you have no idea whether you just made answers better or quietly broke 20% of them. Teams find out in production.
Python has a rich eval stack — promptfoo, DeepEval, Ragas, Langfuse evals. PHP, even with the 2026 surge in Prism / Neuron / Laravel AI SDK adoption, has no native eval harness. You can call a model from PHP, but you can't assert it's still good in CI. Ykachala Evals is that missing test layer — it feels like PHPUnit/Pest, not a separate Python toolchain.
What it does
- Datasets — golden cases (
input, optionalexpected, metadata) from PHP arrays, JSON/CSV, or a generator. - Assertions / scorers, mixed freely per case:
- Deterministic:
equals,contains,regex,jsonSchema,validJson,latencyUnder,costUnder - Semantic:
similarTo(embedding cosine ≥ threshold) - LLM-as-judge:
judge('Is the answer faithful to the context? Score 0–1.')
- Deterministic:
- Runners — drive it from Pest/PHPUnit or the
vendor/bin/evalsCLI. - CI gates — set pass thresholds (e.g. "≥ 0.9 of cases score ≥ 0.8"); non-zero exit on failure.
- Reports — JSON, JUnit XML (for CI annotations), and an HTML diff vs. the last snapshot.
Install
composer require --dev ykachala/evals
Quick start
use Ykachala\Evals\Suite;
use Ykachala\Evals\Dataset;
use function Ykachala\Evals\{contains, jsonSchema, judge, similarTo};
$suite = Suite::make('support-bot')
->using(fn (string $input) => $myAgent->reply($input)) // the system under test
->dataset(Dataset::fromJson(__DIR__.'/golden/support.json'))
->assert(
contains('refund', caseInsensitive: true),
similarTo('We will process your refund within 5 days', threshold: 0.82),
judge('Is the reply polite and does it resolve the request?', pass: 0.7),
)
->gate(passRate: 0.9, minScore: 0.8);
$report = $suite->run();
echo $report->summary(); // 47/50 passed · mean 0.91 · 2 regressions
$report->writeJunit('build/evals.xml');
exit($report->passed() ? 0 : 1);
In Pest
it('keeps the support bot faithful', function () {
$report = Suite::make('support-bot')->/* ... */->run();
expect($report)->toPassGate();
});
From CI
vendor/bin/evals run evals/ --gate=0.9 --junit=build/evals.xml
LLM-as-judge, made affordable
Judge and similarTo assertions cost tokens. Ykachala Evals integrates with
ykachala/semantic-cache to cache judge verdicts and embeddings, so re-running the suite
on an unchanged dataset is nearly free — and with ykachala/meter you see the exact token
cost of each eval run.
Architecture
src/
├── Suite.php # fluent builder: using() / dataset() / assert() / gate() / run()
├── Dataset.php # case loading (array, JSON, CSV, generator)
├── EvalCase.php # one eval case: input, expected, metadata (`Case` is reserved)
├── Assertion/ # deterministic + semantic + judge assertions
├── Judge/ # LLM-as-judge interface + adapters
├── Report.php # scores, regressions, JSON/JUnit/HTML output
└── Gate.php # pass thresholds + CI exit semantics
bin/evals # CLI runner
Roadmap
- EvalCase + Dataset loaders (array/JSON/CSV/generator)
- Deterministic assertions (equals/contains/regex/jsonSchema/validJson/latency/cost)
- Suite builder + Report (scores, regressions vs. snapshot)
- Semantic
similarTo+ LLM-as-judge assertions - Gate + CLI runner + JUnit/HTML reports
- semantic-cache + meter integration
See CLAUDE.md for the full phase plan and conventions.
License
MIT