AI-powered prompt debugging

Your AI gets smarter
every time it fails.

EvalFix automatically finds what's breaking in your AI product, figures out why, and fixes it — so your team spends less time debugging and more time shipping.

ci/eval — prompt-v6
eval score
evalfix / optimizer AI
prompt diff — v6 → v7
- Extract the claim decision from the text.
+ Extract the final claim decision from the adjudicator's
+ summary. Return ONLY "approved" or "denied". If the
+ decision is ambiguous, return "pending".
reasoning

Test cases 14 and 21 failed due to ambiguous extraction instructions. Added explicit output constraints and an ambiguity handler based on 3 captured failures.

score 0.42 → 0.91

The loop every LLM team is stuck in

🔴

Your LLM breaks in prod

You see the failure in your CI. You see the wrong output. But you don't see why — because the prompt alone doesn't tell the story.

🔍

You look at the wrong thing

Most teams debug the prompt in isolation. But the failure lives in the full trace — the inputs, the context, the interaction pattern. That's where we look.

🎲

You edit and re-deploy, hoping

Each fix is a guess. It works on the examples you tested. It breaks on the ones you didn't. The cycle repeats — without a ground truth to anchor to.

There's a better way. One that learns from every failure.

How it works

Three steps. One flywheel.

The longer evalfix runs, the better it gets. Every failure makes your ground truth stronger.

1

Capture failures

from production, real-time

One SDK call captures the full context of every LLM failure — inputs, actual output, expected output, failure category. No manual logging setup.

from evalfix import capture_failure

capture_failure(
  prompt_id="claim-extractor",
  inputs={"doc": doc},
  actual=response,
  expected="denied"
)
2

Grow ground truth

automatically, from real data

Most teams already have a ground truth — we grow it. Promote any failure to a test case with one click. Your eval suite gets stronger with every bug you hit.

Ground truth +3 this week
test_case_14 — exact match from failure
test_case_21 — llm_judge from failure
test_case_33 — regex manual
3

Fix and ship

with AI, with confidence

evalfix reads the full trace — not just the prompt — and uses AI to generate an improved version. Review the diff, run the evals, accept or reject. CI goes green.

Optimization run +49pts
before
0.42
after
0.91

Why not just edit the prompt yourself?

Debugging requires context.
Most tools give you none.

Without evalfix
  • Edit the prompt manually, based on one failing example
  • Test on 2–3 examples you happened to remember
  • Deploy and find out in production if it worked
  • No version history, no diff, no accountability
  • Fix one thing, break another. Repeat forever.
With evalfix
  • Production failures auto-captured with full trace context
  • Ground truth grows from real failures, not invented cases
  • AI optimizer reads the full picture before making changes
  • Version history, diffs, and eval scores for every change
  • Evaluate before you ship. Every time.

"We're not a prompting framework. We're the accuracy layer that sits between your LLM app and your CI — and we get smarter the longer you run it."

Evaluation methods

Exact match Contains Regex LLM-as-judge Custom

Evaluate the way your use case demands — from deterministic checks to AI-graded rubrics.

Integration

Drop it in. That's it.

No new infrastructure. No eval framework to learn. Works alongside whatever you're already building.

claim_handler.py
from evalfix import capture_failure, get_active_prompt

# evalfix manages your prompt versions
prompt = get_active_prompt("claim-extractor")

def process_claim(document: str) -> str:
    response = llm.complete(prompt, document)

    # If validation fails, capture it for evalfix
    if not is_valid_decision(response):
        capture_failure(
            prompt_id="claim-extractor",
            inputs={"document": document},
            actual_output=response,
            category="wrong_answer"
        )

    return response

evalfix handles the versioning, test running, and optimization loop.
You handle the business logic.