Your agents are leaking money.

Their logs already know where. We hand you the map for free, no SDK, no integration. Then one line of code closes the loop, and the engine optimizes every decision your agents make, learning from its own impact.

FREE AUDITZERO INTEGRATIONONE SDK CALLPROPOSE / RECORDTHE LOOP THAT LEARNSOPTIMIZE EVERY DECISIONCOST IS THE FLOORFREE AUDITZERO INTEGRATIONONE SDK CALLPROPOSE / RECORDTHE LOOP THAT LEARNSOPTIMIZE EVERY DECISIONCOST IS THE FLOOR
Platform

Close the loop.

[ 01 ]

Audit before you integrate

Stream the logs you already have and get back a short, dollar-ranked list of changes, measured on your own runs, with zero code changes.

[ 02 ]

Close the loop with one call

Instrument one decision with propose and record. The upper bound becomes a confirmed number, and every decision after learns from its own impact.

[ 03 ]

Total visibility over any metric in an agentic system.

Cost is just the first objective. Watch every run, tenant, and decision, then point the same loop at the metric that matters next, reliability, latency, or revenue per run.

The Problem

Agents fail and overspend.
No one can name the decision.

Monitoring Active
Telemetry
Silent failure001

70 to 95% never deliver.

Most agents fail in production. Teams see that it happened, never which decision caused it.

Invisible spend002

99.5% of every token is context.

Re-read every turn, the single biggest line item on the bill, and it shows up on no dashboard. The model writes back 0.4%.

Evals miss it003

Evals grade how it sounds.

Not the decisions that set the cost, or whether the run actually worked. Looking right is not the same as costing less.

The obvious fix is wrong004

Routing is only the third lever.

The model swap everyone reaches for first moves less than capping runaway runs and shrinking context. Eyeball the logs and you ship the wrong fix.

Introducing Polir

Logs in.
Savings out.

Observability tells you what happened. We replay every run under thousands of alternative decisions and price each one, free, on the logs you already have. Then one SDK call turns those findings into a loop that never stops optimizing.

Integration
Polir Audit
Your LogsTraces \u00b7 Spans
POLIR✓ Off-Policy Eval
IPSDRESS
InterventionsRanked \u00b7 Measured
Prod Risk
Live Agent
Read-Only
Polir Audit✓ Off-Policy Estimate
No Prod Risk
Phase 1, the audit

Start read-only

See the savings on your own traffic before you touch a line of production code.

Zero Integration

No SDK to install

Point the traces you already collect, LangSmith, Braintrust, OpenTelemetry, at us. Nothing in your code changes.

Upper Bounds

Dollars from your own logs

Every finding is a dollar figure computed from your own runs, and clearly tagged as an upper bound, not a benchmark or a vendor demo.

Ranked

A punch list, not a dashboard

The handful of changes that cut the bill the most: route here, trim context there, abort a doomed run, ordered by dollars saved.

Our Approach

Real numbers, not vibes.

A closed estimator stack, propensity, off-policy evaluation, confidence intervals, built for one job: telling you what a different decision would have cost, before it touches production.

Where the money goes

Share of a real agent bill

Context (re-read)██████████████░░░░░░68%
Reasoning████░░░░░░░░░░░░░░░░21%
Output██░░░░░░░░░░░░░░░░░░11%

99.5% of every token is context, re-read every turn.

The model writes back: 0.4%.

Capabilities

Every lever, measured.

[ 01 ]

Route off the premium models.

At matched difficulty the premium third of spend commits no better and runs 2 to 3.6 times slower. Routing alone recovered about 25%, quality-neutral.

[ 02 ]

Shrink the context.

99.5% of every token is context, re-read every turn. Caching only makes re-reading cheaper. The untouched lever is context size.

[ 03 ]

Abort the doomed runs.

An early-warning model flags the riskiest 20% of runs from the first 8 steps and catches two-thirds of the failures before they finish burning tokens.

[ 04 ]

Compose, do not add.

We score every combination of levers at the run level. The obvious fix is rarely the right one. The powerset finds the ~40% that actually moves the bill.

[ 05 ]

Any reward, not just cost.

Cost is the floor because the logs prove it cold. Point the same loop at reliability, latency, or revenue per run.

[ 06 ]

Per tenant, per decision.

Across 53 tenants, some run lean and some leak three times more. The loop optimizes each decision on its own surface.

Recoverable
~40%

of LLM spend

Failures caught
66%

from the first 8 steps

Token volume
99.5%

is context

Integration
Zero

logs, not SDKs

Under The Hood

Adaptive system

Off-Policy Eval
IPS · SNIPS · DR · Switch-DR
De-confounded
38-feature context vector
Shapley attribution
TreeSHAP, what drives each reward
polir://audit · read-only
λ polir init
  ██████╗   ██████╗  ██╗      ██╗ ██████╗
  ██╔══██╗ ██╔═══██╗ ██║      ██║ ██╔══██╗
  ██████╔╝ ██║   ██║ ██║      ██║ ██████╔╝
  ██╔═══╝  ██║   ██║ ██║      ██║ ██╔══██╗
  ██║      ╚██████╔╝ ███████╗ ██║ ██║  ██║
  ╚═╝       ╚═════╝  ╚══════╝ ╚═╝ ╚═╝  ╚═╝
Validated
SNIPS within ~1% of truth
Reproduced
RouterBench, 28% close to our 25%
offpolicy.py
Open-source estimator library
Off-Policy Engineoffpolicy.pyConfirmed Lift
Early Access

See your number.

Send a sample of your agent logs. We hand back the leak map and the dollar-ranked fixes, measured on your own runs.

?FAQ

Frequently asked questions.

An optimization engine for agent decisions. Send the logs you already have and we return a short, dollar-ranked list of the changes that cut the bill without losing quality, then close the loop so every decision learns from its own impact.

Observability tells you what happened; evals grade how the output sounds. We replay every run under thousands of alternative decisions and price each one: what would have happened, and which version to ship, before it touches production.

Not to start. Phase one is read-only on the logs you already have: point your existing traces, LangSmith, Braintrust, OpenTelemetry, at us and nothing in your code changes. Phase two is a single SDK call.

Off-policy evaluation, de-confounded against a 38-feature context vector so routing is isolated cleanly, not confused with task difficulty. Findings are upper bounds, clearly tagged. Public benchmarks back the method: SNIPS recovers the projection within about 1% of ground truth, and RouterBench reproduced the saving independently.

On 6,330 production runs we measured about 25% from routing alone, quality-neutral, and about 40% composing routing, context, and aborting doomed runs. The fix everyone reaches for first, swapping the model, was only the third-biggest lever.

Routing is de-confounded and quality-neutral: at matched difficulty the premium models commit no better, and run 2 to 3.6 times slower. We compose levers at the run level so the number is honest, and a pilot confirms it on a safe slice.

Your logs, your tenant, your tokens, your harness. Cross-customer learning happens on policy patterns, never your raw traces, and the estimator math is open source in offpolicy.py.

The audit runs on your existing logs in days, and hands back a cost router and a failure predictor, code you can run Monday, not a slide deck.

Polir.ai

Stop guessing.
Start measuring.

Get your audit

No integration. Your logs, your tokens, your harness.