Polir, an optimization engine for agent decisions

⚙Platform

Close the loop.

[ 01 ]

Audit before you integrate

Stream the logs you already have and get back a short, dollar-ranked list of changes, measured on your own runs, with zero code changes.

[ 02 ]

Close the loop with one call

Instrument one decision with propose and record. The upper bound becomes a confirmed number, and every decision after learns from its own impact.

[ 03 ]

Total visibility over any metric in an agentic system.

Cost is just the first objective. Watch every run, tenant, and decision, then point the same loop at the metric that matters next, reliability, latency, or revenue per run.

✕The Problem

Agents fail and overspend.
No one can name the decision.

Monitoring Active

Telemetry

Silent failure001

70 to 95% never deliver.

Most agents fail in production. Teams see that it happened, never which decision caused it.

Invisible spend002

99.5% of every token is context.

Re-read every turn, the single biggest line item on the bill, and it shows up on no dashboard. The model writes back 0.4%.

Evals miss it003

Evals grade how it sounds.

Not the decisions that set the cost, or whether the run actually worked. Looking right is not the same as costing less.

The obvious fix is wrong004

Routing is only the third lever.

The model swap everyone reaches for first moves less than capping runaway runs and shrinking context. Eyeball the logs and you ship the wrong fix.

✓Introducing Polir

Logs in.
Savings out.

Observability tells you what happened. We replay every run under thousands of alternative decisions and price each one, free, on the logs you already have. Then one SDK call turns those findings into a loop that never stops optimizing.

Integration

Polir Audit

Your LogsTraces \u00b7 Spans

POLIR✓ Off-Policy Eval

IPSDRESS

InterventionsRanked \u00b7 Measured

Prod Risk

Live Agent

✕Read-Only

Polir Audit✓ Off-Policy Estimate

No Prod Risk

Phase 1, the audit

Start read-only

See the savings on your own traffic before you touch a line of production code.

Zero Integration

No SDK to install

Point the traces you already collect, LangSmith, Braintrust, OpenTelemetry, at us. Nothing in your code changes.

Upper Bounds

Dollars from your own logs

Every finding is a dollar figure computed from your own runs, and clearly tagged as an upper bound, not a benchmark or a vendor demo.

Ranked

A punch list, not a dashboard

The handful of changes that cut the bill the most: route here, trim context there, abort a doomed run, ordered by dollars saved.

◯Our Approach

Real numbers, not vibes.

A closed estimator stack, propensity, off-policy evaluation, confidence intervals, built for one job: telling you what a different decision would have cost, before it touches production.

Where the money goes

Share of a real agent bill

Context (re-read)██████████████░░░░░░68%

Reasoning████░░░░░░░░░░░░░░░░21%

Output██░░░░░░░░░░░░░░░░░░11%

99.5% of every token is context, re-read every turn.

The model writes back: 0.4%.

⚙Capabilities

Every lever, measured.

[ 01 ]

Route off the premium models.

At matched difficulty the premium third of spend commits no better and runs 2 to 3.6 times slower. Routing alone recovered about 25%, quality-neutral.

[ 02 ]

Shrink the context.

99.5% of every token is context, re-read every turn. Caching only makes re-reading cheaper. The untouched lever is context size.

[ 03 ]

Abort the doomed runs.

An early-warning model flags the riskiest 20% of runs from the first 8 steps and catches two-thirds of the failures before they finish burning tokens.

[ 04 ]

Compose, do not add.

We score every combination of levers at the run level. The obvious fix is rarely the right one. The powerset finds the ~40% that actually moves the bill.

[ 05 ]

Any reward, not just cost.

Cost is the floor because the logs prove it cold. Point the same loop at reliability, latency, or revenue per run.

[ 06 ]

Per tenant, per decision.

Across 53 tenants, some run lean and some leak three times more. The loop optimizes each decision on its own surface.

Recoverable

~40%

of LLM spend

Failures caught

66%

from the first 8 steps

Token volume

99.5%

is context

Integration

Zero

logs, not SDKs

☷Under The Hood

Adaptive system

Off-Policy Eval

IPS · SNIPS · DR · Switch-DR

De-confounded

38-feature context vector

Shapley attribution

TreeSHAP, what drives each reward

polir://audit · read-only

λ polir init

  ██████╗   ██████╗  ██╗      ██╗ ██████╗
  ██╔══██╗ ██╔═══██╗ ██║      ██║ ██╔══██╗
  ██████╔╝ ██║   ██║ ██║      ██║ ██████╔╝
  ██╔═══╝  ██║   ██║ ██║      ██║ ██╔══██╗
  ██║      ╚██████╔╝ ███████╗ ██║ ██║  ██║
  ╚═╝       ╚═════╝  ╚══════╝ ╚═╝ ╚═╝  ╚═╝

Validated

SNIPS within ~1% of truth

Reproduced

RouterBench, 28% close to our 25%

offpolicy.py

Open-source estimator library

Off-Policy Engineoffpolicy.pyConfirmed Lift

✉Early Access

See your number.

Send a sample of your agent logs. We hand back the leak map and the dollar-ranked fixes, measured on your own runs.

?FAQ

Frequently asked questions.

An optimization engine for agent decisions. Send the logs you already have and we return a short, dollar-ranked list of the changes that cut the bill without losing quality, then close the loop so every decision learns from its own impact.

Observability tells you what happened; evals grade how the output sounds. We replay every run under thousands of alternative decisions and price each one: what would have happened, and which version to ship, before it touches production.

Not to start. Phase one is read-only on the logs you already have: point your existing traces, LangSmith, Braintrust, OpenTelemetry, at us and nothing in your code changes. Phase two is a single SDK call.

Off-policy evaluation, de-confounded against a 38-feature context vector so routing is isolated cleanly, not confused with task difficulty. Findings are upper bounds, clearly tagged. Public benchmarks back the method: SNIPS recovers the projection within about 1% of ground truth, and RouterBench reproduced the saving independently.

On 6,330 production runs we measured about 25% from routing alone, quality-neutral, and about 40% composing routing, context, and aborting doomed runs. The fix everyone reaches for first, swapping the model, was only the third-biggest lever.

Routing is de-confounded and quality-neutral: at matched difficulty the premium models commit no better, and run 2 to 3.6 times slower. We compose levers at the run level so the number is honest, and a pilot confirms it on a safe slice.

Your logs, your tenant, your tokens, your harness. Cross-customer learning happens on policy patterns, never your raw traces, and the estimator math is open source in offpolicy.py.

The audit runs on your existing logs in days, and hands back a cost router and a failure predictor, code you can run Monday, not a slide deck.

Your agents are leaking money.

Close the loop.

Audit before you integrate

Close the loop with one call

Total visibility over any metric in an agentic system.

Agents fail and overspend.
No one can name the decision.

70 to 95% never deliver.

99.5% of every token is context.

Evals grade how it sounds.

Routing is only the third lever.

Logs in.
Savings out.

Start read-only

No SDK to install

Dollars from your own logs

A punch list, not a dashboard

Real numbers, not vibes.

Model the missing propensity.

Price the counterfactual.

Close the loop.

Every lever, measured.

Route off the premium models.

Shrink the context.

Abort the doomed runs.

Compose, do not add.

Any reward, not just cost.

Per tenant, per decision.

Adaptive system

See your number.

Frequently asked questions.

Stop guessing.
Start measuring.

Your agents are leaking money.

Close the loop.

Audit before you integrate

Close the loop with one call

Total visibility over any metric in an agentic system.

Agents fail and overspend.No one can name the decision.

70 to 95% never deliver.

99.5% of every token is context.

Evals grade how it sounds.

Routing is only the third lever.

Logs in.Savings out.

Start read-only

No SDK to install

Dollars from your own logs

A punch list, not a dashboard

Real numbers, not vibes.

Model the missing propensity.

Price the counterfactual.

Close the loop.

Every lever, measured.

Route off the premium models.

Shrink the context.

Abort the doomed runs.

Compose, do not add.

Any reward, not just cost.

Per tenant, per decision.

Adaptive system

See your number.

Frequently asked questions.

Stop guessing.Start measuring.

Agents fail and overspend.
No one can name the decision.

Logs in.
Savings out.

Stop guessing.
Start measuring.