Last reviewed June 2026

How We Measure What We Claim

Accuracy, latency, and cost-comparison numbers are only useful if you can see how they were produced. This page documents exactly how VerifyAI measures accuracy and latency, and how our competitor cost comparisons are sourced — including which figures are estimates, which assumptions the ROI calculator uses, and how to reproduce or request the underlying numbers.

Tied to a version

Every number references a specific model bundle and gold eval dataset. No floating claims.

Per-policy, not aggregate

Results are reported per policy and per condition, because a blended average hides the cases that matter.

Estimates labeled as estimates

Competitor pricing is sourced from public information and labeled directional — never presented as fact we can’t verify.

Accuracy

How we measure accuracy

Accuracy is measured against held-out gold eval sets, reported as precision and recall per policy, and sliced by the field conditions that actually stress the models. We publish both error directions and the confusion matrix — not a single feel-good percentage.

Gold eval sets per policy

Accuracy is never measured on aggregate. Each policy — scooter_parking, bike_parking, forest_green_bay, forest_designated_bay — has its own held-out gold eval set: real captures, hand-labeled by two reviewers, with a third resolving disagreements. Gold sets are versioned in verify_ai_ml_gold_eval and frozen at the moment a model is promoted, so a score is always tied to a specific dataset and model bundle.

Precision, recall, and F1 — not just "accuracy"

A single accuracy percentage hides the trade-off that matters operationally. We report precision (of the captures we flagged compliant, how many actually were) and recall (of the truly compliant captures, how many we caught) per policy, plus F1 as the balanced summary. For parking verification, false-accepts and false-rejects have very different costs, so we publish both error directions rather than collapsing them.

Per-policy, per-condition slicing

Headline numbers are sliced by the conditions that actually break models in the field: low light, rain and glare, partial occlusion, unusual angles, and edge geographies. A model that scores 99% in daylight and 78% at dusk is not a 99% model. Slices are part of the same gold eval run, not a separate after-the-fact report.

Confusion matrix and threshold transparency

Every gold eval run produces a full confusion matrix at the production confidence threshold (the same threshold that opens an exception in the live triage queue). We show where errors cluster — which violation category gets confused with which — instead of only the top-line rate.

Latency

How we measure latency

VerifyAI has two pipelines — on-device ML and cloud VLM — with different latency profiles. We measure them separately, report percentiles instead of averages, and state exactly where the clock starts and stops on every number.

Two pipelines, measured separately

VerifyAI runs an on-device ML pipeline and a cloud VLM pipeline. They have fundamentally different latency profiles, so we never blend them into one number. On-device inference is timed from frame-ready to result on representative consumer hardware; cloud verification is timed server-side from request receipt to response written, excluding the client's own network round-trip.

Percentiles, not averages

Averages hide tail latency, and the tail is what users feel. We report p50, p95, and p99 for each pipeline. The <200ms figure you'll see in marketing refers to the on-device pipeline's typical (p50) inference time on target hardware — not an average across all requests, and not the cloud path.

Where the clock starts and stops

On-device: from the moment a frame is handed to the model to the moment a compliant/non-compliant result is available, excluding camera capture and UI render. Cloud: from API request receipt to response serialization, measured at the edge, excluding the device's upload time (which depends on the customer's image size and connection). We state the boundary on every number so it can be compared apples-to-apples.

Representative hardware and load

On-device numbers are measured on mid-tier devices, not flagship-only — because field operators run on whatever phone the rider or driver has. Cloud numbers are measured under realistic concurrent load, not single-request best case.

Cost comparisons

How the cost comparisons are built

Our own price — $0.008 per verification, no contract, no minimums — is a fact we control. Competitor figures are estimates assembled from public sources and customer-reported ranges, last reviewed June 2026. Here is exactly how those estimates become the comparisons you see.

Our price is a published fact

VerifyAI standard pricing is $0.008 per verification — no annual contract, no minimums, no per-seat fees. That figure is ours to state directly because we control it. Everything below is about how we source the numbers we don't control: competitor pricing.

Competitor figures are estimates from public sources

Vendors like Captur, Luna, Ravin, and Tractable do not publish per-verification pricing. The figures we show — for example Captur at roughly $40k/year base plus overage — are estimates assembled from public sources, vendor case studies, procurement disclosures, and customer-reported ranges. We label them as estimates wherever they appear. They are directional, not invoices.

How the "$0.008 vs ~$40k/yr" framing is built

The comparison reframes two pricing models onto a common axis: cost per verification at a given volume. A reported annual platform fee (e.g. ~$40k/yr base) is divided by an assumed verification volume to produce an effective per-verification cost, which is then compared to our flat $0.008. Because the vendor fee is largely fixed, their effective per-verification cost falls as volume rises and climbs sharply at low volume — which is exactly why we show it across a volume range in the calculator rather than as a single number.

Assumptions used in the ROI calculator

The cost calculator makes its assumptions explicit: the competitor base fee (an estimate), whether overage is included, monthly verification volume (you set it), and our flat $0.008 rate. It does not assume hidden integration or services fees on either side. When an input is an estimate rather than a known figure, the calculator says so inline.

Competitor pricing disclosure

Estimates

Figures for Captur (~$40k/yr base plus overage), Luna, Ravin, and Tractable are estimates compiled from public sources, vendor case studies, procurement disclosures, and customer-reported ranges. They are directional and may be out of date. Last reviewed June 2026. Have a current quote that differs? Send it to us and we will review and correct.

Reproduce / request

How to reproduce or request the numbers

None of this is meant to be taken on faith. You can run the cost comparison on your own volume, request the underlying per-policy eval results, or bring your own captures for a blind evaluation before you commit to anything.

Run the calculator with your own volume

The cost calculator is interactive. Plug in your actual monthly verification volume and the competitor base fee you've been quoted, and it recomputes the effective per-verification comparison for your situation rather than ours.

Request the underlying eval numbers

We'll share the per-policy gold eval results — precision, recall, F1, confusion matrix, and condition slices — for the policies relevant to your deployment, along with the model bundle and dataset versions they were measured against. Technical buyers and procurement teams can request these under NDA via the contact form.

Bring your own gold set

The most honest benchmark is your own data. We can run a blind evaluation on a held-out set of your captures so you see precision and recall on your environment, not ours. This is the recommended path before any production rollout.

Tell us when a figure is stale

Competitor pricing moves. If you have a current quote that contradicts an estimate on our site, send it — we'll review and correct. The last-reviewed date below tells you how fresh our public-source pass is.

What the headline figures actually mean

When you see <200ms, 99.2%, or 50M+ elsewhere on this site: the latency figure refers to typical (p50) on-device inference on target hardware, the accuracy figure is a per-policy result against a versioned gold eval set, and the volume figure reflects cumulative verifications processed. The slices, percentiles, and dataset versions behind each are available on the benchmarks page or on request.

Dig into the numbers

The benchmarks, the calculator, the head-to-head comparison, and a direct line to the underlying data.

Benchmarks

Per-policy accuracy and latency results measured against our gold eval sets.

View

Captur vs VerifyAI cost calculator

Interactive: enter your volume and see the effective per-verification comparison.

View

VerifyAI vs Captur

Side-by-side on pricing model, deployment, and micromobility fit.

View

Request the raw numbers

Ask for the underlying eval data, or run a blind eval on your own captures.

Contact sales

Evaluate it on your own data

Start a free sandbox with a $5 credit — no card required — and run your own captures, or ask us to run a blind eval against your gold set before you switch.

Get in Touch

Questions about pricing, integrations, or custom deployments? We'd love to hear from you.