How We Measure What We Claim
Accuracy, latency, and cost-comparison numbers are only useful if you can see how they were produced. This page documents exactly how VerifyAI measures accuracy and latency, and how our competitor cost comparisons are sourced — including which figures are estimates, which assumptions the ROI calculator uses, and how to reproduce or request the underlying numbers.
Tied to a version
Every number references a specific model bundle and gold eval dataset. No floating claims.
Per-policy, not aggregate
Results are reported per policy and per condition, because a blended average hides the cases that matter.
Estimates labeled as estimates
Competitor pricing is sourced from public information and labeled directional — never presented as fact we can’t verify.
How we measure accuracy
Accuracy is measured against held-out gold eval sets, reported as precision and recall per policy, and sliced by the field conditions that actually stress the models. We publish both error directions and the confusion matrix — not a single feel-good percentage.
Gold eval sets per policy
Accuracy is never measured on aggregate. Each policy — scooter_parking, bike_parking, forest_green_bay, forest_designated_bay — has its own held-out gold eval set: real captures, hand-labeled by two reviewers, with a third resolving disagreements. Gold sets are versioned in verify_ai_ml_gold_eval and frozen at the moment a model is promoted, so a score is always tied to a specific dataset and model bundle.
Precision, recall, and F1 — not just "accuracy"
A single accuracy percentage hides the trade-off that matters operationally. We report precision (of the captures we flagged compliant, how many actually were) and recall (of the truly compliant captures, how many we caught) per policy, plus F1 as the balanced summary. For parking verification, false-accepts and false-rejects have very different costs, so we publish both error directions rather than collapsing them.
Per-policy, per-condition slicing
Headline numbers are sliced by the conditions that actually break models in the field: low light, rain and glare, partial occlusion, unusual angles, and edge geographies. A model that scores 99% in daylight and 78% at dusk is not a 99% model. Slices are part of the same gold eval run, not a separate after-the-fact report.
Confusion matrix and threshold transparency
Every gold eval run produces a full confusion matrix at the production confidence threshold (the same threshold that opens an exception in the live triage queue). We show where errors cluster — which violation category gets confused with which — instead of only the top-line rate.
How we measure latency
VerifyAI has two pipelines — on-device ML and cloud VLM — with different latency profiles. We measure them separately, report percentiles instead of averages, and state exactly where the clock starts and stops on every number.
Two pipelines, measured separately
VerifyAI runs an on-device ML pipeline and a cloud VLM pipeline. They have fundamentally different latency profiles, so we never blend them into one number. On-device inference is timed from frame-ready to result on representative consumer hardware; cloud verification is timed server-side from request receipt to response written, excluding the client's own network round-trip.
Percentiles, not averages
Averages hide tail latency, and the tail is what users feel. We report p50, p95, and p99 for each pipeline. The <200ms figure you'll see in marketing refers to the on-device pipeline's typical (p50) inference time on target hardware — not an average across all requests, and not the cloud path.
Where the clock starts and stops
On-device: from the moment a frame is handed to the model to the moment a compliant/non-compliant result is available, excluding camera capture and UI render. Cloud: from API request receipt to response serialization, measured at the edge, excluding the device's upload time (which depends on the customer's image size and connection). We state the boundary on every number so it can be compared apples-to-apples.
Representative hardware and load
On-device numbers are measured on mid-tier devices, not flagship-only — because field operators run on whatever phone the rider or driver has. Cloud numbers are measured under realistic concurrent load, not single-request best case.
How the cost comparisons are built
Our own price — $0.008 per verification, no contract, no minimums — is a fact we control. Competitor figures are estimates assembled from public sources and customer-reported ranges, last reviewed June 2026. Here is exactly how those estimates become the comparisons you see.
Our price is a published fact
VerifyAI standard pricing is $0.008 per verification — no annual contract, no minimums, no per-seat fees. That figure is ours to state directly because we control it. Everything below is about how we source the numbers we don't control: competitor pricing.
Competitor figures are estimates from public sources
Vendors like Captur, Luna, Ravin, and Tractable do not publish per-verification pricing. The figures we show — for example Captur at roughly $40k/year base plus overage — are estimates assembled from public sources, vendor case studies, procurement disclosures, and customer-reported ranges. We label them as estimates wherever they appear. They are directional, not invoices.
How the "$0.008 vs ~$40k/yr" framing is built
The comparison reframes two pricing models onto a common axis: cost per verification at a given volume. A reported annual platform fee (e.g. ~$40k/yr base) is divided by an assumed verification volume to produce an effective per-verification cost, which is then compared to our flat $0.008. Because the vendor fee is largely fixed, their effective per-verification cost falls as volume rises and climbs sharply at low volume — which is exactly why we show it across a volume range in the calculator rather than as a single number.
Assumptions used in the ROI calculator
The cost calculator makes its assumptions explicit: the competitor base fee (an estimate), whether overage is included, monthly verification volume (you set it), and our flat $0.008 rate. It does not assume hidden integration or services fees on either side. When an input is an estimate rather than a known figure, the calculator says so inline.
Competitor pricing disclosure
EstimatesFigures for Captur (~$40k/yr base plus overage), Luna, Ravin, and Tractable are estimates compiled from public sources, vendor case studies, procurement disclosures, and customer-reported ranges. They are directional and may be out of date. Last reviewed June 2026. Have a current quote that differs? Send it to us and we will review and correct.
How to reproduce or request the numbers
None of this is meant to be taken on faith. You can run the cost comparison on your own volume, request the underlying per-policy eval results, or bring your own captures for a blind evaluation before you commit to anything.
Run the calculator with your own volume
The cost calculator is interactive. Plug in your actual monthly verification volume and the competitor base fee you've been quoted, and it recomputes the effective per-verification comparison for your situation rather than ours.
Request the underlying eval numbers
We'll share the per-policy gold eval results — precision, recall, F1, confusion matrix, and condition slices — for the policies relevant to your deployment, along with the model bundle and dataset versions they were measured against. Technical buyers and procurement teams can request these under NDA via the contact form.
Bring your own gold set
The most honest benchmark is your own data. We can run a blind evaluation on a held-out set of your captures so you see precision and recall on your environment, not ours. This is the recommended path before any production rollout.
Tell us when a figure is stale
Competitor pricing moves. If you have a current quote that contradicts an estimate on our site, send it — we'll review and correct. The last-reviewed date below tells you how fresh our public-source pass is.
What the headline figures actually mean
When you see <200ms, 99.2%, or 50M+ elsewhere on this site: the latency figure refers to typical (p50) on-device inference on target hardware, the accuracy figure is a per-policy result against a versioned gold eval set, and the volume figure reflects cumulative verifications processed. The slices, percentiles, and dataset versions behind each are available on the benchmarks page or on request.
Dig into the numbers
The benchmarks, the calculator, the head-to-head comparison, and a direct line to the underlying data.
Captur vs VerifyAI cost calculator
Interactive: enter your volume and see the effective per-verification comparison.
ViewRequest the raw numbers
Ask for the underlying eval data, or run a blind eval on your own captures.
Contact salesEvaluate it on your own data
Start a free sandbox with a $5 credit — no card required — and run your own captures, or ask us to run a blind eval against your gold set before you switch.