Detecting model drift in production
A model that scored well on the test set can still misbehave six weeks later — new bike racks, new lighting, new phones, new partner cities. Drift detection is how we catch that before partners do.
We use the classic Population Stability Index (PSI) over a fixed reference snapshot captured at promotion time. Every hour a cron job re-bins the last hour of inference feature vectors and compares them to the reference. If the score crosses an alert threshold, Sentry warns us and the bundle becomes a candidate for rollback.
PSI in one paragraph
PSI compares two distributions over the same bins. For each bin i:
psi_i = (live_p_i - ref_p_i) × ln(live_p_i / ref_p_i)
PSI = Σ over every bin of every feature, summed across feature and
result histogramsEach histogram is normalised to proportions before being summed. Both
proportions are clamped into [1e-6, 1 - 1e-6] before the log to keep
it defined when a bin is empty on one side. The reported drift_score
is the sum of the feature-stats PSI and the result-stats PSI for
that bundle.
The interpretation we use industry-wide:
| PSI | Verdict |
| ----------- | ------------------------------------------------------- |
| < 0.10 | Stable — distributions match. |
| 0.10–0.25 | Minor drift — keep watching. |
| > 0.25 | Alert — investigate before this bundle ships wider. |
The math lives in lib/verify-ai/ml/drift.ts as computePSI() and
computeStats(). computeStats() takes a batch of feature vectors
plus their compliance outcomes and bins them into a fixed set of
histograms: detection_count, primary_confidence, image_quality,
has_vehicle on the feature side; compliance and primary_class on
the result side. The alert threshold is exported as
DRIFT_ALERT_THRESHOLD = 0.25.
Reference snapshots
When a bundle is promoted to bundle_tier = 'production' we freeze a
reference snapshot into verify_ai_model_reference_stats. The
snapshot stores:
bundle_id— the bundle this snapshot baselines.snapshot_type—'promotion'(auto) or'manual'.feature_stats(jsonb) — thefeature_stats.histogramspayload produced bycomputeStats().result_stats(jsonb) — same shape, for the result histograms.sample_count— so we can detect snapshots with too little data.
No snapshot, no PSI. The hourly cron skips bundles that don't have a
reference row (it just records the live stats with drift_score = null).
Live inferences
Every on-device verification posts back a feature vector. Rows land
in verify_ai_model_recent_inferences with:
bundle_idso we can compare per-bundle, not per-policy.policy_idfor convenience filtering.feature_vector(jsonb, the post-detector feature vector).result(jsonb, the verification outcome includingis_compliant).created_atfor the cron's 1-hour window.
The table is intentionally short-lived. The migration ships a
verify_ai_recent_inferences_cleanup() function that deletes rows
older than 7 days; the same hourly drift cron calls it on each run
(supabase.rpc('verify_ai_recent_inferences_cleanup')). We don't need
long history here — for that we have verify_ai_verifications itself.
Hourly cron
app/api/cron/ml-drift/route.ts runs every hour (registered in
vercel.json as "0 * * * *"). Authenticated via a Bearer ${CRON_SECRET} header. For each is_active bundle the loop:
for (const bundle of activeBundles) {
const rows = await loadRecentInferences(bundle.id, windowStart, windowEnd);
if (rows.length === 0) continue;
const liveStats = computeStats(featureVectors, results);
const ref = await loadLatestReferenceSnapshot(bundle.id);
let driftScore: number | null = null;
if (ref) {
driftScore =
computePSI(ref.feature_stats, liveStats.feature_stats) +
computePSI(ref.result_stats, liveStats.result_stats);
}
const driftAlert =
driftScore !== null && driftScore > DRIFT_ALERT_THRESHOLD; // 0.25
await insertRuntimeStats(bundle.id, {
window_start: windowStart,
window_end: windowEnd,
sample_count: liveStats.sample_count,
feature_stats: liveStats.feature_stats,
result_stats: liveStats.result_stats,
reference_snapshot_id: ref?.id ?? null,
drift_score: driftScore,
drift_alert: driftAlert,
});
if (driftAlert) {
Sentry.captureMessage(
`ML drift alert: bundle ${bundle.id} (policy=${bundle.policy_id}) PSI=${driftScore.toFixed(3)}`,
{ level: 'warning', tags: { cron: 'ml-drift', policy_id: bundle.policy_id } },
);
}
}
// TTL cleanup runs at the end of every cron tick.
await supabase.rpc('verify_ai_recent_inferences_cleanup');Per-bundle results are written to verify_ai_model_runtime_stats so
the admin dashboard can chart them and so promotion / rollback
runbooks have something to cite.
Sentry is configured with "warning" severity rather than "error" —
a single PSI spike doesn't mean a bad model, just one worth a human
glance. We pair the alert with a link to the bundle row and the most
recent verifications.
Inspecting drift by hand
You can run the same comparison from a SQL client when triaging:
select
rs.bundle_id,
rs.snapshot_type,
rs.sample_count,
rs.feature_stats,
rs.result_stats
from verify_ai_model_reference_stats rs
where rs.bundle_id = '...'
order by rs.created_at desc
limit 1;
select
bundle_id,
window_start,
window_end,
drift_score,
drift_alert,
sample_count
from verify_ai_model_runtime_stats
where bundle_id = '...'
order by window_end desc
limit 24;Preview: shadow-tier evaluation loop
Today drift detection compares one production bundle to its own
reference. The v2 plan is a shadow loop: when a bundle is in
bundle_tier = 'shadow', the SDK runs both production and shadow
inferences on the same frame and posts both feature vectors. The cron
job then computes:
- Pairwise PSI between live production and live shadow on identical inputs.
- Agreement rate on the final policy decision.
This lets us answer "would this candidate model have agreed with
production?" without ever exposing it to users. The migration leaves
room for it (the runtime stats table is keyed by bundle_id, not
"production_bundle_id"), but the cron logic is not yet wired up.
What's next
- Model dispatch and targeting — how canary cohorts get selected in the first place.
- Training custom models — promotion gate requires a reference snapshot.