Detecting model drift in production

A model that scored well on the test set can still misbehave six weeks later — new bike racks, new lighting, new phones, new partner cities. Drift detection is how we catch that before partners do.

We use the classic Population Stability Index (PSI) over a fixed reference snapshot captured at promotion time. Every hour a cron job re-bins the last hour of inference feature vectors and compares them to the reference. If the score crosses an alert threshold, Sentry warns us and the bundle becomes a candidate for rollback.

PSI in one paragraph

PSI compares two distributions over the same bins. For each bin i:

text

psi_i = (live_p_i - ref_p_i) × ln(live_p_i / ref_p_i)
PSI   = Σ over every bin of every feature, summed across feature and
        result histograms

Each histogram is normalised to proportions before being summed. Both proportions are clamped into [1e-6, 1 - 1e-6] before the log to keep it defined when a bin is empty on one side. The reported drift_score is the sum of the feature-stats PSI and the result-stats PSI for that bundle.

The interpretation we use industry-wide:

| PSI | Verdict | | ----------- | ------------------------------------------------------- | | < 0.10 | Stable — distributions match. | | 0.10–0.25 | Minor drift — keep watching. | | > 0.25 | Alert — investigate before this bundle ships wider. |

The math lives in lib/verify-ai/ml/drift.ts as computePSI() and computeStats(). computeStats() takes a batch of feature vectors plus their compliance outcomes and bins them into a fixed set of histograms: detection_count, primary_confidence, image_quality, has_vehicle on the feature side; compliance and primary_class on the result side. The alert threshold is exported as DRIFT_ALERT_THRESHOLD = 0.25.

Reference snapshots

When a bundle is promoted to bundle_tier = 'production' we freeze a reference snapshot into verify_ai_model_reference_stats. The snapshot stores:

bundle_id — the bundle this snapshot baselines.
snapshot_type — 'promotion' (auto) or 'manual'.
feature_stats (jsonb) — the feature_stats.histograms payload produced by computeStats().
result_stats (jsonb) — same shape, for the result histograms.
sample_count — so we can detect snapshots with too little data.

No snapshot, no PSI. The hourly cron skips bundles that don't have a reference row (it just records the live stats with drift_score = null).

Live inferences

Every on-device verification posts back a feature vector. Rows land in verify_ai_model_recent_inferences with:

bundle_id so we can compare per-bundle, not per-policy.
policy_id for convenience filtering.
feature_vector (jsonb, the post-detector feature vector).
result (jsonb, the verification outcome including is_compliant).
created_at for the cron's 1-hour window.

The table is intentionally short-lived. The migration ships a verify_ai_recent_inferences_cleanup() function that deletes rows older than 7 days; the same hourly Mac Mini drift job calls it on each run (supabase.rpc('verify_ai_recent_inferences_cleanup')). We don't need long history here — for that we have verify_ai_verifications itself.

Hourly drift job

app/api/cron/ml-drift/route.ts is invoked every hour by the Mac Mini wrapper scripts/run-ml-drift-monitor.sh. Authenticated via a Bearer ${CRON_SECRET} header. For each is_active bundle the loop:

for (const bundle of activeBundles) {
  const rows = await loadRecentInferences(bundle.id, windowStart, windowEnd);
  if (rows.length === 0) continue;
 
  const liveStats = computeStats(featureVectors, results);
  const ref = await loadLatestReferenceSnapshot(bundle.id);
 
  let driftScore: number | null = null;
  if (ref) {
    driftScore =
      computePSI(ref.feature_stats, liveStats.feature_stats) +
      computePSI(ref.result_stats, liveStats.result_stats);
  }
 
  const driftAlert =
    driftScore !== null && driftScore > DRIFT_ALERT_THRESHOLD; // 0.25
 
  await insertRuntimeStats(bundle.id, {
    window_start: windowStart,
    window_end: windowEnd,
    sample_count: liveStats.sample_count,
    feature_stats: liveStats.feature_stats,
    result_stats: liveStats.result_stats,
    reference_snapshot_id: ref?.id ?? null,
    drift_score: driftScore,
    drift_alert: driftAlert,
  });
 
  if (driftAlert) {
    Sentry.captureMessage(
      `ML drift alert: bundle ${bundle.id} (policy=${bundle.policy_id}) PSI=${driftScore.toFixed(3)}`,
      { level: 'warning', tags: { cron: 'ml-drift', policy_id: bundle.policy_id } },
    );
  }
}
 
// TTL cleanup runs at the end of every cron tick.
await supabase.rpc('verify_ai_recent_inferences_cleanup');

Per-bundle results are written to verify_ai_model_runtime_stats so the admin dashboard can chart them and so promotion / rollback runbooks have something to cite.

Sentry is configured with "warning" severity rather than "error" — a single PSI spike doesn't mean a bad model, just one worth a human glance. We pair the alert with a link to the bundle row and the most recent verifications.

Inspecting drift by hand

You can run the same comparison from a SQL client when triaging:

sql

select
  rs.bundle_id,
  rs.snapshot_type,
  rs.sample_count,
  rs.feature_stats,
  rs.result_stats
from verify_ai_model_reference_stats rs
where rs.bundle_id = '...'
order by rs.created_at desc
limit 1;
 
select
  bundle_id,
  window_start,
  window_end,
  drift_score,
  drift_alert,
  sample_count
from verify_ai_model_runtime_stats
where bundle_id = '...'
order by window_end desc
limit 24;

Preview: shadow-tier evaluation loop

Today drift detection compares one production bundle to its own reference. The v2 plan is a shadow loop: when a bundle is in bundle_tier = 'shadow', the SDK runs both production and shadow inferences on the same frame and posts both feature vectors. The cron job then computes:

Pairwise PSI between live production and live shadow on identical inputs.
Agreement rate on the final policy decision.

This lets us answer "would this candidate model have agreed with production?" without ever exposing it to users. The migration leaves room for it (the runtime stats table is keyed by bundle_id, not "production_bundle_id"), but the cron logic is not yet wired up.

What's next

Model dispatch and targeting — how canary cohorts get selected in the first place.
Training custom models — promotion gate requires a reference snapshot.

Get in Touch