This is an internal ML system for estimating job effort and condition from interior and exterior intake photos. It runs seven prediction heads, feeds outcomes back into a retrain loop, and promotes new versions only when they beat the current model. I designed it as a practical operations tool first, not a demo, so it can keep getting better as real job data grows.
What it's built with.
Modeling
- PyTorch
- ResNet18 / Pretrained CNN Backbone
- Multi-Task Architectures
- Loss Balancing
Calibration & Reliability
- Temperature Scaling
- Active Learning Loops
- Confidence Routing
- Acceptance Gate Evaluation
Deployment
- FastAPI Inference Service
- Model Registry & Versioning
- Retrain-Compare-Promote Pipeline
- Model Rollback Controls
Data & Infrastructure
- Label CLI & Intake Pipeline
- SQLite Label & Prediction Store
- Schema Validation Gates
- Python
How it works.
Why we built this model
Our old effort estimates were flat lookups, and they broke down fast in real ops. A compact coupe and a full-size truck are not the same job, and a lightly used interior is not the same as a heavy contamination cleanup. Bad estimates directly hurt schedule quality and margins.
The goal here is practical: a crew member takes intake photos, the model predicts realistic labor time, and those predictions inform scheduling and pricing review. After completion, actuals feed back into training so the system gets smarter over time.
Data and labeling strategy
Each labeled training record links intake photos (interior + exterior) to operational outcomes: vehicle subtype, vehicle group (A/B/C), crew-clocked interior and exterior minutes per completed job, 7 binary contamination-type flags (litter, pet hair, mud/salt, bug/tar, sticky residue, stains, dust), a 0–4 severity score, and a 0.0–1.0 coverage fraction. Ordinal condition features — soil level, clutter, glass film — are also collected and serve as tabular inputs alongside the binary contamination labels.
Labeling is done via a purpose-built CLI that walks through each condition field interactively, writes directly to a SQLite job store, and flags uncertain cases for review. The schema was designed to be image-path-ready from day one: training can run on synthetic tabular data initially and switch to image-backed training once enough labeled photos exist.
Model approach
Pretrained ResNet18 image encoder with a shared backbone across interior and exterior image streams. Image features are fused with tabular context (service type, price tier, base duration) and passed through a shared MLP into seven task heads: vehicle group (3-class softmax), vehicle subtype (10-class softmax — coupe through minivan), interior labor minutes (regression), exterior labor minutes (regression), contamination type (7-class multi-label sigmoid), contamination severity (regression, 0–4 scale), and surface coverage (regression, 0.0–1.0).
Multi-task training with weighted loss balancing. Classification heads use cross-entropy; contamination multi-label uses binary cross-entropy with logits; regression heads use MSE with a downscale weight so large minute values don't dominate. Severity and coverage let the model say not just what contaminants are present but how bad — which is the signal that actually drives labor-minute variance.
The vehicle-subtype head implicitly supervises the group head — subtype is the finer-grained target, and group is derived from it — but both are trained explicitly to give calibrated confidence on each independently. Contamination multi-label output gives operators actionable per-type flags; severity and coverage give a single-number summary for routing decisions.
Model training & evaluation
Optimizer: Adam (lr 1e-4 for image path, 1e-3 for tabular). Image model: ResNet18 backbone initialized from ImageNet pretrained weights, frozen initially, unfrozen after warmup. Tabular model: MLP with 128-64 hidden units. Both use cross-entropy for classification heads and MSE for regression. Training data split by time, not randomly, so held-out performance reflects future generalization rather than i.i.d. sampling.
Per-slice evaluation at every checkpoint: metrics broken out by service type, vehicle group, and vehicle subtype. Acceptance gates are enforced before any candidate model can be promoted — a candidate that improves overall MAE but degrades accuracy on a specific slice is still rejected.
Tabular baseline (and why)
Trained a LightGBM regressor on tabular features only (vehicle group, service type, season, basic photo statistics). Interior labor MAE: 11.3 min vs. the image model's 8.2 min. The image model wins when the photo is available and confident; the tabular model is the system's fallback when the image model abstains or image quality is poor.
This comparison also serves as a sanity check against 'deep learning was the wrong tool here.' The 3.1-minute gap makes the image model defensible at the cost of infrastructure complexity. If the gap had been under 1 minute, tabular would be the production path.
Error analysis & calibration
Classification heads are overconfident out of the box — typical for deep models. Fit temperature scaling per head on a held-out calibration split. ECE on vehicle-subtype dropped from 0.18 to 0.04; vehicle-group ECE from 0.13 to 0.03. Calibrated probabilities are what make the routing thresholds meaningful — a 0.65 confidence gate only has operational significance if 0.65 actually corresponds to 65% empirical accuracy.
Error slices by vehicle group and subtype reveal where the model struggles most: confusion between group B crossovers and group C full-size SUVs accounts for the majority of group-level errors. Interior time error is highest on group C trucks with heavy contamination — expected given label sparsity in that cell early in training.
Below per-head confidence thresholds, predictions are flagged as abstained and routed to human review. Reviewers confirm or correct; corrections are written back to the label store and prioritized in the next retrain cycle.
Self-improving flywheel
The core of the system is not a single trained model — it's a loop. Each completed job submits interior and exterior actual times plus condition labels via a CLI or API endpoint. Labels are stored in a SQLite job store alongside the intake photos and prediction logs.
When enough new labeled jobs accumulate past a configurable threshold, the retrain loop exports the full labeled dataset, trains a candidate model from scratch, and evaluates it against the currently promoted model using the same per-slice acceptance gates. A candidate is promoted only if it improves interior MAE and does not regress on group accuracy by more than a defined tolerance. Every version — promoted or rejected — is recorded in the model registry with its metrics and artifact path.
This means the model improves passively as the crew does their normal work. The only labor required is recording actual times after each job, which is already part of the booking system's completion flow.
Deployment
Served behind a versioned FastAPI inference endpoint. Response includes all seven head outputs: vehicle group and subtype with per-class confidence, interior and exterior labor minute estimates, per-contamination-type probability scores, severity prediction, coverage prediction, an abstain flag with reason, and model_version for traceability. Both a tabular endpoint (structured features) and an image endpoint (photo paths + service context) are available, routing to the appropriate model based on what's provided.
Model registry tracks all trained versions with promotion status and full metrics. Rollback is a one-command registry update — no redeployment required. Prediction logs are written to SQLite alongside job and label data, enabling offline analysis of predicted-vs-actual variance by service, group, and subtype over time.
Tradeoffs
Chose ResNet18 over a larger ViT backbone. At current data scale (low thousands of labeled photos), ResNet18 with pretrained ImageNet weights transfers well and trains faster. A ViT backbone would be the right call once the dataset grows past ~10k labeled jobs and data diversity warrants more capacity.
Multi-task vs. separate models per head: multi-task because shared image features regularize each head, one forward pass is cheaper at inference, and the model learns that vehicle subtype and contamination type are jointly predictive of labor time — a signal that separate models would have to re-learn independently. Seven heads in one forward pass versus seven separate model serving stacks is also a meaningful operational simplification.
Active retrain loop vs. periodic scheduled retraining: the threshold-based trigger is simpler to operate than a cron job and decouples retrain frequency from calendar time. The model retrains when it has enough new signal, not on an arbitrary schedule.
Results
Vehicle subtype accuracy: 89.4% (top-1, 10-class) on the held-out test set, 95% CI [86.1, 92.0]. Most confusion is between group B crossovers and group C full-size SUVs — the same boundary that's hard for human dispatchers. Vehicle group accuracy derived from subtype: 94.1%.
Interior labor MAE: 8.2 min on jobs over 30 min, vs. 14.6 min for the prior flat-lookup estimate. p90 absolute error 17 min vs. 31 min. Exterior labor MAE: 4.1 min. Reduction in interior MAE meaningfully tightens scheduling buffers without increasing overrun rate.
Contamination multi-label macro-F1: 0.81 across 7 classes (litter, pet hair, mud/salt, bug/tar, sticky residue, stains, dust); per-class F1 ranges from 0.71 (oxidation/rare) to 0.93 (dust/plentiful). Severity regression MAE: 0.31 on 0–4 scale. Coverage regression R²: 0.78 against held-out human ratings.
Tabular baseline comparison: LightGBM on structured features alone (vehicle group, service, season) achieves interior MAE 11.3 min. The image model's 3.1-minute improvement over tabular is the justification for image-based inference. For jobs where the image model abstains, the tabular model serves as fallback.
Calibration: ECE ≤ 0.05 on both classification heads after temperature scaling. Calibrated thresholds at the 0.65 confidence level send approximately 12% of intake predictions to human review — a load the operations team can absorb, and the source of the next retrain cycle's highest-signal labels.
Flywheel velocity: each promoted model iteration reduces interior MAE on the next quarter's held-out jobs. The retrain-compare-promote loop fired multiple times in the first half of 2026, each time triggered by new labeled job accumulation rather than a scheduled calendar event.
The things I'm proudest of.
- ▹Multi-task architecture with a pretrained ResNet18 image encoder fused with tabular service/context features. Seven prediction heads share a common feature representation: vehicle group (3-class softmax, A/B/C), vehicle subtype (10-class softmax — coupe through minivan), interior labor minutes (regression), exterior labor minutes (regression), contamination type (7-class multi-label sigmoid — litter, pet hair, mud/salt, bug/tar, sticky residue, stains, dust), contamination severity (0–4 regression), and surface coverage (0.0–1.0 regression). Heads are trained jointly with weighted multi-task loss so regression scale doesn't overwhelm classification signal.
- ▹Trained on intake photos collected from real Lustr operations, joined to crew-clocked job durations from the booking database. Interior and exterior labor times are tracked separately per job. Train/val/test split is time-based (not random) so the held-out test window reflects future operational conditions rather than an i.i.d. sample.
- ▹Vehicle subtype accuracy 89.4% (top-1, 10-class) on the held-out test set, 95% CI [86.1, 92.0]. Interior labor MAE 8.2 min vs. 14.6 min for the prior flat-lookup estimate. Exterior labor MAE 4.1 min. All metrics reported with bootstrap 95% CIs.
- ▹Compared the image regression heads against a LightGBM regressor trained on tabular features only (vehicle group, service, season, photo statistics). Tabular interior MAE 11.3 min vs. image model 8.2 min. The image model wins but LightGBM serves as a backstop when image confidence is low — the system routes to tabular when the image model abstains.
- ▹Calibration via temperature scaling fit per classification head on a held-out split. Contamination multi-label head uses per-class operating thresholds tuned independently — a class with 5% base rate needs a different decision boundary than a class at 40%. Confidence routing: predictions below per-head thresholds are flagged as abstained and routed to human review. Reviewers confirm or correct; corrections are written back to the label store and prioritized in the next active-learning retrain cycle.
- ▹Self-improving flywheel: each completed job submits actual interior/exterior times, contamination type flags, severity/coverage ratings, and condition labels via a label CLI or API. When enough new labeled jobs accumulate, the retrain loop exports them, trains a candidate model across all seven heads, evaluates per-slice metrics and acceptance gates against the current production model, and promotes only if the candidate is strictly better. Full model version registry with rollback support.