Applied Compute DoorDash Engagement

Backlink: [[2026-05-21]] # Applied Compute DoorDash Engagement What it looks like in practice is basically an **enterprise AI field-engineering engagement**, not “send us your data and we fine-tune a model.” Using the DoorDash case, the workflow likely looked roughly like this: ## 1. Applied Compute embeds with the customer team Applied Compute says they worked onsite at DoorDash’s Sunnyvale office. The people involved were: - DoorDash ML / product experts - DoorDash internal menu QA experts - Applied Compute engineers / applied researchers The goal of the first phase is not training. It is understanding: - What does “good menu output” mean? - What are the real failure modes? - Which mistakes actually matter? - What does DoorDash already measure? - What does the current AI system output? - What do human reviewers correct? - Where do reviewers disagree? - Which edge cases are common enough to care about? So the Applied Compute engineer is probably sitting with DoorDash people looking at actual menus, AI-generated structured outputs, QA labels, reviewer notes, and production examples. ## 2. They study the existing human workflow For DoorDash, the human workflow is something like: - Merchant submits a restaurant menu as photo / PDF / text. - DoorDash’s system converts it into structured menu data. - Human QA reviewers check whether the menu is correct. - Reviewers flag errors: - wrong item name - missing modifier - wrong price - bad category - duplicate item - incorrect hierarchy - side dish represented incorrectly - option groups modeled incorrectly - combo meal split incorrectly - Reviewers decide what counts as critical. The important insight from the case study: **generation was hard to specify, but verification was easier.** Meaning: humans may disagree on the perfect SOP for creating a menu, but if you show them a candidate structured menu and the ground truth, they can more consistently say whether it is right or wrong. So Applied Compute turns the problem from: > “Teach the model how to create menus perfectly.” into: > “Build a grader that can judge whether a generated menu matches DoorDash’s quality standard.” ## 3. They convert DoorDash’s implicit judgment into an eval / grader This is the core work. Applied Compute takes: - historical menus - structured menu outputs - QA labels - human reviewer decisions - DoorDash style guidelines - known production error types - edge cases and builds an **automated grader**. The grader needs to answer: - Is this menu item correctly represented? - Are categories correct? - Are variants / modifiers correct? - Are prices correct? - Are combo meals modeled correctly? - Are critical errors present? - Which type of error is this? - How severe is it? - Would DoorDash’s human reviewers pass or fail this? Then they calibrate it against human experts. A very simple version: ```text Input: - Original restaurant menu PDF/photo/text - Current AI-generated structured menu - DoorDash ground-truth / corrected menu Grader output: - missing_items: 2 - wrong_prices: 1 - bad_hierarchy: 3 - critical_error: yes - quality_score: 0.71 - pass_fail: fail ``` But in reality it is much more nuanced because menu structure involves judgment. ## 4. They run disagreement analysis This is probably a big part of the onsite work. They ask: - Where does the automated grader disagree with humans? - Where do humans disagree with each other? - Which failures are acceptable? - Which failures are business-critical? - Which errors hurt customers / merchants most? - Which errors are rare but catastrophic? - Which labels are noisy? Then they refine the grader. This maps directly to the talk: they are “paranoid” about QA. They likely sample traces, read examples manually, inspect weird failures, and keep tightening the eval. ## 5. The grader becomes the reward function Once the grader is trusted, it becomes the RL reward. Now the model can attempt the task many times: ```text Given: - messy restaurant menu input - existing structured menu output Model task: - produce corrected structured menu Reward: - score from DoorDash-calibrated grader ``` The model is no longer being optimized for generic helpfulness. It is being optimized for: > “Does this output satisfy DoorDash’s internal menu quality standard?” That is the key difference. ## 6. They train against real DoorDash-style cases Training probably uses examples from DoorDash’s real distribution: - simple menus - messy scans - menus with weird combo structures - bilingual menus - inconsistent prices - add-ons and modifiers - breakfast / lunch / dinner categories - nested choices - edge cases that caused past production issues The point is not to make a generally smarter model. The point is to make a model that is very good at **DoorDash menu correction**. Applied Compute’s case study gives a tiny example of the kind of reasoning the model learns: > One item is a “Taco Dinner” with rice, beans, salad; another is just “Tacos” without sides. Therefore they should be separate categories/items rather than merged incorrectly. That is exactly the kind of judgment that lives in human menu QA experience. ## 7. They instrument everything Applied Compute’s value seems to be not only model training, but the tooling around it: - experiment tracking - rollout tracing - grader calibration - automated failure detection - dashboards by error type - latency/cost monitoring - comparison across model candidates - human review queues for suspicious examples So a DoorDash person can see: ```text Model version: menu-corrector-r48 Critical menu error rate: down 30% Bad hierarchy errors: down 22% Modifier errors: down 35% Wrong price errors: unchanged Latency: within production budget Cost per menu: acceptable Human reviewer agreement: high ``` This is the “observability” part he emphasized in the talk. ## 8. They validate offline with humans Before shipping, DoorDash does stricter offline validation. Likely process: - Take a large held-out sample of production menus. - Run baseline system. - Run new Applied Compute-trained model. - Hide which is which from human graders. - DoorDash reviewers score both. - Compare error rates. The case study says human reviewers confirmed the gains were real. ## 9. They run production A/B test Then DoorDash runs an A/B test: - Some menu traffic gets the old system. - Some gets the new error-correction model. - Human reviewers evaluate outputs. - They measure whether low-quality menus decrease. - They check latency, cost, robustness, and failure modes. The result: roughly **30% relative reduction in low-quality / critical-error menus**. ## 10. Applied Compute hands over a production-ready library The case study says Applied Compute delivered a production-ready library that integrated directly into DoorDash’s codebase. So the final artifact is not just a model checkpoint. It is likely something like: - a model or model endpoint - an inference wrapper - validation logic - logging / monitoring hooks - integration code - eval harness - retraining pathway DoorDash can then call it inside its menu onboarding pipeline. ## 11. Continuous improvement loop After production, the loop continues: - Production corrections become new training/eval data. - Human QA identifies new failure modes. - Grader gets updated. - Model gets retrained. - New A/B tests confirm improvements. This is what Applied Compute means by a system that “improves with every interaction.” ## The real work, in one sentence Applied Compute’s engineer goes onsite and helps turn this: ```text “Our best DoorDash menu QA people know what a correct menu looks like.” ``` into this: ```text A calibrated automated grader + RL training environment + production model that encodes DoorDash’s menu quality judgment. ``` ## Concrete before/after Before: ```text Merchant uploads messy PDF menu. Generic AI extracts menu. Human QA fixes long-tail mistakes. Some critical errors remain. Improvement is slow because the knowledge lives in human reviewers’ heads. ``` After: ```text Merchant uploads messy PDF menu. AI extracts menu. Specialized DoorDash-trained model corrects the structured menu. Automated grader enforces DoorDash-specific standards. Human QA handles fewer critical errors. Production feedback continuously improves the system. ``` So visually, the engagement is less like a SaaS install and more like: > Applied Compute sends a small, technical team into the customer’s actual workflow, watches how the work is done, captures the judgment behind it, builds evals/rewards around that judgment, trains a specialized agent, validates it with humans, then integrates it into production.