Backlink: [[2026-05-21]]
# Applied Compute DoorDash Engagement
What it looks like in practice is basically an **enterprise AI field-engineering engagement**, not “send us your data and we fine-tune a model.”
Using the DoorDash case, the workflow likely looked roughly like this:
## 1. Applied Compute embeds with the customer team
Applied Compute says they worked onsite at DoorDash’s Sunnyvale office.
The people involved were:
- DoorDash ML / product experts
- DoorDash internal menu QA experts
- Applied Compute engineers / applied researchers
The goal of the first phase is not training. It is understanding:
- What does “good menu output” mean?
- What are the real failure modes?
- Which mistakes actually matter?
- What does DoorDash already measure?
- What does the current AI system output?
- What do human reviewers correct?
- Where do reviewers disagree?
- Which edge cases are common enough to care about?
So the Applied Compute engineer is probably sitting with DoorDash people looking at actual menus, AI-generated structured outputs, QA labels, reviewer notes, and production examples.
## 2. They study the existing human workflow
For DoorDash, the human workflow is something like:
- Merchant submits a restaurant menu as photo / PDF / text.
- DoorDash’s system converts it into structured menu data.
- Human QA reviewers check whether the menu is correct.
- Reviewers flag errors:
- wrong item name
- missing modifier
- wrong price
- bad category
- duplicate item
- incorrect hierarchy
- side dish represented incorrectly
- option groups modeled incorrectly
- combo meal split incorrectly
- Reviewers decide what counts as critical.
The important insight from the case study: **generation was hard to specify, but verification was easier.**
Meaning: humans may disagree on the perfect SOP for creating a menu, but if you show them a candidate structured menu and the ground truth, they can more consistently say whether it is right or wrong.
So Applied Compute turns the problem from:
> “Teach the model how to create menus perfectly.”
into:
> “Build a grader that can judge whether a generated menu matches DoorDash’s quality standard.”
## 3. They convert DoorDash’s implicit judgment into an eval / grader
This is the core work.
Applied Compute takes:
- historical menus
- structured menu outputs
- QA labels
- human reviewer decisions
- DoorDash style guidelines
- known production error types
- edge cases
and builds an **automated grader**.
The grader needs to answer:
- Is this menu item correctly represented?
- Are categories correct?
- Are variants / modifiers correct?
- Are prices correct?
- Are combo meals modeled correctly?
- Are critical errors present?
- Which type of error is this?
- How severe is it?
- Would DoorDash’s human reviewers pass or fail this?
Then they calibrate it against human experts.
A very simple version:
```text
Input:
- Original restaurant menu PDF/photo/text
- Current AI-generated structured menu
- DoorDash ground-truth / corrected menu
Grader output:
- missing_items: 2
- wrong_prices: 1
- bad_hierarchy: 3
- critical_error: yes
- quality_score: 0.71
- pass_fail: fail
```
But in reality it is much more nuanced because menu structure involves judgment.
## 4. They run disagreement analysis
This is probably a big part of the onsite work.
They ask:
- Where does the automated grader disagree with humans?
- Where do humans disagree with each other?
- Which failures are acceptable?
- Which failures are business-critical?
- Which errors hurt customers / merchants most?
- Which errors are rare but catastrophic?
- Which labels are noisy?
Then they refine the grader.
This maps directly to the talk: they are “paranoid” about QA. They likely sample traces, read examples manually, inspect weird failures, and keep tightening the eval.
## 5. The grader becomes the reward function
Once the grader is trusted, it becomes the RL reward.
Now the model can attempt the task many times:
```text
Given:
- messy restaurant menu input
- existing structured menu output
Model task:
- produce corrected structured menu
Reward:
- score from DoorDash-calibrated grader
```
The model is no longer being optimized for generic helpfulness. It is being optimized for:
> “Does this output satisfy DoorDash’s internal menu quality standard?”
That is the key difference.
## 6. They train against real DoorDash-style cases
Training probably uses examples from DoorDash’s real distribution:
- simple menus
- messy scans
- menus with weird combo structures
- bilingual menus
- inconsistent prices
- add-ons and modifiers
- breakfast / lunch / dinner categories
- nested choices
- edge cases that caused past production issues
The point is not to make a generally smarter model. The point is to make a model that is very good at **DoorDash menu correction**.
Applied Compute’s case study gives a tiny example of the kind of reasoning the model learns:
> One item is a “Taco Dinner” with rice, beans, salad; another is just “Tacos” without sides. Therefore they should be separate categories/items rather than merged incorrectly.
That is exactly the kind of judgment that lives in human menu QA experience.
## 7. They instrument everything
Applied Compute’s value seems to be not only model training, but the tooling around it:
- experiment tracking
- rollout tracing
- grader calibration
- automated failure detection
- dashboards by error type
- latency/cost monitoring
- comparison across model candidates
- human review queues for suspicious examples
So a DoorDash person can see:
```text
Model version: menu-corrector-r48
Critical menu error rate: down 30%
Bad hierarchy errors: down 22%
Modifier errors: down 35%
Wrong price errors: unchanged
Latency: within production budget
Cost per menu: acceptable
Human reviewer agreement: high
```
This is the “observability” part he emphasized in the talk.
## 8. They validate offline with humans
Before shipping, DoorDash does stricter offline validation.
Likely process:
- Take a large held-out sample of production menus.
- Run baseline system.
- Run new Applied Compute-trained model.
- Hide which is which from human graders.
- DoorDash reviewers score both.
- Compare error rates.
The case study says human reviewers confirmed the gains were real.
## 9. They run production A/B test
Then DoorDash runs an A/B test:
- Some menu traffic gets the old system.
- Some gets the new error-correction model.
- Human reviewers evaluate outputs.
- They measure whether low-quality menus decrease.
- They check latency, cost, robustness, and failure modes.
The result: roughly **30% relative reduction in low-quality / critical-error menus**.
## 10. Applied Compute hands over a production-ready library
The case study says Applied Compute delivered a production-ready library that integrated directly into DoorDash’s codebase.
So the final artifact is not just a model checkpoint. It is likely something like:
- a model or model endpoint
- an inference wrapper
- validation logic
- logging / monitoring hooks
- integration code
- eval harness
- retraining pathway
DoorDash can then call it inside its menu onboarding pipeline.
## 11. Continuous improvement loop
After production, the loop continues:
- Production corrections become new training/eval data.
- Human QA identifies new failure modes.
- Grader gets updated.
- Model gets retrained.
- New A/B tests confirm improvements.
This is what Applied Compute means by a system that “improves with every interaction.”
## The real work, in one sentence
Applied Compute’s engineer goes onsite and helps turn this:
```text
“Our best DoorDash menu QA people know what a correct menu looks like.”
```
into this:
```text
A calibrated automated grader + RL training environment + production model that encodes DoorDash’s menu quality judgment.
```
## Concrete before/after
Before:
```text
Merchant uploads messy PDF menu.
Generic AI extracts menu.
Human QA fixes long-tail mistakes.
Some critical errors remain.
Improvement is slow because the knowledge lives in human reviewers’ heads.
```
After:
```text
Merchant uploads messy PDF menu.
AI extracts menu.
Specialized DoorDash-trained model corrects the structured menu.
Automated grader enforces DoorDash-specific standards.
Human QA handles fewer critical errors.
Production feedback continuously improves the system.
```
So visually, the engagement is less like a SaaS install and more like:
> Applied Compute sends a small, technical team into the customer’s actual workflow, watches how the work is done, captures the judgment behind it, builds evals/rewards around that judgment, trains a specialized agent, validates it with humans, then integrates it into production.