Make tribal knowledge self-serve (Sponsored)Cut onboarding time, reduce interruptions, and ship faster by surfacing the knowledge locked across GitHub, Slack, Jira, and Confluence (and more). You get:
Disclaimer: The details in this post have been derived from the official documentation shared online by the DoorDash Engineering Team. All credit for the technical details goes to the DoorDash Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. When we order food online, the last thing we want is an out-of-date or inaccurate menu. However, for delivery platforms, keeping menus fresh is a never-ending challenge. Restaurants constantly update items, prices, and specials, and doing all of this manually at scale is costly and slow. DoorDash tackled this problem by applying large language models (LLMs) to automate the process of turning restaurant menu photos into structured, usable data. The technical goal of their project was clear: achieve accurate transcription of menu photos into structured menu data while keeping latency and cost low enough for production at scale. On the surface, the idea is straightforward: take a photo, run it through AI, and get back a clean digital menu. In practice, though, the messy reality of real-world images (cropped photos, poor lighting, cluttered layouts) quickly exposes the limitations of LLMs on their own. But the key insight was that LLMs, while strong at summarization and organization, break down when faced with noisy or incomplete inputs. To overcome this, DoorDash designed a system with guardrails. These are mechanisms that decide when automation is reliable enough to use and when a human needs to step in. In this article, we will look at how DoorDash designed such a system and the challenges they faced. Baseline MVPThe first step was to prove whether menus could be digitized at all in an automated way. The engineering team started with a simple pipeline: OCR to LLM. The OCR system extracted raw text from menu photos, and then a large language model was tasked with converting that text into a structured schema of categories, items, and attributes.
This approach worked well enough as a prototype. It showed that a machine could, in principle, take a photo of a menu and output something resembling a digital menu. But once the system was tested at scale, cracks began to appear. Accuracy suffered in ways that were too consistent to ignore. The main reasons were as follows:
Through human evaluation, the team found that nearly all transcription failures could be traced back to one of these three buckets. The Gold standard for AI news (Sponsored)AI is the most essential technical skill of this decade. CEOs of GitHub, Box, and others are prioritising hiring engineers with AI skills. Engineers, devs, and technical leaders at Fortune 1000s + leading Silicon Valley startups read Superhuman AI to stay ahead of the curve and future-proof their skills. LLM Guardrail ModelTo solve the accuracy problem, the engineering team introduced what they call a guardrail model. At its core, this is a classifier that predicts whether the transcription produced from a given menu photo will meet the accuracy bar required for production. The logic is straightforward:
Building the guardrail meant more than just looking at the image. The team realized the model needed to understand how the photo, the OCR system, and the LLM all interacted with each other. So they engineered features from three different sources:
This multi-view approach directly targets the three failure modes identified earlier: inconsistent menu structure, incomplete menus, and poor photographic quality. By combining signals from the image, the OCR process, and the LLM itself, the guardrail learns to separate high-confidence transcriptions from those that are likely to go wrong. Guardrail Model Training and PerformanceDesigning the guardrail model opened up the question of which architecture would actually work best in practice. The team experimented with a three-component neural network design that looked like this:
The diagram below illustrates this design: an image model on one side, OCR/LLM tabular features on the other, both feeding into dense layers and then merging into a final classifier. It’s a standard multimodal fusion approach designed to capture signals from multiple sources simultaneously. The results, however, were surprising. Despite the sophistication of the neural network, the simplest model (LightGBM: a gradient-boosted decision tree) outperformed all the deep learning variants. LightGBM not only achieved higher accuracy but also ran faster, which made it far more suitable for production deployment. Among the neural network variants, ResNet-based encoding came closest, while ViT-based models performed worst. The main reason was data: limited labeled samples made it difficult for the more complex architectures to shine. Production PipelineOnce the guardrail model was in place, the team built a full production pipeline that balanced automation with human review. It works step by step:
The diagram below shows this pipeline as a flow: menu photos enter, pass through the transcription model, then are evaluated by the guardrail. From there, accurate cases flow directly into the system, while uncertain ones branch off toward human operators. This setup immediately raised efficiency. Machines handled the straightforward cases quickly, while humans focused their effort on the difficult menus. The result was a balanced process: automation sped up operations and cut costs without lowering the quality of the final menu data. Rapid Evolution with Multimodal GenAIThe pace of AI research did not stand still. In the months after the first guardrail model went live, multimodal LLMs (models that could process both images and text directly) became practical enough to try in production. Instead of relying only on OCR to extract text, these models could look at the raw photo and infer structure directly. The DoorDash engineering team integrated these multimodal models alongside the existing OCR + LLM pipeline. Each approach had clear strengths and weaknesses:
The diagram below shows how the two pipelines now coexist under the same guardrail system. Both models attempt transcription, and their outputs are evaluated. The guardrail then decides which transcriptions meet the accuracy bar and which need human review. This hybrid setup led to the best of both worlds. By letting the guardrail arbitrate quality between multimodal and OCR-based models, the system boosted automation rates while still preserving the high accuracy required for production. ConclusionAutomating the transcription of restaurant menus from photos is a deceptively complex problem. What began as a simple OCR-to-LLM pipeline quickly revealed its limits when confronted with messy, real-world inputs: inconsistent structures, incomplete menus, and poor image quality. The engineering team’s solution was not just to push harder on the models themselves, but to rethink the system architecture. The introduction of a guardrail classifier allowed automation to scale responsibly, ensuring that customers and restaurants always saw accurate menus while machines handled the simpler cases. As the field of generative AI evolved, the system evolved with it. By combining OCR and LLM models with newer multimodal approaches under the same guardrail framework, DoorDash was able to harness the strengths of both families of models without being trapped by their weaknesses. Looking ahead, several opportunities remain open:
References: ByteByteGo Technical Interview Prep KitLaunching the All-in-one interview prep. We’re making all the books available on the ByteByteGo website. What's included:
SPONSOR USGet your product in front of more than 1,000,000 tech professionals. Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases. Space Fills Up Fast - Reserve Today Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com. |
Don't miss a thing Confirm your subscription Hi there, Thanks for subscribing to fitgirl-repacks.site! To get you up and running, please confirm your email address by clicking below. This will set you up with a WordPress.com account you can use to manage your subscription preferences. By clicking "confirm email," you agree to the Terms of Service and have read the Privacy Policy . Confirm email ...
Comments
Post a Comment