AI has gone big, and so have AI models. 10-billion-parameter universal models are crushing 50-million-parameter task-specific models, demonstrating superior performance at solving many tasks from a single model.
AI models are also becoming multi-modal. New vision models like Microsoft’s Florence 2 and OpenAI’s GPT-4V are expanding the applications of these models to incorporate images, video, and sound, bringing the power of large language models (LLMs) to millions of new use cases.
As bigger has proven to be better in the world of model engineering, every application has undergone a similar progression:
- One task, one domain: A simple model for a specific use case—object detectors for roads, depth segmentation models for indoor scenes, image captioning models, chatbots for web applications, etc.
- One task, every domain: Expanding the application of that simple model to lots of use cases—object detectors for everywhere (YOLO, DINO, etc.), depth segmentation for everything (MobileNet), chat plugins for multiple products).
- Every task, every domain: Large models that can do everything, a paradigm shift made possible by new LLMs—e.g., Florence, GPT-4V, ChatGPT.
- Every task, one domain: Optimizing large models for one domain, enabling real-time applications and higher reliability—e.g., GPT-3.5-Turbo for interactive searching, Harvey.ai for researching and drafting legal docs, DriveGPT for autonomous driving.
Autonomous driving on small models
Autonomous driving still runs on small models. And while a combination of lots of single-task models, specialized sensors, and precise mapping have delivered an impressive prototype, today’s recipe does not yet deliver the safety or scale necessary to support everyday drivers.
Here is what is still holding us back:
- Zero-shot generalization. Existing models often fail into scenarios never seen before, often called “the long tail” of driving. If not sufficiently trained, models have no ability to reason from first principles on what to do next. The solution to date has been to build another special-purpose model. Dynamic scenarios that are tough to map are a key weakness of most autonomous products.
- Interpreting driver and actor intent. Existing models fail to understand the subtleties of human interaction and intent, with respect to both the driver inside the vehicle and road actors outside the vehicle.
- Mapping the entire world, accurately. While well-mapped areas are mostly drivable, accurate HD mapping has proven difficult to scale. And without accurate maps, map-based driving does not work well.
- Scaling vehicles. Today’s small fleets of robotaxis rely on specialized sensors, expensive compute, and combinations of many special-purpose models—a complex and expensive recipe that has yet to scale to everyday drivers.
LLMs and the long tail problem
Across all applications, model engineers are using LLMs as superpowered development tools to improve nearly every aspect of the model engineering process. LLMs have proven extremely useful for developing and improving simulation environments, for sorting, understanding, and labeling massive data sets, and for interpreting and debugging the “black boxes” that are neural networks.
Perhaps one of the biggest advantages of LLMs in the development process is the ability to express complex, multi-step logic in natural language, speeding up development by bypassing the need for expert code. This has already proven quite useful in complex problem areas such as text summarization or code completion with complex dependencies across the code base.
All of these engineering tools stand to improve development efforts broadly, including autonomy, but the most interesting and impactful application of LLMs is directly on the driving task itself: reasoning about complex scenarios and planning the safest path forward.
Autonomous driving is an especially challenging problem because certain edge cases require complex, human-like reasoning that goes far beyond legacy algorithms and models. LLMs have shown promise in going beyond pure correlations to demonstrating a real “understanding of the world.” This new level of understanding extends to the driving task, enabling planners to navigate complex scenarios with safe and natural maneuvers without requiring explicit training.
Where existing models might be confused by the presence of construction workers in an intersection or routing around an accident scene, LLMs have shown the ability to reason about the right route and speed with remarkable proficiency. LLMs offer a new path to solving “the long tail,” i.e., the ability to handle situations never seen before. The long tail has been the fundamental challenge of autonomous driving over the past two decades.
Limitations of LLMs for autonomous tasks
Large language models today still have real limitations for autonomous applications. Put simply, LLMs will need to become much more reliable and much faster. But solutions exist, and this is where the hard work is being done.
Latency and real-time constraints
Safety-critical driving decisions must be made in less than one second. The latest LLMs running in data centers can take 10 seconds or more.
One solution to this problem is hybrid-cloud architectures that supplement in-car compute with data center processing. Another is purpose-built LLMs that compress large models into form factors small enough and fast enough to fit in the car. Already we are seeing dramatic improvements in optimizing large models. Mistral 7B and Llama 2 7B have demonstrated performance rivaling GPT-3.5 with an order of magnitude fewer parameters (7 billion vs. 175 billion). Moore’s Law and continued optimizations should rapidly shift more of these models to the edge.
Hallucinations
Large language models reason based on correlations, but not all correlations are valid in particular scenarios. For example, a person standing in the intersection could mean stop (pedestrian), go (crossing guard), or slow down (construction worker). Positive correlations do not always deliver the correct answer. When the model produces an output that does not reflect reality, we refer to that outcome as a “hallucination.”
Reinforcement learning with human feedback (RLHF) offers a potential solution to these sorts of problems by aligning the model with human feedback to understand these sorts of complex driving scenarios. With better data quality, smaller models like Llama 2 70B are performing on par with GPT-4 with 20x fewer parameters (70 billion vs. 1.7 trillion).
Research projects also are making better data quality easier to scale. For example, the OpenChat framework takes advantage of new techniques like reinforcement learning fine-tuning (RLFT) that advance performance while avoiding costly human preference labeling.
The new long tail
Language models have “everything” encoded into them, but still may not have every driving-specific concept covered, such as the ability to navigate a busy intersection under construction. One potential solution here is exposing the model to long sequences of proprietary driving data that can embed these more detailed concepts in the model. As an example, Replit has used proprietary coding data from their user base to continuously improve their code generation tools with fine-tuning, outperforming larger models like Code Llama 7B.
A new future for autonomous driving
Autonomous driving has yet to reach the mainstream, with only a handful of vehicles today tackling the most complex urban environments. Large models are transforming how we develop autonomous driving models, and ultimately they will transform autonomous driving—providing the safety and scale necessary to finally deliver the technology to everyday drivers.
Prannay Khosla leads model engineering at Ghost Autonomy, a provider of autonomous driving software.
—
Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.
Copyright © 2024 IDG Communications, Inc.