There is No God Tier Video Model
Last year, it seemed like every week a new frontier video model would one-up its peers and push our expectations. We learned to simply expect relentless improvement on measurable benchmarks: longer video output, sustained stories, better physics, character consistency, movement cohesion, and more. But this year, if you care about the benchmarks a lot, you might be getting a little restless. We pretty much expect all video models to generate 10-15 seconds with synced audio and speech, and an all-around sense of realism; which is a great achievement, but not the runaway model progress to which we'd been conditioned.
Instead, we're entering a more exciting time: the product era of video models, where "abundance" isn't about benchmark performance - it's about variety. We're beginning to see models specialize across specific dimensions: there is no "God Model" that's great at everything. Startups are finding new opportunities across two main dimensions: video models that excel at one key thing (Physics! Anime! Multiple shots!) and products that abstract away arduous workflows.
Hot take: even if video model progress stopped entirely, founders would be playing years of catch-up building products around current model capabilities. This is great for startups: it creates space for verticals and wedges that can become their own massive companies. For the past few years, frontier diffusion model labs have consistently released models that outperform prior generations on major benchmarks.
This progress inculcated an assumption among researchers, creators, and (yes) VC's like myself: a "god-tier" model that is great at everything would eventually emerge, and become the default provider for a multitude of different video use-cases. But this year, the assumption was challenged: Sora 2 was released last month, and appeared below Veo 3 on benchmarks like LMarena.
There's a sense that progress might be slowing on the diffusion model level, and the concept of "SOTA", at least when it comes to video models, may not actually exist. Of course, this assumption (and subsequent confrontation with reality) is not unique to diffusion models. Frontier lab LLMs also followed a pattern of step-function improvements from 2023-2025, and then saw performance begin to level off across a variety of benchmarks.
When this happened with LLMs, we saw major foundation labs begin to specialize in specific verticals, and the overall AI product layer take off; in fact, we wrote about this trend earlier this year. Upon reflection it does make sense that progress on video models is beginning to plateau, at least when it comes to public benchmarks. Models have come an extremely long way in terms of realism and aesthetic quality, and now generate outputs that are so realistic that at this point, they might be hitting an asymptote.
Once you convincingly look like real life, how much more "real" can you get? A good analogy here might be still-life painting in the 17th and 18th century: at a certain point, the old masters got so good at painting realistic portraits and pastoral scenes that preferring one artist over another came down to a simple matter of aesthetic taste, rather than any objective measure of what looked more "real".
So what happens when realism stops being the differentiator for video models? We get more variety and specialization. Welcome to the abundance era. Before we explore the increased specialization we're seeing at the model layer, it's worth taking a quick stroll (or more appropriately, scroll) down memory lane.
Back in early 2024, I wrote a piece taking stock of the state of video model outputs. At the time, it was a struggle to generate videos longer than 3-4 seconds. Physics was a joke: people might randomly melt into the ground; basketballs might rebound off the backboard, ricochet toward the ground, then miraculously end up in the hoop in the final frame.
At the time, I remarked that it would be awhile before we saw AI generate a Pixar-level short film. Google hadn’t yet released Veo, or any public model for that matter; they would just tease the timeline with research papers. How quickly things change. Google now dominates LMArena and other leaderboards with their Veo models.
OpenAI is producing a $30 million animated feature with a team of 30 people on a timeline of nine months (admittedly they’re not one-shotting the feature from a single prompt, but this is still staggering progress!). As stated in the introduction, we can now expect longer video outputs, and higher degrees of physics realism: the basketball rebounds off the backboard and hits the ground now.
But even as models collectively are getting better, we're beginning to see them specialize. Why? Because no one model is good at everything. Some teams are optimizing for price and speed of outputs. Others are going all-in on post-training, with apparent focuses on specific verticals.
Specialization in Video Models
Here are a couple of examples of what I'm seeing: Veo 3: has the strongest handle on physics; Anime models like Aiko are getting better at animation and motion; other models are specialized to create more realistic simulations, like water or fire.
Veo 3’s strength in physics is due to its ability to generate realistic collisions between objects. This enables it to simulate complex scenarios that were previously impossible for AI video generation. The model's creators have noted that the biggest challenge was ensuring consistency across different types of objects and surfaces, but their efforts have paid off with impressive results.
Meanwhile, Aiko, the Anime model, is improving at creating more nuanced character expressions and interactions. She can now produce more realistic facial animations and body language, which has opened up new possibilities for storytelling in anime-style videos.
These examples illustrate how video models are being specialized to excel in specific areas, rather than trying to be a jack-of-all-trades master of none. By focusing on one area at a time, developers can create more sophisticated and realistic simulations that are better suited to their intended use cases.
The Benefits of Specialization
So what benefits come from this specialization in video models? For one, it enables developers to deliver more precise and accurate results. By focusing on a specific area, they can fine-tune their models to achieve exceptional performance, rather than trying to balance competing goals.
Another benefit is that specialization allows for greater efficiency and productivity. When developers have the right tools for the job, they can work faster and more effectively, which means that projects get completed sooner and with higher quality.
Finally, specialization opens up new opportunities for innovation and creativity. By focusing on a specific area, developers can explore new ideas and techniques that might not be possible within the constraints of a general-purpose model.
The Future of Video Models
Looking ahead, it's clear that video models will continue to evolve and improve. We'll see more specialized models that excel in specific areas, as well as continued advancements in areas like physics, animation, and simulation.
One potential area of focus for future research is on creating more realistic simulations of human behavior. This could involve developing models that can better capture the subtleties of human emotion and social interaction, which would enable more nuanced and engaging storytelling in AI-generated videos.
In addition to simulation and animation, other areas like lighting, camera work, and music composition will also become increasingly important as video models continue to evolve. By focusing on these aspects, developers can create even more immersive and engaging experiences for viewers.
The Next Steps
As we move forward, it's essential that researchers, developers, and investors stay focused on the benefits of specialization in video models. By prioritizing specific areas and delivering high-quality results, we can unlock new possibilities for innovation and creativity in AI-generated videos.
I invite you to join me on this journey as we explore the future of video models and their potential applications. Stay tuned for updates on the latest developments and breakthroughs in this rapidly evolving field.