DeepMind & Apple Veterans Raise $50M to Kill the "Stitched" AI Model
The "Frankenstein" era of AI is over. Former Google and Apple researchers raise $50M for Elorian to build native AI that sees, hears, and reads in real-time.

Elorian co-founder Andrew Dai, a key architect of Google's Gemini pre-training.
The era of "gluing" vision models onto chatbots is ending. Elorian, a new AI startup founded by veterans from Google DeepMind and Apple, has secured a massive $50 million seed round to build the Holy Grail of generative AI: a truly native multimodal model.
Leading the round is Striker Venture Partners, the newly formed firm by ex-CRV General Partner Max Gazor. Striker’s aggressive entry into the market signals a high-stakes bet that the current industry standard, stitching together separate text and image models, is a dead end. Elorian’s thesis is simple but expensive: to achieve true reasoning, an AI must see, hear, and read within a single neural architecture, not simulate understanding through a patchworked system.
The "Frankenstein" Flaw Explained
To understand Elorian's pitch, you have to understand the dirty secret of today's "Multimodal" AI. Most commercial systems in 2024 and 2025 were built using a "Late Fusion" architecture, effectively a Frankenstein monster.
In these systems, a vision encoder (like the "eyes") processes an image and converts it into mathematical tokens. It then hands those tokens to a Large Language Model (the "brain"), which tries to interpret them as if they were a foreign language. This hand-off is lossy. It creates latency (slow reaction times) and "blind spots" where the model misses subtle visual cues, like the texture of a fabric or the speed of a car, because it is trying to "read" the image rather than "experience" it.
Elorian is building what researchers call "Early Fusion" or "Native" architecture. In this design, there is no hand-off. The neural network is trained from day one on raw pixels and raw audio waves simultaneously with text.
The result? An AI that doesn't just describe a video of a breaking glass; it "hears" the shatter and "sees" the shards in the same millisecond, allowing for the kind of split-second reasoning needed for robotics and real-time voice agents.
Why Investors Are Betting $50M
The valuation is driven by the specific, hard-to-replicate resumes of the founders, who have arguably written the playbook for this exact technology.
- Andrew Dai (CEO): A 14-year veteran of Google, Dai was a critical figure in the development of the Gemini series. His work focused on the massive pre-training data pipelines that allowed Google's models to move beyond text. Investors are betting that he knows exactly where Google's architecture hit its ceiling, and how to build the next version without the bureaucratic overhead.
- Yinfei Yang (Co-Founder): Yang brings the "Apple DNA" of efficiency. At Apple, he co-authored a seminal paper on "Scaling Laws for Native Multimodal Models". His research proved that "early fusion" models are not just smarter, but can be significantly smaller and faster than their "stitched" counterparts. This suggests Elorian isn't just building a smarter model, but one efficient enough to run on devices, the Holy Grail for Apple and roboticists.
The Giants: Who Are They Fighting?
Elorian is walking into a gladiator arena occupied by trillion-dollar titans, but their pitch is that even the giants are mostly faking it. While the marketing departments of Big Tech claim "multimodality," the engineering reality is often far messier, leaving a specific opening for Elorian’s "World Model" approach.
Google (The Native King): The most direct technical rival is Google. With the release of Gemini 1.5 Pro and the more recent Gemini 2.0, Google has already deployed a natively trained architecture that handles video and text in a single stream. However, Elorian’s bet is on specialization. While Gemini is optimized to be a generalist "Knowledge Engine" for Search and Workspace (summarizing emails or finding facts), Elorian is building a "Physical Engine." Andrew Dai’s departure suggests a belief that a model optimized for physics and robotics, one that understands gravity, friction, and spatial depth, cannot be built inside a company obsessed with text-based search results.
OpenAI (The Omni Threat): OpenAI is the other true "native" player. The "o" in GPT-4o stands for "omni," marking their shift to end-to-end training across audio, vision, and text. They are the primary competitor, but their dominance creates Elorian’s market opportunity. Just as Android exists because the world needed an alternative to iOS, the robotics and hardware industry is desperate for a rival to OpenAI. Hardware companies do not want to send their camera feeds to a competitor who might build their own robot. Elorian positions itself as the neutral "Switzerland" of visual intelligence, the high-performance brain that powers everyone else’s hardware.
Meta & The Open Source Lag: This is where Elorian sees its biggest technical edge. Despite the hype, most open-source models are still using the "Frankenstein" method. Meta’s Llama 3.2 Vision models, while impressive, rely on a "Vision Adapter" architecture, effectively bolting a separate image encoder onto a text-based brain. This works for static images but fails at the high-speed, frame-by-frame reasoning needed for real-time agents. Elorian’s native architecture aims to make this "adapter" approach look as obsolete as dial-up internet, offering a model that is arguably generations ahead of what is currently available to the open-source community.
The Future: Beyond The Chatbot
If Elorian succeeds, the user experience of AI changes from "Chat" to "Interaction."
We are currently trapped in the "turn-taking" era of AI. You speak, wait for the cloud to think, and then it replies. Elorian’s "native" architecture promises real-time perception, removing the latency that currently makes it impossible for robots to catch a ball or for voice agents to feel human.
This technical shift opens three massive markets that "Frankenstein" models cannot touch:
True "Agentic" Voice Interfaces Current voice assistants are polite but deaf to nuance. Because they transcribe audio to text before processing it, they lose the rich data of how you said something. Elorian’s native audio processing enables "full duplex" communication, meaning an AI that can hear you interrupt it mid-sentence, detect sarcasm in your tone, and respond instantly. This is the difference between a frustrating call center bot and a real-time translator that feels like a human interpreter.
The "Physical Intelligence" of Robots The biggest blocker for robotics isn't hardware; it is the brain. A robot in a factory cannot afford a 500-millisecond delay while a vision model translates a camera feed into text descriptions. It needs to understand physics, depth, and friction instantly. By processing raw pixels natively, Elorian is positioning itself to be the operating system for the $23 billion Embodied AI market projected by 2030. This aligns with the "Spatial Intelligence" thesis recently championed by World Labs’ Fei-Fei Li, moving AI from simply naming objects to understanding how they move in 3D space.
Autonomous Economy McKinsey predicts that AI agents working alongside humans could unlock $2.9 trillion in economic value by 2030. But these agents cannot be text-only; they must be able to "watch" your screen, navigate complex software interfaces, and "see" the results of their actions. Elorian’s technology provides the visual cortex required for these digital employees to function without constant human supervision.
.png%3Falt%3Dmedia%26token%3D607550cc-f112-4039-b196-a4f0ff0403c9&w=3840&q=75)
.png%3Falt%3Dmedia%26token%3Ddc060cf8-bc68-40bc-9ebe-d08b4c37fb66&w=3840&q=75)
.png%3Falt%3Dmedia%26token%3D93b109bd-4b51-4355-9cf4-14ae74ffd720&w=3840&q=75)
.png%3Falt%3Dmedia%26token%3D20090163-49a1-4529-bcee-167c98b1712f&w=3840&q=75)