Standard Intelligence FDM-1 computer use AI 2026
|

6 People, $75M, and an AI That Learned to Use Computers by Watching 11 Million Hours of Video

Standard Intelligence FDM-1 is Standard Intelligence’s most ambitious computer use AI model ever built — trained on 11 million hours of video showing humans operating computers. This Standard Intelligence 75M 2026 funding story reveals how Standard Intelligence’s six people and $75 million from Sequoia Capital are Standard Intelligence is challenging the biggest names in AI.

A Six-Person Startup With $75M and Sequoia Backing

Standard Intelligence FDM-1 computer use AI 2026

In an industry where billion-dollar AI labs employ thousands of researchers, Standard Intelligence is doing something almost absurdly contrarian: building a foundation model for computer use with a team of six people and a $75 million war chest from Sequoia and Spark Capital.

The startup’s angel investors include Andrej Karpathy — the former Tesla AI director and OpenAI founding member who is widely regarded as one of the most influential voices in modern AI. When Karpathy puts his own money behind a six-person team, the AI community pays attention.

Standard Intelligence emerged in April 2026 with a thesis that challenges how every other company approaches AI-driven computer automation. Instead of wrapping large language models in tool-calling scaffolding or fine-tuning vision models on annotated screenshots, they trained their foundation model — FDM-1 — directly on raw video of humans using computers. And the scale of their dataset is staggering.

FDM-1: The Computer Use Foundation Model

FDM-1 stands for Foundation Desktop Model, and it represents a fundamentally different approach to computer use AI. Where most companies start with a language model and bolt on vision capabilities, Standard Intelligence built FDM-1 from the ground up as a video-native model that understands computer interaction by watching it happen.

The model processes screen recordings and learns the mapping between visual states (what the screen looks like) and actions (what the user did next). This means FDM-1 doesn’t need explicit instructions about how to click buttons, fill forms, or navigate menus — it learned these patterns by observing millions of examples of humans doing exactly those things.

The result is a model that can interact with virtually any desktop application through its graphical interface, without requiring API integrations, browser extensions, or application-specific tooling. If a human can do it by looking at a screen and using a mouse and keyboard, FDM-1 can theoretically learn to do it too.

11 Million Hours of Video: How They Built the Dataset

The most impressive technical achievement behind FDM-1 isn’t the model architecture — it’s the 11 million hours of training data. To put that number in perspective, if you watched videos continuously, it would take you over 1,200 years to get through the entire corpus. It’s orders of magnitude larger than any open-source alternative for computer use training data.

Building a dataset this large required solving a fundamental labeling problem. Manually annotating millions of hours of screen recordings — identifying every click, keystroke, scroll, and drag — would be prohibitively expensive and slow. Standard Intelligence’s solution was to train an inverse dynamics model that automatically labels what actions were taken between video frames.

The inverse dynamics model watches two consecutive frames and predicts what user action caused the transition from frame A to frame B. By running this model across the entire video corpus, Standard Intelligence automatically generated action labels for billions of interaction events — creating a massive supervised learning dataset without hiring an army of human annotators.

This data engineering approach is arguably the company’s most significant technical moat. Competitors can potentially match their model architecture, but replicating an 11-million-hour labeled dataset requires both the raw video data and the inverse dynamics labeling infrastructure.

100x More Efficient Than OpenAI

According to Standard Intelligence, FDM-1’s video encoder is 100 times more efficient than OpenAI’s alternative approach to computer use. This efficiency claim, if accurate, has massive practical implications.

Computer use AI needs to process screenshots or video frames to understand what’s on screen. The more efficiently it can encode visual information, the faster it can respond and the less compute it needs per interaction. A 100x efficiency advantage means Standard Intelligence could potentially run computer use tasks at a fraction of the cost of competitors — or achieve much higher quality at the same cost.

The efficiency comes from training on video natively rather than adapting a general-purpose vision model. When you train a model specifically on desktop application screenshots and screen recordings, it learns to efficiently encode the visual patterns that actually matter for computer interaction — text, buttons, menus, cursors, and UI elements — while ignoring the visual features that matter for general image understanding but are irrelevant to computer use.

What FDM-1 Can Actually Do

Standard Intelligence claims FDM-1 can perform a remarkably broad range of computer tasks. The capabilities described include scanning software for security vulnerabilities by navigating the application’s interface and testing inputs, using computer-aided design (CAD) programs to create and modify designs, navigating complex enterprise software workflows, and performing multi-step data entry and processing tasks.

The security vulnerability scanning capability is particularly interesting because it implies the model can interact with applications in adversarial ways — testing edge cases, submitting unexpected inputs, and observing how the application responds. This kind of exploratory, adaptive interaction is significantly harder than simple form-filling or menu navigation.

If these capabilities work as described, FDM-1 could automate a significant portion of the repetitive computer work that currently requires human operators — data entry, quality assurance testing, compliance checking, and routine administrative tasks across any desktop application.

Why Standard Intelligence FDM-1’s Approach Is Different

The dominant approach to computer use AI in 2026 involves taking a large language model, adding a vision component, and wrapping it in a framework that translates between the model’s text-based reasoning and actual computer actions. This is how Anthropic’s Claude computer use works, and it’s broadly how OpenAI and Google approach the problem.

The LLM-based approach has obvious advantages: it inherits the language model’s reasoning abilities, can follow natural language instructions, and benefits from the massive investment in scaling language models. But it also has fundamental limitations. The vision component is typically bolted on rather than natively integrated, creating latency and information loss. And the model needs to translate between visual perception and text-based reasoning at every step.

Standard Intelligence’s video-native approach sidesteps these limitations by training the model end-to-end on visual input and action output. There’s no intermediate text representation — the model goes directly from pixels to actions. This should theoretically produce faster, more accurate interactions, especially for tasks that are more visual than linguistic.

The Investors: Sequoia, Spark, and Karpathy

Sequoia Capital leading a $75M round for a six-person startup is extraordinary by any measure. Sequoia’s AI portfolio includes some of the most important companies in the space, and their willingness to invest at this stage suggests they see something in Standard Intelligence’s technical approach that justifies early conviction.

Spark Capital, the other lead investor, brings experience from backing companies like Twitter, Slack, and Coinbase — platforms where user interaction and computer use are central to the product experience.

But the most telling signal may be Andrej Karpathy’s personal investment. Karpathy has built AI systems at both OpenAI and Tesla, giving him deep firsthand knowledge of what works and what doesn’t in AI development. His backing suggests that FDM-1’s video-native training approach represents a genuine technical insight rather than incremental improvement.

Standard Intelligence in the Computer Use AI Race

Standard Intelligence enters a competitive landscape that includes some of the world’s most well-resourced AI labs. Anthropic offers computer use capabilities through Claude. OpenAI is building computer use into its agent products. Google DeepMind has demonstrated computer interaction through Gemini.

What Standard Intelligence bets on is that a purpose-built model will outperform general-purpose models adapted for computer use — even if those general-purpose models are far larger. The analogy would be a specialized tool versus a Swiss Army knife: the specialized tool may be smaller, but it does its specific job better.

The $75M gives them runway to prove this thesis, but they’re competing against organizations with billions in resources. The efficiency advantage needs to be dramatic enough to overcome the scale disadvantage — and the 100x efficiency claim, if it holds up in real-world benchmarks, might be exactly that dramatic.

Final Thoughts on Standard Intelligence and FDM-1

Standard Intelligence is one of the most interesting hidden gems in the current AI landscape. Six people, $75 million, a novel training approach, an 11-million-hour dataset, and backing from Sequoia and Karpathy — it’s the kind of story that either becomes a case study in ambitious failure or the origin story of a category-defining company.

The video-native training approach is the key bet. If learning from raw video of human computer use produces better computer use AI than adapting language models, Standard Intelligence has found a genuine competitive advantage. If it doesn’t — if the LLM-based approach proves superior despite its limitations — then the $75M buys an interesting research result but not a business.

Either way, the fact that this approach attracted Sequoia, Spark Capital, and Andrej Karpathy’s personal conviction suggests it’s worth paying attention to. In AI, the companies that challenge conventional approaches sometimes turn out to be exactly right.

How Standard Intelligence FDM-1 Fits Into the Computer Use AI Landscape

Standard Intelligence’s FDM-1 enters a rapidly growing field of computer use AI agents. Anthropic’s Claude computer use was among the first commercial models to demonstrate autonomous computer interaction, using screenshot analysis and action planning to navigate desktop applications. OpenAI’s Operator agent takes a similar approach, using GPT-4o’s vision capabilities to interact with web browsers autonomously.

What distinguishes FDM-1 is its training methodology. While competitors rely on language model reasoning to interpret screenshots and decide actions, Standard Intelligence trained FDM-1 on 11 million hours of actual human computer usage recordings. This behavioral cloning approach means FDM-1 doesn’t reason about what to click — it pattern-matches against millions of examples of what humans actually clicked in similar situations.

The implications for enterprise automation are significant. According to McKinsey’s AI productivity research, knowledge workers spend roughly 60% of their time on repetitive computer tasks that could theoretically be automated. If FDM-1 can reliably handle even a fraction of these tasks, the productivity gains would be enormous.

The $75M raise from just 6 people also challenges the conventional wisdom that AI companies need massive teams. Standard Intelligence has proven that a small, focused team with the right data and architecture can compete with well-funded labs employing hundreds of researchers. As Andreessen Horowitz’s AI research has noted, the democratization of AI training infrastructure is enabling exactly this kind of lean, high-impact startup to emerge.

Whether FDM-1’s behavioral cloning approach will scale to handle the full complexity of real-world computer tasks remains to be seen. But with $75M in fresh capital and a model trained on more human interaction data than any competitor, Standard Intelligence has earned its place as one of the most interesting computer use AI startups to watch in 2026.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *