
2024: The Year of Multimodal + Agents + Reasoning + Open-Weights Catch-Up

Table of Contents
2024: The Year of Multimodal + Agents + Reasoning + Open-Weights Catch-Up #
As December 31, 2024 draws to a close, the AI industry has delivered a year of structural shifts that builders will be unpacking for years to come. Four interconnected themes defined this year: the multimodal breakthrough that turned AI from text machines into sensory systems; the emergence of agentic capabilities that let models actually do things; the reasoning revolution that taught models to think before answering; and perhaps most consequentially—the open-weights catch-up that democratized frontier-level AI.
This retrospective captures the essential timeline, the strategic inflection points, and what it all means for teams building with AI right now.
The Four Themes That Defined 2024 #
Four major shifts converged this year to fundamentally reshape what AI systems can do and who can build them. Understanding these themes is essential context for every product decision you'll make in 2025.
Theme 1: Multimodal—From Text to Senses #
The defining breakthrough of 2024 was AI that processes text, images, audio, and video natively—trained end-to-end rather than stitched together with pipelines. GPT-4o's real-time audio capabilities arrived in May with response latencies matching human conversation speed (~232-320ms), and crucially, it preserved tone, emotion, and background context that earlier systems stripped away. This wasn't just faster speech-to-text—it was the first widely deployed system that truly understood spoken conversation.
Gemini 1.5 Pro's 1-million-token context window (and later demonstrations of 10M token capability) enabled another multimodal dimension: processing hours of video or audio in a single pass. A developer could feed an entire 107-hour audio archive or 10.5 hours of video and query across it with near-perfect recall. This changed what's possible for content analysis, research workflows, and media processing.
By year's end, Claude 3.5 Sonnet (New) shipped with computer use capabilities—giving the model the ability to see your screen, move your mouse, and interact with desktop applications. Multimodal had evolved from input-only to full sensory-motor integration.
Theme 2: Agents—From Answerers to Actors #
2024 was the year AI stopped being merely conversational and started becoming operational. The shift from chatbots to agents happened across multiple fronts:
Anthropic's Computer Use API (October) let Claude control desktop applications through screenshots and mouse/keyboard actions—the first production-grade primitive for autonomous computer operation.
Salesforce's Agentforce launch at Dreamforce '24 signaled that enterprise software had internalized the agent paradigm, embedding autonomous agents directly into CRM and workflow platforms.
OpenAI's o1 and the subsequent agent SDK releases demonstrated that reasoning capabilities and tool use could combine to execute multi-step tasks with planning.
MCP (Model Context Protocol) landed in November from Anthropic, standardizing how agents connect to data sources, tools, and external systems—essentially becoming the "USB-C for AI agents."
By December, n8n workflows with Claude 3.5 Sonnet had become a standard production stack for teams building custom automations, with thousands of builders sharing agent templates across the ecosystem.
Theme 3: Reasoning—The Chain-of-Thought Revolution #
The most significant architectural advance of 2024 was teaching models to think through problems step-by-step rather than generating immediate responses. OpenAI's o1-preview and o1-mini (September) introduced this paradigm at scale, followed by the full o1 release in December.
The results were dramatic on benchmarks requiring sustained reasoning:
| Benchmark | GPT-4o | o1 (Dec 2024) | Improvement |
|---|---|---|---|
| MATH (pass@1) | ~55% | 96.4% | +75% relative |
| AIME 2024 | ~12% | 79.2% | +560% relative |
| GPQA Diamond | ~40% | 75.7% | +89% relative |
| SWE-bench Verified | ~33% | 48.9% | +48% relative |
The "reasoning_effort" parameter became a new dial for developers—trading latency and cost for accuracy on complex tasks. This wasn't an incremental improvement; it was a categorical shift in what AI could tackle, from PhD-level science problems to sophisticated software engineering tasks.
Theme 4: Open-Weights—The Frontier Democratized #
Perhaps the most consequential long-term development of 2024 was the narrowing gap between proprietary and open-weight models. At the start of the year, GPT-4 stood largely unchallenged. By year's end, a builder could download and run GPT-4-class models on consumer hardware.
The timeline tells the story:
| Date | Release | Significance |
|---|---|---|
| Apr 2024 | Llama 3 8B/70B | First frontier-grade open weights from a US lab |
| Jul 2024 | Llama 3.1 405B | GPT-4-class performance at zero inference cost |
| Aug 2024 | Mistral Large 2 123B | European open-weights reaching frontier levels |
| Sep 2024 | Qwen2-VL | Open multimodal from China matching proprietary systems |
| Oct 2024 | Pixtral Large | First frontier-class multimodal from Mistral |
| Dec 2024 | DeepSeek V3 671B | Trained for ~$5.6M, beats Claude 3.5 Sonnet |
DeepSeek V3's December release was particularly significant—a 671B parameter MoE model trained for reportedly $5.6 million that benchmarked above Claude 3.5 Sonnet on multiple tasks. The open-weights movement had not just caught up; it had demonstrated that efficient training could challenge the scaling assumptions of closed labs.
The 2024 AI Timeline: Month by Month #
The pace of 2024 made it difficult to track even major releases in real-time. Here's the essential timeline of what shipped, when, and why it mattered.
Q1: Foundation Setting #
January–March established the baseline. GPT-4 remained the frontier standard. Google shipped Gemini 1.5 Pro in February with its breakthrough context window. The open-weights conversation centered on Mistral and the anticipation for Llama 3.
Key releases:
- February: Gemini 1.5 Pro (1M context window)
- March: Claude 3 family (Haiku, Sonnet, Opus) set new benchmarks
- March: Stability AI's Stable Diffusion 3 announcement
Q2: The Assistant Wars Begin #
April–June saw the competitive landscape intensify dramatically. Llama 3's arrival in April proved open weights could compete on quality. GPT-4o in May introduced native multimodal at scale. Apple's WWDC announcement in June signaled mainstream AI had arrived.
| Date | Release | Impact |
|---|---|---|
| Apr 18 | Llama 3 8B/70B | First credible open-weights challenger to GPT-4 |
| May 13 | GPT-4o | Real-time multimodal, free tier access to frontier |
| Jun 10 | Apple Intelligence | Consumer AI goes mainstream, on-device focus |
| Jun 20 | Claude 3.5 Sonnet | Anthropic beats GPT-4o on key benchmarks |
| Jun 28 | LangGraph 0.1 | Pre-MCP agent framework landscape emerging |
The Claude 3.5 Sonnet release on June 20 was a genuine surprise—Anthropic had leapfrogged OpenAI's flagship on coding benchmarks while delivering twice the speed of its predecessor. Combined with Artifacts (the split-pane workspace), it established Anthropic as the serious alternative for builders.
Q3: Video, Voice, and Cost Collapse #
July–September brought production capabilities to what had been demos, while costs collapsed dramatically. Runway Gen-3, Luma Dream Machine, and other video models turned text-to-video from a novelty into a tool. GPT-4o Mini replaced GPT-3.5 Turbo at 60% lower cost. And the o1 reasoning preview changed the fundamental architecture of how models process complex tasks.
| Date | Release | Impact |
|---|---|---|
| Jul 18 | GPT-4o Mini | GPT-4-class quality at GPT-3.5 prices |
| Jul 23 | Llama 3.1 405B | Frontier performance becomes free |
| Aug 1 | FLUX.1 | Best open image model ships from Black Forest Labs |
| Aug 12 | Grok-2 | xAI enters the serious model conversation |
| Sep 12 | o1-preview/o1-mini | Reasoning models debut |
| Sep 25 | Llama 3.2 | Multimodal Llama, on-device focus |
| Sep 25 | Meta Orion AR | Ambient AI hardware teased |
The September o1 release marked the most significant architectural shift of the year. For the first time, a major lab shipped a model trained with reinforcement learning to think through problems rather than predict the next token. The "chain of thought" approach produced dramatic improvements on math, science, and coding benchmarks.
Q4: Agents, Sora, and the China Shock #
October–December delivered the year's most intense concentration of releases. Computer use capabilities, Sora's long-awaited launch, ChatGPT Pro at $200/month, and DeepSeek V3's cost-efficient frontier model all landed within weeks of each other.
| Date | Release | Impact |
|---|---|---|
| Oct 22 | Claude Computer Use | First production desktop control primitive |
| Oct 28 | iOS 18.1 / Apple Intelligence | Consumer AI ships to 1.5B devices |
| Oct 31 | ChatGPT Search | Google challenger goes live |
| Nov 5 | US Election | Policy uncertainty shifts |
| Nov 18 | Mistral Pixtral Large | Frontier multimodal from open weights |
| Nov 25 | MCP Launch | Agent protocol standardization |
| Dec 9 | Sora | Text-to-video finally ships |
| Dec 11 | Gemini 2.0 Flash | Google's agent-ready stack |
| Dec 18 | o1 Full Release | 34% fewer errors, vision capabilities |
| Dec 20 | o3 Announcement | ARC-AGI 87.5%—shockingly close to threshold |
| Dec 26 | DeepSeek V3 671B | $5.6M training cost, beats Claude 3.5 Sonnet |
The December o3 announcement—hitting 87.5% on ARC-AGI, near the human threshold—came as a genuine shock to most observers. Whether it represents genuine progress toward AGI or benchmark optimization remains debated, but it closed the year with a clear signal that reasoning capabilities were advancing faster than most had anticipated.
Enterprise Adoption: By the Numbers #
2024 wasn't just a year of model releases—it was the year AI crossed into mainstream enterprise deployment. The adoption statistics tell a story of rapid normalization.
| Metric | 2023 | 2024 | Change |
|---|---|---|---|
| Orgs with GenAI in production | 18% | 30% | +67% relative |
| Enterprise AI spending | $2.3B | $13.8B | +500% |
| ChatGPT message volume growth | baseline | 8x YoY | +700% |
| OpenAI business customers | ~100K | 1M+ | +900% |
Per a Gartner survey from mid-2024, generative AI became the most frequently deployed AI solution type in organizations, with 29% of enterprises reporting active use. The Menlo Ventures State of Generative AI report showed 92% of surveyed organizations reported increased GenAI use over the previous 12 months.
The interesting wrinkle: only 19% of enterprises rely primarily on standalone tools like ChatGPT. 34% use GenAI embedded in existing applications (Microsoft 365 Copilot, Adobe Firefly, etc.), with the remainder deploying custom solutions. The integration path—baking AI into existing workflows rather than adding new interfaces—appears to be winning in enterprise contexts.
The Funding Landscape: Capital Concentration #
2024 saw unprecedented capital flows into AI, with clear winners and strategic positioning. OpenAI's October raise represented the largest venture round in history.
| Company | 2024 Raise | Post-Money Valuation | Total Raised |
|---|---|---|---|
| OpenAI | $6.6B | $157B | $17.9B |
| Anthropic | ~$4B* | ~$18B | ~$9.7B |
| xAI | $6B+ | $24B | $12B+ |
| Mistral | ~$640M | ~$6B | ~$1B |
*Includes Amazon's $4B investment commitment announced in 2024.
The $6.6B OpenAI round led by Thrive Capital (with Microsoft, Nvidia, SoftBank, and others) reflected a bet that scale advantages in frontier AI would compound. OpenAI reportedly asked investors to avoid backing rivals—a controversial move that signaled competitive pressure.
Meanwhile, Anthropic's Amazon partnership ($4B total commitment) and Google's continued investment positioned the Claude family as the clear alternative for enterprises wanting options. By December, Claude was available on AWS Bedrock, Google Cloud, and directly—true multi-cloud availability that mattered for enterprise buyers.
Cross-Links: Related 2024 Coverage #
This retrospective builds on detailed coverage throughout the year:
- June 2024: Claude 3.5 Sonnet + Artifacts — The moment Anthropic pulled ahead on coding benchmarks
- May 2024: GPT-4o Launch Day — How OpenAI's omni model changed the free tier forever
- September 2024: OpenAI o1-Preview + o1-Mini — The chain-of-thought reasoning era begins
- October 2024: Claude Computer Use API — The first desktop control primitive ships
- November 2024: MCP Launch — The protocol that won the agent stack
- December 2024: DeepSeek V3 — How a $5.6M training run challenged the frontier
- December 2024: OpenAI Sora Launch — Text-to-video finally ships
What This Means for Builders #
If you're building with AI as 2024 ends, four strategic realities have emerged:
1. The Closed/Closed-Weight Gap Has Collapsed #
You no longer need OpenAI or Anthropic APIs to access frontier capabilities. Llama 3.3 70B, DeepSeek V3, and Mistral Large offer comparable performance for many tasks at zero marginal cost. The strategic calculus for product architecture has shifted—self-hosting is now viable for performance, not just privacy.
2. Reasoning is a New Capability Dimension #
o1-class reasoning models aren't just "better"—they're categorically different. For tasks requiring multi-step logic (complex analysis, code generation, research), the reasoning overhead pays for itself. For simpler tasks, the latency and cost penalties make standard models the right choice. You now need a routing strategy, not a default model.
3. Agents Require Infrastructure, Not Just Models #
Computer use, MCP servers, and agent frameworks have made the agent stack concrete. But production agents need eval systems, retry logic, human-in-the-loop fallbacks, and observability. The model is maybe 30% of a production agent system. The other 70% is engineering discipline most teams are still building.
4. Multimodal is Now Table Stakes #
If your product only processes text, you're behind. Users expect image understanding, document parsing, and audio processing. The technical barrier has collapsed—Gemini 1.5 Pro's context window and Claude's computer use make previously hard multimodal tasks straightforward. The remaining challenge is UX design, not capability access.
Looking Ahead: The 2025 Landscape #
As 2024 closes, the setup for 2025 is clear:
The open-weights vs. proprietary competition will intensify. DeepSeek's efficiency breakthroughs suggest training cost curves may be steeper than pure scaling laws predicted. Meta's Llama 4 is anticipated. The possibility of frontier-class models trained for single-digit millions could reshape the economics of the entire industry.
Reasoning capabilities will expand—o3's ARC-AGI performance hints at genuinely new capabilities emerging. Whether this represents progress toward AGI or simply better optimization for specific benchmark types will be the defining debate of early 2025.
Agent infrastructure will mature. MCP adoption is accelerating. The first wave of production agent deployments is generating real feedback on what works. 2025 will be the year agent stacks stabilize and best practices emerge.
And multimodal agents—systems that can see, hear, and act—will move from demos to production. The pieces are all in place. The integration work remains.
Frequently Asked Questions #
What was the most important AI release of 2024? #
Claude 3.5 Sonnet in June changed the competitive landscape—Anthropic proved it could not just match but exceed OpenAI's flagship on coding benchmarks. Combined with Artifacts, it established a genuine alternative for builders and shifted the default assumption that OpenAI leads on all fronts.
How did open-weights models catch up to GPT-4 in 2024? #
A combination of better training data, more efficient architectures (MoE), and concentrated engineering effort. Llama 3 proved that compute-efficient training could challenge scale-heavy approaches. DeepSeek V3 demonstrated that a $5.6M training budget could produce a model competitive with systems trained on budgets 10-50x larger. The open-weights ecosystem also benefited from community fine-tuning and rapid iteration that closed labs can't match.
What's the difference between reasoning models like o1 and standard LLMs? #
Standard LLMs predict the next token in a single forward pass. Reasoning models like o1 use reinforcement learning to generate a "chain of thought"—working through the problem step-by-step before producing a final answer. This adds latency and cost but dramatically improves performance on complex tasks requiring sustained logic. The "reasoning_effort" parameter lets developers trade accuracy for speed.
Are AI agents actually production-ready now? #
Anthropic's Computer Use API (October) and the MCP protocol (November) made agents concrete, but production readiness depends on use case. Simple automation tasks with clear failure modes work today. Complex autonomous agents requiring multi-step planning in unstructured environments still need significant engineering investment in evals, fallbacks, and observability. The infrastructure is there; the implementation discipline is still developing.
What was the biggest surprise of 2024? #
DeepSeek V3's December release—a 671B parameter model trained for ~$5.6M that outperformed Claude 3.5 Sonnet on multiple benchmarks. The efficiency gains called into question assumptions about the necessity of massive compute budgets for frontier performance. It suggested that algorithmic innovation may matter as much as raw scaling, with profound implications for the competitive landscape.
How did enterprise AI adoption change in 2024? #
From pilot to production. Enterprise AI spending grew from $2.3B to $13.8B—more than 6x year-over-year. 30% of businesses now run generative AI in production (up from 18% in 2023). The dominant pattern is embedding AI into existing applications rather than deploying standalone tools. Integration, not replacement, is the winning enterprise strategy.
What happened with video generation in 2024? #
Text-to-video crossed from novelty to tool. Runway Gen-3 Alpha (June), Luma Dream Machine (June), and OpenAI Sora (December) all shipped production-capable systems. By year's end, video generation was being used in advertising, content workflows, and creative production—not just demos. The quality gap with professional video production narrowed dramatically.
Will reasoning models replace standard LLMs? #
No—they'll coexist as complementary capabilities. Reasoning models excel at complex analysis, coding, and research tasks where accuracy matters more than latency. Standard LLMs remain optimal for conversational applications, creative writing, and simple queries where immediate response matters. The skill is routing tasks to the right capability tier.
What is MCP and why does it matter? #
MCP (Model Context Protocol) is Anthropic's open standard for connecting AI agents to data sources, tools, and external systems. Think of it as "USB-C for AI agents"—a standardized interface that lets any agent connect to any data source or tool that speaks the protocol. Released in November 2024, it rapidly gained adoption across the agent ecosystem and solved the fragmentation problem that was slowing agent deployment.
What's the state of AI safety and regulation as 2024 ends? #
Mixed progress. California's SB-1047 safety bill was vetoed in September after intense industry opposition. The US AI Safety Institute secured voluntary MoUs with OpenAI and Anthropic. The EU AI Act began phased implementation. Notably, OpenAI's o3 announcement and its ARC-AGI performance reignited discussions about whether existing safety frameworks are adequate for rapidly advancing capabilities. The policy conversation is racing to catch up to technical reality.
Closing: The Year AI Became Infrastructure #
2024 will be remembered as the year AI stopped being a novelty and became infrastructure.
The multimodal breakthrough turned AI into sensory systems. The reasoning revolution added genuine cognition. Agentic capabilities made AI operational. And the open-weights catch-up democratized access to all of it.
The builders who thrived this year were those who treated AI not as a magic solution but as a new layer of infrastructure to be engineered with—combining the right models, the right orchestration, and the right evaluation frameworks for their specific problems.
As 2025 begins, the tools are better, the costs are lower, and the capabilities are broader than seemed possible twelve months ago. The challenge now isn't accessing AI—it's engineering with it intelligently.
If you're building agentic systems, custom automations, or AI-powered growth infrastructure, book an AI automation strategy call to discuss what's possible with the 2024 stack.
If you need a flagship website or immersive digital experience that integrates these capabilities, start a custom website project. The intersection of AI agents and premium web experiences is where the most interesting work is happening right now.
This retrospective draws on coverage published throughout 2024 on the William Spurlock blog. For detailed analysis of specific releases, see the cross-linked posts above.
Related Posts

Google I/O 2026 Action List: How I Prompted Gemini 3.5 Flash and Antigravity Workflows
Google I/O 2026 just reset the AI tooling landscape. Here's the 9-action checklist for builders who want to ship this week, not just watch the keynote.

Anthropic vs. OpenAI vs. Google: The State of the Frontier in May 2026
A head-to-head breakdown of the three AI giants in May 2026: Claude Opus 4.6, GPT-5.3 and 5.4, Gemini 3.1 Pro. Real specs, real pricing, and what actually matters for builders.

Kimi K2 Open Weights: How I Prompted Moonshot's Frontier Model for Agentic Tool Use
How I direct Kimi K2 by Moonshot AI for agentic workflows, long-context tool calling, and workflow automation. A 1 trillion parameter MoE model with competitive benchmarks at 5-17x lower cost than GPT-5 and Claude.




