
Elon Musk’s xAI has just launched Grok 4, the latest and most advanced version of its language model—and it’s already making waves in the AI world. Touted as a major leap forward, Grok 4 has earned praise for surpassing some of the industry’s most established models, including OpenAI’s GPT-4 and Google’s Gemini 2.5 Pro, on a number of competitive benchmarks.
A New Benchmark Leader
During a recent launch event, Elon Musk described Grok 4 as “smarter than most graduate students across disciplines.” Backing up that bold claim is Grok 4’s strong showing on rigorous benchmark tests. The model achieved a standout score of 26.9% on “Humanity’s Last Exam”—a notoriously difficult test evaluating academic-level knowledge in areas like physics, biology, and computer science. That score places Grok 4 ahead of Gemini 2.5 Pro (21.6%) and GPT-4 (around 20%).
Even more impressive was the Grok 4 Heavy variant. By using a multi-agent architecture, where multiple AI models collaborate and refine answers, it scored 50.7% with tools like code interpreters and web access. This collective approach enables better reasoning, planning, and decision-making—offering a glimpse of how future AI models might work in teams.
Smarter Training: Reinforcement Learning with Verifiable Rewards (RLVW)
What sets Grok 4 apart isn’t just raw power, but how it’s trained. Using a technique called Reinforcement Learning with Verifiable Rewards (RLVW), the model learns from tasks with clear right or wrong answers. This approach is particularly effective in building logical reasoning and problem-solving abilities.
Engineers from xAI even noted that they were running out of standard training problems, suggesting that future models might learn directly from real-world environments for endless feedback.
ARC-AGI and the Fluid Intelligence Leap
Another benchmark where Grok 4 shines is ARC-AGI V2, which tests abstract reasoning and pattern recognition—challenges that many AIs still struggle with. Grok 4 scored 15.9%, nearly doubling the previous best score of 8% by Opus 4.
According to ARC Prize founder Greg Kamradt, that result shows Grok 4 is beginning to exhibit signs of fluid intelligence, something previously unseen at this level in AI models.
Real-World Tasks: From Math to Game Design
Beyond academic benchmarks, Grok 4 impressed with practical tasks too:
Math Arena: Scored 96.7%, the highest so far.
Graduate-Level QA (GPQA): 88.9%
USA Math Olympiad: 79.4%
AI & ML 2025 Challenge: Perfect 100%
Grok 4 also topped the VendingBench simulation, which evaluates long-term decision-making in a constrained environment. The model earned a virtual net worth of $4,700, easily surpassing GPT-3.5 ($1,800) and even outperforming human test subjects ($844). This shows Grok 4’s ability to manage resources, forecast outcomes, and adapt to unpredictable inputs.
Can It Code and Create?
Yes, and fast. One engineer from xAI used Grok 4 to build a fully functional first-person shooter game in just four hours. The model handled everything from asset generation to gameplay logic and visuals, significantly reducing development time.
These capabilities support Musk’s earlier claim that future AI will build entire AAA games. While this isn’t quite there yet, it’s certainly a step in that direction.
Multimodal Reasoning and Web Access
Grok 4 supports real-time internet access, multimodal inputs, and a 256K context window, allowing it to manage longer conversations and complex reasoning chains. It also includes enterprise-level security, making it suitable for businesses as well as researchers.
Pricing is straightforward:
Grok 4: $30/month
Grok 4 Heavy: $300/month or $3,000/year
What About AGI?
Despite the buzz, Grok 4 isn’t Artificial General Intelligence (AGI). While it excels in structured tasks like math, logic, and code, it still lacks core AGI traits—such as self-directed learning, awareness, and goal-setting.
Critics also pointed out that Grok 4 struggles with visual understanding and spatial reasoning, and like other LLMs, it is prone to hallucinations—confidently generating inaccurate responses.
Even Elon Musk clarified that his “graduate student intelligence” comment referred specifically to Grok 4’s test scores, not real-world decision-making or autonomy.
What’s Next for xAI?
According to Musk, xAI is already training its next-generation model, Foundation v7. Upcoming plans include:
A code-specialized model in August
A multimodal agent in September
A video generation model in October
Final Thoughts
Grok 4 has firmly established xAI as a serious competitor in the global AI race. With high scores across multiple domains, an innovative training approach, and powerful multi-agent teamwork in Grok 4 Heavy, the model proves that xAI is capable of developing cutting-edge AI.
That said, Grok 4 is not AGI—but it’s one of the clearest signs yet that we’re moving closer to that frontier.