• Everyday AI
  • Posts
  • Ep 662: Opus 4.5: New king of the AI hill or just a niche model for coders?

Ep 662: Opus 4.5: New king of the AI hill or just a niche model for coders?

Our hands-on with Claude Opus 4.5, OpenAI teases app store, Gemini introduces interactive images, Google escalates its AI chip war with Nvidia and more.

Sup y’all 👋

Thankful for all your support over the years for this wild AI ride called Everyday AI. 

Appreciate all the messages of support like what Cecilia dropped in today’s livestream. 

Our small team will be taking the next two days off to celebrate the Thanksgiving holiday and spend time with family. 

See ya Monday and I’m thankful for all of you. 

✌️

Jordan 

Outsmart The Future

Today in Everyday AI
8 minute read

🎙 Daily Podcast Episode: Another “best AI model” just dropped. This time it’s Claude Opus 4.5. We gave it the live treatment for AI at Work on Wednesdays. Find out more in today’s show and give it a watch/listen.

🕵️‍♂️ Fresh Finds: ChatGPT’s big live update, Interactive Images on Gemini, Perplexity Shopping Assistant upgrades and more. Read on for Fresh Finds.

🗞 Byte Sized Daily AI News: OpenA’sI new App Store leaks, Nvidia Chips Vs Google Chips, OpenAI Hardware Prototypes and more. Read on for Byte Sized News.

🧠 Learn & Leveraging AI: Anthropic says Opus 4.5 is the best model in the world. We tested it live—and it did not go well. Yikes. Keep reading for that!

↩️ Don’t miss out: How to fight AI sprawl, Anthropic surprises with Claude Opus 4.5, Meta and Google partner on AI chips, White House's shocking AI announcement and more. Check it here!

Ep 662: Opus 4.5: New king of the AI hill or just a niche model for coders?

"... best model in the world..." 🤔

Wait, again? 

Days after Gemini 3 Pro splashed on the scene, Anthropic snuck in a low-key drop in Claude Opus 4.5. 

And Anthropic pulled no punches, calling its new model the "best model in the world for coding, agents and computer use"

So, should you be hot swapping your Gemini or ChatGPT use out for the new Opus 4.5? Or, is this model more of a niche for software devs? 

Tune in, as we put AI to Work on Wednesday! 

Opus 4.5: New king of the AI hill or just a niche model for coders?

P.S.... we're out for Thanksgiving. So after this show, we'll see ya Monday!


Also on the pod today:

• Opus 4.5: coder’s new favorite? 👨‍💻
• Anthropic’s massive API price cut 💸
• Visualization: Excel & Slides improved 📊

 It’ll be worth your 46 minutes:

Listen on our site:

Click to listen

Subscribe and listen on your favorite podcast platform

Listen on:

Here’s our favorite AI finds from across the web:

New AI Tool Spotlight – Questas builds choose-your-own-adventure stories with AI-generated images and videos. Every choice leads to a new path, FireCut is a lightning-fast AI video editor, AskCodi helps you to Build your own Coding Models on top of any LLM

AI Music — Warner strikes licensed AI deal with Suno, letting opted-in artists’ voices be generated.

AI Tech — Nota AI powers Samsung’s Exynos 2500 to run generative AI fully on-device.

ChatGPT Live Chat — Inline voice streams live replies — now see replies typed as you speak.

AI Thanksgiving Talk — AI joins politics and football at the Thanksgiving argument table — what to know.

Gemini Interactive Images — Tap diagrams to unlock interactive explanations for deeper, visual learning

Perplexity Shopping Assistant — Perplexity launches PayPal-powered AI shopping with memory-driven recommendations.

Gemini Image Reimagine — Turn any photo into new scenes — Google Messages' Remix uses Nano Banana Gemini.

ChatGPT Image Generator — ChatGPT may be rolling out a faster, style-driven ImageGenV2 for richer, consistent edits

1. ChatGPT readies App Store like platform for apps and workflows ⚒️

OpenAI is preparing a timed rollout of developer features that would let creators publish Agent Builder workflows directly into ChatGPT, add image-generation as a tool inside workflow building, and launch an Apps dashboard to manage and integrate apps with ChatGPT.

The moves speed up how developers bring custom agents and visual tools into ChatGPT, reducing friction for discovery and distribution and hinting at a broader December product push. Key details remain undecided, such as whether workflows will be shareable and when features will reach all users, with early access now limited to select partners.

2. Nvidia pushes back as Google TPU momentum grows 📈

Nvidia responded Tuesday to Wall Street concerns after its shares dipped following reports Meta may use Google’s TPUs, saying its Blackwell GPUs remain a generation ahead and run every AI model everywhere computing happens. The company argued its GPUs offer greater flexibility and performance than Google’s ASIC-style TPUs, while noting Google remains a customer and that Gemini can run on Nvidia hardware.

Google says demand is rising for both its custom TPUs and Nvidia GPUs as it supports customers via Google Cloud, and recent advances like Gemini 3 have boosted attention on TPU-based training.

3. OpenAI unveils first hardware prototypes, says Sam Altman đŸ¤–

Sam Altman revealed at Sun Valley that OpenAI has completed its first hardware prototypes, calling the work “jaw dropping” and signaling a major step toward new AI devices following its $6.4 billion acquisition of Jony Ive’s io.

He offered no product details but said the goal is a calmer, long-duration assistant that filters information and “knows everything you’ve ever thought about, read, said,” positioning the device as a quieter alternative to today’s attention-grabbing smartphones.

4. North Korea’s AI advances pose new surveillance and cyber risks 🧑‍💻

A new South Korean analysis says recent 2025 North Korean research papers reveal rapid AI improvements that could bolster surveillance, voice impersonation and automated cybercrime, making the findings timely given Pyongyang’s push to field AI-equipped unmanned systems.

The Institute for National Security Strategy warns facial recognition and multi-object tracking research could strengthen monitoring at military sites and borders, while lightweight speech synthesis enables near-real-time voice impersonation for psychological operations. The report also flags AI-driven automation of reconnaissance, phishing and money-laundering as a force multiplier for cryptocurrency theft and social engineering.

5. OpenAI says ChatGPT subscriptions could swell to 220 million by 2030 💸

According to The Information, OpenAI now projects roughly 8.5% of its future weekly user base — about 220 million people by 2030 — will pay for ChatGPT, a sharp increase from about 35 million paid users in July. The report also says OpenAI’s annualized revenue run rate could hit $20 billion this year even as the company burns cash on R&D and operations.

Management plans to diversify revenue with shopping and ad-driven features, including a new personal shopping assistant that could unlock commissions and advertising. If these forecasts hold, ChatGPT would rank among the world’s largest subscription businesses while still navigating significant costs and profit pressure.

Anthropic just claimed they built the "best model in the world" with their new Opus 4.5 and dropped the mic.

But when we actually put Opus 4.5 to the test live on air, the mic didn't just drop—it broke.

Yikes.

In our testing on today’s show, the model hallucinated tools that don't exist, failed basic instruction following, and crashed its own context window repeatedly.

We walked through Anthropic’s benchmarks, outside leaderboards, and the company’s kinda quiet pivot and three sneaky good features they rolled out this week.

Here’s the exec‑level version: how Anthropic’s vertical pivot and price cut change your stack, where Opus 4.5 is actually dangerous for competitors, and when Gemini or GPT should still be your default. You’ll get Monday‑ready moves, not another hype recap.

Time to capitalize. Let’s get it.

1. Kill the “Best Model” Myth 🔥

We just watched a familiar playbook: a vendor cherry‑picks benchmarks, calls its model the top option, and hopes nobody reads the fine print or notices the difference. Anthropic framed Opus 4.5 as the leading choice for coding agents and computer use.

But when you step back to aggregated benchmarks, Gemini 3 Pro leads on overall coding and intelligence scores, and Opus 4.5 is effectively tied with GPT‑5.1 High, with GPT‑5.1 Pro not even in the race yet from an API testing perspective. 

The real advantage goes to teams that treat models like interchangeable parts instead of religion. If your stack still assumes one model for everything, you’re handing speed and margin to competitors who swap tools per workflow. Benchmarks still matter, but only as directional signals, not gospel. And those league tables are shifting weekly as new variants keep dropping.

Try This

Hook up your three main models – Gemini 3 Pro, Claude Opus 4.5, and whatever GPT‑5.1 tier you have – into one internal playground.

Take five core workflows: weekly reporting, customer replies, code review, spreadsheet analysis, and slide outlines. Run the same prompt through each model and log accuracy, latency, and cleanup time.

Rank them per task and set a routing rule: “If X, use Y model.” Lock this in as a quarterly calibration ritual, not a one‑off bake‑off. Store results in a dashboard.

2. Exploit Claude’s Vertical Pivot ⚡

Under the noise, Anthropic is making a very specific bet: stop trying to be the everything‑app and dominate a couple of high‑value verticals. 

The clues are everywhere.

Opus 4.5 is tuned for software engineering, agentic workflows, and heavy data/finance work. Claude for Excel is now generally available for higher‑tier users, built to chew through thousands‑row spreadsheets. File creation quietly turns prompts into polished decks and sheets.

Pair that with a two‑thirds API price cut and cheaper compute from Amazon and Google, and you see the shape of the play: become the default “work AI” inside codebases and spreadsheets while others chase consumer dazzle. If your business lives in IDEs and Excel, ignoring this shift is leaving productivity on the table.

Try This

Pick one domain: engineering or spreadsheets.

For engineering, plug Claude into a sandbox repo and feed it a week of “code copilot” work—bug triage, tests, and small refactors. For spreadsheets, pilot Claude for Excel on one messy reporting workbook with thousands of rows.

In each case, document what it nails and where it fails. Then redraw one team’s RACI: humans own design and judgment; Claude owns boilerplate, data cleanup, and first‑pass analysis. End the week with a 30‑minute go/no‑go decision.

3. Make Reliability Your Real Benchmark 🚀

Brutal. 

In our very limited live demos, Opus 4.5 still behaved like an overconfident intern. Long, complex tasks blew past its context window and it dropped every plate that it tried to spin in the air.

It ignored a very specific URL instruction and grabbed the wrong content. A multi‑step slide‑deck refresh died mid‑flow despite careful scoping. An artifacts‑powered dashboard never rendered while Gemini 3 Pro produced a slick, SaaS‑level analytics app on the same brief.

That gap really matters for execution. 

Again – take our lil demos with a huge grain of salt. Even though Opus 4.5 kinda bombed the limited tests we gave them live, the model still has an insanely high ceiling. (For when it works, that is.) 

Know that the winning teams won’t be the ones with the fanciest agentic marketing slide. They’ll be the ones who ruthlessly test where models fall apart, read the traces, and design workflows around those failure modes. 

Treat every “extended context” claim as a hypothesis to break, not a promise to trust. When in doubt, break work into smaller hops and orchestrate the agents instead of worshiping a single giant prompt.

Try This

Pick one ugly workflow that already stalls your team: updating decks, merging docs, or turning CSVs into dashboards.

Run it end‑to‑end in three variants: Gemini 3 Pro, Opus 4.5, and your current default. Don’t just judge outputs; log every failure. Check the context blowups, ignored instructions, tool use, timeouts.

Turn that into a one‑page playbook: which model to use, max input size, when to split tasks, and where humans must stay in the loop. Re‑run the test whenever a “best model” headline drops.

 

Reply

or to participate.