• Everyday AI
  • Posts
  • Ep 628: What’s the best LLM for your team? 7 Steps to evaluate and create ROI for AI

Ep 628: What’s the best LLM for your team? 7 Steps to evaluate and create ROI for AI

Our guide: 7 Steps to ROI with AI, Ex-Google CEO says AI could be deadly, Gemini to Challenge OpenAI, early signs of AI bubble and more.

Sup y’all 👋

Today’s show is a culmination of hundreds of hours I’ve spent consulting companies on getting an ROI with GenAI.

Curious how you/your company measure ROI on GenAI.

How does your company measure ROI on GenAI?

🗳️ VOTE to see live results, and leave feedback after voting on how your company is measuring ROI on GenAI 🗳️

Login or Subscribe to participate in polls.

✌️

Jordan

PS. Info on how to get another BANGER bonus guide at the end of today’s newsletter.

Outsmart The Future

Today in Everyday AI
8 minute read

🎙 Daily Podcast Episode: We’re talked with hundreds of businesses about AI implementation. We give away our ROI on AI template on today’s show. Give it a watch/read/listen.

🕵️‍♂️ Fresh Finds: 2025 State of AI Report, why Taylor Swift’s fan are mad at AI, a small language model outperform giants and more. Read on for Fresh Finds.

🗞 Byte Sized Daily AI News: Ex-Google CEO says AI could be deadly, Gemini to Challenge OpenAI, early signs of AI bubble and more. Read on for Byte Sized News.

💪 Leverage AI: We cut it straight. Here’s the 7 Steps to evaluate LLMs and create ROI with AI. Keep reading for that!

↩️ Don’t miss out: Did you miss our last newsletter? We covered: Inside NotebookLM's recent updates, NVIDIA to invest $2 billion in xAI, OpenAI’s trillion dollar compute lineup, Google’s new model uses the computer and more Check it here!

Ep 627: NotebookLM: New features, what’s next and complete walkthrough

How can you measure ROI on GenAI for your team? 🤔

Internal evaluations and intentionality. 

We've helped thousands of orgs put LLMs to work and ACTUALLY save time. On today's show, we're dishing the 7 steps you need to follow. 

Also on the pod today:

Front-end AI replacing APIs? 🤖
Hidden ROI pitfalls in GenAI 💸
Training gaps killing AI success 📉

It’ll be worth your 39 minutes:

Listen on our site:

Click to listen

Subscribe and listen on your favorite podcast platform

Listen on:

Here’s our favorite AI finds from across the web:

New AI Tool Spotlight – Tight Studio uses AI for beautiful screen recordings, ElevenLabs launched an open source component library for AI audio and voice agents, Glue is like Slack, but AI native.

State of AI Report — The popular report just dropped. You can read it here.

AI in Pop Culture — Taylor Swift’s latest album promo sparked backlash from fans who say the videos used AI. Here’s why there were upset.

AI Native — How can you become AI native? Well, you can learn from these startups.

Small ModelsA tiny 7M-parameter “Tiny Recursion Model” beats or matches far larger LLMs on tough grid reasoning tasks by iteratively refining its own answers. Here’s how.

AI and Politics Former UK PM Rishi Sunak is now a part-time advisor to Microsoft, Anthropic, and Goldman Sachs—donating the pay. Why?

AI ChipsThe U.S. OK’ed billions in NVIDIA AI chip exports to the UAE as part of a deal linking shipments to Emirati investments

 

1. Ex-Google CEO warns hacked AI could be deadly ⚠️

At London’s Sifted Summit, Eric Schmidt cautioned that AI models can be jailbroken to shed safety guardrails and potentially be trained to kill, citing evidence of reverse-engineering and past exploits like ChatGPT’s “DAN,” according to CNBC.

The warning, framed against a question comparing AI risk to nuclear weapons, underscores the tech industry’s lack of a non-proliferation regime to prevent powerful models from being misused. Schmidt, who also flagged risks like loneliness from AI “perfect girlfriends,” still predicts AI capabilities will far outpace humans over time.

2. Figure AI’s home robot gets a splashy update, but it’s not moving in yet 🚶‍♂️

Figure AI unveiled Figure 03 today, a softer, fabric-covered humanoid with articulated, camera-equipped hands that can fold laundry and load dishwashers, and it just landed on Time’s Best Innovations of 2025 cover.

The company says it’s powered by vision-language-action AI using OpenAI models and Nvidia robotics stacks, yet there’s no timeline or pricing for home use, and factory deployments are likely to come first. The demo shows careful, slow movement designed for safety, hinting at real progress but also reinforcing how far home-ready humanoids still have to go amid competition from Tesla Optimus, Unitree G1, and Boston Dynamics Atlas.

3. Google fires back at Microsoft and OpenAI with Gemini Enterprise 🥊

Google launched Gemini Enterprise to challenge ChatGPT Enterprise and Microsoft’s 365 Copilot, with pricing starting at $21 per user per month and climbing to $30 for higher tiers.

The platform plugs into Google Workspace, Microsoft 365, Salesforce, and SAP, bundles a workbench to coordinate AI agents plus a prebuilt taskforce for research, and promises low latency on Google Cloud’s GPU and TPU stack. This move signals a full-court press to monetize Google’s AI, giving companies a single front door to chat with their data, audit agents, and automate tasks in ways that could streamline customer service and boost team productivity.

4. Google Cloud rolls out Gemini agent subscriptions at Next 2025

Google Cloud launched Gemini Enterprise and Gemini Business subscriptions in Las Vegas, pitching AI agents that handle tasks across Box, Microsoft and Salesforce with pricing at $30 and $21 per user per month.

The bundles include premade agents for software development, data science and customer engagement, plus Model Armor for built-in security and governance, and they fold in Agentspace upgrades at no extra cost for current clients. Google’s timing tightens the race after OpenAI’s third‑party app access in ChatGPT and amid Amazon’s Quick Suite reveal, signaling a push to make no‑code agent building standard across enterprise software.

5. Seaport analyst flags early-stage AI bubble 🫧

According to Yahoo Finance, Seaport Research Partners’ Jay Goldberg says the AI market is in the early phase of a bubble, fueled by aggressive spending from Amazon, Google, Meta, Microsoft, OpenAI and Oracle.

He notes Oracle has started taking on debt to ramp capacity and OpenAI, despite negative free cash flow, is driving frenetic deal-making and plans for a staggering 16 gigawatts of compute, which is prompting rivals to spend heavily to avoid falling behind. Goldberg warns that while the tech giants are unlikely to face credit stress, smaller “neocloud” players and financing vehicles underneath them are loading up on debt and could stumble, potentially rippling upward.

🦾How You Can Leverage:

What’s that….you’re gonna run a 12-month AI pilot to determine ROI on a set of tasks that you’ve never created a human baseline for? 

Spoiler alert: you’ve failed before you started. 

By month 3 you're evaluating Stone Age AI while already missing features your competitors are using to leapfrog you. 

Companies are burning budgets testing yesterday's capabilities while their competitors moved workflows into front-end AI platforms that function as full operating systems now.

Not chatbots. Operating systems.

ChatGPT, Claude, and Gemini aren't simple AI chatbots anymore y’all. 

Suddenly, front-end AI chatbots agnatically research, write and render code and connect dynamically to your business data. 

Moving your daily business processes inside a front-end LLM is the same inflection point companies faced choosing Windows versus Mac versus Linux in the 90s.

Except nobody's treating it that way.

Let’s dive in. 

1 – Front-End AI Is Your New Operating System 🚀

You open ChatGPT and see 12 model options staring back at you.

Which one do you pick?

Here's the right one: why did CEO Sam Altman reveal only 7% of paid users actually use thinking models? Because 93% of people paying premium prices are actively choosing inferior capabilities for their work.

That's not a user problem.

That's your training crisis.

ChatGPT isn't a chatbot anymore. Neither is Claude or Gemini. They're operating systems combining models with modes that create entirely new ways of working.

Canvas for iterative editing. Deep research for autonomous investigation. Connectors creating mini RAG pipelines with your live company data. OpenAI just announced apps this week at dev day, meaning entire external user interfaces now function inside ChatGPT itself.

Your team isn't prompting a model.

They're orchestrating an entire ecosystem that updates constantly without warning. When GPT-5 launched it somehow made everything more complicated rather than simpler. More model choices, more modes, more features buried without documentation.

Zero training for the people expected to extract value from this chaos.

Try This: Ask three random employees which ChatGPT model they use for their common tasks and why. You'll discover most people default to whatever the interface selected initially, never exploring options that might work 10x better for their actual work.

2 – Your Missing Human Baseline Is Killing ROI 🔥

Here's the conversation that happens at every company:

"This project takes 8 hours with AI now."

"Great! What did it take before?"

"Uh……. khakis?"

No human baseline equals no measurable ROI on GenAI. Full stop sonny.

Just expensive guessing dressed up as AI innovation.

But it gets worse. Generative AI generates different outputs every single run, which means your one spectacular demo might be pure luck rather than reliable capability. Companies celebrate single successful results without testing whether quality stays consistent.

Then there's the change management failure. People get ChatGPT Teams access, watch maybe one onboarding video, get told to be productive. Zero systematic training on which models suit which tasks. Zero understanding of when to use thinking mode versus instant responses.

And the shiny object syndrome never ends. Google ships something, your team pivots. OpenAI responds, priorities shift. Anthropic launches a feature, direction changes again.

Implementation never actually finishes.

Meanwhile your competitors finished their two-week evaluation sprint three months ago and are already capturing compound advantages while you're still "exploring use cases."

Try This: Pick your most common AI task right now and time three employees completing it manually with zero AI. Track every step including research, drafts, and revisions. Calculate fully loaded hourly costs. This becomes your only honest comparison when leadership asks whether your AI spend actually pays for itself.

3 – The Seven-Step Evaluation Sprint ⚡

Stop overthinking this. Two to four weeks maximum per workflow evaluation.

Step 1: Define Success Before Testing Write exactly what makes this test pass or fail before anyone touches AI. No vague goals. Explicit outcomes, constraints, allowed tools, forbidden shortcuts.

Step 2: Measure Your Human Baseline First Multiple employees, identical workflow, no AI, everything timed and documented. Errors counted. Revisions tracked. This data becomes sacred.

Step 3: Build Realistic Test Datasets Twenty to forty actual work examples with real messiness. Renamed files, dead links, deliberate traps. Both humans and AI navigate identical chaos.

Step 4: Configure Production Environments Exactly Use the exact models, modes, and permissions your team will actually have. Not free versions. Not idealized setups. The real thing or your results mean nothing.

Step 5: Run Everything Three Times Minimum Separate chats, memory disabled, different days. One good result proves nothing. Require working citations and proof for every accepted answer. Calculate reliability scores.

Step 6: Calculate Real ROI With Blind Grading Graders can't know what came from humans versus AI. Convert time savings to actual dollars using fully loaded rates. Subtract subscription costs. Report net ROI alongside cost, latency, accuracy, stability, safety, integration, compliance.

Step 7: Retest Monthly Because Everything Changes GPT-5 Thinking updated multiple times since August. Canvas mode refreshes constantly. GPT-5 Auto shipped changes this week. Track trends against three-month averages. Investigate immediately when accuracy or savings drop.

Start with public benchmarks like LM Arena, LiveBench, Epoch AI, and Scale's LLM leaderboard to narrow which platforms deserve your time. Then run this framework on your simplest workflow first.

One workflow. Two to four weeks. Frozen model choice ignoring every shiny announcement during testing.

Try This: Choose your simplest AI use case today and commit to this two-week sprint. Document everything obsessively because this first rigorous evaluation becomes your template for every other workflow where AI might create measurable value instead of expensive theater.

 🎁 Bonus Content 🎁

Anotha one?

Yup.

Go repost today’s LinkedIn livestream and we’ll send you this (freakishly detailed) guide on evaluating front-end LLMs for ROI.

Pretty sure you could prolly resell this thing and people would pay TBH.

Reply

or to participate.