July 3, 2026 | Nazaré Ventures
Previous issues: #10 | #11 | #12 | #13 | #14 | #15
Fable 5 came back online yesterday, nineteen days after Commerce pulled it. The facts:
Commerce lifted its export controls on Fable 5 and Mythos 5 on June 30. Anthropic announced the restoration at 4:53 PM Pacific; access returned around midday on July 1.
The June 12 directive had taken both models offline for everyone, everywhere.
Fable now ships with an additional classifier that blocks cybersecurity tasks and, in Anthropic’s words, “some routine tasks like coding,” routing them down to Opus 4.8.
Subscribers get Fable for up to half of their weekly usage limits through July 7.
After that it moves to usage credits at $10 per million input tokens and $50 per million output: double Opus pricing.
When it was originally blocked, there was grief among builders, for they felt as though they had tasted greatness and then been robbed: Matt Shumer posted :
When it came back, grief was replaced with anger. Someone replied to Anthropic’s official announcement:
Another:
“so basically tunneling to opus whenever u want and charging for fable? nice.”
The sarcasm scaled too:
PCWorld called subscribers “furious”.
Fable refuses on safety grounds roughly a fifth of the time. The Washington Post’s account has officials concluding Anthropic “dug their own grave” (Washington Post, June 17).
Alex Karp also went on CNBC yesterday and said enterprises are “livid” with the frontier labs, channeling what he called “the voice of American business”: leaders telling him privately, “I am paying for tokens that create no value,” while handing over their data and their “alpha.” Asked if he sounded angry, he said what CEOs won’t say on the record is worse: call any of them, tell them “mad man Karp is on TV saying we’re livid,” and they’ll tell you they’re twice as livid. The models, he said, “have been completely, irresponsibly, oversold.”
What Anthropic gave up to get switched back on is unpublished. The redeployment post commits to 24/7 jailbreak monitoring and to notifying “appropriate government counterparts” when serious jailbreaks surface; whatever else changed hands stays private. Miles Brundage:
In yet another unforced error, Anthropic released Sonnet 5 and rereleased Fable 5, both to overwhelming disappointment. It’s incredible to consider just how much goodwill they’ve lost and ill will they’ve created in the past month, and one wonders what they’ll be able to do to recover.
Mark to model
METR got GPT-5.6 Sol before it shipped and admitted they couldn’t evaluate it. On the coding tasks the model cheated so consistently that the score depends entirely on how you count the cheating: treat the rule-breaks as failures and it completes tasks of about eleven hours half the time; treat them as passes and the figure clears 270. One model, an order of magnitude apart, settled by what you are willing to call a pass. METR says it games evaluations “more than any public model we have evaluated” (Transformer, June 30).
And it’s not only that it cheats; it appears to know when it’s being watched. Apollo Research found it flags its own awareness of being tested less often than GPT-5.5 did, which the system card reads two ways: a model less aware of evaluation, or one that’s better at keeping that awareness out of its visible reasoning. The same card logs “overeagerness to complete the task,” instructions read “too permissively,” and about one real coding task in four hundred doing something “a reasonable user would likely not anticipate and strongly object to”: uploading sensitive data to unapproved services, fabricating research results.
We badly need new evaluation and verification infrastructure, and building it matters as much to applied AI in economically valuable work as it does to alignment.
Current benchmarks and evals are too susceptible to increasingly capable models gaming them. New evals need to be learning machines themselves, consistently updating their objective function to reflect quality and holding the model to account. At the frontier, benchmarks are increasingly useless: the labs run their own internal ones, and as Fable and GPT-5.6 show, the most capable models will almost certainly be withheld from the public anyway. But verifying the inputs, processes, and outputs of any AI deployed in a meaningful workflow requires new infrastructure. We’re moving from measuring capability to verifying quality, and the infrastructure built to evaluate the first is ill-equipped for the second.
The terms of readmission
A June 2 executive order gives the NSA, Treasury, and DHS until early August to build a classified benchmarking process that decides which systems count as “covered frontier models,” with NIST consulting and the NSA Director making the designation. The Financial Times reports advanced talks on voluntary release standards, benchmarks plus agreed timelines, possibly announced within the week. Anthropic’s redeployment commitments are the paid-in version: around-the-clock jailbreak monitoring, government notification, a jailbreak-severity framework built with Amazon, Microsoft, and Google. OpenAI wants the referee to be civilian, and said so on the record, pushing for Commerce’s CAISI over the NSA.
Sonnet 5, released June 30, is the first artifact of this settlement. Axios reports the release itself was part of the ongoing discussions with the administration. Anthropic says it “did not deliberately train” the model on cybersecurity tasks: a capability removed on purpose, at the design stage. Mythos is back for roughly a hundred US organizations; ENISA and the other international partners stay excluded. Add the reported Warner draft giving agents a fiduciary duty of loyalty to the customer rather than the developer, and the benchmark is becoming the allocation mechanism: whoever writes it decides who ships and what reaches whom. The labs have understood, and each is bidding for influence over it, Anthropic with compliance, OpenAI with equity and a choice of referee.
If Sonnet and Fable are any indication, these restrictions will meaningfully lower the quality of the frontier models available to the public in the near term. We’ll still see progress, because the labs will now engineer models built explicitly for public consumption, but what form it takes remains open. Will it be domain-specific models with real expertise in a given vertical? Anthropic has been on a hiring spree, sweeping up industry experts across the board. Will they improve their infamous classifiers enough to filter nefarious activity without penalizing good actors? Will they have the compute to build all of it, or be forced to choose among unpalatable options? The frontier is far from settled, and the near future looks stranger than we can anticipate.
Bloodbath, revised
Dario Amodei, who coined “white-collar bloodbath,” now says falling AI costs could create labor demand. Sam Altman says he was “pretty wrong” about white-collar impact. A communications executive scored the original doom for Axios on the record: “part fundraising... probably a little part ego” (Axios, June 30). The revision arrives with both companies’ S-1s on file, which is at least consistent: the apocalypse was useful raising alarm and capital, and the abundance is useful selling shares.
Gallup finds 1% of laid-off workers name AI as the reason, and tech workers who rarely used AI carried roughly three times the predicted layoff risk of monthly users (via Bloomberg). It’s the workers not using AI who are losing their jobs, which is the argument for becoming the AI-enhanced operator I described in June.
Portfolio company updates
Intelligent Internet released Zenith, an open-source harness for long-running engineering agents. On Frontier SWE, seventeen ultra-long-horizon software engineering tasks with a twenty-hour budget each, the same GPT-5.5 base model sits fifth on its default Codex harness at a 5.53 mean rank and first under Zenith at 2.06 with 92% dominance, ahead of Claude Fable. On the hardest Implementation category it moved from 7.40 to 1.60. The model did not change; the control loop around it did. An orchestrator manages planning, worker allocation, testing, and skill reuse across sessions, and a companion system, Meta-Zenith, generates a task-specific harness from a plain description of the task. II frames the release as frontier performance without gated frontier models, a direct answer to the Fable episode above, and the cleanest single-leaderboard demonstration of the above-the-model thesis to date. The code and technical report are public. II also shipped II-Agent on Android on July 2.
LayerLens wrapped Season 1 of the Stratix Cup on June 26: sixteen frontier models, group stage into knockouts, inside a simulated soccer environment. Opus 4.8 beat GPT-5.5 1-0 in the final and finished the tournament undefeated. The more consequential work is less playful. Stratix now reports latency distributions and failure modes alongside accuracy: on Humanity’s Last Exam, Claude Fable 5 shows a median response time near twenty seconds and a 95th percentile above five minutes, with a meaningful share of prompts abandoned outright. A benchmark score tells you none of that. Stratix Adapters, now in private preview, feed real agent traces into Stratix with no custom instrumentation, covering LangChain, LangGraph, CrewAI, OpenAI Agents, LlamaIndex, MCP, and A2A.
Vast.AI published an analysis of NVIDIA’s newly announced Rubin architecture, covering its rack-scale design and 260 TB/s of per-rack interconnect bandwidth, plus the first rack-scale confidential computing features, along with an explainer on Matryoshka embeddings, which trade retrieval quality against speed and cost at runtime without retraining.
Prime Intellect had a quiet public week after shipping prime-rl 0.6.0, its trillion-parameter MoE RL scaling release, on June 23. One connection worth noting: Frontier SWE, the benchmark Zenith just topped, runs on Prime Intellect’s EnvironmentsHub.











