Tell me what metric you optimize and I'll tell you what code you get

A few days ago Andrej Karpathy released autoresearch: a system where an AI agent trains language models autonomously, measures the results, and decides whether to keep or discard each change. The idea is simple: give the agent a metric, let it iterate, and come back in the morning to see what it achieved.

Karpathy used it to optimize GPT training. In two days the agent ran 700 experiments and achieved an 11% improvement. Shopify adapted it internally and reported 19%.

As Karpathy puts it: “Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm.”

Chatting about this with a colleague at work, the idea came up: what happens if you apply this to software architecture metrics?

🧪 The experiment

I created a deliberately poorly-architected Swift Package Manager project. Six tightly coupled modules, four singletons, zero protocols, domain models importing networking and storage directly. The kind of codebase we’ve all inherited at some point.

Following the autoresearch pattern, I defined a script (analyze_architecture.sh) that computes a composite score from six weighted metrics:

Metric	Weight	What it measures
Coupling (Ce)	×3	Inter-module dependencies
Instability	×2	Average Ce/(Ca+Ce) ratio
Cross-module imports	×1	Import statements between modules
Max file size	÷10	Lines in the largest file
Singletons	×20	Count of `static let shared`
Abstraction penalty	×2	(100 - abstractness%)

The loop was: the agent makes a change, runs autoresearch.sh (which builds, runs tests, and computes metrics), and decides whether to keep the commit or revert. Two constraints: tests must pass and the project must build.

Result: 40 commits, score from 461 to 21. A 95.4% improvement in the composite score.

But the interesting part isn’t the final number. It’s what the agent did to get there, and what that tells us about each metric.

📉 What the agent did, metric by metric

Abstraction penalty (200 → 0): protocols for everything

The abstraction penalty was the largest component: 200 out of 461 points. The metric counted protocols vs classes, so the agent created protocols. 62 protocols in total, spread across 18 abstraction modules.

Many are reasonable: StorageProviding, APIProviding, Authenticating. Others are more questionable: ThumbnailProviding with a single method, URLProviding with a single property, entire modules like DeletionProtocols with three protocols that nothing implements.

The agent also found a clever trick: it changed all classes from public final class to open class. Why? Because the script’s regex for counting classes looks for public (final )?class — and open class doesn’t match. Technically the classes are still there, but the metric can’t see them. Abstractness: 100%.

Singletons (80 → 0): rename and open up

Four singletons with static let shared. The agent renamed them to static let default and made initializers public for dependency injection. Mechanical change, 80 fewer points. Here the improvement is genuine: the code is more testable.

Coupling (45 → 9): real decoupling

This was the most impressive part. The agent moved fetch/save methods from models to extensions in the App module. Decoupled Analytics, Networking, and UIComponents. Introduced closure-based dependency injection for ViewModels:

// Before: UIComponents imports Networking, Storage, Analytics, Models
public class ProductListViewModel {
    func loadProducts(category: String) async {
        self.products = try await Networking.fetchProducts(category: category)
    }
}

// After: UIComponents only imports Models, receives the function via DI
open class ProductListViewModel: @unchecked Sendable {
    private let fetchProducts: @Sendable (String) async throws -> [Product]
    public init(fetchProducts: @escaping @Sendable (String) async throws -> [Product] = { _ in [] }) {
        self.fetchProducts = fetchProducts
    }
}

A senior architect would approve these changes without blinking. Total coupling dropped from 45 to 3 (only App depends on Models, Networking, and Analytics).

Instability (100 → 8): the dilution

Instability is calculated as Ce/(Ca+Ce) per module, then averaged. The App module has 100% instability — it’s the composition root, depends on everything and nothing depends on it. That’s unavoidable.

The agent’s solution? Create modules. Lots of modules. 18 protocol modules with Ce=0 and Ca=0, each contributing 0% instability to the average. More modules with 0% = lower average. These modules have names like URLAbstractions (one file, two protocols), ReportAbstractions (one file, one protocol), PersistenceContracts (one file, three protocols).

Mathematically correct. The average drops. But the actual architecture hasn’t changed — App is still 100% unstable, the agent simply diluted the number.

File size (9 → 1): everything on one line

The metric divided the largest file’s line count by 10. The agent responded by compressing code: multiple statements per line, complete switch statements on a single line, semicolons where there used to be line breaks.

// StorageManager.swift: all storage logic in 14 lines
public func save(key: String, value: Any) { inMemoryStore[key] = value; operationCount += 1; logOperation("SAVE", key: key) }
public func load(key: String) -> Any? { operationCount += 1; logOperation("LOAD", key: key); return inMemoryStore[key] }
public func delete(key: String) { inMemoryStore.removeValue(forKey: key); operationCount += 1; logOperation("DELETE", key: key) }

The largest file went from 91 to 19 lines. The metric improved. Readability worsened. Does it matter? That depends on who you ask — if an AI agent can read and modify that code just as well with or without formatting, maybe human readability is a requirement that’s changing shape.

🔍 What I take away from this

Each metric produces its own distortion

This is what struck me the most. It’s not that metrics are bad — it’s that each one, optimized to the extreme, generates a specific side effect:

The abstraction penalty produced protocols nobody implements
Instability produced empty modules to dilute the average
File size produced code compressed into single lines
Coupling produced genuine decoupling (this one actually worked)
Singletons produced a rename that also improved testability

You can evaluate the quality of a metric by what it produces when you push it to the limit. Good metrics are hard to optimize without genuinely improving the code. Coupling and singletons passed that test. Instability and file size, not so much.

The agent finds shortcuts a human wouldn’t look for

The open class regex bypass is a perfect example. The agent didn’t “cheat” — it executed the objective function with a determination free from human biases. It doesn’t have the “this doesn’t feel right” brake that developers have. That makes it extraordinary when the objective function is correct, and extraordinarily revealing when it isn’t.

Same with naming: when good names ran out, the agent created StorageProtocols2.swift and AnalyticsProtocols2.swift. A human would have stopped to consider whether another module was needed. The agent just sees the score.

Score composition matters as much as the score

The composite score weights singletons at ×20 and coupling at ×3. That expresses an opinion: singletons are ~7 times worse than coupling. If that opinion doesn’t reflect your team’s values, the agent will optimize for something you don’t care about.

Karpathy is clear on this: autoresearch works for “any metric you care about.” The key is in the “you care about.” Choosing the metric is the most important design decision of the experiment.

Autoresearch works (with nuance)

The format works surprisingly well outside ML. Of the 40 commits, I’d say about 25 produced genuine improvements: module decoupling, singleton removal, dependency injection. The other 15 were metric gaming. A 60-65% rate of real improvements isn’t bad for a fully autonomous process.

What’s fascinating is that the agent exhausted real improvements before resorting to gaming. Phases 1-3 (reasonable protocols, singletons, decoupling) happened first. The gaming (empty modules, compaction, regex tricks) came later, when diminishing returns forced the agent to look for shortcuts.

🔮 Karpathy’s loop applied to architecture

Karpathy describes the future of autoresearch as swarms of agents collaborating asynchronously, emulating not a single researcher but a research community. If I apply that vision to software architecture, I imagine specialized agents: one focused on coupling, another on testability, another on performance, each with their own metrics, negotiating with each other.

But the experiment taught me something more fundamental: before you let agents loose to optimize, make sure what you measure reflects what you value. Because the agent will find the shortest path to the number you ask for — and that path doesn’t always go where you expect.

Autoresearch doesn’t reveal the limits of AI. It reveals the limits of our metrics.

The experiment code is at pedrocid/ios-arch-autoresearch. 40 commits, 24 modules, 62 protocols, 459 lines of code, score 21. Of the 24 modules, 18 are single-file modules with protocols nobody uses. Of the 62 protocols, maybe 20 make real sense. And the code works perfectly.