I Built an Army of AI Pundits to Beat a Prediction Market. They Lost to a Single Number.
Part 1 · I built an 8-agent AI swarm to forecast Polymarket. It cost 27 cents a run and lost to the market price.
The Polymarket Series | Start · Part 1 · Part 2 · Part 3 · Part 4
There’s a project on GitHub called MiroFish. I found it the way I find most things that get me in trouble: late, half-asleep, one tab too deep.
The pitch is irresistible if your brain is wired a certain way. You hand it a question about the future. Instead of asking one AI for an answer, it builds a tiny society of them. Eight little personas, each with a name and a backstory and opinions, set loose on a simulated Twitter and Reddit. They post. They argue. They react to each other, talk themselves into things, change their minds. After thirty rounds of this synthetic shouting match, you read the room, and out comes a number: the probability of the thing happening.
I read the README twice and thought the obvious thought.
Could that thing trade?
Because there’s a place to test exactly that. Polymarket is a prediction market where real people bet real money on real questions. Will there be a ceasefire by the end of the month. Will this candidate win. Each market has a price between 0 and 1, and that price is just the crowd’s probability with skin in the game. If my little AI mob could systematically see the future more clearly than that crowd, even a tiny bit, that edge is worth money.
So I decided to find out. Not with money — this stayed on paper the whole way. With a notebook, a rented server, and a couple of weeks I will not be getting back.
Setting up the swarm
I won’t bore you with the plumbing, partly because some of it is mine to keep. The short version. I got MiroFish running on a small server I rent, wired it up to an LLM, and built a harness around it so it would wake up every morning, pick a few fresh Polymarket questions, run the whole simulation, and write down its prediction before the outcome was known. That last part matters. Anyone can be a genius about yesterday.
For each question it would also do two other things, so I had something to compare against. It asked a single AI, cold, for its best guess. That’s the “baseline” — the dumb version, one model, no theater. And it wrote down the market price at that moment. That’s the crowd.
Three numbers per question. The swarm. The lone model. The market. Then we wait and see who was right.
The first surprise was the cost. I’d budgeted somewhere between one and three euros per run, because eight agents arguing for thirty rounds and then writing a report sounds expensive. It came out to about 27 cents. A factor of ten under what I’d guessed. An entire simulated civilization deliberating the fate of the Middle East, for less than a vending-machine coffee. I felt clever for about a day.
The second surprise was less flattering. Almost a third of the runs — 29 percent, nine out of thirty-one — just failed. Crashed somewhere in the middle, usually while building the knowledge graph the agents argue on top of. So before I get to whether the swarm was smart, file this away: it wasn’t even reliable. Roughly one morning in three, my synthetic civilization simply didn’t show up to work.
The night it resolved at 23:43
Then the predictions that did survive started coming in, and I started watching markets resolve.
One night a market closed at 23:43 and my phone buzzed with the result. The swarm had been confident. Loud, even. It had looked at a tense geopolitical question, watched its eight agents work themselves into a frenzy about escalation, and come back with a high probability that something dramatic would happen.
The dramatic thing did not happen. Most dramatic things don’t. That’s the whole trick of the world, and it’s the whole trick of these markets, and it took me an embarrassingly long time to feel it in my gut instead of just knowing it on paper.
There was one beautiful exception. A market about a possible peace deal — the swarm went one way, hard, against a falling market price, and it nailed it. On paper that single call was worth +81.74 €.
I want to be honest about that number, because it’s the kind of number people build entire Medium articles around. “How my AI made me 80 euros overnight.” Here’s the part those articles skip.
Across everything the swarm bet on, the total paper profit was +61.70 €. Which means if you remove that one lucky peace-deal call, the rest of the swarm’s confident, expensive, beautifully simulated opinions added up to about minus twenty euros.
One good night carried the whole thing. That’s not a strategy. That’s a slot machine that happened to pay out while I was watching.
The number that actually mattered
Profit is a liar over small samples. It’s mostly luck wearing a suit. To know whether the swarm could actually see anything, I needed a cleaner measure, and the cleanest one is almost boring: how close were the probabilities to reality?
The standard tool here is the Brier score. You don’t need the formula. Just the direction. Lower is better, and it punishes confident wrongness harder than honest uncertainty. Say something will happen with 95% certainty and watch it not happen, and your Brier score takes a beating. Hedge at 60% and you barely get scratched.
Here’s how the three contenders did, over the markets that actually resolved:
- The lone AI, asked cold: 0.42
- My eight-agent swarm, with its graphs and personas and thirty rounds of debate: 0.38
- The market price — the thing I can read for free, in one second, with no server at all: 0.30
Read that again. The swarm beat the dumb single-model baseline. Two weeks of work bought me that, a tenth of a point, the gap between mediocre and slightly-less-mediocre. But the market crushed both of them, and it wasn’t close.
I had built an elaborate machine to be reliably worse than just looking at the price.
Why it lost: the drama problem
Once I stopped sulking, the failure got interesting.
The swarm wasn’t randomly bad. It was bad in a direction. It systematically over-predicted dramatic, escalatory, news-cycle outcomes. Give it a tense question and the eight agents would feed off each other, each one’s anxiety becoming the next one’s evidence, until the simulated room was certain the worst was about to happen.
I ran a deliberately dull control to check this. I pointed the swarm at boring, almost-certainly-nothing markets — the kind of low-probability questions that resolve “no” 95 times out of 100. The swarm overshot them by a factor of three to seven. It could not look at a quiet room without imagining a fire.
Which, if you think about it, is exactly what you’d build if you trained a model on the internet and then asked eight copies of it to talk to each other. The internet is a machine for manufacturing the feeling that something huge is always about to happen. My swarm had inherited the single most expensive bias in human judgment, amplified it eight times, and charged me 27 cents for the privilege.
The market doesn’t do this. The market has people on the other side of every dramatic bet, quietly taking the boring side and usually winning. That’s what the 0.30 is. It’s not magic. It’s a lot of people, many with real information, none of them able to monetize panic the way a Reddit thread can.
What I’m doing about being wrong
The trap with a result like this is that I picked it apart after seeing it. That’s how people fool themselves. Run enough analyses on yesterday’s data and something will always look like genius. So before I close this chapter, I did the one honest thing available.
I wrote down nine predictions the swarm is making right now, on markets that haven’t resolved yet, and committed them to a timestamped file I can’t quietly edit later. One clean bet, made in public, before the outcome exists. They settle at the end of the month. I told the file, in writing, that I expect the swarm to lose — because the escalation wave it’s betting on has already passed, and the drama problem is the drama problem.
If it wins anyway, that’s the first real argument against everything I just told you, and I’ll report it honestly.
I don’t think it will.
The takeaway, before you go build your own
Here’s what two weeks and a few euros of paper bets bought me. A genuine, slightly deflating respect for the number on the screen.
A prediction market price isn’t a starting point you can out-think with enough cleverness. It’s already the aggregated, money-weighted, continuously-updated opinion of a crowd that includes people who know more than you and are betting accordingly. Beating it isn’t a matter of bringing a fancier model. The fancy model was worse. It also broke a third of the time, which felt about right.
But “the simple version of this idea didn’t work” has never once stopped me, and it shouldn’t stop you from reading on. The AI swarm was only the first thing I tried.
Next, I went after the market the way a quant would. No AI, no theater — just the oldest, most famous edges in betting, tested one by one against ten months of real Polymarket history. Longshot bias. Momentum. The thin, ignored corners where mispricing is supposed to hide.
Six classic strategies. I’ll show you the numbers on every one.
Part 2: Six Ways to Beat Polymarket That All Don’t Work.
Keep going: ‹ Series home · Part 2 ›
Want every number, the methodology, and the code to rebuild this yourself from the public APIs? Get the reproducible code bundle → Or subscribe for the next experiment.