Phil Tetlock
Subject: Can Anyone Predict What Happens Next?
Bio: Professor at Wharton and Author of Superforecasting
Transcript:
Larry Bernstein:
Welcome to What Happens Next. My name is Larry Bernstein. What Happens Next is a podcast that covers economics, politics, and history. Today’s episode is Can Anyone Predict What Happens Next?
This podcast was taped at a conference where I hosted several Penn Professors on a variety of topics. The audience included my friends who will join me in asking questions.
Our speaker is Phil Tetlock who is a Professor at Wharton and the author of a book entitled Superforecasting. Often, we get our news and analysis from experts who make predictions that are terribly wrong. Phil has analyzed methods of forecasting and has found individuals and groups who are fantastic predictors of politics, war, and sports.
I want to learn how AI and superforecasters working together will revolutionize the prediction process and why that is helpful to markets and mankind.
Phil, can you please begin with six minutes of opening remarks.
Phil Tetlock:
How long are we going to need human forecasters? Will AI be able to perform all the intellectual functions that super forecasters were able to perform in assigning realistic probability estimates to a very wide range of real-world events over extended periods of time. The short answer is probably. I probably will be overtaking the top human performers. And my colleagues at the Forecasting Research Institute are estimating that’s going to happen in about eight months. I’m not sure I believe that projection.
I’ve been studying subjective probability forecasting for four decades plus. I was doing this in the middle of the 1980s during the Cold War. Reagan-Gorbachev was a big debate in the US whether Reagan was increasing or decreasing the risks of a nuclear war with the Soviet Union. I collected some early probability estimates back then, and virtually nobody in 1984 predicted how radical a reformer Gorbachev would be.
Everybody could explain what happened after the fact. So, the conservatives believed that the Soviet Union was a totalitarian system and Jean Kirkpatrick and others argued that these totalitarian systems had perfected the art of self-perpetuation. They hadn’t. And liberals said, “The Reagan policies are going to increase the likelihood of war. We’re going to have Neo-Stalinist retrenchment in the Kremlin.” Those predictions were flat out wrong. The Soviet Union did substantially pivot toward liberalization. Nobody doubts that now and it ultimately led to the disintegration of the Soviet Union. I refer to the debate in the mid-80s as an outcome and relevant learning situation because no matter what happened, everybody was well positioned to explain it. They had no skin in the game because they had too many convenient fallback positions.
I covered that in my book Expert Political Judgment. that came out in 2005, 10 years before Superforecasting. Experts had a hard time outperforming educated readers in elite newspapers. Experts also had a hard time outperforming simple linear extrapolation. There seemed to be virtually no relationship between how prestigious you were and one’s predictive powers.
There were some experts who were pretty good and we characterized their cognitive style, cognitive ability profile, and it was a predictor. What is it that predicted good judgment? It was exactly what makes you unpopular as a topic as a guest in mass media. A good pundit is stepping on the gas, generating lots of reasons why he’s right and the other side is wrong. More balanced forecasters look at the trade-offs. One of the defining features of better forecasters is that they’re boring.
The intelligence community after 9/11, after WMD in Iraq was on the defensive. Congressional oversight demands more accountability for accuracy. The Director of National Intelligence created a research branch called IARPA whose primary mission was to improve forecasting accuracy. I got to know people at IARPA because I was one of the few people in geopolitical forecasting. I designed tournaments that ran from about 2010 to 2016. We won.
IARPA wanted to improve the accuracy of intelligence analysts. The people who missed WMD, the people who missed 9/11 were on the defense. They thought we would create forecasting tournaments, and we would give them some competition. I was running one of the research teams. There were four other research teams with different methodological approaches.
What the intelligence community cared about was their people getting better. When they compared how well their analysts were performing on the same questions that these outsiders were performing. We’re paying the amateurs $300 a year to answer a batch of 100 geopolitical questions that the IC considers to be of national security relevance. And the intelligence analysts lost.
Colin Teichholitz:
And they had access to the United States’ global intelligence efforts.
Phil Tetlock:
It’s their full-time job and they’re professionals. They have access to all this classified information. You would think that they should be completely dominating the amateurs. I don’t think the intelligence analysts are as bad as this comparison is making them out to be, but the results were sobering.
Larry Bernstein:
In the future, do you think there are questions that humans will have an advantage relative to AI and vice versa?
Phil Tetlock:
My colleagues at Forecasting Research Institute led by brilliant young economist Ezra Karger from the University of Chicago created the ForecastBench, which monitors the relative performance of top human performers and frontier AI models on thousands of questions that are posted in public prediction sites. A couple of years ago, the frontier models were well behind the top human performers. It was about a 40% gap and now it’s about a 10% gap. If you do a linear extrapolation, it’s about eight months.
Colin Teichholtz:
You said in the book that one of the big knocks on prediction markets is that it’s small money. People don’t really take them seriously. And so, beating these academic prediction markets is no big deal. Obviously, things have changed. Prediction markets have become a big deal and I’m curious how the forecasters today do?
Phil Tetlock:
15 years ago, low liquidity markets had trouble clearing. Robin Hanson created a market maker that solved some of that problem, but it’s a real limitation and it’s the old line for super forecasters. If you’re so great, why aren’t you rich? It should be monetizable, right?
Some of them are rich, but many of them aren’t. You look at the relationship between intelligence and wealth. Intelligence is certainly correlated with wealth, but to differentiate people in the top 99% versus the top 99.99%, not really.
At the end of the IARPA Forecasting tournaments, the prediction markets were running head-to-head. And the only way the forecasting tournaments could match even the illiquid prediction markets was by using some fancy statistical algorithms that did some of the same functions as markets like cleaning up scale forecasts. But the forecasting tournaments with reasonable algorithms, they could do as well as the prediction markets. Would that still be true with higher monetary stakes in prediction markets? The average prediction market accuracy is in the vicinity of the average top human performers and they’re both about 10% above the AIs that are catching up on them.
Colin Teichholtz:
What traits to look for you to assess that somebody will be a great forecaster?
Phil Tetlock:
The best predictor of whether someone’s a good forecaster is their past track record. The gold standard for accuracy is quadratic scoring. You take the gap between probabilities and realities, you square it, you do the deviations, the smaller the gap, the better forecaster you are. It incentivizes people to report their true beliefs and not to fudge. So, if you say, “ I don’t want to make this mistake because it’ll make climate change look less serious. I’m going to fudge on this.” You’re going to be degrading your score. If you want to minimize your Brier Score, maximize your accuracy, you should report your true beliefs.
There are many different proper scoring rules. I just described it as quadratic squared deviation rule, but it breaks down for some problems in finance and geopolitics elsewhere when you have tail risk. So, there you need logarithmic scoring rules. If you say there’s a 1% chance of something happening and it does, you’re going to get a bad Brier score. If you say there’s a 0% chance of something happening and it does, you’re going to get a slightly worse bad score. If you say 1% and you go with this 1% to 0% error on a logarithmic scale, it’s huge. It means you incur a negative infinity scoring penalty. It means I’ll never believe you again.
If you tell me something is impossible and it happens, I don’t ever want to hear you again. With the logarithmic score, you can never recover from it. From a Brier score, you can recover from it. Brier is much more forgiving. The logarithmic scoring rule says you make an error at the extremes you’re finished. And that produces a sensitivity to tail risk you probably want in finance. I’d recommend log scoring rules as opposed to the more popular Brier.
Colin Teichholtz:
Do you draw a distinction between the elites who acknowledge that they’re as vulnerable as anybody else or is it a broader brush than that?
Phil Tetlock:
I knew Danny Kahneman for a long time, and he felt that the judgmental biases he was studying ran deep into human nature and were very resistant to debiasing interventions. When we tried to train people to become better forecasters, he thought it couldn’t work. He thought if you’re going to make people better forecasters, you must institutionalize the process. Don’t count on them to remember it from their training. You must embed it in the institutional procedures for making judgments. You protocolize it rather than train it. And there’s a good case that Danny’s right about that, but training is possible.
Even though his Nobel Prize was in economics, his background was in perception. He thought that the judgmental biases like overconfidence that he studied were analogous to perceptual illusions that are resistant to training. You’ll simply slip back into the same error over and over again, no matter how much we train you. So, we have to teach you to pull out a ruler. That’s what I mean by protocolizing. And I think that’s sound advice.
David Wecker:
You present very well-defined questions for people to forecast at a certain date, but I find there’s huge value in knowing what the issues are going to be in the future that people care about but aren’t paying attention to today. Do you find that skill to be much different from what you researched and have you done any work on that?
Phil Tetlock:
I love your question. It captures one of the major directions the work has taken since super forecasting. Yes, I find huge value in that. Is there going to be a major war between the US and China before 2050? You ask experts in geopolitics this question and some of them will say very likely because wars tend to occur in hegemonic transitions. Nuclear weapons need to be factored and there’s a school of thought that believes that. There’s another school of thought that says, no, we think for various institutional, economic and political reasons, things are going to stabilize and they’ll work Taiwan out and it’s going to be tense, but it’s navigable.
Those are two big picture views of US-China relations That’s a big question if there’s going to be a war in the next few years.
How do you break down a big question like that into smaller units that are decomposable testable forecasts? That’s the methodology we’ve been pursuing in a number of studies. We call them Bayesian question clusters. What are the things you would expect to observe by 2030, for example, if you could expect a war by 2050? What would your expectations be?
What things would happen around Taiwan? What things would happen in the South China Sea, the Chinese nuclear arsenal, Chinese naval buildup, and Chinese rhetoric? If all of them break in the direction of increased likelihood of war, you would say, this is a collective early warning indicator, the bundle. One of these questions might not be all that diagnostic of the longer trend, but as a bundle, their cumulative diagnosticity can be quite substantial.
Larry Bernstein:
We have Steve Kuhn in the audience. Steve spoke on a previous podcast about his new business Sportspredict.com that applies Phil Tetlock’s Superforecasting to sports betting. Steve, can you please explain your new company’s business.
Steve Kuhn:
Sportspredict.com is free to play, not a gambling platform for people to show their predictive skill. Why sports? It’s the top of the funnel. Four billion people care about sports. It’s 90% of the revenue of Kalshi, so people care about it. There are multiple games every day, so there’s lots of ways to test. There’s not much of an Oracle problem.
Larry Bernstein:
An “Oracle problem” in a bet means that the hard part is not the wager itself but determining the objective truth of the outcome. In other words: who decides what actually happened?
Steve Kuhn:
The leagues decide who won and they publish it. Sports crosses cultures. We’re doing a big event around the World Cup.
In an age of AI, there’s not much difference between a teenager in Nairobi to compete against a PhD student at Stanford given the amount of data. You were asking, how could you find good ways to hire people? We would run a sports contest for your potential candidates. We use Brier scores as our way of measuring talent. Instead of asking somebody, “will Italy beat Spain?” We’ll ask, “What do you think the odds that Italy will beat Spain?” We get a lot more data and more precision following the professor’s work.
We’re working on a contest partnered with a hedge fund where the winner of that contest gets a fellowship to work at that hedge fund and perhaps gets to manage a $1 million dollar sports trading portfolio for them. So, we’re trying to use prizes to award status, but even in sports, people love to play free games. There’s a game called Fantasy Premier League, the English Soccer League. They have 11.5 million people that play it actively and people care very much about winning it. Famously, one year, Magnes Carlsen, who’s the best chess player in the world, got in the top 10 at the finals of that and he couldn’t stop talking about it.
Everyone tells him he’s a genius all the time. He gets to the top 10 in Fantasy Premier League and he puts it all over the place. So, I think there’s an insatiable demand for status and to prove you’re smart.
We can look at your track record. For the World Cup, we’ll have over a thousand predictions that we’ll be able to score. We will have a good sense of whether someone’s a good predictor or not.
We hope to announce this contest soon, which we hope to get large media attention. In addition to that, we’ve also partnered with a group called the John Locke Institute. It’s a summer institute for elite high school students. They have an essay contest every year that has over 100,000 entrants. We’re creating a six-week class on how to become a better predictor. We rolled out this class for 10 students as a trial and they loved it. In six weeks, these students got a lot better at this, and they also had a great time. We’re rolling that out to 300 students this week.
Being a good predictor is something that’s trainable and valuable to know. It’s crazy that we spend a year studying trigonometry, but we don’t study how use data to make good judgments. That’s something we need to fix.
Phil Tetlock:
I think there is an opportunity there. We just created Superforecasting as a label to get the top 2% engaged and stick with us through the IARPA tournaments. It was an invention of convenience because the government refused to pay for performance. So, we had to pay people for status. But if you take it much more seriously, you can create very rich status gradation. Every chess master knows what a Grandmaster is. There’s a very clear hierarchy that people work through and even within Grandmasters, there’s the elite club of over 2,800 and so forth.
In chess, AI’s are better than human chess players a long time ago. The best human players have an Elo rating of about 2850 and the best AI programs like Stockfish and AlphaGo are rated around 4,000, which means that they’re going to win 99.99% of the games against Magnus Carlsen. Magnus Carlsen isn’t going to have a chance. Do people lose interest in chess because there’s this dominant AI force? The interesting thing is they seem even more engaged by it. They use AIs to train themselves. If you were to look at the top players now today who are trained against AIs, compared to the top players 40, 50 years ago, the top players are better.
Rory MacFarquhar:
One of the major concerns about prediction markets is insider trading. Do you know whether your super forecasters are participating in these markets so that they are able to not just get status but cash in on their ability to be better at making these predictions
Phil Tetlock:
I haven’t had a lot to do with prediction markets until recently when I became affiliated with Forecast X. They have a concern about what’s happening with the prediction market, Kalshi in particular, which they think has gone crazy with sports betting and they’re expecting a severe regulatory backlash. You can create a prediction market on whether various legal cases against Kalshi or Polymarket are going to be successful? The Supreme Court is going to hear some of the state cases. The Trump administration is blocking everything at the federal level right now, but there’s a huge pent up demand for suing Kalshi and Polymarket at the state level and there’s a pretty good chance that the Supreme Court will uphold some of those state cases, which could be pretty serious.
Prediction markets are very useful public functions. The prices are very valuable signals, but on NFL games, I’m not so sure that a valuable social good is being served there.
David Stellings:
I’m Co-Lead counsel in the nationwide class action case against Kalshi pending in New York. We have the attorney generals attacking Kalshi for participating in sports betting because it endangers their tax dollars. It endangers their ability to enforce state laws that limit sports betting, for example, to people who are 21 or older. You have the federal regulatory agency, the CFTC, maintaining that all sports bets on Kalshi are swaps and therefore covered solely by the jurisdiction of the CFTC. If this administration is correct about that it could turn gambling in the United States on its head because going to a roulette table at Caesars in Las Vegas could be considered a swap under the definition that Kalshi is using and the CFTC is using for a swap.
There are about 20 different cases pending in various federal district courts around the country and 3 of them have already gone up to the court of appeals. The decisions are all over the place. This is definitely going to be decided by the Supreme Court. It’s not an issue that falls along traditional political lines the way that people can predict or often predict successfully what the Supreme Court’s going to decide. The attorneys general who are pursuing these cases are across the board on the political spectrum.
Phil Tetlock:
That’s fascinating. I thought it was mostly the left coming after the prediction markets.
You’re causing me to update my beliefs now. The likelihood of the Supreme Court upholding at least some of these state cases increases quite a bit if there’s support at the state level.
David Stellings:
Polymarket does not operate fully in the US on the sports betting front, and they’ve done that very intentionally because they see the legal flack that Kalshi is under. As of now, even though Polymarket is technically allowed to offer sports betting in the United States for the last several months, the only people who are allowed to participate in that are beta testers. And it’s a very tiny number of people. Other sports betting apps are being sued. We’ve sued a bunch of them; they just don’t get as much media.
Larry Bernstein:
I had a podcast with Peter Grace recently and the topic was the intellectual foundations of the CIA. Sherman Kent was a Yale history professor who built the research and analysis team at the CIA. I read Sherman Kent’s book called Strategies of Intelligence, where he laid out his process for CIA analysts to incorporate probabilities in evaluating risks. In preparation for today’s talk, I reread Superforecasting and it turned out that Sherman Kent was an important figure in Phil Tetlock’s book.
Phil, tell us about Sherman Kent, why he’s important to the development of the CIA.
Phil Tetlock:
He’s a fascinating character. He was way out front about urging his colleagues to quantify probabilities. He was very concerned about miscommunication. He was concerned that estimates for the Soviet invasion of Yugoslavia. He was worried that President Kennedy was misled during the Bay of Pigs because someone said there was a reasonable chance of success. The people who said reasonable chance meant one in three, Kennedy thought it was much higher. So, you have potential for miscommunication. He thought analysts should get in the habit of making explicit judgments and scoring their accuracy. He got a lot of push back from his more qualitative colleagues, very similar to the pushback that we got 30 years later in the IARPA tournament.
It’s a very same divide between quantitative and anti-quantitative people inside the CIA. And one person said, “Take it to Sherman Kent, you’re trying to turn me into a goddamn bookie.” Prediction markets. And he said, “I’d rather be a bookie than a Poet.”
Colin Teichholtz:
You tell the story of Obama and Osama Bin Laden and getting advice from his advisors around the table and that the reasonable aggregation of those advisors was 70% probability of success and yet Obama came away saying it was 50/50 and he still decided to go and do it. What can forecasters do to better convey this information to decision makers?
Phil Tetlock:
I think Obama may have been reluctant to take the 70% aggregate and pushed it down to 50 because he didn’t want to be overconfident. Trump probably would have made the reverse call. It’s a matter of what your political values are. The forecasting process itself is supposed to be value-neutral. Ideally, you want the president to have the best estimate of the odds, whether they’re 50 / 50 or 70 / 30 or whatever they may be. You want the president to have the best estimate of the odds and then the president can plug in his own utility function. It might be an Obama utility function. It might be a Trump utility function, but that’s the division of labor. There are the technocrats and then they’re the policymakers who plug their values in.
Colin Teichholtz:
Could there also be a division of labor between the people who tend to be experts in a field and spend their lives learning lots of information versus somebody else who hasn’t committed their life to studying that particular field but is very good at consuming that information and then making predictions from it.
Phil Tetlock:
We have a lot of experts on AI, we have a lot of economists, we have a lot of biosecurity experts, we have a ton of experts now and we’re able to compare their judgments to super forecasters across many, many issues and the pattern is the experts almost always have higher estimates of risk than the super forecasters do. I’m not saying they’re always wrong, but I’m saying if you spend your life studying something, you’re probably going to think it’s pretty important, correctly or incorrectly.
My sense is super forecasters are quick to become skeptical of a source. It’s easy to lose the trust of a super forecaster. You want to use an aggregate of super forecasters. The best super forecasters in my experience use the statistical base rate evidence and they look at case specific information.
You’re at a wedding and you say to the person next to you, “What do you think the likelihood of the couple staying married is? “ Most people see the couple and they look happy, but most people manage to look happy at their wedding. The accounting people are going to say, “the divorce rate for this sociodemographic category is 33% chance of that marriage ending in the next six years.” That doesn’t mean that you ignore everything about the couple. That’s not the end of the process. You use case specific evidence to update against the base rate
Jay Greene:
Are there super forecasters who are more effective given their identity at forecasting things that they have an affinity for because of their identity?
Phil Tetlock:
I think the performance engine underlying super forecasting is very fundamentally human. Human beings have a hard time with statistical reasoning. The super forecasters are unusually good at it for humans. AIs are way better, but the super forecasters are pretty good. And the super forecasters have good causal intuitions about human affairs. The AIs don’t have good causal intuitions. I don’t if they know if AIs even understand what causality is. At least right now, that’s not how they think. We typically don’t use individual superforecaster judgments. We almost always use aggregates. The aggregate of super forecasters is better than 90% of the individual super forecasters. The aggregates work better when they’re cognitively diverse. You can put more confidence in a prediction when people who normally disagree suddenly agree. The Obama-Osama Bin Laden is a good example. You have people inside the CIA offering estimates of whether Obama is in that compound in Abbottabad.
If you have people who have human intelligence versus people using cyber code breaking versus satellite pictures and they are coming from totally different angles and they haven’t talked to each other and they’re reaching a very similar conclusion, you have much more to make an extreme prediction. If the average is 70%, but people who normally disagree are agreeing, the IARPA extremizing algorithms would tell you to turn that 70% into 90% that power of viewpoint diversity at a group level not at an individual level. People who can take many perspectives in their head do a weighted average that puts you at an advantage too.
It’s as if you’re doing a crowd aggregation. If you’re a good perspective taker that’s one of the defining qualities of the super forecasters.
Larry Bernstein:
Thanks to Phil for joining us.
If you missed our previous podcast, it was Objects of Desire.
Our speaker was Karl Ulrich who is a Professor at Penn specializing in Industrial Design and has written a book entitled Product Design and Development.
Karl spoke about inventing a new ice cream scooper that is beautiful, sexy, and more useful.
You can find our previous episodes and transcripts on our website. Please follow us on Apple Podcasts or Spotify.
I am Larry Bernstein with the podcast What Happens Next.
Check out our previous episode, Objects of Desire, here.



