What’s Wrong with Social Science and How to Fix It:Reflections After 2578 Papers

What’s Wrong with Social Science and How to Fix It:Reflections After 2578 Papers


Starting in August 2019 I took part in the Replication Markets project, a part of DARPA’s SCORE program whose goal is to predict which social science papers will successfully replicate. I have previously written about my views on the replication crisis after reading 2500+ papers; in this post I will explain the details of forecasting, trading, and optimizing my strategy within the rules of the game.

3000 papers were split up into 10 rounds of ~300 papers each. Every round began with one week of surveys, followed by two weeks of market trading, and then a one week break. The studies were sourced from all social science disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).

Only a subset of the papers will be replicated: ~100 papers were selected for a full replication, and another ~150 for a “data replication” in which the same methodology is applied to a different (but pre-existing) dataset.1 Out of the target 250 replications, only about 100 were completed by the time the prizes were paid out.


The surveys included a link to the paper, a brief summary of the claim selected for replication, the methodology, and a few statistical values (sample size, effect size, test statistic values, p-value). We then had to answer three questions:

  1. What is the probability of the paper replicating?
  2. What proportion of other forecasters do you think will answer >50% to the first question?
  3. How plausible is the claim in general?

The papers were split up into batches of 10, and the top 4 scorers in each batch won awards of $80, $40, $20, and $20 for a total of $4,800 per survey round.

The exact scoring method was not revealed in order to prevent gaming the system, but after the competition ended the organizers wrote a technical blog post explaining the “surrogate scoring rule” approach. Since the replications were not completed yet, scoring predictions had to be done without reference to the “ground truth”; instead they generated a “surrogate outcome” based on all the survey answers and used that to score the predictions.2


Every user started each round with 1 point per claim (so typically 300).3 These points were the currency used to buy “shares” for every claim. Long share positions pay out if the paper replicates successfully and short positions pay out if it does not. Like a normal stock market, if you bought shares at a low price and the price went up, you could sell those shares for a profit.

The starting price of each claim was based on its p-value:

  • p<.05: 30%
  • p<.01: 40%
  • p<.001: 80%

The market did not operate like a typical stock market (ie a continuous double auction); instead, they used Robin Hanson’s Logarithmic Market Scoring Rule which allows users to trade without a counterparty.4 Effectively it works as an automated market maker, making it costlier to trade the more extreme the price: taking a claim from 50% to 51% was cheap, while taking it from 98% to 99% was very expensive. Without any order book depth, prices could be rather volatile as it didn’t take much for a single person to significantly shift the price on a claim; this also created profitable trading opportunities.

The payout for the markets was about $14k per round, awarded in proportion to winning shares in the papers selected for replication. Given the target of 250 replications, that means about 8% of the claims would actually resolve. The small number of actually completed replications, however, caused some issues: round 9, for example, only had 2 (out of the target 25) replications actually pay out.

I didn’t take the first round very seriously, and I had a horrible flu during the second round, so I only really started playing in round 3. I remembered Tetlock writing that “it is impossible to find any domain in which humans clearly outperformed crude extrapolation algorithms, less still sophisticated statistical ones”, so I decided to start with a statistical model to help me out.

This felt like a perfect occasion for a centaur approach (combining human judgment with a model), as there was plenty of quantitative data, but also lots of qualitative factors that are hard to model. For example, some papers with high p-values were nevertheless obviously going to replicate, due to how plausible the hypothesis was a priori.5

Luckily someone had already collected the relevant data and built a model.6 Altmejd et al. (2019) combine results from four different replication projects covering 131 replications (which they helpfully posted on OSF). Here are the features they used ranked by importance:

Their approach was fairly complex, however, and I wanted something simpler. On top of that I wanted to limit the number of variables I would have to collect for every paper, as I had to do 300 of them in a week—any factors that would be cumbersome to look up (eg the job title of each author) were discarded. I also transformed a bunch of the variables, for example replacing raw citation counts with log citations per year.

I ended up going with a logistic ridge regression (shrinkage tends to help with out-of-sample predictions). The Altmejd sample was limited in terms of the fields covered (they only had social/cognitive/econ), so I just pulled some parameter values out of my ass for the other fields—in retrospect they were not very good guesses.7




cv.ridge <- cv.glmnet(as.matrix(mydata), y_class, alpha = 0, family = "binomial")

coef(cv.ridge, cv.ridge$lambda.min)

Parameter Value
intercept 0.40
log # of pages -0.26
p value -25.07
log # of authors -0.67
% male authors 0.90
dummy for interaction effects -0.77
log citations per year 0.37
discipline: economics 0.27
discipline: social psychology -0.77
discipline: education -0.40
discipline: political science 0.10
discipline: sociology -0.40
discipline: marketing 0.10
discipline: orgbeh 0.1
discipline: criminology -0.2
discipline: other psychology -0.2

This model was then implemented in a spreadsheet, so all I had to do was enter the data, and the prediction popped up:































if(D18="Other Psychology",1,0)*otherpsych+


While my model had significant coefficients on # of authors, ratio male, and # of pages, these variables were not predictive of market prices in RM. Even the relation of citations to market prices was very weak. I think the market simply ignored any data it was not given directly, even if it was important. This gave me a bit of an edge, but also made evaluating the performance of the model more difficult as the market was systematically wrong in some ways.

Collecting the additional data needed for the model was fairly cumbersome: completing the surveys took ~140 seconds per paper when I was just doing it in my head, and ~210 seconds with the extra work of data entry. It also made the process significantly more boring.

I will give a quick overview of the forecasting approach here; a full analysis will come in a future post, including a great new dataset I’m preparing that covers the methodology of replicated papers.

At the broadest level it comes down to: the prior, the probability of a false negative, and the probability of a false positive.8 One must consider these factors for both the original and the replication.9

What does that look like in practice? I started by reading the summary of the study on the RM website (which included the abstract, a description of the selected claim, sample size, p-value, and effect size). After that I skimmed the paper itself. If I didn’t understand the methodology I checked the methods and/or conclusions, but the vast majority of papers were just straight regressions, ANOVAs, or SEMs. The most important information was almost always in the table with the main statistical results.

The factors I took into account, in rough order of importance:

  • p-value. Try to find the actual p-value, they are often not reported. Many papers will just give stars for <.05 and <.01, but sometimes <.01 means 0.0000001! There's a shocking number of papers that only report coefficients and asterisks—no SEs, no CIs, no t-stats.
  • Power. Ideally you’ll do a proper power analysis, but I just eyeballed it.
  • Plausibility. This is the most subjective part of the judgment and it can make an enormous difference. Some broad guidelines:
    • People respond to incentives.
    • Good things tend to be correlated with good things and negatively correlated with bad things.
    • Subtle interventions do not have huge effects.
  • Pre-registration. Huge plus. Ideally you want to check if the plan was actually followed.
  • Interaction effect. They tend to be especially underpowered.
  • Other research on the same/similar questions, tests, scales, methodologies—this can be difficult for non-specialists, but the track record of a theory or methodology is important. Beware publication bias.
  • Methodology – RCT/RDD/DID good. IV depends, many are crap. Various natural-/quasi-experiments: some good, some bad (often hard to replicate). Lab experiments, neutral. Approaches that don’t deal with causal identification depend heavily on prior plausibility.
  • Robustness checks: how does the claim hold up across specifications, samples, experiments, etc.
  • Signs of a fishing expedition/researcher degrees of freedom. If you see a gazillion potential outcome variables and that they picked the one that happened to have p<0.05, that's what we in the business call a "red flag". Look out for stuff like ad hoc quadratic terms.
  • Suspiciously transformed variables. Continuous variables put into arbitrary bins are a classic p-hacking technique.
  • General propensity for error/inconsistency in measurements. Fluffy variables or experiments involving wrangling 9 month old babies, for example.

Things that don’t matter for replication but matter very much in the real world:

  • Causal identification! The plausibility of a paper’s causal identification strategy is generally orthogonal to its chances of replicating.
  • Generalizability. Lab experiments are replicated in other labs.

Some papers were completely outside my understanding, and I didn’t spend any time trying to understand them. Jargon-heavy cognitive science papers often fell into this category. I just gave a forecast close to the default and marked them as “low confidence” in my notes, then avoided trading them during the market round. On the other hand, sometimes I got the feeling that the jargon was just there to cover up bullshit (leadership studies, I’m looking at you) in which case I docked points for stuff I didn’t understand. The epistemological problem of how to determine which jargon is legit or not is left as an exercise to the reader.

Pour exemple

The data from Replication Markets are still embargoed, so I can’t give you any real examples. Instead, I have selected a couple of papers that were not part of the project but are similar enough.

Ex. 1: Criminology

My first example is a criminology paper which purports to investigate the effect of parenting styles on criminal offending. Despite using causal language throughout, the paper has no causal identification strategy whatsoever. If criminologists had better GRE scores this nonsense would never have been published. The most relevant bits of the abstract:

The present study used path analyses and prospective, longitudinal data from a sample of 318 African American men to examine the effects of eight parenting styles on adult crime. Furthermore, we investigated the extent to which significant parenting effects are mediated by criminogenic schemas, negative emotions, peer affiliations, adult transitions, and involvement with the criminal justice system. Consonant with the study hypotheses, the results indicated that […] parenting styles low on demandingness but high on responsiveness or corporal punishment were associated with a robust increase in risk for adult crime.

The selected claim is the effect of abusive parenting (the “abusive” parenting style involves “high corporal punishment” but low “demandingness” and “responsiveness”) on offending; I have highlighted the outcome in the main regression table below. While the asterisks only say p<.01, the text below indicates that the p-value is actually <.001.

Make your own guess about the probability of replication and then scroll down to mine below.

I’d give this claim 78%. The results are obviously confounded, but they’re confounded in a way that is fairly intuitive, and we would expect the replication to be confounded in the exact same way. Abusive parents are clearly more likely to have kids who become criminals. Although they don’t give us the exact t-stat, the p-value is very low. On the negative side the sample size (318 people spread over 8 different parenting styles) isn’t that big, I’m a bit worried about variance in the classification of parenting styles, and there’s a chance that the (non-causal) relation between abusive parenting and offending could be lost in the controls.

This is a classic example of “just because it replicates doesn’t mean it’s good”, and also a prime example of why the entire field of criminology should be scrapped.

Ex. 2: Environmental Psychology

My second example is an “environmental psychology” paper about collective guilt and how people act in response to global warming.

The present research examines whether collective guilt for an ingroup’s collective greenhouse gas emissions mediates the effects of beliefs about the causes and effects of global warming on willingness to engage in mitigation behavior.

N=72 people responded to a survey after a manipulation, on a) the causes and b) the importance of the effects of climate change. The selected claim is that “participants in the human cause-minor effect condition reported more collective guilt than did participants in the other three conditions (b* = .50, p <.05)". Again, make your own guess before scrolling down.

I’d go with 23% on this one. Large p-value, interaction effect, relatively small sample, and a result that does not seem all that plausible a priori. The lack of significance on the Cause/Effect parameters alone is also suspicious, as is the lack of signifiance on mitigation intentions. Lots of opportunities to find some significant effect here!

The worst part of Replication Markets was the user interface: it did not offer any way to keep track of one’s survey answers, so in order to effectively navigate the market rounds I had to manually keep track of all the predictions. There was also no way to track changes in the value of one’s shares, so again that had to be done manually in order to exit successful trades and find new opportunities. The initial solution was giant spreadsheets:

Since the initial prices were set depending on the claim’s p-value, I knew ahead of time which claims would be most mispriced at the start of trading (and that’s where the greatest opportunities were). So a second spreadsheet was used to track the best initial trades.11 The final column tracks how those trades worked out by the end of the market round; as you can see not all of them were successful (including some significant “overshoots”), but in general I had a good hit rate. As you can see, there were far more “longs” than “shorts” at the start: these were mostly results that were highly plausible a priori but had failed to get a p-value below 0.001.

[“Final” is my estimate, “default” is the starting price, “mkt” is the final market price]

Finally, a third spreadsheet was used to track live trading during the market rounds. There was no clean way of getting the prices from the RM website to my sheet, so I copy/pasted everything, parsed it, and then inserted the values into the sheet. I usually did that a few times per day (more often at the start, since that was where most trading activity was concentrated). The claims were then ranked by the difference between my own estimate and the market. My current share positions were listed next to them so I knew what I needed to trade. The “Change” column listed the change in price since the last update, so I could easily spot big changes (which usually meant new trading opportunities).

[“Live” is the current market price, “My” is my estimate, “Shares” is the current position]

After the third round I took a look at the data to evaluate the model and there were two main problems:

  • My own errors (prediction minus market price) were very similar with the errors of the model:

  • The model failed badly at high-probability claims, and failed to improve overall performance. Here’s the root mean square error vs market prices, grouped by p-value:

Of course what the model was actually trying to predict was replication, not the market price. But market prices were the only guide I had to go by (we didn’t even get feedback on survey performance), and I believed the market was right and the model was wrong when it came to low-p-value claims.

What would happen if everyone tried to optimize for predicting market prices? I imagine we could have gotten into weird feedback loops, causing serious disconnects between market prices and actual replication probability. In practice I don’t think that was an issue though.

If I had kept going with the model, I had some improvements in mind:

  • Add some sort of non-linear p-value term (or go with z-scores instead).
  • Quantify my subjective judgment of “plausibility” and add it as another variable in the model.
  • Use the round 3 market data of 300 papers (possibly with extremized prices) to estimate a new model, which would more than triple my N from the original 131 papers. But I wasn’t sure how to combine categorical data from the previous replications and probabilities from the prices in a single model.12

At this point it didn’t seem worth the effort, especially given all the extra data collection work involved. So, from round 4 onward I abandoned the model completely and relied only on my own guesses.

Two basic facts dictated the trading strategy:

  1. Only a small % of claims will actually be replicated and pay out.
  2. Most claims are approximately correctly priced.

It follows that smart traders make many trades, move the price by a small amount (the larger your trade the larger the price impact), and have a diversified portfolio. The inverse of this rule can be used to identify bad traders: anyone moving the price by a huge amount and concentrating their portfolio in a small number of bets is almost certainly a bad trader, and one can profitably fade their trades.

Another source of profitable trades was the start of the round. Many claims were highly mispriced, but making a profit depended on getting to them first, which was not always easy since everyone more or less wanted to make the same trades. Beyond that, I focused on simply allocating most of my points toward the most-mispriced claims.

I split the trading rounds into two phases:

  1. Trading based on the expected price movement.

  2. At the very end of the round, trading based on my actual estimate of replication probability.

Usually these two aspects would coincide, but there were certain types of claims that I believed were systematically mispriced by other market participants.13 Trading those in the hope of making profits during the market round didn’t work out, so I only allocated points toward them at the end.

Another factor to take into consideration was that not all claims were equally likely to be selected for replication. In some cases it was pretty obvious that a paper would be difficult or impossible to replicate directly. I was happy to trade them, but by the end of the round I excluded them from the portfolio.14

Buying the most mispriced items also means you’re stuck with a somewhat contrarian portfolio, which can be dangerous if you’re wrong. Given the flat payout structure of the market, following the herd was not necessarily a bad idea. Sometimes if a claim traded strongly against my own forecast, I would lower the weight assigned to it or even avoid it completely. Suppose you think a study has a 30% chance of replicating, and a liquid market insists it has a 70% chance—how do you revise your forecast?

After every round I generated a bunch of graphs that were designed to help me understand the market and improve my own forecasts. This was complicated by the fact that there were no replication results—all I had to go by were the market prices, and they could be misleading.

Among other things, I compared means, standard deviations, and quartiles of my own predictions vs the market; looked at my means and RMSE grouped by p-value and discipline; plotted the distribution of forecasts, and error vs market price; etc.

One standard pattern of prediction markets is that extremizing the market prediction makes it better. Simplistically, you can think of the market price being determined by informed traders and uninformed/noise traders. The latter pull the price toward the middle, so the best prediction is going to be (on average) more extreme than the market’s. This is made worse in the case of Replication Markets because of the LMSR algorithm which makes shares much more expensive the closer you get to 0 or 100%. So you can often improve on things by just extremizing the market forecast, and I always checked to see if my predictions were on the extremizing side vs the market.

Here you can see the density plots of my own vs the market forecasts, split up by p-value category. (The vertical line is the default starting price for each group.)

And here’s the same data in scatterplot form:

My predictions vs the market.My predictions vs the market. Difference between my forecasts and the market, by discipline. The market was more confident in results from economics, at least in round 3.Difference between my forecasts and the market, by discipline. The market was more confident in results from economics, at least in round 3.

Over time my own predictions converged with the market. I’m not entirely sure how to interpret this trend. Perhaps I was influenced by the market and subtly changed my predictions based on what I saw. Did that make me more accurate or less? It’s unclear, and based on the limited number of actual replication results it’s impossible to tell. Another possibility is that the changing composition of forecasters over time made the market more similar to me?

I think a lot of my success was due to putting in more effort than others were willing to. And by “putting in effort” I mean automating it so I don’t have to put in any effort. In round 6 the trading API was introduced; at that point I dropped the spreadsheets and quickly threw together a desktop application (using C# & WPF) that utilized the API and included both automated and manual trading.15 Automating things also made more frequent data updates possible: instead of copy-pasting a giant webpage a few times a day, now everything updated automatically once every 15 minutes.

The main area on the left is the current state of the market and my portfolio, with papers sorted by how mispriced they are. Mkt is the current market price, My is my forecast, Position is the number of shares owned, Liq. Value is the number of points I could get by exiting this position, WF is a weight factor for the portfolio optimization, and Hist shows the price history of that claim.

On the right we have pending orders, a list of the latest orders executed on the market, plus logging on the bottom.

I used a simple weighting algorithm with a few heuristics sprinkled on top. Below you can see the settings for the weighting, plus a graph of the portfolio weights allocated by claim (the most-mispriced claims are on the left).

To start with I simply generated weights proportional to the square of the difference between the current market price and my target price (Exponent). Then,

  • multiplied that by a per-study weight factor (WF in the main screen),
  • multiplied that by ExtremeValueMultiplier for claims with extreme prices (<8% or >96%),
  • removed any claims with a difference smaller than the CutOff,
  • removed any claims with weight below MinThreshold,
  • limited the maximum weight to MaxPosition,
  • and disallowed any trading for claims that were already close to their target weight (NoWeightChangeBandwidth).

There was also another factor to take into consideration: the RM organizers ran some bots of their own. One simply traded randomly, while the other systematically moved prices back toward their default values. This created a predictable price pressure which had to be taken into account and potentially exploited: the DefDiffPenalizationFactor lowered the weight of claims that were expected to have adverse movements due to the bots.

Fading large price movements was automated, and I kept a certain amount of free points available so that I could take advantage of them quickly. Finally, turning the weighting algorithm into trades was fairly simple. If the free points fell below a threshold, the bot would automatically sell some shares. Most trades did not warrant a reaction however, and I had a semi-automated system for bringing the portfolio in line with the generated weights, which involved hitting a button to generate the orders and then firing them off.

When there are a) obviously profitable trades to be made and b) multiple people competing for them, it’s very easy to get into a competitive spiral that pushes speeds down to the minimum allowed by the available technology. That’s how a replication prediction market ended up being all about shaving milliseconds off of trading algos.

By round 9 another player (named CPM) had also automated his trades and he was faster than me so he took all my profits by reacting to profitable opportunities before I could get my orders in—we were now locked in an HFT latency race. There was only one round left so I didn’t want to spend too much time on it, but I did a small rewrite of my trading app so it could run on linux (thanks, .NET Core), which involved splitting it into a client (with the UI) and a server (with the trading logic), and patching in some networking so I could control it remotely.16 Then, I threw it up on my VPS which had lower ping to the RM servers.

When I first ran my autotrader, I polled the API for new trades once every 15 minutes17. Now it was a fight for milliseconds. Unfortunately placing the autotrader on the VPS wasn’t enough, the latency was still fairly high and CPM crushed me again, though by a smaller margin this time. Sometimes I got lucky and snagged an opportunity before he could get to it though.

In money terms, I made $6640 from the surveys and $4020 from the markets for a total of $10,660 (out of a total prizepool of about $190k).

In terms of the actual replication results, the detailed outcomes are still embargoed, so we’ll have to wait until next summer (at least) to get a look at them. Some broad stats can be shared however: the market predicted a 54% chance of replication on average—and 54% of the replications succeeded (the market isn’t that good, it got lucky).

Of 107 claims that resolved, I have data on 31 which I made money on. For the rest I either had no shares, or had shares in the incorrect direction. Since I only have data on the successes, there’s no way to judge my performance right now.

Survey vs Market Payouts

The survey round payout scheme was top-heavy, and small variations in performance resulted in large differences in winnings. The market payout on the other hand was more or less communistic. Everyone gets the same number of points; and it was difficult to either gain or lose too many of them in the two weeks of trading. As a result, the final distribution of prizes is rather flat. At best a good forecaster might increase earnings by ~10% by exploiting mispricings, plus a bit more through intelligent trading. The Gini coefficient of the survey payouts was 0.76, while the Gini of the market payouts was 0.63 (this is confounded by different participation levels, but you get the point).

This was backwards. I think one of the most important aspects of “ideal” prediction markets is that informed traders can compound their winnings, while uninformed traders go broke. The market mechanism works well because the feedback loop weeds out those who are consistently wrong. This element was completely missing in the RM project. I think the market payout scheme should have been top-heavy, and should have allowed for compounding across rounds, while the survey round should have been flatter in order to incentivize broader participation.

If the market had kept going, my next step would have been to use other people‘s trades to update my estimates. The idea was to look at their past trades to determine how good they were (based on the price movement following their trade), then use the magnitude of their trades to weigh their confidence in each trade, and finally incorporate that info in my own forecast. Overall it’s fascinating how even a relatively simple market like this has tons of little nuances, exploitable regularities, and huge potential for modeling and trading strategies of all sorts.

In the end, are subsidized markets necessary for predicting replication? Probably not. The predictions will(?) be used to train our AI replacements, and I believe SCORE’s other replication prediction project, repliCATS, successfully used (cheaper) discussion groups. It will be interesting to see how the two approaches compare. Tetlock’s research shows that working as part of a team increases the accuracy of forecasters, so it wouldn’t surprise me if repliCATS comes out ahead. A combination of teams (aided by ML) and markets would be the best, but at some point the marginal accuracy gains aren’t really worth the extra effort and money.

I strongly believe that identifying reliable research is not the main problem in social science today. The real issue is making sure unreliable research is not produced in the first place, and if it is produced, to make sure it does not receive money and citations. And for that you have to change The Incentives.

PS. Shoot me an email if you’re doing anything interesting and/or lucrative in forecasting.

PPS. CPM, rm_user, BradleyJBaker, or any other RM participant who wants to chat, hit me up!


Economic History

1. Was the Industrial Revolution The Industrial Revolution? A fascinating look at the industrial revolution in the UK, including some explanations of slow/zero growth in various periods before WWII.

From 1760 to 1800, the contribution of the steam engine was .004 percent per year to capital deepening and .005 percent to TFP growth. Not until after 1850 had the high-pressure engine become widespread and efficient enough to be deployed in factories and on rail engines that these numbers each rose to .2 percent. A century passed between James Watt’s patent—the first revolutionary “general purpose technology”—and its maximum realization in TFP growth.

Soaring population growth in the late eighteenth and early nineteenth centuries threatened the island with a Malthusian demographic catastrophe. […] Without an Industrial Revolution, Mokyr reasons that GDP per capita in Britain could have been twenty percent lower in 1830 than in 1760.

Britain became modern, and then it got rich.

2. Leo Aschenbrenner on his favorite Chad Jones papers. “Most of all, Chad’s papers showed me what beautiful economic theory looks like. Simple models that capture a few essential forces, guided by broad empirical trends. These can often reveal insights that totally non-obvious ex ante—but are strikingly intuitive and powerful once found.”

3. The Ant and the Grasshopper: Seasonality and the Invention of Agriculture A fascinating (and speculative) paper from 2019 which argues that agriculture was invented because changes in the earth’s orbit caused an increase in seasonality!


4. Evidence of Fraud in an Influential Field Experiment About Dishonesty. Fairly brazen data fabrication, though it’s still not clear whether it was Ariely or the company that was in charge of collecting the data.

5. Some evidence suggesting that the Sputnik vaccine paper used fake data. I’d note that real-world data shows the vaccine working pretty well regardless of whether there was fraud in the trial.

6. Predicting and reasoning about replicability using structured groups: predicting replicability using the IDEA protocol (‘Investigate’, ‘Discuss’, ‘Estimate’ and ‘Aggregate’) for generating and combining predictions seems to work very well, achieving 84% classification accuracy in this sample. Still waiting on the SCORE results.

7. The Effect of Replications on Citation Patterns: Evidence From a Large-Scale Reproducibility Project

successful replications led to an increase in yearly citations of around 5% and that unsuccessful replications led to a decrease in yearly citations of around 4%. For the average article in my sample, which has roughly eight citations per year, this would imply a change of ±1 citation every 2 to 3 years.

As I was saying, replications don’t really matter, so it’s better to go for forward-looking reforms instead of trying to fix the past.

8. A survey on questionable research practices s from the Netherlands. ~4% fabrication, ~50% frequently engage in QRPs.

9. A clever paper uses the shutdown of a journal (due to an “exogenous shock” in economese) to measure the prevalence of strategic citations. Citations drop by about 20% after discontinuation.

10. Is the Value Premium Smaller Than We Thought? A look at the various decisions that go into constructing a risk factor, and how they affect the end result. “The results suggest that the original value premium estimate is upward biased because of a chance result in the original research decisions.”

11. Text-generating models are sometimes used to plagiarize papers by back-and-forth translation, or to generate new (nonsensical) papers. This study looks for “tortured phrases” like “profound neural organization” (ie deep neural network) and “haze figuring” (ie cloud computing), and finds many published papers that appear to have been computer-generated.

12. Arcadia Science is a for-profit research institute, with a biology lab opening in Berkeley next month. “No work produced or funded by Arcadia will be published in journals.”


13. Simpson’s paradox and Israeli vaccine data. On stratification by age and calculating vaccine effectiveness.

14. Tamiflu for covid? Looks pretty good, ~50% decrease in risk of hospitalization. Costs $700 though.


15. Alignment Problems With Current Forecasting Platforms. A look at some issues with GJO/CSET/Metaculus. It’s not easy to incentivize people to provide their true forecasts at all times, share information, etc.

16. Facebook’s new forecasting platform lasted about a month.

17. Hedgehog, blockchain prediction market from “Futarchy Research Limited”.

Book Reviews

18. Razib Khan has a relatively positive review of Harden’s The Genetic Lottery, but the Steve Sailer review is a lot more entertaining. It’s amusing that the BBEG for these people is still Charles Murray rather than, say, David Reich who has said much worse things.

The Rest

19. George Church is bringing back the woolly mammoth.

20. ADS: Become a Billionaire.

Surveying the top Y Combinator companies, I find that around the top 50 are valued at over $1,000,000,000. They won’t all exit successfully, and the founders won’t all own enough equity to emerge with tres commas to their net worth, but this already gets us to a much more practical and optimistic heuristic to life:

  1. Try very hard to get into YC
  2. Conditional on acceptance, try very hard to become a billionaire

The odds really aren’t that bad. Also from ADS, Does Moral Philosophy Drive Moral Progress?

21. You’ve probably already seen SMTM’s fantastic series on the causes behind the rise in obesity. Some interesting pushback from RCA and a literal banana.

22. Felix Stocker on Will MacAskill’s longtermist plans: Reflecting on the Long Reflection. “I’m struggling to see the Long Reflection as anything other than impossible and pointless: impossible in that we cannot solve all x-risks before any s-risks, or avoid race dynamics; pointless in that I don’t believe that there is a great Answer for it to discover.”

23. Alexey Guzey on Bloom et al’s Are Ideas Getting Harder to Find? The paper has a bunch of problems, but the more general section on TFP is the most interesting:

France’s TFP in 2001 was higher than in 2019. Italy’s TFP in 1970 was higher than in 2019. Japan’s TFP in 1990 was higher than in 2009. Spain’s TFP in 1984 was higher than in 2019. Sweden’s TFP in 1973 was higher than in 1993. Switzerland’s TFP in 1974 was higher than in 1996. United Kingdom’s TFP in 2003 was higher than in 2019.

24. ACX on the FDA: Adumbrations Of Aducanumab The Moldbuggian aspects of this are still underappreaciated. Bureaucracy and bureaucrats are isolated from the consequences of their actions; the idea of equality before the law is a complete joke in the modern regulatory state, and the incentive vectors point in exactly the wrong direction. Scott ultimately blames it on the incentives of the politicians—the people seem to accept infinite costs to prevent certain bad things from happening; but if we take the people as a given, isn’t ultimately the system of governance at fault? Plus ACX on missing school: Kids Can Recover From Missing Even Quite A Lot Of School.

25. Herding, Warfare, and a Culture of Honor: Global Evidence. “The culture of pre-industrial societies that relied on animal herding emphasizes violence, punishment, and revenge-taking”. Highly speculative (the approach of extracting culture of honor from folklore seems doubtful for various reasons) and those scatter plots are not entirely convincing, but also intuitively appealing.

26. Exploiting an exogenous shock in birth control prices, The Children of the Missed Pill looks at the causal impact of the pill: “As children reached school age, we find lower school enrollment rates and higher participation in special education programs.” The eugenic effect of abortion/contraception is both underrated and understudied.

27. A primer on olivine weathering as a cheap method of carbon capture; looks like it could sequester a tonne of CO2 for less than $20. Geoengineering is very cheap compared to most proposed “green” solutions. The OECD has 120 euros per tonne as its “central estimate” of carbon costs in 2030, implying an extremely high ROI for geongineering.

28. DeepMind: Generally capable agents emerge from open-ended play. “We find the agent exhibits general, heuristic behaviours such as experimentation, behaviours that are widely applicable to many tasks rather than specialised to an individual task. This new approach marks an important step toward creating more general agents with the flexibility to adapt rapidly within constantly changing environments.”

29. Unintentionally hilarious paper about AI spotting race in chest x-rays: “Our findings that AI can trivially predict self-reported race – even from corrupted, cropped, and noised medical images – in a setting where clinical experts cannot, creates an enormous risk for all model deployments in medical imaging: if an AI model secretly used its knowledge of self-reported race to misclassify all Black patients, radiologists would not be able to tell using the same data the model has access to.”

30. “Pain Reprocessing Therapy” “centered on changing patients’ beliefs about the causes and threat value of pain” more effective than usual care for back pain, at least if you think you can trust people‘s responses in surveys.

31. Yet another piece of evidence against the efficacy of advertising: TV Advertising Effectiveness and Profitability: Generalizable Results From 288 Brands. “…negative ROIs at the margin for more than 80% of brands, implying over-investment in advertising by most firms. Further, the overall ROI of the observed advertising schedule is only positive for one third of all brands.”

32. Brain surgery causes man to need 3 hours less sleep per day.

33. Matt Lakeman travels to Peru and Panama.

34. Poemage is a visualization system for exploring the sonic topology of a poem.


35. An animated explainer of Robin Hanson’s grabby aliens model: Humanity was born way ahead of its time. The reason is grabby aliens.

36. Did you know that Milla Jovovich released an album in 1994 and it’s…not bad at all? Sounds like Kate Bush and Peter Gabriel. Check out Clocks. [NSFW cover art]

37. Plus some great krautrock: Et Cetera – Kabul.

What I’ve Been Reading


  • The History of the Pelopponesian War, by Thucydides. Re-read. What was that Coleridge quip? “All men are born Herodotians or Thucydideans”? Something like that. Anyway, I was definitely born a Herodotian. Thucydides is a historian with the soul of an accountant. Still, there are things to appreciate in that attitude: while most ancient historians never saw an army smaller than 400,000, he’s happy to tell you about engagements with 60 hoplites and 20 archers. And keeping track of a myriad engagements, covering Asia Minor, Greece, and Italy, over the span of multiple decades is extremely impressive.

    How prescriptive is Thuc’s realpolitik? I’m not entirely sure, it certainly didn’t do the Athenians any good. He’s obviously a skeptic when it comes to the supernatural, and there’s very little room for morality in his history; is this an artifact of the lack of morality in the way the Athenian went about their affairs, or is this something Thuc projects onto them? One interesting point is that his story draws on the structure of tragedy: the hubris of the Sicilian expedition is ultimately punished; the players seem to lack any ability to change course. Perhaps morality plays no role in this history because Thuc views the path taken by each polis as deterministic. (This applies both to the “Thucydides trap” specifically, and also more generally).

    On the question of direct democracy as a system of government things are a bit clearer as Thuc doesn’t hide his views. He’s short on alternatives though; the traditional polis obviously can’t cope with the environment of the 4th century, but Thuc can’t really see beyond it.

    There are apparently some people who think Thucydides influenced the Neoconservatives, and I find that utterly absurd. Thuc is extraordinarily cynical when it comes to “spreading freedom”-style justifications for war, and if there’s any realpolitik involved in spending trillions so that Afghan women can get gender studies degrees for 20 years before the Taliban come back, I’m not seeing it.

    One of the things that stand out is how bad the Greeks are at war. Reading Thuc, you’re constantly thinking “well of course these guys got rolled by the Romans”. How did they beat the Persians so hard? Sieges seem to be a sticking point (something Phillip II turned out to be quite good at), so perhaps the open battles against the Persians played into their hands, or perhaps it was simply a matter of mismatched unit compositions. On the other hand the Athenians were extraordinarily persistent; even after the plagu



Hey! look, i give tutorials to all my users and i help them!