Some Notes on the Problems with Social Sciences
Taking some notes on the problems with the social sciences...
What's Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers
Over the past year, I have skimmed through 2578 social science papers, spending about 2.5 minutes on each one. This was due to my participation in Replication Markets, a part of DARPA's SCORE program, whose goal is to evaluate the reliability of social science research.
The studies were sourced from all social science disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).
But actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers.
You might hypothesize that the citations of non-replicating papers are negative, but negative citations are extremely rare. One study puts the rate at 2.4%. Astonishingly, even after retraction the vast majority of citations are positive, and those positive citations continue for decades after retraction.
As in all affairs of man, it once again comes down to Hanlon's Razor. Either:
- Malice: they know which results are likely false but cite them anyway.
- or, Stupidity: they can't tell which papers will replicate even though it's quite easy.
This is bad not only for the direct effect of basing further research on false results, but also because it distorts the incentives scientists face. If nobody cited weak studies, we wouldn't have so many of them. Rewarding impact without regard for the truth inevitably leads to disaster.
Like citations, journal status and quality are not very well correlated: there is no association between statistical power and impact factor, and journals with higher impact factor have more papers with erroneous p-values.
Even the crème de la crème of economics journals barely manage a ⅔ expected replication rate. 1 in 5 articles in QJE scores below 50%, and this is a journal that accepts just 1 out of every 30 submissions. Perhaps this (partially) explains why scientists are undiscerning: journal reputation acts as a cloak for bad research.
As far as I can tell, for most journals the question of whether the results in a paper are true is a matter of secondary importance. If we model journals as wanting to maximize "impact", then this is hardly surprising: as we saw above, citation counts are unrelated to truth.
Just Because a Paper Replicates Doesn't Mean it's Good
While blatant activism is rare, there is a more subtle background ideological influence which affects the assumptions scientists make, the types of questions they ask, and how they go about testing them. It's difficult to say how things would be different under the counterfactual of a more politically balanced professoriate, though.
The replication crisis did not begin in 2010, it began in the 1950s. All the things I've written above have been written before, by respected and influential scientists. They made no difference whatsoever. Let's take a stroll through the museum of metascience.
The reason nothing has been done since the 50s, despite everyone knowing about the problems, is simple: bad incentives. The best cases for government intervention are collective action problems: situations where the incentives for each actor cause suboptimal outcomes for the group as a whole, and it's difficult to coordinate bottom-up solutions. In this case the negative effects are not confined to academia, but overflow to society as a whole when these false results are used to inform business and policy.
Nobody actually benefits from the present state of affairs, but you can't ask isolated individuals to sacrifice their careers for the "greater good": the only viable solutions are top-down, which means either the granting agencies or Congress.
Are Replication Rates the Same Across Academic Fields? Community Forecasts from the DARPA SCORE Programme
The Defense Advanced Research Projects Agency (DARPA) programme ‘Systematizing Confidence in Open Research and Evidence' (SCORE) aims to generate confidence scores for a large number of research claims from empirical studies in the social and behavioural sciences. The confidence scores will provide a quantitative assessment of how likely a claim will hold up in an independent replication.
Replication has long been established as a key practice in scientific research [1,2]. It plays a critical role in controlling the impact of sampling error, questionable research practices, publication bias and fraud. An increase in the effort to replicate studies has been argued to help establishing credibility within a field [3]. Moreover, replications allow us to test if results generalize to a different or larger population and help to verify the underlying theory [2,4,5] and its scope
Systematizing Confidence in Open Research and Evidence (SCORE)
Research progress could be accelerated if there were rapid, scalable, accurate credibility indicators to guide attention and resource allocation for further assessment. The SCORE program is creating and validating algorithms to provide confidence scores for research claims at scale.
A primary activity of science is evaluating the credibility of claims--assertions reported as findings from the evaluation of evidence. Researchers create evidence and make claims about what that evidence means. Others assess those claims to determine their credibility including evaluating reliability, validity, generalizability, and applicability.
Assessing confidence in research claims is important and resource intensive
The “Systematizing Confidence in Open Research and Evidence” (SCORE) program has an aspirational objective to develop and validate methods to assess the credibility of research claims at scale with much greater speed and much lower cost than is possible at present.