Macroscience
The Macroscience Podcast
Metascience 101 - EP5: "How and Why to Run an Experiment"
0:00
Current time: 0:00 / Total time: -58:01
-58:01

Metascience 101 - EP5: "How and Why to Run an Experiment"

IN THIS EPISODE: Professor Heidi Williams, Professor Paul Niehaus, Emily Oehlsen, and Jim Savage dive in on a practical “how-to” for experimentation and evaluation in metascience. They discuss how to keep metascience experimentation and evaluation relevant to policymakers.

“Metascience 101” is a nine-episode set of interviews that doubles as a crash course in the debates, issues, and ideas driving the modern metascience movement. We investigate why building a genuine “science of science” matters, and how research in metascience is translating into real-world policy changes. 


Episode Transcript

(Note: Episode transcripts have been lightly edited for clarity)

Caleb Watney: Welcome to this episode of the Metascience 101 podcast series. Professor Heidi Williams, Professor Paul Niehaus, Emily Oehlsen, and Jim Savage discuss “How and Why to Run an Experiment.” 

Emily is the Managing Director of the Global Health portfolio at Open Philanthropy. Jim Savage talks about his experience as the Director of Data Science at Schmidt Futures, another science funder.

Together, we’ll zoom in for a practical “how-to” on experimentation and evaluation in metascience with a special eye for relevance for policymakers.

Heidi Williams: We wanted to do an episode talking about how to do research on how we do research. Research can mean a lot of different things to a lot of people: qualitative research and interviews, novel sources of data collection, trying to understand something that's happened in the world and going back to evaluate what we learned from it, or prospectively designing a randomized experiment. 

I want to emphasize that by research, I don't mean just narrowly research papers that are primarily intended to be published in prestigious journals. What we are talking about today is research in a very broad sense: a way of learning about what's working that can inform making things better.

In particular, today we'll talk about cases of research done within organizations to improve how they were accomplishing their goals – how organizations use research to try to better accomplish their goals. Sometimes this research results in traditional, published academic papers. But to make clear at the beginning, the intention of our conversation is that we're not talking about research only for the basis of publication, but rather trying to accomplish some other goal in the world.

When thinking about the economy, we do that in a lot of different settings. When we have a new candidate drug compound and we want to know whether that saves patients' lives, we do a very intentional series of tiered research investments. 

You do a phase one trial that's usually done in animal models. You learn something about that from a safety perspective. Then, you move on to a phase two trial, which is more expensive. If something looks promising in phase two, then you move on to phase three trials, where you're looking at efficacy in a larger human population which costs more. 

In generating that tiered set of evidence, the hope is that we're going to take an idea and move it toward something that could have social impact at scale. I think that this framework of piloting an idea and moving it through a funnel to get a serious evidence base, where we're comfortable making scaled organizational decisions, is one that we will come back to at various points. It has much wider potential than how it is currently used in the science space.

The other thing I wanted to preview upfront is: even very thoughtful people are often extremely skeptical about whether research on how we do research is even a feasible possibility. Oftentimes we think, “Well, what we want is to fund high-quality science, but people can't even agree on what it means to measure high-quality science.” 

I think the right place to start off here is with a thoughtful example around measurements. I tend to be an optimist because I feel there are actually a lot of opportunities to make progress in a very concrete way. Rather than talking in the abstract, I wanted to have Jim start by talking about talent identification, which is one of the areas that people think of as very hard to measure. 

How do you find talented people? Jim, you worked on leading an innovative program, the RISE program. An exciting thing about your work was taking seriously that it is a hard thing to do, but tackling it as a question that you could invest in research to learn about how to do it better. Could you talk a little bit about that?

Jim Savage: Sure thing, Heidi. Let me just start with a little bit about RISE and why it was an important thing for us to spend some time trying to learn about. 

RISE is the cornerstone of a billion dollar commitment by our co-founders, Eric and Wendy Schmidt, to support talent for good. It is a global talent search for brilliant teens aged 15 to 17, whom we try to find, support, and challenge as they go and create public goods at scale. We make a huge amount of support available to them via needs-based scholarships, summer camps where they get to spend a bunch of time with all the people who are like them, other forms of support, and then career support as they go and hopefully do good for other people.

Now, we kicked this program off about three years ago. It's a very large program. We have tens of thousands of applicants applying for this program, and only 100 winners every year. When we started it along with our partners at Rhodes Trust, our principals, Eric and Wendy, and our CEO Eric Braverman gave us this challenge. The challenge was that we needed to come up with a way of finding brilliant youngsters that was open to anyone in the world who could apply. We should have lots of different pathways to apply. 

Because most people will miss out, it should also be a program that benefits people in the very act of applying. So although most people miss out, they have still gained something from having applied, which is kind of a bit different to many university applications or scholarship applications. We were given this challenge of finding a scalable way of measuring talent and identifying people who are brilliant and empathetic, and have high integrity, perseverance, and some calling. How do we find those people at scale in such a way that it benefits them? It's a really hard problem.

We went and read the research. Our team interviewed dozens and dozens of scientists on this – people who'd done studies over many decades. We read all the papers. We worked with a really great team that spun out of Khan Academy that did some interesting design work on how we might measure talent at scale. Then we had a product review with Eric Schmidt. Now, product review maybe has the vibe of – I’m sitting with a bunch of economists in here – like an economic seminar. It's where your principals will challenge you and really test how much about this you understand. 

Let's just say, we didn't last very long in this product review. After a few minutes, Eric stops us and he's like, “You know, you've obviously done a lot of work, but this is a real investment. We need to understand whether we can identify talent at scale using this method. You haven't shown me the experiments that you've run. You haven't shown me whether what you're proposing is a good way of identifying talent.”

He called us up afterwards and said, “Okay, I'm giving you air cover here to go and do the trials. Go and do the experiments so that you can show us whether this works.” We pulled together a team. Now, I've never run a human trial study before, so this was very new to all of us. We worked out a couple of different models for how we might test this: how do you go and measure integrity in teens at scale? 

And we came up with an interesting design. Imagine we could take a population of brilliant youngsters where we know that there are pre-existing outliers, which we know from very expensive-to-collect data. Then, we have that population of youngsters go through a mock application process. A good application process should identify those people who you already knew to be outliers.

What we did is we recruited 16 classrooms from eight different countries around the world: United States, Hong Kong, South Africa, Zimbabwe, and a few others. We sat down with these gifted and talented classroom teachers who had spent at least a year with their students. We asked them to roughly score all their students against intelligence, empathy, integrity, and those sorts of things. The hunch here was that those traits might be observable by people after having extended exposure to each other in rich context.

Then, we sat down with many of the students in each of those classes and had them nominate the top three most empathetic peers in the class and the top three most persevering kids in the class. It turns out there's a lot of agreement between people. People are talking about real constructs here. If you construct a Gini index where zero is everyone guessing at random and one is everyone naming the same top three, we're talking like 0.4 to 0.6. It is a fair degree of agreement.

This data is very costly to gather. There's no way that we could get this at scale. With this data, we then had the question: are there interview questions, tasks, exams or other things that would identify the people in this group? 

Now, these kids were already in gifted and talented programs. They're already pretty sharp to start with, and we know their classmates and teachers could identify outliers. Can we identify those people? So we tested dozens of interview questions. We had 23 wonderful volunteers, mostly PhD students at Oxford from all around the world, who volunteered as interviewers to have interview panels with these youngsters. They had 45-minute structured interviews where they tried to get the sense of whether the person demonstrated evidence of having high integrity or empathy. We gave people questions from the LSAT and other aptitude tests. We gave people divergent reasoning tests like, ‘How many uses for a T-shirt can you come up with?’.

We had the youngsters record selfie videos, and we recruited some volunteers from that same age group to watch these selfie videos and grade on whether they thought they exhibited high intelligence, high empathy, or integrity. After all of this, we learned a couple of really shocking things that changed how we built RISE. 

The first was that many of these questions that I have been using in interviews for years don't work very well. At least they didn't identify the outliers that we knew about. It was a big hit to me – I haven't used any of my old interview questions since.

Second, we learned that there was very little relationship between the structured interview panels and the very costly data that we gathered from classmates and teams. When we decomposed that error, it was an error that systematically favored the mock candidates from richer backgrounds and systematically penalized candidates from poorer backgrounds. That was really something.

We went back to all those interviewers and told them this, and they said, “That's really interesting. I want to know who were the people that we were making mistakes with?” We found one candidate who was nominated by 80% of his class as being the smartest kid in the class, and yet when the interviewers interviewed him, they rated him as the second worst on that measure. Why is this happening? The three interviewers agreed – they had very high inter-rater reliability. We went to the interviewers and they said, “Well, he didn't answer any of the interview questions. He just did very, very poorly in the interview and didn't give us any evidence.” We used a lot of that.

We still do use a small amount of interviews at the end of the application process for RISE, but it is not a hurdle. You can have a fairly weak interview, and as long as the rest of your application for RISE is really strong, you can still get through.

We now spend a lot of time preparing people to make sure they really are able to put their best foot forward when interviewing. We also only use questions that we know have been validated using this sort of mechanism. Now, the delightful thing about this was that we found certain questions to be very strong predictors of whether the classmates and teachers thought highly of candidates. When we rolled out the live application, the candidates who did well on those questions had much higher rates of completing the application, which involves working on a project for seven weeks. In live data, we saw it validated that we could get some data very, very cheaply that was predictive of real world behavior.

Heidi Williams: One thing I love about this example is that oftentimes when people think about research, they're like very narrowly focused on impact evaluation as opposed to validation of measures. 

Paul, it's interesting because it is very similar to how you talk about some of your data and measurement validation work. I know that came up a lot in your work on GiveDirectly and other things too. But I want to transition and ask if you could kind of talk about research that's more traditional, like impact evaluation. How do you think about what the key steps are that you need to bring together for that to be a meaningful, high potential investment?

Paul Niehaus: Yeah. I always tell people that there are three hard things with impact evaluation. One is to be clear conceptually about what you're trying to achieve. A second is to think about good metrics for that, which I think is what Jim has just shared a great example of. Those two are obviously very interrelated and are things that organizations typically need to do anyway for many other purposes. Sometimes running an experiment can be a good forcing function to get you to do that if you haven't already.

Then the third key thing is counterfactual reasoning. The essential thing about impact is how the world is different as a result of the thing I did compared to the way it would have been if I had done something else. People will sometimes say in a sort of loose way, “Oh, we did this thing and you could really see the impact of it.” But if you take the definition seriously, that's not true. There's no sense in which you can ever literally see the impact of something you've done because you can never see how that alternative world where you did something else looks. 

The really exciting and challenging thing about impact evaluation is what are good ways to make inferences about that counterfactual which we can not see. That's what experiments are all about and why I think they're very powerful.

With an experiment, we take a group of people – kids, perhaps, who want to enroll in Jim's program – and assign a group of them to get in and a group of them not to get in. Then, we look at how their lives evolve after that. When we compare those outcomes, if they were assigned randomly to those two groups, we can be really confident that the kids who didn't get it are giving us a pretty good counterfactual for what life would've looked like for the kids that did get it, if they had not. That's the power of the method and the experiment.

Why is it so important? There are lots of other ways that we can go about trying to measure these things that seem appealing or intuitively right, but they can turn out to backfire or not to work in the way we expected. A very common way to look at how things are going is comparing people before versus after they get help. You'll often see situations where people opt into getting help at times when they need it and when things are going badly, and then afterwards things get better. We're tempted to say, “Ah, things got better, because the thing that we did to help them is working.” Whereas in fact, some of that is just because when things are really bad, there's nowhere to go but up. Things tend to get better after that. That's been a common issue in a lot of program evaluation. 

For example, when looking at ways to help people who are unemployed, people who are having a hard time finding a job opt into some sort of help finding a job, and then low and behold, they do find a job. But we don't know how much of that is just because they would have found a job anyway.

That's the power of experiments. There are also other ways of trying to draw these counterfactual inferences that can be useful – times when you can do something that's very close to an experiment even if it's not exactly an experiment which is within the parameters of the decision-making structures you already have in place. 

A common thing that we do in economics is we might look at a system where there's a cutoff. Maybe like Jim's program, if they're above some threshold, people get into the program. We can say, “Well, let's look at the people that are just above that and compare those to the people that are just below that threshold.” They're slightly different, but those differences are pretty slight. So we'd feel pretty confident saying we can attribute different outcomes for those groups largely to the impact of the program, as opposed to other factors.

So experiments are not the only way of drawing these kinds of counterfactual inferences, but they are very powerful. They force us at least to think hard about that question of “how am I confident that I can see what the world would've been like, if I hadn't done this thing that I've done?”

Heidi Williams: Yeah. I want to kind of transition to talk more about experiments directly. Before we do that, Emily, I would love for you to say a little bit about how Open Philanthropy thinks about using impact evaluation and counterfactual evidence in your decision-making – just to give a lay of the land before we get into kind of more specifics on experiments.

Emily Oehlsen: Yeah, absolutely. By way of quick background, Open Philanthropy is a philanthropic organization that gives away a couple hundred million dollars a year, and we aim to maximize our impact. We think about that in a pretty evidence-based and explicitly expected-value-maximizing way. There are two sides of our organization: one that focuses on potential catastrophes that we might encounter over the next couple decades, and the other focuses on ways that we can make the world better in the near term – often in much more concrete and legible ways. 

A key distinguishing feature of that second piece is that we are often trying to compare outcomes – not only within causes but also across them – to try to optimize our overall portfolio, which we take to mean equalizing marginal returns across all of the different areas that we could be working in.

Heidi Williams: To be concrete in thinking about this, how do you put health investments and education investments in a similar unit?

Emily Oehlsen: So some of the areas that we work in are scientific research and global health R&D. We do some work on the health impacts of air quality. We do some work on farm animal welfare which makes the comparisons quite difficult because you have to think about the suffering of animals and people. We do work on global aid advocacy, and a few other areas. 

There are lots of things that we care about, but as a simplifying principle, we often try to think about the health impacts of the work that we do and the way that they affect people's consumption. So far, we have thought as hard as we can about how to compare those two units and use that as a disciplining force to think about the marginal thing that we could do in each of these areas.

I really liked Paul's taxonomy. We try to think hard about what we care about and the metrics that we can use to measure those things in the world. There's tons of complexity. Even just taking health impacts, we rely a lot on the IHME and the WHO to think about the life years lost to different health conditions. And there is tons and tons of complexity embedded into that. We are avid consumers of experimental evidence, as we try to evaluate different opportunities that we could pursue.

I’m particularly excited about the work that we do thinking about places where we can innovate how we use experiments in social sciences and in public health. 

One example from Open Phil today is that our science team is exploring the possibility of funding a controlled human infection model (CHIM) for hepatitis C. Hepatitis C is a particularly good candidate for a CHIM because there's slow natural progression of the disease and a relatively low intensity of transmission – even among high risk groups – which make classic field efficacy trials extremely slow and difficult to conduct and make the possibility of a human challenge trial more exciting. I don't know where that will go, but I think it's interesting to push the frontiers of places where we can use experimentation.

Heidi Williams: Yeah. That's a great example. Like Paul brought up, what is the experiment you would run with RISE? 

If you were going to fund 100 kids, let's choose 200 kids that you would most like to fund, and you randomize funding for 100 of them and not for the other 100. Then, you want to track how their life is different. That sounds like something where you're going to structure this 20 year study, where what you care about is their earnings when they're older. So, these studies come across as feeling very infeasible. 

You mention an interesting example of how we can learn more, and more quickly – innovating on the research side of that. 

Paul, I was curious if you could say a little bit about when people understandably say, “Isn't that too expensive and going to take too long?" What are some of the ways that you bring to people when they want to use experiments for more real-time decision-making?

Paul Niehaus: Experiments really run the gamut from extremely fast to very long-term, from extremely cheap or free to very expensive. Concretely, like at GiveDirectly, which is an NGO that I started, we've run around 20 studies that have ranged from five years long from the initiation until having results and cost hundreds of thousands of dollars, to four weeks from the beginning to having useful data back and cost nothing to run. Just to have some sense of the range of possibilities. 

What drives that? Randomization per se is not expensive. I mean, if we just want to randomize something, we can do that right now in a Google spreadsheet and it costs nothing at all. Picking things using a lottery is free. The thing that is typically expensive and possibly slow is the outcome measurement.

At GiveDirectly, for example, the expensive and slow trial that I mentioned was where we wanted to see the impact on local economies if we bring in a huge amount of money. To do that, we have to do this very extensive measurement of what's going on with households, what's going on with firms, what's going on in markets with pricing, and what's happening at the local government to get this comprehensive picture of how an economy reacts when there's a big influx of money. That takes time and it takes a lot of resources to go measure all those outcomes and then analyze the data. That is to some extent intrinsic to the thing that we want to look at.

At the other end of the spectrum, the very quick and cheap study I mentioned looked at whether a little bit of liquidity before people decide how they want to structure their transfer changes their decisions. The beauty here is that this is an administrative outcome which we're already collecting anyway. We have people that are already asking, “How would you like to structure your transfer? If I give you a thousand dollars, would you like it all at once? Or would you like it in 12 tranches?” If we want to see what happens when they have a little bit more cash in hand when they make that decision, we get the data back for free already, so it only takes a few weeks to do that and it is very cheap to do. It’s largely a question of the thing that you want to look at.

In terms of the longer term, sometimes we really do care about how the world will be different in 10 years. There's a part of the answer here that – whether you're doing an experiment or measuring impact in some other way – if you want to know what things will look like in 10 years, you just have to wait 10 years. That's not a feature of experiments, that's just a feature of life. 

But I would also say that there's an interesting frontier in statistical analysis looking at surrogates, essentially things that we can observe now that we think are good predictors of what the world might look like in 10 years. They can at least give us some leading indicators of whether we're seeing the kinds of changes that are indicative of the longer term as well. I think there are ways to be smart about that.

The last thing on the cost of experiments is that sometimes there is a risk of being penny wise and pound foolish when sizing them. The issue here is that you want to design an experiment that's big enough to give you the degree of confidence in a statistical sense that you need to be able to make decisions. You want to be thoughtful about that. I have participated in things where later I think, “Actually, we should have done this with twice the sample to have more confidence in the result.” There's a whole art and science around sizing those experiments.

That's the one place where you want to be careful not to cut corners. Part of that is because we are in this bad habit as social scientists of saying, “If we can't be 95% sure that something happened, then we're going to treat it as if it didn't.” There's this pathology in how we interpret so-called “null results” that makes it even more problematic and makes me err on the side of having a larger experiment to make sure it doesn't get misinterpreted in a way that things often get misinterpreted.

Heidi Williams: Emily, a different concern that people often bring up with experiments other than feasibility is external validity. 

Say you do a study in one setting. GiveDirectly does an experiment in one country like Kenya. What is the external validity consideration? What should we learn about that if GiveDirectly was going to expand its giving in India, for example? 

At Open Phil, you seem to use experiments a lot internally. Open Phil has a really great practice of often publishing the reasoning behind their investments, so one can get a lot of insight into how they used research in making their decisions. It seems like you think a lot about that. I was wondering if you could give a few examples of where you've seen that work well.

Emily Oehlsen: Yeah, definitely. Two responses come to mind. So one, Heidi, you talked earlier about how in the biomedical sciences we have a clinical trial process with different stages that have different costs associated with them. We're willing to invest as we get more and more information that a particular drug, diagnostic, or therapeutic is potentially effective and looks promising to get widely distributed. There's an equivalent in the social sciences too. 

The main example that comes to mind for me is Development Innovation Ventures, called DIV for short. Within USAID, they're a special division that makes smaller investments often in riskier and earlier stage projects, where there's the potential for high impact. They have a similarly staged process. I think it's stages one, two and three where the dollar amounts scale up. There's an initial pilot phase where you might run a small experiment to get some preliminary data on a particular intervention. As you become more and more confident that that intervention is effective, you can run larger and larger experiments to see how it scales before ultimately thinking about broader deployment. I think that that's like a really effective model.

Another observation that comes to mind is that sometimes thinking about a single paper or a single experiment is not the right unit. One example here is there were a number of RCTs that were run around water quality, but they were all individually underpowered to look at mortality because mortality is rare. Michael Kremer – who recently won the Nobel Prize – did a metaanalysis and found a big, statistically significant effect from these water quality interventions on mortality. That meta analysis played a significant role in GiveWell's decision to scale up Evidence Action’s Dispensers for Safe Water program. Using this as one example to say that sometimes an individual experiment isn't enough in and of itself to be decisive, but it can be coupled with other types of evidence that can then lead to a bigger decision.

Paul Niehaus: Can I just add that I completely agree. By the way, the Michael Kremer story is a good example of this pathology I mentioned where sort of we interpret things that don't reject a null hypothesis as not being informative. Michael basically showed that they are individually somewhat informative and collectively very informative. I think that's a great example. 

The other thing I wanted to say is there is a very common misperception that external validity – which I don't even like the term, but I mean whatever – is more of an issue for experimental methods than it is for non-experimental methods. Personally I think that it's actually often the exact opposite. When you use non-experimental methods, the results are not representative of the population you care about in ways that are very opaque and hard to understand. This is opposed to experiments where it's pretty clear what population the results are representative for and where you should therefore be careful in extrapolation or scaling up, as Emily mentioned. We don't need to get into the details of the statistics of that, but if anything I would say that cuts the other way.

Emily Oehlsen: Thinking about the topic of external validity, I think it does raise – and this has been sort of woven into our conversation so far – two challenges that come up when we think about experimental evidence and how to use it. 

One, it is often the case that the importance of some experiments are not evident right at the moment of discovery. We do a lot of grant making at Open Phil that's particularly directed towards trying to improve health outcomes for people living in low-income countries. It's sort of clear what we're aiming for and the significance of the potential results that we could get from any particular experiment. We also do a lot of grantmaking that is more basic science in nature. This relates to Paul's first question of trying to decide what matters to you. A lot of times when we're doing that work, it is quite difficult to articulate what a good result means or how it's going to ultimately flow to impact downstream. That is a challenge that we always have to grapple with.

Another is how to think about effectively using experimentation. At Open Phil, we think about a lot of our grant making as hits based. This is the idea that you are willing to pursue low probability strategies because of the potential upside. Oftentimes with those opportunities, the work is riskier, there are fewer feedback loops, causal attribution is harder, and oftentimes the outcomes aren't observable until like 30 years down the road and you can't maintain a control group for that long. Some of our corporate advocacy work in farm animal welfare has this quality, as well as some other areas of grant making. 

I think this is a pretty common observation in the metascience world. In science, you might think that the distribution is pretty fat-tailed and we should be focused on some of these outlier opportunities. And so how to bring experimental evidence to bear productively on those questions is a challenge.

Not in the dimension of learning things faster, but types of experiments or experimental evidence that we're particularly excited to fund because we think that they're under-provided in some way by the ecosystem. A couple in this broader bucket is the replication of really promising work. 

Paul and Heidi will know far more about this, than I do. But, I think there are some incentives within the academic world, to under-provide replication because it doesn't have the same novelty or it doesn't contribute to your potential career prospects in the same way. But oftentimes, when you're evaluating a particular intervention and you see one piece of evidence that seems like an outlier compared to maybe y (your prior or other things that you've seen), sometimes the most powerful thing that you could then observe is a replication of that work in some capacity. And so that's one thing we're really excited about.

To your timeline point: it's really hard, as Jim was saying, to set up experiments to create that infrastructure and act on it. And sometimes there are particular moments where it is really valuable to spin up something quickly. I think COVID is the one that comes to mind. So being able to create more flexibility in the system so people can jump on opportunities as they arise. 

Heidi Williams: Yeah, and that's a bit of what we touched on at the beginning. When people think of experiments for science, they often think of this as: What's the best way of getting the best science? And that's where I think you get into these ideas, "Well, how would you even measure that? And isn't that like 10 years, and long, and variable legs of these incredibly tail outcomes which have all the social value?" And when I talk with people that do science funding as kind of their job, I try to kind of anchor them a little bit on what are kind of concrete challenges that you have that are not tied to these sort of more existential questions?

So one example is the National Institutes of Health is very concerned that the average age at which you get your first NIH grant has been going up and up and up over time. And so they're very concerned that their grant structure for some reason is not doing a good job of identifying young talent. And so for them, if you can kind of say, can we design kind of a research approach, or an experiment, whatever you would like to do, that's going to investigate our different grant mechanisms, doing a better job of identifying talented young scientists that for some reason might be getting missed by the default system? That's something where you observe that outcome kind of right away.

Emily Oehlsen: Yeah.

Heidi Williams: You could say, “Well, maybe the young scientists aren’t currently as good as the older ones, but everybody agrees that we need a way of onboarding people into the system.” I feel like some things that can get you out of this “how would we know good science when we see it” issue when it comes up.

But, Jim, I wanted to come back to talk a little bit about the organizational dynamics of how this can kind of work in practice. So oftentimes, organizations understandably see experiments as pretty risky to conduct because: what if we show that our program doesn't work, and what does that mean for people's jobs, and what does that mean for me personally as being the person that ran this?

And Ben Jones, who's an economist at Northwestern often makes a distinction between what he calls existential experiments and operational experiments. The existential experiment is, "Should my organization exist?" And the operational experiment is: "We would like to find talented youth, and we have two ideas on how to do that, and which one is better?" And so I think there are some structural ways in which the research questions that you pick can make this less threatening within organizations. But I'm just curious if you could comment on your work with RISE, kind of how the internal organizational dynamics played out.

Jim Savage: I personally have not experienced that sort of existential threat of whether you have to close down a program or something because it doesn't have an impact. Which is not to say that I've never had any pushback against doing experimentation.

Heidi Williams: Yeah.

Jim Savage: Almost all the time, that pushback has been because doing things is really hard, and setting up a big program, especially something on the scale of RISE where you're coordinating hundreds of volunteers, you've got zillions of candidates, you've got different types of software, you've got paper applications coming in and chatbot applications and all these sorts of things, it's an incredibly complex initiative. And each time you add complexity to a program, it just becomes exponentially more difficult to operate. And so especially when you're setting up some organization or some initiative, it can be really challenging to just add more complexity, and you should have a bias towards past money and what you do.

Heidi Williams: Yeah.

Jim Savage: Which not to say that that's not also a really good time to do an experiment, because you work out what to do. There is always going to be a tension. I just think that, especially for these more operational or formative evaluation, I think there are, the evaluation people would say, questions. It's not a fear that you're going to have to shut it down, it's just really hard to do experiments. Now, you have seen some fields adopt experimentation that are not full of macroeconomists. So where are these fields? You go looking for them and it's like, if I'm on MailChimp and I want to send an email to a zillion people, I've got an option where I can just run a randomized control trial. They're called A/B tests, and they call them A/B tests. I think that term is used because it's non-threatening. Randomization is kind of this scary word. But I can now have different copy and see which one has better click-through rates, so which subject lines get opened more easily. 

You know, if you log onto various news websites, you will see different headlines as they A/B test, or even use multi-armed bandit approaches to work out which are the most higher- high-performing variants of headlines to serve to you. And it's not because they've got a whole bunch of macroeconomists who've been pushing people to adopt an experimental method in science.

It's because the software makes it really easy to do these sorts of experiments. If we are to be able to do more experimentation internally, a lot of it comes down to, how can we reduce the cost of doing experimentation?

Heidi Williams: Mm-hmm.

Paul Niehaus: I think that's got to be partly because Jim Savage has selected to work in really good, high-performing learning organizations, and that there are definitely examples out there of high-profile, important efforts where people have resisted encouragement to test. You know, Millennium Villages, for example. We would all love to know what the impact of that was. They refuse to do it. There are also pretty well known organizations, examples of RCTs, that got done and got buried because people didn't like the results. There are some places that are great about it and some places that really do resist.

Jim Savage: When I talk to people, I don't really hear that, especially if you're talking to public servants. People admit there is a lot of complexity. They would love to do it, and I think we should be making it easier to do experiments.

Emily Oehlsen: Do you think there's an organizational feature here? If you're an organization that does one thing, and the experimental result shows that that thing is not as effective as you thought it was, would that feel more existential to you than if you're in an organization where you have many different programs going on at once? Where you can take more of a disinterested perspective? Do you think that's a factor?

Jim Savage: There's definitely a fear of "evaluation." I think evaluation has this very threatening tone: "Oh, we're looking at the sum of impacts of your program or your organization." That does have some existential threat, but I don't think we're really talking about that. 

I'm not talking about that. I'm talking about the idea that you can get program managers to run experiments internally, if it's easy enough. And I think that most people are willing to do that.

Heidi Williams: Yeah. There's obviously a continuum. Measuring teacher value added was something that I think felt very threatening to individuals. “I'm getting ranked and scored,” right? I do want to come back to Emily's point, because there's an interesting example with No Lean Season that bothEmily and Paul could probably offer perspectives on. An experiment that ended up shaping the organization in important ways.

Emily Oehlsen: I observed this from afar, but it’s an example that I've always found, pretty inspiring. Evidence Action was founded in 2013 to scale evidence-based, cost-effective programs. They had their core programs around deworming and scaling free, reliable access to safe water. But then they also had this program called No Lean Season where I think the original experimental evidence was from Bangladesh. It involved giving people both information and then small loans, so that they could migrate to other parts of the country, when seasonal work was scarce where they lived. The original RCT evidence showed that this was a pretty promising intervention for poverty alleviation, and so, Evidence Action started to scale it up. Then they ran two more RCTs as it scaled, that showed that it was less effective than they had expected, and they ended up shutting down the program.

I found that decision quite impressive, to be able to take a step back and say, “Okay, this is not as promising as we thought it was. There are other ways we could deploy this money that we think would help more people and help them more deeply. And so as an organization, we're going to pivot.” That was a really impressive example.

Paul Niehaus: Especially when you say there are other ways to deploy the money, a lot of that money isn't in your pocket. Will funders actually respect this choice and listen to us when we say we think you should fund this other thing instead, or will they just walk away entirely? I think there was courage in it. But then also as we've talked about, the fact that they had other things that people could move to makes it less of an existential evaluation and more of an operational one, right?

Emily Oehlsen: Yeah.

Jim Savage: One question with these sorts of evaluations is measurability of outcomes. Many of the most impressive investments might be on things that result in some cultural shift, or some change in the zeitgeist, or some demonstration effects that have a lot of people change how they go and do their work. I cheekily use the example of the Koch philanthropies. I think they have pursued many investments that would be almost impossible to study in an evaluation framework that we would be happy with. But nobody accuses the Koch philanthropies of having had no impact. I think people often do the opposite. If you are pursuing some types of things, people might be legitimately afraid of being evaluated when the evaluator will never be able to observe the rich outcomes that they're actually trying to affect.

Heidi Williams: Yeah. I think of GiveDirectly as kind of a nice example of this. GiveDirectly ruled out doing a lot of experiments that were targeting these more narrow questions. But if I were going to introspect on GiveDirectly's impact, it seems like it was mostly shifting the narrative around cash transfers.

Paul Niehaus: Yeah.

Heidi Williams: Could you say a little bit about that for people that might not be familiar with the broader context?

Paul Niehaus: I totally agree with what Jim said for this reason, that on the one hand, GiveDirectly had this very evidence-centric strategy, and so even the very first transfers that we delivered were part of a program evaluation which went on to get very well published. And I was totally wrong about that, by the way, thinking, “This is going to be a boring paper. Nobody wants to read yet another paper on cash transfers.” They did. 

It was really powerful, and we said to people, "We're an evidence-based organization, and we're going to begin…” And so we've gone on to do lots and lots of these. But the most valuable thing that we've really contributed to the world is that the narrative around cash transfers has changed dramatically, and we've played some role in that.

When we started the very first people we'd go to for funding would say, “This is crazy. This is nuts.” You know, the first time we had a New York Times story, the headline was like, “Is it nuts to give money to the poor with no strings attached?” We now have come to a place where most people don't think that it's nuts at all. 

They think it's obviously something we should be doing to some degree, and the debates are all just about how much, when and when not, and things like that. I think that's exactly right, that the rigor of the experimental method helped contribute to the credibility of it, and to drive this change in narrative and in perception. But that change in narrative and perception was a very hard part, but the most important part of the impact of it all.

Jim Savage: One of the most helpful things that both GiveDirectly and Open Phil have done within the broader funding community is given the rest of us a really good benchmark, so that when we are talking about intervention that is directed at shifting the zeitgeist or culture or some set of incentives or demonstration effects. 

You've got this shadow price in your head “Oh yeah, it costs three and a half to five thousand dollars to save a human life. You can buy this many utils with cash transfers.” Those are really important things to have in your mind when you're spending philanthropic capital or money for science or something – that there is this opportunity cost out there.

Heidi Williams: Do you think that happens mostly within a cause area that is already a focus for our funders? Because as Emily was saying, at Open Philanthropy, they're partly using this to prioritize across areas. A lot of philanthropies come in and they know the area that they want to be in already. 

Do you have an example in mind that you could give around that? Is it really prioritizing across interventions for that cause, finding where there is the most high impact?

Jim Savage: No, I think it's simply that you need to have in your head the knowledge that you can do a lot with this money. It forces you to be creative and thoughtful and think through what you're trying to do more carefully. That might be a very unmeasurable impact of both Open Phil and GiveDirectly in the long run.

Heidi Williams: One thing about this No Lean Season example. There was one very high impact study, but it was also just a very intuitively comfortable idea for people: that there was this mismatch spatially between work opportunities because of the seasonality of labor in these countries. I think that really resonated with people as a very plausible case. The scaled experimentation did a real service by showing, “The thing that we find intuitive and that one study suggested actually might not be kind of right at scale.” 

Paul Niehaus: There's this old trope in development: give a man a fish and you feed him for today, teach a man to fish and you feed him for a lifetime. To me the closest analog in terms of things we actually do to try to help people who are living in poverty, is to teach a man to fish. Active labor market interventions where we try to train people and help make them more employable and help them get jobs. That's an area where the evidence has generally been really, really negative. We've tried those things a lot. I don't think we're very good at teaching people how to fish.

So that's a great example of something that at a very loose abstract level seems intuitive: “Of course I want to feed somebody for a lifetime, not for one day, like that seems obvious, right?” But then when you actually get into the data, it turns out we're just not that great at teaching people how to fish. We should think about whether we can get better at that or other things we could be doing. That's always been a good example.

Heidi Williams: Emily, you talked about the case for thinking about more organizational or philanthropic investment experimentation as a methodology. Paul brought up one example, which is this idea around surrogates. To spell it out for people that aren't very familiar with the details of experiments, oftentimes for drug development, the default is that we need to know whether this drug improves survival. 

There are some very specific cases where the regulator, say the Food and Drug Administration, will be willing to accept some substitute outcome that we observe much sooner than improvements in mortality. Then if that surrogate changes, we know that’s actually going to reliably predict that your mortality would be changing later. Those surrogate endpoints enable much shorter trials and a faster opportunity to learn about drug effectiveness than if we always needed to wait 20 years.

That structure has started to be interesting in the social sciences. A group of economists were interested if there were some equivalent of that for school test scores and wages. How do we expand the idea of surrogates beyond this very medical context to a broader set of frameworks? I do think surrogates themselves have really high potential, but there's a more general interest also in how we invest in the statistical methodology of learning things more quickly. You mentioned one example of a novel way of doing clinical trials that you guys were looking into. Another example would be human challenge trials. Is there one particular one that you want to talk about more?

Jim Savage: An approach from the marketing and online experimentation world that I find kind of compelling is this multi-armed bandit approach. One of the things we get with surrogates is very rapid feedback of whether something is working. We really ultimately care about long-run impacts, but we learn more in the short-run. Now, I don't really care about whether we learn what works, so much as I care about whether we are doing the thing that works best which might be different from you.

Paul Niehaus: You're not asking how good is the best thing, but which is the best thing.

Jim Savage: Yeah, exactly. Multi-armed bandits: imagine you've got a row of poker or slot machines, and you know that one of them has better odds than the others. How do you go and discover that?

The best strategy is you go and start putting a quarter in all of them, pull a handle, and you keep on doing that until one of them pays out, and you can update your posterior of which is the higher, the one with the better likelihood of paying out based on that observation. It’s a finite sample. And you don't just now sit down at that poker machine and put everything into it, you still explore, but you start to put more of your money in the machines that seem to be paying out more.

We can do something similar with organizations of what programs to scale up once we've got better surrogates. So by seeding many programs and slowly doubling down on those programs that seem to be gaining traction against near-term surrogates, we are hopefully getting the same objective of doing the right thing. Even if we never learn how good that right thing is relative to some counterfactual.

Emily Oehlsen: To add one example: a recent, promising example of this was the recovery trial in the UK during COVID. It was a multi-arm adaptive platform trial, where they were able to investigate many things at once. I think it was quite successful.

Heidi Williams: One thing that's often struck me about GiveDirectly is there's a lot of self-reinforcing good will that happens when not just one organization is doing experimentation and learning and having the commitment to that in isolation, but is growing up alongside other institutions that provide support. That say, “We value the work that we're doing and we're also learning from it,” or, “We see the social value that you're creating in taking a more evidence-based approach, and we're going to support you through funding or other meaningful ways of doing support.” 

But Paul, I'm curious if you could say a little bit about how that played out for GiveDirectly.

Paul Niehaus: We're doing these podcasts now because there is this interesting moment where there's a nascent ecosystem building effort underway to support a science-based approach to doing science, which is super exciting. 

That parallels what happened for us at GiveDirectly. We started GiveDirectly and decided we want to take this very evidence-based approach to what we do, at the same time that a lot of other people were creating parallel efforts to do philanthropy and global development in a more evidence-based way. So, GiveWell and Open Philanthropy were getting set up, and Founders Pledge, Google.org, and Jaquelline Fuller were taking this approach. 

Organizations like J-PAL and IPA were building out research infrastructure to make it easier for people to do the experimental trials in the countries that we were working. There were people that were trying to take a more evidence-based approach to thinking about where to work, like 80,000 Hours. That helps us to attract talent to what we're doing, because people recognize these approaches are evidence based.

So the fact that all of those things were happening at the same time was super important for us and created an environment where we could say, “We have this idea that does sound crazy because you've always been told, ‘Don't just give money to people living in poverty. That's not going to help,’ but look, there are all these other people that are taking this evidence-based approach to the way they think about where to give. They're supporting it and validating it.” That was enormously important for us. Also in terms of a morale level, it's good to feel like you're not alone in that.

So I also want to highlight because we're also thinking about science and federal support for science, that there were important things happening in governments that were a part of that. 

So for example, one of the really important, early, influential evaluations of cash transfers was done in Mexico with the Progressive program. That was done because there was a set-aside in the Mexican government’s budget for program evaluation. That ended up being a very influential evaluation that changed a lot of people's thinking. That was critical. We grew up as part of this ecosystem that was trying to move attention away from places where the founder had great oratorical ability and towards things where there's good evidence to back it up.

Jim Savage: And Heidi, you’ve been really a part of this. I talk about the Heidi vortex: you're great at finding all these people in different organizations who are doing this and bringing them together, so thanks.

Heidi Williams: With government employees especially, I feel like oftentimes there are people within an organization that themselves don't have a public-facing founder. The employee is on staff as part of a huge organization, but they themselves have gone out on a limb to do something that was not the norm. They really want to figure out whether the program that they were doing was working. 

We just had a conference in March where we brought a lot of those government employees together. Daniel Handel came from USAID and was the key person at USAID who really made cash transfer experiments happen. Paul was involved with that and Google.org and others were supporting it.

Another great example is Peter-Anthony Pappas who was out at the Patent Office. He wasn’t convinced by someone with an economics PhD that he should do an experiment. He was tasked with designing a program that was meant to accomplish a goal, and he thought, “Well, how would I know whether it was working?” He ended up designing a randomized experiment without even knowing what a randomized experiment looked like. 

You can find these people who are bringing research into the process of trying to improve their organization's effectiveness, not, again, because some PhD told them that they should, but with a real intentionality of wanting the work they are doing to be more effective.

The more that we can showcase examples of that, the more it brings a very different meaning to the value of doing research on research. This in turn makes it easier for organizations to justify the additional bandwidth, like Jim was saying, required to start up a program. 

People are doing a ton of work, and this is an additional thing that you're asking them to do. You might be able to bring in talent to help them. But at the end of the day, people have bandwidth capacities and there's only so much they can do. The more we highlight good examples of where this has provided value to organizations, changed their decision-making, and really helped them accomplish their goals, the more momentum this work will have.

Jim Savage: I should say for listeners, if you know anyone who is running experiments in large organizations on how to do science funding or funding better, you should have that person send Heidi an email.

Heidi Williams: We’re trying to do a lot of matchmaking for organizations that have very particular constraints on what they can and can't do, or that need more people. We’ll do what we can to try to get you matched with somebody who can help you on that. That's a natural point to wrap up, so I'll just say thanks.

Emily Oehlsen: Thanks, Heidi.

Paul Niehaus: That was great.

Jim Savage: Thank you.

Caleb Watney: Thank you for tuning in to this episode of the Metascience 101 podcast series. Next episode we’ll talk about whether scientific progress has downsides, and if so, how we can accelerate science safely.

Thanks for reading Macroscience! Subscribe for free to receive new posts

Discussion about this podcast

Macroscience
The Macroscience Podcast
A podcast about macroscientific theory, policy, and strategy