Max Grossmann: Steering clear of experimental economics hazards

Steering clear of experimental economics hazards

Posted: 2026-01-02 · Last updated: 2026-03-04 · Permalink

This page is a perpetual work-in-progress that presents my personal views on experiments in economics and what, to me, makes a good experiment, analysis, and paper.

Treatments matter only to the extent of their mutual differences
Don’t sacrifice verisimilitude for shiny objects
The Golden Mean does not apply to experiments
Avoid nonparametric tests like the plague
Small design changes can vastly improve statistical inference
Carefully design comparisons in factorial designs (e.g., 2×2)
Null effects can in fact be quantified
Know what you are doing, measuring and estimating
Radical openness is likely costless or a free lunch (if you start early)

Treatments matter only to the extent of their mutual differences

A core idea in experimental design is that only contrasts matter. This is one reason why the term “Control” is so misleading. “Control” is simply just one other treatment. It does not matter except for its differences to other treatments. (Whenever I speak of “two treatments”, I am referring to the classic design where you have one baseline/control and one extra treatment, often referred to as the treatment.)

More generally, I frequently see the following pattern: the baseline condition contains some established standard treatment, and the “treatment” is about a novel mechanism, approach, or joie de vivre. In many cases, the treatment differs in more than just the mechanism, approach, or joie de vivre. For example, if the treatment truly is novel, participants may simply be more experienced with or accustomed to the baseline. Or, if instructions differ (for example, the baseline has one fewer page of instructions), that is another difference. The (whole) contrast between treatments defines the interpretation of a treatment effect.

The ideal experiment changes exactly one thing. How can we get as close as possible to that ideal?

One approach is to explain more than is necessary. For example, when testing a classical mechanism against a novel mechanism, instructions may simply explain both mechanisms, and subsequently transparently randomize participants into either condition. That approach is not always sensible, but it does eliminate particular kinds of confounds.

Another problem that I see is that sometimes the comparisons implied by treatments simply do not matter or are difficult to interpret. For example, in 2×2 designs (more on them below), it can well be that the comparisons relating to the interaction are irrelevant. Or, in the case of binary outcomes, that we do not know how to reason about them. In that case, it can be proper to eliminate that fourth treatment, and keep only the three treatments in the upper left of the 2×2 matrix. If so, it can be sensible to increase the sample size for the north-western treatment if both remaining comparisons use it as baseline.

However, the fourth treatment can often be used productively, but typically only in terms of simple effects (i.e., against the off-diagonal treatments). More generally, treatments just do not matter—only treatment differences do, and you should focus on engineering those properly!

Use active controls in information provision experiments

Information provision experiments have a special hazard. Consider a simple design where you inform some participants about $X$ and leave others uninformed. What are you identifying? You identify the change from pre-existing beliefs to informed beliefs. But here is the problem: you have no control over the pre-existing belief. Some participants may believe $X$ is higher than the truth. Others may believe $X$ is lower than the truth. Your treatment effect is inherently heterogeneous and depends on what participants happened to believe before your intervention.

This is bad. You are identifying a mixture of positive and negative belief shocks. The direction of your treatment is not controlled by you. It is controlled by participants’ prior beliefs. You can adjust for those econometrically, but that control has no causal nature.

The solution is to use an active control treatment. Instead of comparing “informed” versus “uninformed,” in most cases you should compare “informed high” versus “informed low.” Tell one group that $X$ is high. Tell another group that $X$ is low. Now the pre-existing belief is marginalized out. Both groups receive information. The treatment difference isolates only the treatment-induced difference in beliefs. This is the right way to run information experiments.

Needless to say, there must be some genuine uncertainty about the true state. For example, it can help to provide participants with credible forecasts or estimates. Participants should believe the information (and ideally believe it equally in both conditions). Also, there are cases where active control treatments are not good design choices. I have an example. The above JEL paper discusses such issues.

An additional benefit: the active control also cancels out the mere effect of providing any information at all. Receiving information may have psychological effects independent of content. The active control differences out this confound as well.

Don’t sacrifice verisimilitude for shiny objects

A common issue I observe is how quickly verisimilitude, that is, the appearance of truth, is sacrificed for golden calves.

The fundamental conflict is easy to understand. As experimentalists, we study human behavior. As economists, we study markets. Nonetheless, many experimental economists are interested in “uneconomic” topics (as am I, though I call these topics studies in Non-Market Decision Making). Here’s an example. You might want to study how an outgroup member’s viewpoints on apolitical topics affects people’s “warmth” towards the outgroup.

Since “warmth” is not a rigorous economic concept, economist experimenters tend to practice an understandable bait and switch. They replace “warmth” by “giving in a dictator game” or “trust in the trust game” or “reciprocity in the trust game.” The golden calf of incentive compatibility must not be sacrificed!

But what about just asking people how warm they feel? Needless to say, and as fairly represented on that Wikipedia page, the feeling thermometer is not the most world-historically rigorous measure and it has certain important issues. However that may be, it also has a core advantage: without doubt it relates to the concept of “warmth,” even if that “warmth” is imprecisely measured.

So, while the feeling thermometer is not very valid as an economic construct, it is of high conceptual validity. Contrarily, economic games like the dictator or trust games are highly valid economically, but they are weak as conceptual representations of the concept of “warmth.” There are various other advantages and disadvantages of each approach. An important one in favor of measures like the feeling thermometer is simplicity: it truly is trivial to understand. The same holds true for other “less rigorous” measures, such as Big 5 questionnaires. Interpersonal comparability seems to be mostly a theoretical concern; many such measures are far more valid than the average economic experimentalist believes, though economists do not and cannot understand “why” they work. The feeling thermometer in particular is predictive of real-world behavior; it correlates with real-world outcomes (voting behavior, policy support).

As so often, econometrics comes to the rescue. Is there a way to combine the conceptual richness of the feeling thermometer with the golden calf that is incentive compatibility? Yes! Just elicit both and combine them into an index, assuming that the measures are noisy signals of a common latent variable. A solid approach is to use an average of z-scores, but there are other methods such as principal component analysis or factor analysis. The latter is valuable if many measures have been collected. Needless to say, any such approach must be preregistered, but it is highly valid.

The principle of verisimilitude can be used to turn many old-school lab experiments into modern survey experiments that have better external validity and are far, far cheaper and simpler to reason about.

Finally, experimental economists should understand that the study of human behavior is not fundamentally about “money.” Surely financial incentives are one way of making a design incentive compatible (and it does matter, especially when dealing with beliefs or perhaps with experimenter demand effects). However, human action can be revealed in a multitude of ways that are non-financial: If participants choose to read more about a topic? That’s revealed. If participants choose to wear an “I voted” button? That’s revealed. If participants choose to interact with the outgroup? That’s revealed. How we understand this revealed action is, of course, another question. This is where theory really comes in. But money should not really be the be-all and end-all of experimentalists’ attention.

The Golden Mean does not apply to experiments

An important insight that I learned far too late is that your treatments must always be as strong as humanly possible, under two crucial constraints: (i) no deception and (ii) no demand effects (unless you purposely want to test for them).

Do not implement “intermediate” treatments. It is wasteful. Statistically speaking, intermediate treatments have smaller effect sizes, decreasing power and/or increasing sample sizes.

In my experience, people sometimes feel bad about strong treatments. It is worth pondering why. Is it because a maximally strong treatment veers into demand territory? Then make a sharp left turn before that happens.
Is it because an extreme treatment might cause another, non-demand-related, psychological phenomenon that counteracts or artificially inflates a “true” effect? This kind of introspection is actually invaluable for theory-building! Perhaps you have found a new truth about human behavior simply by exaggerating a stimulus. In other words, if you are reluctant about stronger treatments, that may be because your theory of human behavior is too fragile, and needs a stronger foundation.

My paper Knowledge and Freedom started out with a very complicated design where draws were made from a lottery, and then the results were shown, and participants had to imagine someone else seeing any kind of lottery outcome for a given number of draws, etc., etc. The insight that extreme treatments matter did not only significantly improve the analysis, but it allowed me to get rid of all these complicated aspects of the design! Now the whole design is just predicated on whether the other person knows everything (basically, an infinite number of “draws”) or nothing (0 “draws”). That’s it. Much easier, much stronger, much better.

Summing up the previous two sections: measure richly but manipulate utterly.

Avoid nonparametric tests like the plague

Nonparametric tests are one of the most dangerous infohazards of our field. I want to show four things in this section: (i) nonparametric tests are often recommended based on severe econometric misconceptions; (ii) nonparametric tests test for things that don’t matter, generally speaking; (iii) nonparametric tests, even if used appropriately, have inferior statistical properties; (iv) parametric methods, and especially a particular kind of linear regression analysis with heteroskedasticity-consistent standard errors, are actually really good. In sum, nonparametric tests should not be used at all.

The following nonparametric tests are commonly used in experimental economics:

Test	Synonyms	Use case	Implementations
Mann-Whitney U test	Wilcoxon rank-sum test, Mann-Whitney-Wilcoxon test	Two independent samples	R: `wilcox.test()` Stata: `ranksum`
Wilcoxon signed-rank test	Paired Wilcoxon test	Paired samples	R: `wilcox.test(paired=T)` Stata: `signrank`
Kruskal-Wallis test	Kruskal-Wallis $h$ test	Multiple independent groups	R: `kruskal.test()` Stata: `kwallis`
Kolmogorov-Smirnov test	K-S test	Comparing distributions	R: `ks.test()` Stata: `ksmirnov`
Fisher’s exact test	Fisher-Irwin test	Binary outcomes	R: `fisher.test()` Stata: `tabi ..., exact`
Spearman’s rank correlation	Spearman’s rho	Monotonic association	R: `cor.test(method="spearman")` Stata: `spearman`

Below, I will focus on Wilcoxon-type tests (this includes Kruskal-Wallis), as these are the epitome of testing in old-school experimental economics. The other tests have their own issues, but also some strengths; the K-S test is especially useful, though probably not for what you would expect.

The use of nonparametric tests relies on common misconceptions

Two misconceptions lead, in my experience, people to use or recommend the use of nonparametric tests in our field. The first is how great and important it is not to make distributional assumptions. The second is that the thing being tested is somehow more relevant than with common parametric tests. This second claim will be examined below. Let’s for now focus on just the first one.

A common way the first claim is made is as follows: “Real-life data are not normally distributed, hence we should use nonparametric tests, BAZINGA!” In fact, these were the exact words used when I first learned about nonparametric tests. Claims like that are virtually always made to show nonparametric tests’ superiority over one particular alternative test: the t-test.

Whenever I refer to “the t-test,” I am here specifically referring to Welch’s t-test, the default in R’s t.test. Never, ever assume equal variances. If your software assumes so by default, your software is bad.

This argument is not even wrong. It is just irrelevant whether the underlying data are normally distributed. Rather, for the t-test to work, the mean of the data should have an approximate normal distribution. Now, it is of course accurate that if your data are normally distributed, then the mean itself is normally distributed. However, a remarkable result in statistics called the “Central Limit Theorem” shows that under very weak conditions, the distribution of the mean of any data with bounded mean and variance converges to a normal distribution.

The CLT gets even more remarkable. In all experiments known to me, data are inherently bounded. It simply is never possible to report arbitrary numbers. Just think of the public goods game. You simply cannot give less than zero or more than your endowment. The same principle applies to all experimental data. It is easy to prove that if a variable is bounded from above and below, all conditions of the CLT are satisfied and the mean of the variable will converge to a normal distribution. Simply put, the elicitation of bounded data kills pathological behavior of statistical tests.

Still, a commonly held misconception is that you need $n > 30$, or similar, for the CLT to apply. (This is an oversimplification. It depends heavily on the underlying distribution’s skewness.) However that may be, this does not mean that the t-test would fail. There are two cases that should be distinguished here: Type I errors and Type II errors.

For small $n$ and roughly equal group sizes, both the degrees of freedom, $\nu$, and the estimate of the standard deviation, $\sigma$, lead to such an unbelievably conservative distribution of t that I am confident to say that your test will not be oversized.

REWARD! If you are the first one to send me any R function at the indicated position subject to the conditions in the code, with the script running without error, the test at the end being rejected at $\alpha = 10^{-12}$, then I will admit defeat, update this post accordingly, and send a charity of your choice US$100:

library(pbapply)

set.seed(123)

my_dgp <- function () {
    # YOUR CODE HERE!
    # CONDITIONS:
    # 1. Your function MUST NOT use any global variable.
    # 2. Your function MUST NOT use any kind of global state.
    # 3. Your function MUST NOT manually change the seed.
    # 4. Your function MUST NOT in any way circumvent the independence of draws.

    # EXAMPLE:
    runif(1)
}

draw_independently <- function () {
    replicate(10, my_dgp())
}

one_run <- function () {
    x0 <- draw_independently()
    x1 <- draw_independently()

    x0 <- pmax(-10, pmin(10, x0))
    x1 <- pmax(-10, pmin(10, x1))

    stopifnot(length(x0) == 10)
    stopifnot(length(x1) == 10)

    t.test(x0, x1)$p.value < 0.05
}

reject <- pbreplicate(1000000, one_run())

final.test <- prop.test(x = sum(reject), n = length(reject), p = 0.05, alternative = "greater")

This offer was posted on 2026-01-02 and is still valid.

Good luck.

Second, with respect to Type II errors: if you ever have $n \lt 30$, that’s a You problem. Increase your sample size or improve your design more generally. That’s part of why power analyses are useful.

Nonparametric tests’ null hypotheses are poorly understood

A common claim about Wilcoxon-type tests is that they are tests on the median. In general, that is false. See also this. And this. If you are happy with making the additional assumptions required to reframe Wilcoxon-type tests as tests on the median, then why not simply make the minimal assumptions for the t-test (see above) and actually test something real?

Generally speaking, Wilcoxon-type tests are tests on stochastic dominance. While in some very specific instances, stochastic dominance may be what matters (see below), in general it is not.

By the way: even if nonparametric tests tested for the median—or if you’re willing to make the necessary assumptions—that would only indicate the difference in medians between your conditions, not the movement of a “baseline median” person to a different outcome given treatment. The latter interpretation requires additional assumptions (such as rank invariance, constant additive treatment effects, or other restrictions on the joint distribution of potential outcomes). More broadly, your theory would have to justify why medians matter at all.

We know what t-tests do and that they have good properties

t-tests are tests on the mean. Means (or, rather, ATEs) matter a whole lot over a broad bandwidth of economic theory. And that’s that.

That t-tests have good properties was previously accomplished (warning: loud).

Linear regression with HC3 standard errors (“OLS-HC3”) is even better than t-tests

OLS is essentially an extension of the t-test. The t-test is just OLS with a binary treatment indicator. OLS generalizes this to multiple treatments, continuous controls, and factorial designs. It should be the natural framework for experimental analysis.

But what about heteroskedasticity? That problem is basically solved. Just use HC3 standard errors. They have excellent finite-sample properties. And yes, they are better than HC0, HC1, or HC2. Just use HC3. In R: lmtest::coeftest(model, vcov = sandwich::vcovHC(model, type = "HC3")). In Stata: reg y t, vce(hc3). Note that Stata by default uses HC1 if you just specify robust, so don’t do that.

OLS has many incredible benefits beyond robustness. You can easily include additional control variables to improve precision (see the ANCOVA section below). You can cluster standard errors by session, group, or any other unit. You can estimate multiple treatment effects simultaneously by using saturated regressions. Coefficients are directly interpretable. There are no convergence issues. Results are transparent and reproducible. OLS-HC3 should be the default workhorse for experimental economics. It just works.

Note: there is a critical chapter on “robust” standard errors, including HC3, in Mostly Harmless Econometrics, Section 8.1. Their simulation results rely on very small sample sizes, unequal between treatments (literally $N_1 = 0.1 \cdot 30 = 3$ in one of the treatments) and should be viewed as extreme and purely pedagogical. Also, interestingly, the newer HC4 standard errors (but not HC5!), mitigate the issue. I attach R code below to replicate the results in their final column for the cases of (i) no and (ii) lots of heterogeneity. If you change FACTOR or TREATED, you can see how well HC3 in fact performs under even slightly more realistic scenarios despite low sample sizes.

library(sandwich)
library(lmtest)

FACTOR <- 1
TREATED <- 0.1

N <- FACTOR * 30
N0 <- round((1 - TREATED) * N)
N1 <- N - N0
nsim <- 25000

D <- c(rep(0, N0), rep(1, N1))

simulate <- function(sigma) {
    set.seed(123)

    reject <- matrix(0, nsim, 2 + 5)
    colnames(reject) <- c("classical", "HC0", "HC1", "HC2", "HC3", "HC4", "HC5")

    for (i in 1:nsim) {
        e <- ifelse(D == 0, rnorm(N, 0, sigma), rnorm(N, 0, 1))
        Y <- e  # beta0 = 0, beta1 = 0 under null
        m <- lm(Y ~ D)
        t_class <- coeftest(m)[2, 3]
        t_hc <- sapply(paste0("HC", 0:5), function(h) coeftest(m, vcov = vcovHC(m, h))[2, 3])
        reject[i, ] <- abs(c(t_class, t_hc)) > qt(0.975, N - 2)
    }

    colMeans(reject)
}

cat("No heteroskedasticity (sigma=1):\n")
print(simulate(1))

cat("\nLots of heteroskedasticity (sigma=0.5):\n")
print(simulate(0.5))

Even the analysis of binary outcomes using OLS-HC3 is probably fine (with simple designs)

A common concern is that when your outcome variable is binary (0 or 1), you “should” use logit or probit instead of OLS. This concern is largely misplaced for experimental work with simple designs.

The linear probability model (LPM, just OLS with a binary dependent variable) has many advantages: (i) coefficients are directly interpretable as percentage point changes in probability; (ii) there are no convergence issues; (iii) it does not impose functional form assumptions about how treatment effects vary across the probability distribution; (iv) with HC3 standard errors, it is robust to heteroskedasticity (which is inherent with binary outcomes).

The main remaining criticism of the LPM is that fitted values can fall outside $[0, 1]$. I have never seen fitted values matter in experimental economics. If you extrapolate, you must choose an appropriate data-generating process.

For simple treatment comparisons, the LPM with HC3 standard errors is an excellent default choice. It is transparent, robust, and interpretable. It just works. Logit and probit models should be reserved for cases where (i) you have strong theoretical reasons to impose a specific functional form (such as from utility theory), or (ii) you are working with observational data where extrapolation matters.

One final note: if you do use logit or probit, report marginal effects, not raw coefficients. Raw coefficients from nonlinear models are nearly impossible to interpret and compare across studies. Marginal effects at the mean (or average marginal effects) restore interpretability.

Interactions (and why you probably shouldn’t care about them)

As I mention below in the section on factorial designs, interactions are fundamentally model-dependent. An interaction that is significant in OLS may be insignificant in logit, or vice versa.

Moreover, interactions are almost always underpowered ex ante.

In factorial designs, as discussed above, you should include the interaction term in your regression to avoid functional form misspecification, but you should typically focus on main effects or simple effects for interpretation. The interaction coefficient itself is rarely of interest. Feel free to just eliminate conditions where multiple treatments are active if you don’t need the resulting comparisons.

Small design changes can vastly improve statistical inference

One of the highest-return design modifications you can make is to collect a baseline measure of your outcome variable before randomizing participants into treatment. Then, include this baseline measure as a control variable in your analysis. This approach is called ANCOVA (analysis of covariance) and it can dramatically improve statistical power.

Consider the standard regression for a randomized experiment:

\begin{equation} y_i = \beta_0 + \tau T_i + \gamma \pmb{X} + \varepsilon_i \end{equation}

Here, $y_i$ is your outcome, $T_i$ is the treatment indicator, and $\pmb{X}$ represents any additional control variables (demographics, etc.). Now suppose you collect a baseline measure $y_i^0$ of the outcome before treatment assignment. You can then estimate:

\begin{equation} y_i = \beta_0 + \beta_1 y_i^0 + \tau T_i + \gamma \pmb{X} + \varepsilon_i \end{equation}

This helps because $y_i^0$ absorbs individual-level variation in the outcome. If people differ substantially in their baseline levels, the inclusion of $y_i^0$ reduces the residual variance $\text{Var}(\varepsilon_i)$. This directly shrinks the standard error of $\hat{\tau}$, increasing your statistical power without requiring a larger sample!

The gains can be enormous. But: timing matters. The baseline measure must be collected before randomization. However, any pre-treatment variable that is correlated with the outcome will help. For example, if your outcome is post-treatment donations, a baseline measure of past donations or general prosociality will still improve precision.

Measurement error in $y_i^0$ attenuates $\beta_1$ but does not bias $\hat{\tau}$, because $T_i$ is randomized and thus by construction uncorrelated with $\varepsilon_i$.

Summing up, if your experiment allows for it, always collect a baseline measure. There is essentially no downside. Preregister the inclusion of $y_i^0$ and always include it in your main specification.

Carefully design comparisons in factorial designs (e.g., 2×2)

set.seed(123)

meanwhere <- function (data, f1, f2) {
    mean(data[data$g == f1 & data$h == f2, "y"])
}

n <- 100

data <- data.frame(g = numeric(n), h = numeric(n), y = numeric(n))

data$g <- rbinom(n, 1, 0.5)
data$h <- rbinom(n, 1, 0.5)
data$y <- rnorm(n)

tab <- xtabs(~ g + h, data)

# Benchmark

m00 <- meanwhere(data, 0, 0)
m01 <- meanwhere(data, 0, 1)
m10 <- meanwhere(data, 1, 0)
m11 <- meanwhere(data, 1, 1)

## True main effects

eg <- mean(c(m10, m11)) - mean(c(m00, m01))
eh <- mean(c(m01, m11)) - mean(c(m00, m10))

# Models

mwrongnointeract <- lm(y ~ g + h, data)
mwrong <- lm(y ~ g + h + g:h, data)

## Using effect/deviation coding

data$g <- data$g - 0.5
data$h <- data$h - 0.5

mcorrect <- lm(y ~ g + h + g:h, data)

Suppose you have two factors, $g$ and $h$. Both are dummy variables (1 if “turned on” and 0 otherwise). Participants get iid randomly assigned to $g$ and/or $h$. As is clear, this is a 2×2 between-subjects design. Moreover, you have your outcomes, y. What can you do?

There are three standard effects in 2×2 designs: main effects, simple effects, and interaction effects.

Economists are inherently (and rightly) suspicious of interaction effects, and thus I will not be covering them. A core challenge with interpreting interactions is that they work only for certain outcome variables and models. With binary outcome variables, for example, an interaction that is not significant in the linear probability model/with OLS may be significant in a logit model, a probit model, both, or neither. It is a huge unresolved and likely unresolvable mess. Never rely on the interaction in a factorial design unless you fully understand the model to be used on the outcome.

Simple effects are simple indeed; they refer to the effect of one factor at a fixed level of the other. In R, just do lm(y ~ h, data = df[df$g == 1, ]) (with optional control variables, see above) and get lunch. Your day is completed.

Main effects are not so simple. The main effect of, say, $g$, requires you to average over $h$ with some weights. If these weights can be derived from a policy question or your theoretical framework, great! Otherwise, just use equal weights.

Denote by $\mu_{g,h}$ the population mean in some factorial treatment $g, h$. Then, the main effects of $g$ and $h$ are defined as follows:

\begin{align} \tau_g^{\text{M}} = \left[ w_1 \mu_{1,0} + (1-w_1) \mu_{1,1} \right] - \left[ w_1 \mu_{0,0} + (1-w_1) \mu_{0,1} \right]\\ \tau_h^{\text{M}} = \left[ w_2 \mu_{0,1} + (1-w_2) \mu_{1,1} \right] - \left[ w_2 \mu_{0,0} + (1-w_2) \mu_{1,0} \right] \end{align}

In the following, I assume $w_1 = w_2 = \frac{1}{2}$. As I argued above, linear regressions are an excellent method to analyze your experiment. How, then, can we get $\tau_g^{\text{M}}, \tau_h^{\text{M}}$ from a neat linear regression?

Unfortunately, this linear model is simply wrong:

\begin{equation} y_i = \beta_0 + \beta_1 g_i + \beta_2 h_i + \beta_3 g_i h_i + \varepsilon_i \end{equation}

Crucially, with main effects, it is not correct to just throw your (binary) dummy treatment indicators into the linear model as-is. That would only be correct if you did not use an interaction term in your linear model. Given my skepticism about interaction terms, you may ask why I would ever propose using one of these! The reason is simple: including the interaction term makes your model saturated. It can exactly recover all four cell means without imposing any functional form restrictions. If you omit the interaction term, you assume additivity; if this assumption is wrong, your main effect estimates will be biased. Including the interaction term protects against this misspecification. Therefore, you should always include the interaction term but (probably) ignore the coefficient on it.

In general, $\beta_1 \neq \tau_g^{\text{M}}$ and $\beta_2 \neq \tau_h^{\text{M}}$. Why is that? Simple:

\begin{equation} \frac{\partial y_i}{\partial g_i} = \beta_1 + \beta_3 h_i \label{marginal} \end{equation}

This marginal effect depends on the value of $h_i$! Only at $h_i = \frac{1}{2}$ does the marginal effect equal the main effect. So, we must transform $g_i, h_i$ as follows:

\begin{align} g'_i = g_i - \frac{1}{2}\\ h'_i = h_i - \frac{1}{2} \end{align}

This effect coding or deviation coding ensures that the marginal effect is equal to the main effect, since we can now ignore the final term in the equivalent of Equation \eqref{marginal}. The following linear model recovers the main effects:

\begin{equation} y_i = \beta_0 + \tau_g^{\text{M}} g'_i + \tau_h^{\text{M}} h'_i + \Xi g'_i h'_i + \varepsilon_i \end{equation}

Needless to say, you can include further control variables as usual (and should ignore $\Xi$). However, transforming dummies is crucial—as long as you do that, your coefficients are meaningful! See here and here for excellent references.

set.seed(123)

meanwhere <- function (data, f1, f2, f3) {
    mean(data[data$g == f1 & data$h == f2 & data$j == f3, "y"])
}

n <- 100

data <- data.frame(g = numeric(n), h = numeric(n), j = numeric(n), y = numeric(n))

data$g <- rbinom(n, 1, 0.5)
data$h <- rbinom(n, 1, 0.5)
data$j <- rbinom(n, 1, 0.5)
data$y <- rnorm(n)

tab <- xtabs(~ g + h + j, data)

# Benchmark

m000 <- meanwhere(data, 0, 0, 0)
m001 <- meanwhere(data, 0, 0, 1)
m010 <- meanwhere(data, 0, 1, 0)
m011 <- meanwhere(data, 0, 1, 1)
m100 <- meanwhere(data, 1, 0, 0)
m101 <- meanwhere(data, 1, 0, 1)
m110 <- meanwhere(data, 1, 1, 0)
m111 <- meanwhere(data, 1, 1, 1)

## True main effects

eg <- mean(c(m100, m101, m110, m111)) - mean(c(m000, m001, m010, m011))
eh <- mean(c(m010, m011, m110, m111)) - mean(c(m000, m001, m100, m101))
ej <- mean(c(m001, m011, m101, m111)) - mean(c(m000, m010, m100, m110))

# Models

mwrongnointeract <- lm(y ~ g + h + j, data)
mwrong <- lm(y ~ g + h + j + g:h + g:j + h:j + g:h:j, data)

## Using effect/deviation coding

data$g <- data$g - 0.5
data$h <- data$h - 0.5
data$j <- data$j - 0.5

mcorrect <- lm(y ~ g + h + j + g:h + g:j + h:j + g:h:j, data)

Null effects can in fact be quantified

A common reaction to a non-significant result is utter despair. That reaction is easily wrong or at least premature. A p-value above 0.05 tells you that you failed to reject the null hypothesis. It does not tell you that the effect is zero, small, or negligible. It says nothing about effect size.

A critical distinction must be made between underpowered null results (which are in general not so valuable) and well-powered, tightly bounded null results (which are valuable). If your study had 20% power to detect a small effect and you find $p = 0.5$, you have learned essentially nothing. Your confidence interval will be wide, spanning large positive and negative effects. Contrarily, if your study has 90% power and you find $p = 0.5$ with a confidence interval of $[-0.05, 0.15]$ in standardized units, you have learned something important: the effect, if it exists at all, is small. Let’s not mix both of these cases!

Null results can be valuable, but only if they are informative. An informative null result rules out effects larger than some threshold. To claim an informative null in a frequentist framework, you must demonstrate that your confidence interval is sufficiently narrow.

There are two main approaches to quantifying null effects: equivalence testing (TOST) and Bayesian methods. Both allow you to make positive claims about the absence or negligibility of an effect, rather than merely failing to reject the null.

Using TOST

# (1) git clone https://github.com/mrpg/heizg.git
# (2) Install dependencies (see README)
# (3) setwd/change working directory to stats/ within the repository

source("tables.R")

data2 <- filter(data, treat == "control" | treat == "full")

# Get estimates

m1 <- lm(heizg_good ~ treat, data2)

robust <- coeftest(m1, vcov = vcovHC(m1, "HC3"))

tcrit <- qnorm(0.95)

ci90 <- robust[2, 1] + c(-1, 1) * tcrit * robust[2, 2]

# Get pooled SD (for Cohen's d threshold)

stats <- data2 %>%
  group_by(treat) %>%
  summarise(n = n(), var = var(heizg_good))

sd_pooled <- sqrt(sum((stats$n - 1) * stats$var) / (sum(stats$n) - 2))

ci90_d <- ci90 / sd_pooled

The two one-sided tests (TOST) procedure tests whether your effect lies within a pre-specified equivalence region. Instead of testing $H_0: \tau = 0$, you test whether $\tau$ is practically equivalent to zero by checking if the confidence interval for $\tau$ lies entirely within some interval $[-\Delta, \Delta]$, where $\Delta$ is the smallest effect size you consider meaningful.

The conventional choice is $\Delta = 0.2$ in standardized units (Cohen’s $d$), though you should justify this threshold based on your research context. A narrower equivalence region (say, $\Delta = 0.1$) makes a stronger claim but requires more statistical power. The TOST procedure uses a 90% confidence interval, which corresponds to two one-sided tests at $\alpha = 0.05$.

Here is an example (from the code above):

To assess whether the effect is practically negligible, we report the 90% confidence interval, which corresponds to a two one-sided tests (TOST) equivalence procedure at $\alpha = 0.05$ (Lakens et al., 2018). The estimated coefficient is $-0.032$ ($\text{SE} = 0.072$), yielding a 90% CI of $[-0.150, 0.086]$. Given the pooled outcome standard deviation of $1.329$, this corresponds to $[-0.113, 0.065]$ in standardized units (Cohen’s $d$). Using the conventional threshold of $|d| = 0.2$ as the smallest effect size of interest, the 90% CI lies entirely within the equivalence region $[−0.2, 0.2]$, supporting the conclusion of practical equivalence.

Using Bayesian t-tests

# (1) git clone https://github.com/mrpg/heizg.git
# (2) Install dependencies (see README)
# (3) setwd/change working directory to stats/ within the repository

source("tables.R")
set.seed(123)

library(BayesFactor)

# Conventional

power_high <- t.test(data$heizg_good[data$treat == "control"], data$heizg_good[data$treat == "full"])
power_low <- t.test(sample(data$heizg_good[data$treat == "control"], 50),
                    sample(data$heizg_good[data$treat == "full"], 50))

# Bayesian t-test

power_high_bayes <- ttestBF(data$heizg_good[data$treat == "control"], data$heizg_good[data$treat == "full"])
power_low_bayes <- ttestBF(sample(data$heizg_good[data$treat == "control"], 50),
                           sample(data$heizg_good[data$treat == "full"], 50))

NEW: Open web app

My personal intuitionist view of evidence is as follows: any study should yield one of three conclusions: evidence for X, evidence for Y, or indeterminate. Bayesian tests deliver on all three possibilities. A Bayes Factor (BF) can provide evidence for the null, evidence for the alternative, or remain inconclusive ($\text{BF} \approx 1$). Frequentist methods, by contrast, do not distinguish between evidence for $H_0$ and poor data. This is why I prefer Bayesian approaches for quantifying null effects, though TOST serves a similar purpose when properly powered. However, Bayesian methods come with batteries included. As I like to say: “Bayesian methods solve all problems, for free.” It’s true. Really!

Simply put, Bayesian methods provide an alternative approach by computing a Bayes Factor, which quantifies the relative evidence for the null hypothesis versus the alternative. A $\text{BF}_{01} > 3$ suggests moderate evidence for the null, while $\text{BF}_{01} > 10$ suggests strong evidence. Conversely, $\text{BF}_{01} \lt 1/3 \equiv 1 / \text{BF}_{10}$ suggests at least moderate evidence for an effect. Read more here.

Bayesian t-tests (e.g., as implemented in the BayesFactor package in R) specify a prior distribution on effect sizes and compute how much the data update your beliefs. The advantage of this approach is that you can actually claim evidence for the null, not just failure to reject it. The disadvantage is that results depend on your choice of prior. Therefore, always use the now-standard default prior of $\text{Cauchy}\left(0, \frac{\sqrt{2}}{2}\right)$ and conduct robustness analyses.

Here is an example (from the code above, same data as with the TOST example):

We conducted a Bayesian independent samples t-test (Rouder et al., 2009) using a default prior with scale parameter $r = \frac{\sqrt{2}}{2} \approx 0.707$ (Cauchy distribution centered at zero). The Bayes Factor strongly favored the null hypothesis, with $\text{BF}_{01} = 15.0$ (equivalently, $\text{BF}_{10} = 0.067$), indicating that the observed data are 15 times more likely under the null hypothesis of no effect than under the alternative hypothesis. This constitutes strong evidence for the absence of an effect. To assess robustness to prior specification, we conducted sensitivity analyses with alternative scale parameters: $r = 0.5$ yielded $\text{BF}_{10} = 0.094$ ($\text{BF}_{01} = 10.7$); $r = 1.0$ yielded $\text{BF}_{10} = 0.047$ ($\text{BF}_{01} = 21.1$); and $r = 2.0$ yielded $\text{BF}_{10} = 0.024$ ($\text{BF}_{01} = 42.2$). Across all specifications, we obtained consistent strong evidence favoring the null hypothesis ($\text{BF}_{01} > 10$ in all cases), demonstrating that our conclusion is robust to reasonable prior choices.

In practice, TOST and Bayesian methods often agree qualitatively. If you have a well-powered study with tight confidence intervals, TOST will show equivalence and the Bayes Factor will favor the null (as in the example above). If your study is underpowered, neither method will save you: TOST will fail to show equivalence and the Bayes Factor will be inconclusive (close to 1). The lesson is simple: power matters for null results just as much as for positive results. Design your study properly.

Know what you are doing, measuring and estimating

Any experiment starts with a theory of human behavior. This theory may not be mathematical, but it is always a statement about counterfactual behavior. The section title promises three requirements: know what you are doing (treatment design), know what you are measuring (outcome elicitation), and know what you are estimating (estimands). All three must align with your theory’s objects.

Know what you are doing

For example, suppose that you want to study workers’ effort when provided with information about coworkers’ effort. Now this is clearly an experiment about workers’ beliefs. These beliefs are changed through information, and then those beliefs are (probably) thought to induce a response in effort. I have talked more above about some recommendations for such experiments, but for now let us just focus on treatment design.

Putting aside questions of deception, a simple experiment could tell some participants (i) that their group of coworkers had an average effort level of $e_1$ and some other participants (ii) that their group of coworkers had an average effort level of $e_2$, where $e_2 \lt e_1$. Do you see a problem with this design?

This design is built on the assumption that workers’ effort level is shaped by beliefs about the average effort of others. But it could well be that their effort responds to the minimum effort, the maximum effort, or any other statistic of the distribution.

If your theory proves that only beliefs about averages should matter, then such a design is wholly appropriate. And even if people’s effort is in fact a function of the minimum, the minimum could still be correlated with the average, so the design may “work” accordingly. Still, beliefs are extraordinarily rich objects, and any particular experimental configuration of beliefs must grapple with that complexity. As Einstein put it: “theory decides what we can observe.” So true!

Formal mathematical theory has the great feature that it forces you to make explicit assumptions. Formal theory constrains an experimenter’s degrees of freedom. This is why theory is invaluable even when not strictly necessary: it disciplines experimental design.

Know what you are measuring

Your outcome measures must be as close as possible to theoretically relevant objects. The same principle applies here as with treatment design: theory dictates what is relevant.

In the absence of formal theory, or if theory makes no predictions for a particular object, your planned analyses can serve as a reduced-form guide to what you need to measure. A key principle: comparisons are only valid when the objects being compared are of the same kind. The null hypothesis must be a sensible benchmark.

Suppose you want to compare first-order beliefs (what I believe about others) with second-order beliefs (what I believe others believe about others). You elicit the first-order belief by asking: “What is the average effort of others?”

Now, how should you elicit the second-order belief? An invalid approach would be to ask: “What is the most common belief about the average effort of others?” This compares a mean (first-order) with a mode (second-order). These are two different statistics that need not coincide even under rationality.

A valid approach could ask: “What is the average belief about the average effort of others?” Now both elicitations concern means. Under common priors and rational expectations, the first-order and second-order beliefs should coincide. Any divergence could reveal false consensus, projection bias, or asymmetric updating. Crucially, the comparison is valid in principle because the null hypothesis (equality under rationality) is a meaningful benchmark.

Know what you are estimating

We have discussed estimands in greater detail above. For now, keep the following principle in mind, which I call the Fundamental Law of Experimental Economics:

If theory predicts a monotonic individual response, a zero average treatment effect (ATE) implies zero individual effects.

This law holds because the mean cannot hide offsetting responses when all individual effects have the same sign. Under random assignment, the ATE therefore directly tests the theory. This is a precise statement: the ATE is an expectation (an integral over the distribution of individual types) that identifies the causal effect of a stimulus. Random assignment “marginalizes out” unobservable theoretical parameters, leaving only the predicted effect.

The Fundamental Law immediately suggests how to design good experiments: isolate behavioral factors that, under theory, predict a monotonic change. Such factors deliver clean comparative statics. Contrarily, if theory permits heterogeneous signs (some individuals respond positively, others negatively) a zero ATE is uninformative. It could reflect no effects, or it could reflect large offsetting effects. Without further restrictions, you cannot distinguish these cases.

The upshot is that experimental design and statistical analysis must be in concordance. The estimands of interest must directly test the predicted monotonic relationships that your theory delivers. This is what it means to know what you are estimating.

Radical openness is likely costless or a free lunch (if you start early)

Transparency is free if you do things properly from the beginning. Do good theory. Write clean analysis code. Organize your files sensibly. When the time comes to publish a replication package, you will have virtually no extra work. The cost is zero. Actually, the cost is negative, because good practices save you time during the project itself!

Better yet: just upload sanitized data immediately after your experimental sessions to GitHub (if permitted). Make your analysis scripts public from day one. There is little reason to wait. If your IRB or data agreements prohibit immediate sharing, fine. Otherwise, default to radical openness. And no, you will not be “scooped.” Your project is, in all likelihood, not that interesting.

Preanalysis plans and preregistrations are also valuable. They protect you from accusations of p-hacking. They force you to think clearly about your design and analysis before seeing the data. They improve the credibility of your findings and they make our whole science better. Also, by committing to analyses, sample sizes, and the like beforehand you can remove the need for later decisions. Just do it.