<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <atom:link href="https://max.pm/posts/rss.xml" rel="self" type="application/rss+xml" />
        <title>Max R. P. Grossmann</title>
        <link>https://max.pm/posts</link>
        <description>Mostly facts and logic. And, sometimes, unbridled emotion.</description>
        <language>en</language>


        <item>
<title>“Interest first” is not a conspiracy (it’s a mathematical necessity)</title>
<link>https://max.pm/posts/loans/</link>
<guid>https://max.pm/posts/loans/</guid>
<description><![CDATA[



    
        Open web app
    


A common class of memes complains about banks’ “practice” to have homeowners and other borrowers pay interest first, and pay the principal only later.

These memes suggest that there must be some sort of conspiracy between bankers to subjugate borrowers. After all, why can’t I pay the principal first, thereby reducing the balance?

A common response to the meme is that homeowners entered this arrangement willingly. While that is generally true, it misses the point entirely. The payment schedule, with higher interest rate payments first, is mathematically determined and necessary. For a level-payment, fixed-rate, fixed-term, fully-amortizing loan, the payment schedule is dictated by facts and logic! So, not only is there no conspiracy, and not only did borrowers voluntarily agree to the terms of a loan, but banks simply have no choice about the payment schedule!

Simply put, for a standard fixed-rate, fixed-term, level-payment fully amortizing loan, the interest portion is larger at the start because interest is computed on the outstanding balance, which is largest at the start.

We can prove this mathematically. Let’s consider a simple loan of $B_0$ in your favorite currency. The loan has a fixed periodic interest rate $r$ and term of $N$ periods. (None of these assumptions are crucial, but they simplify the exposition tremendously.)

Let $B_n$ reflect the balance after $n$ payments, where $n \in \{0, 1, \ldots, N\}$. Initially, the balance is $B_0$; and after all $N$ payments we want the loan paid off, so $B_N = 0$. How does the balance change from one period to the next? Simple. Interest accrues on the current balance, and then the payment $P$ is subtracted. Mathematically,

\begin{equation}
B_{n+1} = (1 + r)B_n - P.
\label{loanrate}
\end{equation}

Equation \eqref{loanrate} is what mathematicians call a first-order linear difference equation. Its unique solution can be shown to be $B_n = c(1+r)^n + \frac{P}{r}$. Since we know that $B_0$ is the initial principal and $B_N = 0$, it follows after some simple calculations that $P$ must, mathematically have the following value:

\begin{equation}
P = \frac{rB_0(1+r)^N}{(1+r)^N - 1}.
\label{payment}
\end{equation}

Under this setup, this is the single possible payment! It is literally dictated by the laws of mathematics. Let’s inspect it further. In period $n$, interest owed is $I_n = rB_{n-1}$ and principal paid is $\Pi_n = P - I_n$. Substituting the solution for $B_{n-1}$:

\begin{align*}
I_n &= (rB_0 - P)(1+r)^{n-1} + P\\
\Pi_n &= (P - rB_0)(1+r)^{n-1}
\end{align*}

If we look at $I_{n+1} - I_n = r(rB_0 - P)(1+r)^{n-1}$, we find that this expression is actually negative. (This is because $P > rB_0$ to amortize the loan over time, so that the loan is actually paid off at $N$.) The implication is that the unique possible payment schedule indeed starts off with a "high" interest portion that subsequently trails off.

However, $\Pi_{n+1} - \Pi_n = r(P - rB_0)(1+r)^{n-1}$ is positive! Thus, it is correct that, over time, the portion paid to the principal increases.

So the next time someone complains about banks’ “practice” of having you “pay interest first:” There is no practice. There is no policy. There is no choice. The payment schedule is not a decision anyone made. It is a mathematical necessity, as unavoidable as $2 + 2 = 4$.

What if you tried to “pay principal first”?

Suppose you demanded that your bank let you pay principal first, meaning you wanted $\Pi_n$ to start high and decrease over time, rather than the other way around. What would happen?

For $\Pi_n = (P - rB_0)(1+r)^{n-1}$ to be decreasing, we would need $P - rB_0 \lt 0$, i.e., $P \lt rB_0$. But look at what this implies for the balance. After the first period, $B_1 - B_0 = rB_0 - P > 0.$

The balance is increasing. Your payment does not even cover the interest! This is called negative amortization, where the debt grows rather than shrinks. Far from paying the loan off faster, you’re falling further behind!

Worse still, with $P \le rB_0$, the loan cannot be paid off in finite time. The boundary condition $B_N = 0$ becomes impossible to satisfy for any finite $N$. You would have to either increase your payment (back above $rB_0$, restoring the "interest-first" structure), extend the term to infinity (possible only for $P = rB_0$), or default. No bueno. And once again, this is simply just mathematics. No politician can save you from plain mathematics.

In other words, “pay principal first” is not an alternative payment schedule. The assumptions that define a standard amortizing loan (a fixed rate, fixed term, fixed payment, balance paid off at maturity) mathematically require the interest-heavy-first structure. You cannot violate it without violating one of those assumptions.

It is true that borrowers are often allowed to pay extra principal early (prepayments/curtailments). That changes the balance path and reduces total interest, which may of course be sensible. But it is not mathematically possible to merely increase the percentage of $P$ being spent on $\Pi_n$ without violating important assumptions of the model. Similarly, borrowers can often opt for interest-only payments to temporarily reduce payments. Once again, there is no free lunch: such arrangements require either higher payments later, an extended term, or a balloon payment at maturity. Other loan structures can violate other assumptions of the model, which may be individually acceptable, but the fundamental equations dictated by mathematics do not and cannot change.
]]></description>
</item>
<item>
<title>LEDs on consumer devices shouldn’t double as aircraft beacons</title>
<link>https://max.pm/posts/bright-leds/</link>
<guid>https://max.pm/posts/bright-leds/</guid>
<description><![CDATA[
I recently purchased an Alogic ULCGE-SGR USB-C to Ethernet Adapter.

The device itself works OK-ish, but I believe the link light along with some nice headphones could be used to land planes.

Here’s an idea: for the vast majority of people, Ethernet adapters only need a light when there’s no link. Just go dark when everything works as intended.

It’s time for manufacturers of electric devices to stop contributing to light pollution.
]]></description>
</item>
<item>
<title>Steering clear of experimental economics hazards</title>
<link>https://max.pm/posts/experimental-hazards/</link>
<guid>https://max.pm/posts/experimental-hazards/</guid>
<description><![CDATA[


This page is a perpetual work-in-progress that presents my personal views on experiments in economics and what, to me, makes a good experiment, analysis, and paper.

Table of contents


    


Treatments matter only to the extent of their mutual differences

A core idea in experimental design is that only contrasts matter. This is one reason why the term “Control” is so misleading. “Control” is simply just one other treatment. It does not matter except for its differences to other treatments. (Whenever I speak of “two treatments”, I am referring to the classic design where you have one baseline/control and one extra treatment, often referred to as the treatment.)

More generally, I frequently see the following pattern: the baseline condition contains some established standard treatment, and the “treatment” is about a novel mechanism, approach, or joie de vivre. In many cases, the treatment differs in more than just the mechanism, approach, or joie de vivre. For example, if the treatment truly is novel, participants may simply be more experienced with or accustomed to the baseline. Or, if instructions differ (for example, the baseline has one fewer page of instructions), that is another difference. The (whole) contrast between treatments defines the interpretation of a treatment effect.

The ideal experiment changes exactly one thing. How can we get as close as possible to that ideal?

One approach is to explain more than is necessary. For example, when testing a classical mechanism against a novel mechanism, instructions may simply explain both mechanisms, and subsequently transparently randomize participants into either condition. That approach is not always sensible, but it does eliminate particular kinds of confounds.

Another problem that I see is that sometimes the comparisons implied by treatments simply do not matter or are difficult to interpret. For example, in 2×2 designs (more on them below), it can well be that the comparisons relating to the interaction are irrelevant. Or, in the case of binary outcomes, that we do not know how to reason about them. In that case, it can be proper to eliminate that fourth treatment, and keep only the three treatments in the upper left of the 2×2 matrix. If so, it can be sensible to increase the sample size for the north-western treatment if both remaining comparisons use it as baseline.

However, the fourth treatment can often be used productively, but typically only in terms of simple effects (i.e., against the off-diagonal treatments). More generally, treatments just do not matter—only treatment differences do, and you should focus on engineering those properly!

Use active controls in information provision experiments

Information provision experiments have a special hazard. Consider a simple design where you inform some participants about $X$ and leave others uninformed. What are you identifying? You identify the change from pre-existing beliefs to informed beliefs. But here is the problem: you have no control over the pre-existing belief. Some participants may believe $X$ is higher than the truth. Others may believe $X$ is lower than the truth. Your treatment effect is inherently heterogeneous and depends on what participants happened to believe before your intervention.

This is bad. You are identifying a mixture of positive and negative belief shocks. The direction of your treatment is not controlled by you. It is controlled by participants’ prior beliefs. You can adjust for those econometrically, but that control has no causal nature.

The solution is to use an active control treatment. Instead of comparing “informed” versus “uninformed,” in most cases you should compare “informed high” versus “informed low.” Tell one group that $X$ is high. Tell another group that $X$ is low. Now the pre-existing belief is marginalized out. Both groups receive information. The treatment difference isolates only the treatment-induced difference in beliefs. This is the right way to run information experiments.

Needless to say, there must be some genuine uncertainty about the true state. For example, it can help to provide participants with credible forecasts or estimates. Participants should believe the information (and ideally believe it equally in both conditions). Also, there are cases where active control treatments are not good design choices. I have an example. The above JEL paper discusses such issues.

An additional benefit: the active control also cancels out the mere effect of providing any information at all. Receiving information may have psychological effects independent of content. The active control differences out this confound as well.

Don’t sacrifice verisimilitude for shiny objects

A common issue I observe is how quickly verisimilitude, that is, the appearance of truth, is sacrificed for golden calves.

The fundamental conflict is easy to understand. As experimentalists, we study human behavior. As economists, we study markets. Nonetheless, many experimental economists are interested in “uneconomic” topics (as am I, though I call these topics studies in Non-Market Decision Making). Here’s an example. You might want to study how an outgroup member’s viewpoints on apolitical topics affects people’s “warmth” towards the outgroup.

Since “warmth” is not a rigorous economic concept, economist experimenters tend to practice an understandable bait and switch. They replace “warmth” by “giving in a dictator game” or “trust in the trust game” or “reciprocity in the trust game.” The golden calf of incentive compatibility must not be sacrificed!

But what about just asking people how warm they feel? Needless to say, and as fairly represented on that Wikipedia page, the feeling thermometer is not the most world-historically rigorous measure and it has certain important issues. However that may be, it also has a core advantage: without doubt it relates to the concept of “warmth,” even if that “warmth” is imprecisely measured.

So, while the feeling thermometer is not very valid as an economic construct, it is of high conceptual validity. Contrarily, economic games like the dictator or trust games are highly valid economically, but they are weak as conceptual representations of the concept of “warmth.” There are various other advantages and disadvantages of each approach. An important one in favor of measures like the feeling thermometer is simplicity: it truly is trivial to understand. The same holds true for other “less rigorous” measures, such as Big 5 questionnaires. Interpersonal comparability seems to be mostly a theoretical concern; many such measures are far more valid than the average economic experimentalist believes, though economists do not and cannot understand “why” they work. The feeling thermometer in particular is predictive of real-world behavior; it correlates with real-world outcomes (voting behavior, policy support).

As so often, econometrics comes to the rescue. Is there a way to combine the conceptual richness of the feeling thermometer with the golden calf that is incentive compatibility? Yes! Just elicit both and combine them into an index, assuming that the measures are noisy signals of a common latent variable. A solid approach is to use an average of z-scores, but there are other methods such as principal component analysis or factor analysis. The latter is valuable if many measures have been collected. Needless to say, any such approach must be preregistered, but it is highly valid.

The principle of verisimilitude can be used to turn many old-school lab experiments into modern survey experiments that have better external validity and are far, far cheaper and simpler to reason about.

Finally, experimental economists should understand that the study of human behavior is not fundamentally about “money.” Surely financial incentives are one way of making a design incentive compatible (and it does matter, especially when dealing with beliefs or perhaps with experimenter demand effects). However, human action can be revealed in a multitude of ways that are non-financial: If participants choose to read more about a topic? That’s revealed. If participants choose to wear an “I voted” button? That’s revealed. If participants choose to interact with the outgroup? That’s revealed. How we understand this revealed action is, of course, another question. This is where theory really comes in. But money should not really be the be-all and end-all of experimentalists’ attention.

The Golden Mean does not apply to experiments

An important insight that I learned far too late is that your treatments must always be as strong as humanly possible, under two crucial constraints: (i) no deception and (ii) no demand effects (unless you purposely want to test for them).

Do not implement “intermediate” treatments. It is wasteful. Statistically speaking, intermediate treatments have smaller effect sizes, decreasing power and/or increasing sample sizes.

In my experience, people sometimes feel bad about strong treatments. It is worth pondering why. Is it because a maximally strong treatment veers into demand territory? Then make a sharp left turn before that happens.
Is it because an extreme treatment might cause another, non-demand-related, psychological phenomenon that counteracts or artificially inflates a “true” effect? This kind of introspection is actually invaluable for theory-building! Perhaps you have found a new truth about human behavior simply by exaggerating a stimulus. In other words, if you are reluctant about stronger treatments, that may be because your theory of human behavior is too fragile, and needs a stronger foundation.


My paper Knowledge and Freedom started out with a very complicated design where draws were made from a lottery, and then the results were shown, and participants had to imagine someone else seeing any kind of lottery outcome for a given number of draws, etc., etc. The insight that extreme treatments matter did not only significantly improve the analysis, but it allowed me to get rid of all these complicated aspects of the design! Now the whole design is just predicated on whether the other person knows everything (basically, an infinite number of “draws”) or nothing (0 “draws”). That’s it. Much easier, much stronger, much better.

Summing up the previous two sections: measure richly but manipulate utterly.

Avoid nonparametric tests like the plague

Nonparametric tests are one of the most dangerous infohazards of our field. I want to show four things in this section: (i) nonparametric tests are often recommended based on severe econometric misconceptions; (ii) nonparametric tests test for things that don’t matter, generally speaking; (iii) nonparametric tests, even if used appropriately, have inferior statistical properties; (iv) parametric methods, and especially a particular kind of linear regression analysis with heteroskedasticity-consistent standard errors, are actually really good. In sum, nonparametric tests should not be used at all.

The following nonparametric tests are commonly used in experimental economics:


    
        
            Test
            Synonyms
            Use case
            Implementations
        
    
    
        
            Mann-Whitney U test
            Wilcoxon rank-sum test, Mann-Whitney-Wilcoxon test
            Two independent samples
            
                R: wilcox.test()
                Stata: ranksum
            
        
        
            Wilcoxon signed-rank test
            Paired Wilcoxon test
            Paired samples
            
                R: wilcox.test(paired=T)
                Stata: signrank
            
        
        
            Kruskal-Wallis test
            Kruskal-Wallis $h$ test
            Multiple independent groups
            
                R: kruskal.test()
                Stata: kwallis
            
        
        
            Kolmogorov-Smirnov test
            K-S test
            Comparing distributions
            
                R: ks.test()
                Stata: ksmirnov
            
        
        
            Fisher’s exact test
            Fisher-Irwin test
            Binary outcomes
            
                R: fisher.test()
                Stata: tabi ..., exact
            
        
        
            Spearman’s rank correlation
            Spearman’s rho
            Monotonic association
            
                R: cor.test(method="spearman")
                Stata: spearman
            
        
    


Below, I will focus on Wilcoxon-type tests (this includes Kruskal-Wallis), as these are the epitome of testing in old-school experimental economics. The other tests have their own issues, but also some strengths; the K-S test is especially useful, though probably not for what you would expect.

One nonparametric method that deserves a positive mention is the permutation test (also called a randomization test). Unlike Wilcoxon-type tests, permutation tests can directly test for differences in means, which is the estimand we typically care about. They inherit appealing distribution-free properties while remaining powerful and interpretable. That said, for the standard experimental settings discussed here, OLS-HC3 remains simpler and equally valid.

The use of nonparametric tests relies on common misconceptions

Two misconceptions lead, in my experience, people to use or recommend the use of nonparametric tests in our field. The first is how great and important it is not to make distributional assumptions. The second is that the thing being tested is somehow more relevant than with common parametric tests. This second claim will be examined below. Let’s for now focus on just the first one.

A common way the first claim is made is as follows: “Real-life data are not normally distributed, hence we should use nonparametric tests, BAZINGA!” In fact, these were the exact words used when I first learned about nonparametric tests. Claims like that are virtually always made to show nonparametric tests’ superiority over one particular alternative test: the t-test.

Whenever I refer to “the t-test,” I am here specifically referring to Welch’s t-test, the default in R’s t.test. Never, ever assume equal variances. If your software assumes so by default, your software is bad.

This argument is not even wrong. It is just irrelevant whether the underlying data are normally distributed. Rather, for the t-test to work, the mean of the data should have an approximate normal distribution. Now, it is of course accurate that if your data are normally distributed, then the mean itself is normally distributed. However, a remarkable result in statistics called the “Central Limit Theorem” shows that under very weak conditions, the distribution of the mean of any data with bounded mean and variance converges to a normal distribution.

The CLT gets even more remarkable. In all experiments known to me, data are inherently bounded. It simply is never possible to report arbitrary numbers. Just think of the public goods game. You simply cannot give less than zero or more than your endowment. The same principle applies to all experimental data. It is easy to prove that if a variable is bounded from above and below, all conditions of the CLT are satisfied and the mean of the variable will converge to a normal distribution. Simply put, the elicitation of bounded data kills pathological behavior of statistical tests.

Still, a commonly held misconception is that you need $n > 30$, or similar, for the CLT to apply. (This is an oversimplification. It depends heavily on the underlying distribution’s skewness.) However that may be, this does not mean that the t-test would fail. There are two cases that should be distinguished here: Type I errors and Type II errors.

For small $n$ and roughly equal group sizes, both the degrees of freedom, $\nu$, and the estimate of the standard deviation, $\sigma$, lead to such an unbelievably conservative distribution of t that I am confident to say that your test will not be oversized.

REWARD! If you are the first one to send me any R function at the indicated position subject to the conditions in the code, with the script running without error, the test at the end being rejected at $\alpha = 10^{-12}$, then I will admit defeat, update this post accordingly, and send a charity of your choice US$100:

    
        
            Show R code
            Hide R code
        

        
            
        
    

    This offer was posted on 2026-01-02 and is still valid.


Good luck.

Second, with respect to Type II errors: if you ever have $n \lt 30$, that’s a You problem. Increase your sample size or improve your design more generally. That’s part of why power analyses are useful.

Nonparametric tests’ null hypotheses are poorly understood

A common claim about Wilcoxon-type tests is that they are tests on the median. In general, that is false. See also this. And this. If you are happy with making the additional assumptions required to reframe Wilcoxon-type tests as tests on the median, then why not simply make the minimal assumptions for the t-test (see above) and actually test something real?

Generally speaking, Wilcoxon-type tests are tests on stochastic dominance. While in some very specific instances, stochastic dominance may be what matters (see below), in general it is not.

By the way: even if nonparametric tests tested for the median—or if you’re willing to make the necessary assumptions—that would only indicate the difference in medians between your conditions, not the movement of a “baseline median” person to a different outcome given treatment. The latter interpretation requires additional assumptions (such as rank invariance, constant additive treatment effects, or other restrictions on the joint distribution of potential outcomes). More broadly, your theory would have to justify why medians matter at all.

We know what t-tests do and that they have good properties

t-tests are tests on the mean. Means (or, rather, ATEs) matter a whole lot over a broad bandwidth of economic theory. And that’s that.

That t-tests have good properties was previously accomplished (warning: loud).

Linear regression with HC3 standard errors (“OLS-HC3”) is even better than t-tests

OLS is essentially an extension of the t-test. The t-test is just OLS with a binary treatment indicator. OLS generalizes this to multiple treatments, continuous controls, and factorial designs. It should be the natural framework for experimental analysis.

But what about heteroskedasticity? That problem is basically solved. Just use HC3 standard errors. They have excellent finite-sample properties. And yes, they are better than HC0, HC1, or HC2. Just use HC3. In R: lmtest::coeftest(model, vcov = sandwich::vcovHC(model, type = "HC3")). In Stata: reg y t, vce(hc3). Note that Stata by default uses HC1 if you just specify robust, so don’t do that.

OLS has many incredible benefits beyond robustness. You can easily include additional control variables to improve precision (see the ANCOVA section below). You can cluster standard errors by session, group, or any other unit. You can estimate multiple treatment effects simultaneously by using saturated regressions. Coefficients are directly interpretable. There are no convergence issues. Results are transparent and reproducible. OLS-HC3 should be the default workhorse for experimental economics. It just works.

Note: there is a critical chapter on “robust” standard errors, including HC3, in Mostly Harmless Econometrics, Section 8.1. Their simulation results rely on very small sample sizes, unequal between treatments (literally $N_1 = 0.1 \cdot 30 = 3$ in one of the treatments) and should be viewed as extreme and purely pedagogical. Also, interestingly, the newer HC4 standard errors (but not HC5!), mitigate the issue. I attach R code below to replicate the results in their final column for the cases of (i) no and (ii) lots of heterogeneity. If you change FACTOR or TREATED, you can see how well HC3 in fact performs under even slightly more realistic scenarios despite low sample sizes.


    
        Show R code
        Hide R code
    

    
        
    


Even the analysis of binary outcomes using OLS-HC3 is probably fine (with simple designs)

A common concern is that when your outcome variable is binary (0 or 1), you “should” use logit or probit instead of OLS. This concern is largely misplaced for experimental work with simple designs.

The linear probability model (LPM, just OLS with a binary dependent variable) has many advantages: (i) coefficients are directly interpretable as percentage point changes in probability; (ii) there are no convergence issues; (iii) it does not impose functional form assumptions about how treatment effects vary across the probability distribution; (iv) with HC3 standard errors, it is robust to heteroskedasticity (which is inherent with binary outcomes).

The main remaining criticism of the LPM is that fitted values can fall outside $[0, 1]$. I have never seen fitted values matter in experimental economics. If you extrapolate, you must choose an appropriate data-generating process.

For simple treatment comparisons, the LPM with HC3 standard errors is an excellent default choice. It is transparent, robust, and interpretable. It just works. More precisely, the LPM is valid when all right-hand-side variables are categorical and fully interacted. Once continuous covariates enter the model, functional form issues can arise. But for testing differences between treatments, where the key regressor is just a dummy variable, this is not a concern. Logit and probit models should be reserved for cases where (i) you have strong theoretical reasons to impose a specific functional form (such as from utility theory), or (ii) you are working with observational data where extrapolation matters.

One final note: if you do use logit or probit, report marginal effects, not raw coefficients. Raw coefficients from nonlinear models are nearly impossible to interpret and compare across studies. Marginal effects at the mean (or average marginal effects) restore interpretability.

Interactions (and why you probably shouldn’t care about them)

As I mention below in the section on factorial designs, interactions are fundamentally model-dependent. An interaction that is significant in OLS may be insignificant in logit, or vice versa.

Moreover, interactions are almost always underpowered ex ante.

In factorial designs, as discussed above, you should include the interaction term in your regression to avoid functional form misspecification, but you should typically focus on main effects or simple effects for interpretation. The interaction coefficient itself is rarely of interest. Feel free to just eliminate conditions where multiple treatments are active if you don’t need the resulting comparisons.

Small design changes can vastly improve statistical inference

One of the highest-return design modifications you can make is to collect a baseline measure of your outcome variable before randomizing participants into treatment. Then, include this baseline measure as a control variable in your analysis. This approach is called ANCOVA (analysis of covariance) and it can dramatically improve statistical power.

Consider the standard regression for a randomized experiment:

\begin{equation}
    y_i = \beta_0 + \tau T_i + \gamma \pmb{X} + \varepsilon_i
\end{equation}

Here, $y_i$ is your outcome, $T_i$ is the treatment indicator, and $\pmb{X}$ represents any additional control variables (demographics, etc.). Now suppose you collect a baseline measure $y_i^0$ of the outcome before treatment assignment. You can then estimate:

\begin{equation}
    y_i = \beta_0 + \beta_1 y_i^0 + \tau T_i + \gamma \pmb{X} + \varepsilon_i
\end{equation}

This helps because $y_i^0$ absorbs individual-level variation in the outcome. If people differ substantially in their baseline levels, the inclusion of $y_i^0$ reduces the residual variance $\text{Var}(\varepsilon_i)$. This directly shrinks the standard error of $\hat{\tau}$, increasing your statistical power without requiring a larger sample!

The gains can be enormous. But: timing matters. The baseline measure must be collected before randomization. However, any pre-treatment variable that is correlated with the outcome will help. For example, if your outcome is post-treatment donations, a baseline measure of past donations or general prosociality will still improve precision.

Measurement error in $y_i^0$ attenuates $\beta_1$ but does not bias $\hat{\tau}$, because $T_i$ is randomized and thus by construction uncorrelated with $\varepsilon_i$.

Summing up, if your experiment allows for it, always collect a baseline measure. There is essentially no downside. Preregister the inclusion of $y_i^0$ and always include it in your main specification.

Carefully design comparisons in factorial designs (e.g., 2×2)


    
        Show R code
        Hide R code
    

    
        
    


Suppose you have two factors, $g$ and $h$. Both are dummy variables (1 if “turned on” and 0 otherwise). Participants get iid randomly assigned to $g$ and/or $h$. As is clear, this is a 2×2 between-subjects design. Moreover, you have your outcomes, y. What can you do?

There are three standard effects in 2×2 designs: main effects, simple effects, and interaction effects.

Economists are inherently (and rightly) suspicious of interaction effects, and thus I will not be covering them. A core challenge with interpreting interactions is that they work only for certain outcome variables and models. With binary outcome variables, for example, an interaction that is not significant in the linear probability model/with OLS may be significant in a logit model, a probit model, both, or neither. It is a huge unresolved and likely unresolvable mess. Never rely on the interaction in a factorial design unless you fully understand the model to be used on the outcome.

Simple effects are simple indeed; they refer to the effect of one factor at a fixed level of the other. In R, just do lm(y ~ h, data = df[df$g == 1, ]) (with optional control variables, see above) and get lunch. Your day is completed.

Main effects are not so simple. The main effect of, say, $g$, requires you to average over $h$ with some weights. If these weights can be derived from a policy question or your theoretical framework, great! Otherwise, just use equal weights.

Denote by $\mu_{g,h}$ the population mean in some factorial treatment $g, h$. Then, the main effects of $g$ and $h$ are defined as follows:

\begin{align}
\tau_g^{\text{M}} = \left[ w_1 \mu_{1,0} + (1-w_1) \mu_{1,1} \right] - \left[ w_1 \mu_{0,0} + (1-w_1) \mu_{0,1} \right]\\
\tau_h^{\text{M}} = \left[ w_2 \mu_{0,1} + (1-w_2) \mu_{1,1} \right] - \left[ w_2 \mu_{0,0} + (1-w_2) \mu_{1,0} \right]
\end{align}

In the following, I assume $w_1 = w_2 = \frac{1}{2}$. As I argued above, linear regressions are an excellent method to analyze your experiment. How, then, can we get $\tau_g^{\text{M}}, \tau_h^{\text{M}}$ from a neat linear regression?

Unfortunately, this linear model is simply wrong:

\begin{equation}
y_i = \beta_0 + \beta_1 g_i + \beta_2 h_i + \beta_3 g_i h_i + \varepsilon_i
\end{equation}

Crucially, with main effects, it is not correct to just throw your (binary) dummy treatment indicators into the linear model as-is. That would only be correct if you did not use an interaction term in your linear model. Given my skepticism about interaction terms, you may ask why I would ever propose using one of these! The reason is simple: including the interaction term makes your model saturated. It can exactly recover all four cell means without imposing any functional form restrictions. If you omit the interaction term, you assume additivity; if this assumption is wrong, your main effect estimates will be biased. Including the interaction term protects against this misspecification. Therefore, you should always include the interaction term but (probably) ignore the coefficient on it.

In general, $\beta_1 \neq \tau_g^{\text{M}}$ and $\beta_2 \neq \tau_h^{\text{M}}$. Why is that? Simple:

\begin{equation}
\frac{\partial y_i}{\partial g_i} = \beta_1 + \beta_3 h_i
\label{marginal}
\end{equation}

This marginal effect depends on the value of $h_i$! Only at $h_i = \frac{1}{2}$ does the marginal effect equal the main effect. So, we must transform $g_i, h_i$ as follows:

\begin{align}
g'_i = g_i - \frac{1}{2}\\
h'_i = h_i - \frac{1}{2}
\end{align}

This effect coding or deviation coding ensures that the marginal effect is equal to the main effect, since we can now ignore the final term in the equivalent of Equation \eqref{marginal}. The following linear model recovers the main effects:

\begin{equation}
y_i = \beta_0 + \tau_g^{\text{M}} g'_i + \tau_h^{\text{M}} h'_i + \Xi g'_i h'_i + \varepsilon_i
\end{equation}

Needless to say, you can include further control variables as usual (and should ignore $\Xi$). However, transforming dummies is crucial—as long as you do that, your coefficients are meaningful! See here and here for excellent references.


    
        Show R code for 2×2×2 designs
        Hide R code for 2×2×2 designs
    

    
        
    


Null effects can in fact be quantified

A common reaction to a non-significant result is utter despair. That reaction is easily wrong or at least premature. A p-value above 0.05 tells you that you failed to reject the null hypothesis. It does not tell you that the effect is zero, small, or negligible. It says nothing about effect size.

A critical distinction must be made between underpowered null results (which are in general not so valuable) and well-powered, tightly bounded null results (which are valuable). If your study had 20% power to detect a small effect and you find $p = 0.5$, you have learned essentially nothing. Your confidence interval will be wide, spanning large positive and negative effects. Contrarily, if your study has 90% power and you find $p = 0.5$ with a confidence interval of $[-0.05, 0.15]$ in standardized units, you have learned something important: the effect, if it exists at all, is small. Let’s not mix both of these cases!

Null results can be valuable, but only if they are informative. An informative null result rules out effects larger than some threshold. To claim an informative null in a frequentist framework, you must demonstrate that your confidence interval is sufficiently narrow.

There are two main approaches to quantifying null effects: equivalence testing (TOST) and Bayesian methods. Both allow you to make positive claims about the absence or negligibility of an effect, rather than merely failing to reject the null.

Using TOST


    
        Show R code
        Hide R code
    

    
        
    


The two one-sided tests (TOST) procedure tests whether your effect lies within a pre-specified equivalence region. Instead of testing $H_0: \tau = 0$, you test whether $\tau$ is practically equivalent to zero by checking if the confidence interval for $\tau$ lies entirely within some interval $[-\Delta, \Delta]$, where $\Delta$ is the smallest effect size you consider meaningful.

The conventional choice is $\Delta = 0.2$ in standardized units (Cohen’s $d$), though you should justify this threshold based on your research context. A narrower equivalence region (say, $\Delta = 0.1$) makes a stronger claim but requires more statistical power. The TOST procedure uses a 90% confidence interval, which corresponds to two one-sided tests at $\alpha = 0.05$.

Here is an example (from the code above):


    To assess whether the effect is practically negligible, we report the 90% confidence interval, which corresponds to a two one-sided tests (TOST) equivalence procedure at
    $\alpha = 0.05$
    (Lakens et al., 2018). The estimated coefficient is
    $-0.032$
    ($\text{SE} = 0.072$),
    yielding a 90% CI of
    $[-0.150, 0.086]$.
    Given the pooled outcome standard deviation of
    $1.329$,
    this corresponds to
    $[-0.113, 0.065]$
    in standardized units (Cohen’s $d$).
    Using the conventional threshold of $|d| = 0.2$ as the smallest effect size of interest, the 90% CI lies entirely within the equivalence region $[−0.2, 0.2]$, supporting the conclusion of practical equivalence.


Using Bayesian t-tests


    
        Show R code
        Hide R code
    

    
        
    


NEW: Open web app

My personal intuitionist view of evidence is as follows: any study should yield one of three conclusions: evidence for X, evidence for Y, or indeterminate. Bayesian tests deliver on all three possibilities. A Bayes Factor (BF) can provide evidence for the null, evidence for the alternative, or remain inconclusive ($\text{BF} \approx 1$). Frequentist methods, by contrast, do not distinguish between evidence for $H_0$ and poor data. This is why I prefer Bayesian approaches for quantifying null effects, though TOST serves a similar purpose when properly powered. However, Bayesian methods come with batteries included. As I like to say: “Bayesian methods solve all problems, for free.” It’s true. Really!

Simply put, Bayesian methods provide an alternative approach by computing a Bayes Factor, which quantifies the relative evidence for the null hypothesis versus the alternative. A $\text{BF}_{01} > 3$ suggests moderate evidence for the null, while $\text{BF}_{01} > 10$ suggests strong evidence. Conversely, $\text{BF}_{01} \lt 1/3 \equiv 1 / \text{BF}_{10}$ suggests at least moderate evidence for an effect. Read more here.

Bayesian t-tests (e.g., as implemented in the BayesFactor package in R) specify a prior distribution on effect sizes and compute how much the data update your beliefs. The advantage of this approach is that you can actually claim evidence for the null, not just failure to reject it. The disadvantage is that results depend on your choice of prior. Therefore, always use the now-standard default prior of $\text{Cauchy}\left(0, \frac{\sqrt{2}}{2}\right)$ and conduct robustness analyses.

Here is an example (from the code above, same data as with the TOST example):


    We conducted a Bayesian independent samples t-test (Rouder et al., 2009) using a default prior with scale parameter $r = \frac{\sqrt{2}}{2} \approx 0.707$ (Cauchy distribution centered at zero). The Bayes Factor strongly favored the null hypothesis, with $\text{BF}_{01} = 15.0$ (equivalently, $\text{BF}_{10} = 0.067$), indicating that the observed data are 15 times more likely under the null hypothesis of no effect than under the alternative hypothesis. This constitutes strong evidence for the absence of an effect. To assess robustness to prior specification, we conducted sensitivity analyses with alternative scale parameters: $r = 0.5$ yielded $\text{BF}_{10} = 0.094$ ($\text{BF}_{01} = 10.7$); $r = 1.0$ yielded $\text{BF}_{10} = 0.047$ ($\text{BF}_{01} = 21.1$); and $r = 2.0$ yielded $\text{BF}_{10} = 0.024$ ($\text{BF}_{01} = 42.2$). Across all specifications, we obtained consistent strong evidence favoring the null hypothesis ($\text{BF}_{01} > 10$ in all cases), demonstrating that our conclusion is robust to reasonable prior choices.


In practice, TOST and Bayesian methods often agree qualitatively. If you have a well-powered study with tight confidence intervals, TOST will show equivalence and the Bayes Factor will favor the null (as in the example above). If your study is underpowered, neither method will save you: TOST will fail to show equivalence and the Bayes Factor will be inconclusive (close to 1). The lesson is simple: power matters for null results just as much as for positive results. Design your study properly.

Know what you are doing, measuring and estimating

Any experiment starts with a theory of human behavior. This theory may not be mathematical, but it is always a statement about counterfactual behavior. The section title promises three requirements: know what you are doing (treatment design), know what you are measuring (outcome elicitation), and know what you are estimating (estimands). All three must align with your theory’s objects.

Know what you are doing

For example, suppose that you want to study workers’ effort when provided with information about coworkers’ effort. Now this is clearly an experiment about workers’ beliefs. These beliefs are changed through information, and then those beliefs are (probably) thought to induce a response in effort. I have talked more above about some recommendations for such experiments, but for now let us just focus on treatment design.

Putting aside questions of deception, a simple experiment could tell some participants (i) that their group of coworkers had an average effort level of $e_1$ and some other participants (ii) that their group of coworkers had an average effort level of $e_2$, where $e_2 \lt e_1$. Do you see a problem with this design?


    
        Show problem
        Hide problem
    

    
        This design is built on the assumption that workers’ effort level is shaped by beliefs about the average effort of others. But it could well be that their effort responds to the minimum effort, the maximum effort, or any other statistic of the distribution.

        If your theory proves that only beliefs about averages should matter, then such a design is wholly appropriate. And even if people’s effort is in fact a function of the minimum, the minimum could still be correlated with the average, so the design may “work” accordingly. Still, beliefs are extraordinarily rich objects, and any particular experimental configuration of beliefs must grapple with that complexity. As Einstein put it: “theory decides what we can observe.” So true!
    


Formal mathematical theory has the great feature that it forces you to make explicit assumptions. Formal theory constrains an experimenter’s degrees of freedom. This is why theory is invaluable even when not strictly necessary: it disciplines experimental design.

Know what you are measuring

Your outcome measures must be as close as possible to theoretically relevant objects. There are important exceptions: sometimes the theoretically relevant object (such as a preference parameter) cannot be directly elicited as an outcome measure. Simple tasks like binary choices, coupled with a structural model, can nonetheless be highly informative about it. Theory still dictates what is relevant. It just also dictates the mapping from observable outcomes to theoretical objects.

In the absence of formal theory, or if theory makes no predictions for a particular object, your planned analyses can serve as a reduced-form guide to what you need to measure. A key principle: comparisons are only valid when the objects being compared are of the same kind. The null hypothesis must be a sensible benchmark.

Suppose you want to compare first-order beliefs (what I believe about others) with second-order beliefs (what I believe others believe about others). You elicit the first-order belief by asking: “What is the average effort of others?”

Now, how should you elicit the second-order belief? An invalid approach would be to ask: “What is the most common belief about the average effort of others?” This compares a mean (first-order) with a mode (second-order). These are two different statistics that need not coincide even under rationality.

A valid approach could ask: “What is the average belief about the average effort of others?” Now both elicitations concern means. Under common priors and rational expectations, the first-order and second-order beliefs should coincide. Any divergence could reveal false consensus, projection bias, or asymmetric updating. Crucially, the comparison is valid in principle because the null hypothesis (equality under rationality) is a meaningful benchmark.

Know what you are estimating

We have discussed estimands in greater detail above. For now, keep the following principle in mind, which I call the Fundamental Law of Experimental Economics:

If theory predicts a monotonic individual response, a zero average treatment effect (ATE) implies zero individual effects.

This law holds because the mean cannot hide offsetting responses when all individual effects have the same sign. Under random assignment, the ATE therefore directly tests the theory. This is a precise statement: the ATE is an expectation (an integral over the distribution of individual types) that identifies the causal effect of a stimulus. Random assignment “marginalizes out” unobservable theoretical parameters, leaving only the predicted effect.

The Fundamental Law immediately suggests how to design good experiments: isolate behavioral factors that, under theory, predict a monotonic change. Such factors deliver clean comparative statics. Contrarily, if theory permits heterogeneous signs (some individuals respond positively, others negatively) a zero ATE is uninformative. It could reflect no effects, or it could reflect large offsetting effects. Without further restrictions, you cannot distinguish these cases.

The upshot is that experimental design and statistical analysis must be in concordance. The estimands of interest must directly test the predicted monotonic relationships that your theory delivers. This is what it means to know what you are estimating.

Radical openness is likely costless or a free lunch (if you start early)

Transparency is free if you do things properly from the beginning. Do good theory. Write clean analysis code. Organize your files sensibly. When the time comes to publish a replication package, you will have virtually no extra work. The cost is zero. Actually, the cost is negative, because good practices save you time during the project itself!

Better yet: just upload sanitized data immediately after your experimental sessions to GitHub (if permitted). Make your analysis scripts public from day one. There is little reason to wait. If your IRB or data agreements prohibit immediate sharing, fine. Otherwise, default to radical openness. And no, you will not be “scooped.” Your project is, in all likelihood, not that interesting.

Preanalysis plans and preregistrations are also valuable. They protect you from accusations of p-hacking. They force you to think clearly about your design and analysis before seeing the data. They improve the credibility of your findings and they make our whole science better. Also, by committing to analyses, sample sizes, and the like beforehand you can remove the need for later decisions. Just do it.
]]></description>
</item>
<item>
<title>A brief overview of Bayes Factors</title>
<link>https://max.pm/posts/bayes-factors/</link>
<guid>https://max.pm/posts/bayes-factors/</guid>
<description><![CDATA[



    Bayes' theorem expresses the following relationships between various probabilities:


\begin{equation}
P(H \,\vert\, D) = \frac{P(H) P(D \,\vert\, H)}{P(D)}
\label{bayes}
\end{equation}


    $P(H \,\vert\, D)$ is the posterior (as in, post having data, given the data).
    $P(H)$ is the prior.
    $P(D \,\vert\, H)$ is the likelihood of the data (given the hypothesis).
    $P(D)$ is the marginal likelihood of the data.


Proving Bayes' theorem

\begin{align*}
P(H \,\cap\, D) &= P(H \,\vert\, D) P(D)\\
P(D \,\cap\, H) &= P(D \,\vert\, H) P(H)
\end{align*}

Since $P(H \,\cap\, D) = P(D \,\cap\, H)$, Equation \eqref{bayes} follows.

Working in odds space (deriving a Bayes Factor)

Suppose now we are working with two hypotheses, $H_1 , H_2$. We can use Equation \eqref{bayes} twice to obtain the following representation of posterior odds:

\begin{align*}
\frac{P(H_1 \,\vert\, D)}{P(H_2 \,\vert\, D)} &= \frac{\frac{P(H_1) P(D \,\vert\, H_1)}{P(D)}}{\frac{P(H_2) P(D \,\vert\, H_2)}{P(D)}}\\
&= \frac{P(H_1) P(D \,\vert\, H_1)}{P(H_2) P(D \,\vert\, H_2)}\\
&= \underbrace{\frac{P(H_1)}{P(H_2)}}_{\text{Prior Odds}} \cdot \underbrace{\frac{P(D \,\vert\, H_1)}{P(D \,\vert\, H_2)}}_{\text{Bayes Factor}}
\end{align*}

In other words, Posterior&nbsp;Odds = Prior&nbsp;Odds · Bayes&nbsp;Factor.

More precisely, we use a subscript and define

\begin{equation}
BF_{12} = \frac{P(D \,\vert\, H_1)}{P(D \,\vert\, H_2)}
\label{bf}.
\end{equation}

Note how $BF_{12}$ is about likelihood odds, and the subscript translates “top to bottom” to numerator and denominator in the Bayes Factor.

Crucially,

\begin{equation}
BF_{21} = \frac{1}{BF_{12}}.
\label{reversed}
\end{equation}

Calculating the posterior probability from a Bayes Factor

Suppose now that either $H_1$ or $H_2$ is true, and that there are no other possibilities. That is, $H_1 \cup H_2 = \Omega$, so $H_1, H_2$ divide all states of the world, $P(\Omega) = 1$. Then, $P(H_2 \,\vert\, D) = 1 - P(H_1 \,\vert\, D)$.

This means that the posterior probability can be calculated as follows:

\begin{align*}
\frac{P(H_1 \,\vert\, D)}{1-P(H_1 \,\vert\, D)} &= \frac{P(H_1)}{P(H_2)} \cdot \frac{P(D \,\vert\, H_1)}{P(D \,\vert\, H_2)}\\
\\
\Longleftrightarrow\\
P(H_1 \,\vert\, D) &= \frac{\frac{P(H_1)}{P(H_2)} \cdot BF_{12}}{1 + \frac{P(H_1)}{P(H_2)} \cdot BF_{12}}
\end{align*}

In the common case where prior odds are equal to 1, the posterior probability has a convenient expression:

\begin{equation}
P(H_1 \,\vert\, D) = \frac{BF_{12}}{1 + BF_{12}}
\label{postprob}
\end{equation}

Bayes Factors and strength of evidence

If $BF_{12} \geq 1$, use the below table; otherwise, use Equation \eqref{reversed} to calculate $BF_{21}$. The second column assumes prior odds equal to 1.


    
        
            Bayes Factor
            Posterior probability
            InterpretationA
        
    
    
        
            $1$
            $0.5$
            No evidenceB
        

        
            $1 \dots 3$
            $0.5 \dots 0.75$
            Anecdotal evidenceB
        

        
            $3 \dots 10$
            $0.75 \dots 0.91$
            Moderate evidenceB
        

        
            $10 \dots 30$
            $0.91 \dots 0.97$
            Strong evidenceB
        

        
            $30 \dots 100$
            $0.97 \dots 0.99$
            Very strong evidenceB
        

        
            $> 100$
            $> 0.99$
            Extremely strong evidenceB
        
    
    
        A As per Jeffreys (tradition has it).
        B In favor of the hypothesis under consideration.
    


Relationship to likelihood ratio tests

The classical likelihood ratio test statistic is

\begin{equation}
\Lambda = \frac{\sup_{\theta \in \Theta_1} P(D \,\vert\, \theta)}{\sup_{\theta \in \Theta_0} P(D \,\vert\, \theta)}
\end{equation}

where $\Theta_0, \Theta_1$ are the parameter spaces under the null and alternative. This ratio compares the best-fitting parameter values.

The Bayes Factor instead compares average likelihoods:

\begin{equation}
BF_{10} = \frac{\int P(D \,\vert\, \theta) \, p(\theta \,\vert\, H_1) \, d\theta}{\int P(D \,\vert\, \theta) \, p(\theta \,\vert\, H_0) \, d\theta}
\end{equation}

The averaging penalizes models with diffuse priors over large parameter spaces: probability mass wasted on poor-fitting parameter values drags down the marginal likelihood. This provides an automatic Occam's razor absent from $\Lambda$.

Decomposing Bayes Factors

In some cases, there is no single data set $D$ that can be used to evaluate the hypotheses, but rather multiple “evidence” $E_1 , E_2, \dots, E_m$ such that $E \equiv \bigcap_{i=1}^{m} E_{i}$ is the totality of evidence.

Rewriting Equation \eqref{bayes} in terms of evidence, we get

\begin{equation}
P(H \,\vert\, E) = \frac{P(H) P(E \,\vert\, H)}{P(E)}.
\end{equation}

Using the chain rule, for $m=2$ it holds that

\begin{equation}
P(E \,\vert\, z) = P(E_1 \,\vert\, z) P(E_2 \,\vert\, E_1, z)
\end{equation}

for any conditioning variable, $z$. More generally,

\begin{equation}
P(E \,\vert\, z) = P(E_1 \,\vert\, z) \prod_{i=2}^{m} P(E_i \,\vert\, E_1, \dots, E_{i-1}, z).
\end{equation}

The Bayes Factor can thus be decomposed as

\begin{align*}
\frac{P(E \,\vert\, H_1)}{P(E \,\vert\, H_2)} &= \frac{P(E_1 \,\vert\, H_1) \prod_{i=2}^{m} P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_1)}{P(E_1 \,\vert\, H_2) \prod_{i=2}^{m} P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_2)}\\
&= \frac{P(E_1 \,\vert\, H_1)}{P(E_1 \,\vert\, H_2)} \prod_{i=2}^{m} \frac{P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_1)}{P(E_i \,\vert\, E_1, \dots, E_{i-1}, H_2)}\\
&= BF_{12}^{E_1} \cdot BF_{12}^{E_2 \, \vert \, E_1} \cdot \dots \cdot BF_{12}^{E_m \, \vert \, E_1, \dots, E_{m-1}}
\end{align*}

This decomposition is useful when evidence arrives sequentially or when different pieces of evidence have qualitatively different sources. Instead of computing a single likelihood ratio over all evidence at once, you can update beliefs incrementally. Each factor $BF_{12}^{E_i \, \vert \, E_1, \dots, E_{i-1}}$ measures how much the $i$-th piece of evidence favors $H_1$ over $H_2$, given what was already known.

In practice, this matters when some evidence is easier to evaluate than others, or when you want to diagnose which pieces of evidence drive the overall conclusion. A large overall Bayes factor might be dominated by a single $E_i$, or it might accumulate from many modest contributions. The decomposition makes this transparent. It also helps when combining evidence from heterogeneous sources (e.g., experimental data and observational data) where assuming independence would be wrong, but the conditional structure is tractable.

When the pieces of evidence are conditionally independent given the hypothesis (when $P(E_i \mid E_1, \dots, E_{i-1}, H) = P(E_i \mid H)$ for all $i$) the decomposition simplifies to a product of unconditional Bayes factors:

\begin{equation}
\frac{P(E \,\vert\, H_1)}{P(E \,\vert\, H_2)} = \prod_{i=1}^{m} \frac{P(E_i \,\vert\, H_1)}{P(E_i \,\vert\, H_2)} = \prod_{i=1}^{m} BF_{12}^{E_i}
\end{equation}

This is convenient but rarely justified in practice. Evidence from the same domain or measurement process is typically correlated, and treating dependent evidence as independent can inflate confidence.
]]></description>
</item>
<item>
<title>Your R code may not be yours</title>
<link>https://max.pm/posts/r_not_yours/</link>
<guid>https://max.pm/posts/r_not_yours/</guid>
<description><![CDATA[

    This is not legal advice.


tl;dr: Your R analysis code likely has to be licensed under the GNU GPL. My r-snippets README tells you how to proceed for maximum legal compliance.



I have a confession: Until recently, I believed my research code was mine to license however I pleased. Then I discovered a significant legal gray area that should concern anyone sharing R code publicly.

While preparing a replication package, I stumbled onto a contested question in open source licensing: If your R code loads certain popular GPL-licensed packages, can you legally license it however you want? The answer is uncertain, but the safer and more logical interpretation suggests you cannot, strictly speaking.



Many of the most popular R packages used in econometrics and empirical research are licensed under the GPL (GNU General Public License). The sandwich package is perhaps the most important case. It’s utterly ubiquitous for robust standard errors. But it’s far from alone: lmtest, plm, fixest, AER, ivreg, and MASS are all GPL-licensed. These aren’t obscure packages. These are veritable workhorses of empirical economics.

Now, if these packages were licensed under the LGPL (Lesser GPL), there would be no controversy. The LGPL explicitly permits linking to libraries without viral copyleft effects. But they aren’t. They’re licensed under the GPL, and this is where things get complicated.

According to the Free Software Foundation’s interpretation of the GPL, code that loads and uses GPL libraries is a “combined” or derived work and must itself be licensed under the GPL. Under this reading, that library(sandwich) at the top of your analysis script makes your entire file a GPL work. Your carefully crafted LICENSE file declaring your replication code as CC0 or MIT is potentially invalid. You may have inadvertently contravened copyright law.



There is no settled case law on whether loading a GPL library in an interpreted language creates a derivative work under copyright law. This matters enormously, and there are two competing theories:

The FSF’s position: The GPL FAQ states that when an interpreter provides “bindings” to GPL facilities, the interpreted program is “effectively linked” to those facilities. Under this view, library(sandwich) creates a combined work that must be distributed under GPL terms. This interpretation has some logical force: if there were no difference between GPL and LGPL for interpreted languages, why would LGPL exist? Under the idea of “software freedom” espoused by the FSF, it simply cannot make a difference whether the language is interpreted or not, as the GPL is supposed to protect users’ rights. These rights are paradoxically best-protected by a maximally infectious copyleft license.

The R community’s position: The well-known “R Packages” book by Hadley Wickham states it’s their “personal opinion that the license of your package doesn’t need to be compatible with the licenses of R packages that you merely use by calling their exported R functions.” The R Foundation clarified in 2009 that R code doesn’t need to be GPL-licensed just because it uses R. Thousands of CRAN packages use MIT licenses despite likely depending on GPL packages, suggesting widespread acceptance of this interpretation. Certainly is is correct that the mere use of R the programming language itself does not impose a particular license on R code, but interfacing too closely with R APIs may (see below).

Which interpretation is correct? Legally, we don’t know. There’s been no court case. But here’s the critical point: the FSF’s interpretation is safer from a compliance perspective. If you want to be conservative about license compliance, treating your code as GPL when it loads GPL packages is the less risky choice. (Indeed, the risk is zero, but it is sad that we cannot be even more liberal.) The FSF’s position is also logically more convincing:

Perhaps I can use an economic analogy: if the use of a GPL-licensed work is highly substitutable, the GPL does not apply to your use. But if the use is not highly substitutable, the GPL applies. In other words, if your code is so narrow that it relies on a particular (GPL-licensed) implementation, it is a derivative. However, copyright determinations involve many factors (creativity, expression vs idea, transformative use), so this is more of a heuristic. Nonetheless, as analysis code often works around particular, highly specialized implementations, it is likely to constitute a derivative.

There is an important exception: if you’re only using your code privately or within your organization, the GPL doesn’t restrict you. The copyleft provisions only apply when you distribute the code to others. But when you submit a replication package to a journal, post your code on GitHub, or share your analysis with another scientist, you’re distributin’.



I suspect there are thousands of replication packages sitting in journal data repositories right now with licensing terms that might be problematic under the stricter interpretation. Researchers who carefully chose licenses for their code, possibly unaware that this legal question even exists. Code released as “public domain” or CC0 that might need to be GPL.

Are these researchers violating copyright law? Under the FSF’s interpretation, possibly. Under the R community’s interpretation, no. The legal ambiguity around interpreted languages means there’s room for disagreement, even among sophisticated users of open source software. That uncertainty itself is a problem when we want clear terms for code reuse.



When I discovered this issue, I had to make a choice for my r-snippets repository. These snippets use (excellent) GPL packages. I could have relied on the permissive interpretation, but I opted for the conservative approach: I updated the README to license them under GPL.

For my own replication package, we’re taking the same conservative approach, even though I tend to prefer more permissible licenses. (It depends on context, though.)



Let me be very clear about one thing: If you’re writing an R package meant to be used as a library by others, strongly consider LGPL instead of GPL.

The LGPL (Lesser GNU General Public License) was designed precisely to avoid this ambiguity. It allows your library to remain free and open source while explicitly permitting users to link to it without their code becoming a derived work. No legal uncertainty, no competing interpretations. This works as long as your library doesn’t bind too closely to R-internal APIs. If your library uses standard features of R the language, LGPL is fine.

I’ve done exactly this with some of my own software (unrelated to R). uproot is licensed under the LGPL specifically to avoid creating this problem for users. If you’re writing a library that’s meant to be called by other people’s code (which is basically the definition of an R package), LGPL removes all ambiguity. It is fundamentally a question about where to draw the line. The LGPL draws the line a bit closer than the GPL.



The one piece of good news: the GPL applies to code, not to data or to output generated by that code. You can still license your datasets under CC0 or CC-BY or whatever terms are appropriate. The GPL doesn’t “infect” your data, only your code. My r-snippets README contains further information.



Licensing really matters. And open science depends on clarity about reuse rights. The current situation creates uncertainty: If I find your replication package with a LICENSE file saying MIT but your code loads GPL packages, what am I supposed to conclude? That you’re following the permissive interpretation? That you’re unaware of the question? That you researched it and made a conscious choice?

The GPL is not a bad license. It has served the free software community well for decades, and many people prefer its strong copyleft provisions. I do too, in some cases. But in the context of interpreted languages and package dependencies, its requirements are unclear. If you’re going to use GPL code (and if you’re doing econometrics in R, you almost certainly are), you should at least understand that this legal uncertainty exists and decide which interpretation you’re comfortable with.



So check your code. Check your licenses. If you’re loading GPL packages, you face a choice: follow the conservative FSF interpretation and license your code as GPL, or rely on the R community’s permissive interpretation. Given the uncertainty, the GPL approach is arguably safer. But whichever you choose, you should make that choice consciously.

Your R code might not be yours to license freely. The law is unclear, but now you know the question exists.
]]></description>
</item>
<item>
<title>Using the Framework Laptop 13 with Debian 13 (trixie)</title>
<link>https://max.pm/posts/framework_13_debian/</link>
<guid>https://max.pm/posts/framework_13_debian/</guid>
<description><![CDATA[
I recently ordered and obtained a brand-new Framework Laptop 13 with AMD Ryzen™ AI 300 Series - Ryzen™ AI 7 350.

Changes to this document


    2025-10-01
    Post created.

    2025-10-04
    Added microphone configuration.

    2025-10-06
    Added touchpad notes.

    2026-01-24
    Added NVMe crash investigation and fix.

    2026-02-04
    Added SSD disappearance issue. Switched to 60W charger.

    2026-03-16 (current version)
    Updated SSD disappearance issue: the problem has recurred repeatedly.


Table of contents


    


Assembly

Needless to say, I ordered the DIY edition. The official guide seems basically correct, though I was unable to identify a “white line” (Step 8).

Linux kernel

Linux 6.12.48 works very well. However, I took the liberty to add Backports and install Linux 6.16.3 (apt install linux-image-amd64/trixie-backports). The issues described below applied to both versions.

So, despite what others write on the Internet, it literally does not matter one bit. Linux 6.12.48, which comes with Debian trixie, is recent enough. Wi-Fi works perfectly well, and so does Bluetooth and everything else I can think of. I am writing these lines on kernel 6.12.48.

Always just use Debian. You'll never need to worry about anything. I recommend completely ignoring other Linux distros, especially Ubuntu and similar slop.

I also recommend using btrfs, but that too is not specific to a Framework laptop.

Graphics issue

I am a user of dwm and I encountered the following issue: after I had played a video, the screen essentially froze in place and refused to redraw those parts on the screen that had changed. Whenever I switched to another dwm “tag” (a workspace, if you will), the screen was redrawn, however.

In any case, that was fixed by putting amdgpu.dcdebugmask=0x12 into GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and running sudo update-grub. This disables PSR (Panel Self-Refresh).

By the way, screen brightness can be controlled by means of teeing to /sys/devices/pci0000*/*/*0/drm/card0/card0-eDP-1/amdgpu_bl0/brightness.

Bluetooth/audio issue

Regrettably Debian has decided to go full pipewire. Never go full pipewire. I like pulseaudio especially because it has always worked flawlessly (as has systemd, by the way—another irrelevant controversy to the detriment of users). One of the issues with pipewire is that I simply could not get it to work with Bluetooth devices. Blueman just showed No audio endpoints registered and refused to connect with my headphones. Audio did play, but pavucontrol (another excellent piece of software) was no help in getting it all to work.

That one was resolved with systemctl --user disable --now pipewire pipewire-pulse wireplumber pipewire.socket pipewire-pulse.socket &amp;&amp; sudo systemctl --global mask pipewire pipewire-pulse wireplumber pipewire.socket pipewire-pulse.socket and then systemctl --user enable --now pulseaudio.service pulseaudio.socket Remember to install pulseaudio-module-bluetooth, which automatically pulls in pulseaudio.

To get the builtin microphone (the “Internal Stereo Microphone”) to work, you need to use the “Play HiFi quality Music (Mic1, Mic2, Speaker)” profile of your Family 17h/19h/1ah HD Audio Controller. See Configuration in pavucontrol.

NVMe stability

On January 20 and January 24, I encountered random crashes in the middle of the night. The screen started flickering and the device was completely inoperable. Further investigations revealed that these crashes occurred under heavy I/O, namely during my nightly btrfs scrub start / cronjob. The issue could be reproduced by running the same plus glmark2 --run-forever plus having some spiky CPU benchmarks. These crashes were presaged by audit messages in dmesg and concluded with nvme nvme0: controller is down; will reset. The fix recommended in these messages and on the internet (adding kernel parameters) was not quite sufficient, but read on.

After a tremendous amount of research, I came to a theory that is related to power supply triggering buggy NVMe power state management. At home, I use a 30W USB-C charger, not the 60W charger I purchased with the laptop. Under heavy load, 30W are not sufficient (even 60W can be borderline), and the power controller will draw some current from the battery to make up for the difference. That in turn causes the battery to enter the discharging state (as may be verified with upower). This kicks in a feature called PCIe Dynamic Link Power Management. As anyone knows, such features are inherently problematic on Linux. BAZINGA!

I was able to resolve this issue by disabling that feature in the BIOS. I am for now also keeping the suggested kernel parameters (nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off plus amdgpu.dcdebugmask=0x12, see above). I have not yet tried removing these kernel parameters. Disabling PCIe Dynamic Link Power Management appears to be the key causal factor (but there may be interactions, and for now I am happy to throw these parameters at the kernel). Despite substantial effort, I have been unable to reproduce the crashes under this configuration. I will update this post should that change. Note that disabling these power management features increases power consumption.

The issue has not recurred since.

SSD disappearance

Separately, I once awoke to a screen saying that no operating system could be found. Once again, this happened during my nightly btrfs scrub start / run. After a reboot, everything once again looked normal.

While WD BLACK SN850 is known to suffer from “sudden death,” the SN850X that I have is not immune either. Reports of BSODs, drive disconnections, and ASPM-related crashes are not hard to find. However, all SMART data looked entirely normal. Despite tremendous effort, I was unable to figure out the core issue, or to reproduce the crash. I thus followed the ubiquitous recommendation to “reseat” the SSD (though I had already put it in very tightly and there was no discernible slack), and switched to the official 60W charger.

Unfortunately, the SSD disappearance has recurred again and again—roughly once every two weeks, always during the nightly btrfs scrub, and even under the appropriate 60W charger. This is frustrating. The issue remains unresolved.

Overall assessment

I am impressed by the Framework Laptop 13. It's a high-performing laptop, aesthetically pleasing, and pretty compatible with Debian GNU/Linux. The keyboard is good, too (especially because it is very quiet).

The build quality in general is very good. Right-clicking works through the bottom right corner of the touchpad. Compared to the keyboard, clicking in general is kind of noisy, which is a downside.

One issue that could limit my enjoyment is that it has only four slots for expansion cards, with one of the four taken up by the power supply. So far this has remained only a theoretical concern. Also, I would like to get a Mini-DisplayPort expansion card.

After 4 months of usage, I am still happy with my Framework 13.

Benchmarks

Here is some further information about my previous laptop (a ThinkPad T440s) and my new Framework, including hardinfo2 benchmarks.


]]></description>
</item>
<item>
<title>oTree Room Label Generator</title>
<link>https://max.pm/posts/label_generator/</link>
<guid>https://max.pm/posts/label_generator/</guid>
<description><![CDATA[
This web app generates short, unique, and optically unambiguous oTree room labels.

Open web app
]]></description>
</item>
<item>
<title>Know the Formats</title>
<link>https://max.pm/posts/formats/</link>
<guid>https://max.pm/posts/formats/</guid>
<description><![CDATA[
How can we harness groups to make effective, good decisions? We can think of many group decisions as being differentiated on two dimensions: how complex the decision is and how aligned the interests of group members are.

To say anything meaningful at all about this topic, it is necessary to stay quite general and vague. For example, what does it even mean for a decision to be “effective” or “good”? You are free to replace these words with the long-term satisfaction of all group members. I will come to examples below.

Decisions that involve fewer moving parts or that are simpler or easier to verify obviously require less time to think. When it comes to groups, there is also less need for discussions or what is sometimes called “deliberation.” On the other hand, complex topics need to be thoroughly talked through, subdivided, inspected, and finally once again put together.





The second dimension also has very clear implications. If group members tend to share the same goal, there will be more trust within the group. If group members have distinct objectives, suspicion and doubt will often affect which decisions are made. These issues can impede free discussions, and thus cause suboptimal decision-making and decrease satisfaction. This dimension, too, is almost obvious: a multitude of objectives decreases the likelihood that any joint decision can satify every member's goal. If we define trust as an expectation about the latter, it becomes clear that unaligned interests do not promote trust.





These two dimensions have important implications for the question of how to structure group decision-making. Various mechanisms, or formats, have different properties that can strengthen the quality of group decision-making in some circumstances and weaken it in others. It is foolish to believe that one mechanism solves all problems. It is important that groups can find ways to implement diverse decision-making formats for diverse problems.





One example is the new revolt against technocracy and “experts” (quadrant I). Experts interface well with highly complex problems under the assumption that objectives are clearly defined and experts can help satisfy these objectives. This example highlights the vertical axis as having a normative character. As I have written before, policy cannot work without values. If people have an objective that conflicts with the stated or implied goal that experts as tasked with reaching (for example: the ascertainment of scientific truth), dissatisfaction is inevitable. If values don't matter, however, delegation to experts or expert bodies can be highly effective. But values often really, really matter, no matter how strongly many try to ignore it.

Another example concerns majority voting. This mechanism is well-known for its simplicity, cost-effectiveness, and seeming universal applicability. Nonetheless, the story of the 51 percent oppressing the 49 percent is well-known. Raw majority voting is unable to handle complex scenarios, as evidenced by smart hacks such as logrolling. Moreover, voting works without further assumptions only between two options. If preferences are single-peaked, it also works with more options, but that is unlikely with multidimensional questions. Democracy works best if all citizens share at least some objectives and decisions are simple (quadrant II).

Unanimity can overcome some of these challenges (quadrant III), but it is procedurally tough: it might require substantial “side payments” between players, and is in general highly complex even for simple problems. The great—and obvious—thing about unanimity rules is that they produce a lot of satisfaction (Pareto efficiency) because they avoid the tyranny of the majority. Unfortunately, it is too tedious—where the costs of dissatisfaction are not too high (where there is more alignment of interests), majority rule works better.

The common law is a good example of a mechanism that happens to overcome unaligned interests at high complexity (quadrant IV). In any court of law, interests are inherently opposed. While many matters are simple, there is no limit to the complexity that the common law can handle. The common law is able to do this because it relies on abstract rules that categorically provide solutions for certain classes of problems. One great example of this is the reasonable person standard in negligence law, which provides a consistent framework for determining liability across countless different scenarios by asking what a hypothetical reasonable person would have done in similar circumstances. Since everyone can form their expectations around these solutions, and act accordingly, the common law prevents many issues in the first place, or provides litigants with quasi-dictatorial decision-makers (judge and jury) to settle issues once and for all. While a common law court can in theory settle all problems, it's quite an expensive process, and also probably not fully legitimate from first principles. The former issue is so severe that litigation is often foregone in favor of settlements, i.e., unanimity.



On another note, I have decided to cut back on conference attendance. While I believe that scientific conferences are invaluable for more junior scientists to form connections, I am currently focusing more on improving my research. The problem is that conference audiences often do not have aligned interests, yet are dealing with highly complex issues (quadrant IV). It follows from the discussion above that research simply cannot be improved by fifteen-minute slots with five minutes reserved for “discussion.” To be clear: I cannot really blame the other attendees, as there is just far too much material. (I do blame the organizers, however. Why not leave the afternoons free for scientists to interact?) This is in contrast to more highly self-selected seminars and reading groups, where the material gains focus and there is more space for ideas.

While conferences can be useful in promoting ideas and networking (tasks that have low complexity), they are not made for depth. Ivan Illich was a social critic who frequently met with friends and other discussants. He threw out a provocative idea and then discussed it to the end in all its nuance. Crucially, his audience were people who shared his goal of brutally and utterly dissecting modern institutions. Although I have many differences with Illich, we was able through in-depth discussions to truly get at the bottom of issues. Ivan knew the formats. The goal of research is to find truth, and I no longer believe that conferences are the right tool for that.

For the time being, I will spend my time mostly elsewhere: with other scientists who care, or deeply immersed in my work and that of others.
]]></description>
</item>
<item>
<title>My review of the Daylight DC-1 tablet</title>
<link>https://max.pm/posts/daylight_dc1/</link>
<guid>https://max.pm/posts/daylight_dc1/</guid>
<description><![CDATA[

    The Daylight Computer is a tablet that promises to be a more caring computer, promising a distraction-free experience focused on reading while simultaneously granting access to the rich Android ecosystem. I have now spent some time with my DC-1. This is a review.



    Neither is this post sponsored nor was I provided with the product for free. I will be updating this post as necessary.


Table of contents


    


Getting the DC-1


    By far the most difficult aspect in this whole episode was actually obtaining the device and navigating customer service. I made a pre-order on September 14, 2024, after reading about Daylight on Marginal Revolution. I was subsequently offered to make a full order on October 7, which I did. At that point, I was told that the device would arrive until the end of Q4 2024. On December 16, I was informed that the device ought to ship out until the end of January 2025.



    This did not suit me, as I had a long vacation planned for January and February. I thus contacted the Daylight offices via email and a text message, but to no avail. Ultimately, I reached them via Twitter and email, and I was promised that my order would not ship before February 24.



    Thus imagine my surprise when on January 29, I received a shipping confirmation. Almost needless to say, this is a massive boo-boo and screw-up on their part. Now Daylight Computer Co. is a startup. Also, I try to be a simple customer and in this instance I was not. I will spare you the tremendous effort and luck that was required for me to eventually obtain the device near the end of Q1 2025, and I will just note that the organizational aspects of the experience still need tremendous improvement. The same is true for a VAT refund they had to perform.




What I got and overall value for money

I obtained the DC-1 including the travel pouch and a stylus. The total cost, including the pre-order and taking into account the VAT refund comes down to about €709.

Now this may seem quite expensive for an Android tablet—and it sort of is. However, the best Android tablets for stylus usage have similar price tags. I will come to some hardware disappointments below, but in general I would say that the DC-1 has the best stylus support in any Android tablet I have used so far. The latency is minimal; pressure is detected well; palm rejection works flawlessly. However, it should be noted that the standard stylus is “simple,” but to me that is perfect. (I am very basic.)

I do find it disappointing that a hardware keyboard was not included. Some early customers appear to have gotten the Apple Magic Keyboard, but I do not know whether that was included in a price similar to what I paid.

The bottom-line is: the pricing is barely acceptable, but still acceptable.

Hardware disappointments

As a device, the DC-1 is slightly paradoxical. It aims to straddle the narrow road between the Kindle and full-fledged tablets. This makes trade-offs inevitable.

The DC-1 does not have a mobile broadband modem. It does not have a camera (but it does have a microphone and a loudspeaker). It does not have a GPS module. The latter is especially unfortunate for me. By itself, the DC-1 would be perfect for usage with OsmAnd—maps are snappy and work well. But without GPS, it is mostly for naught.

Moreover, Bluetooth appears to be slightly flaky. I have encountered repeated disconnects when playing music or videos on the DC-1 and having it connected to my WH-1000XM4 headphones.

The most important issue, however, is with the screen. First, some websites claim that the DC-1 has a density of 190 DPI, which is just too little. Even my Kindle Paperwhite has 300 DPI!
Second, the contrast is not great. The black is just not black enough.
While both of these aspects are still acceptable in purely practical terms, they could reasonably be expected to be better.


The DC-1 appears not to allow the stylus to magnetically attach to the tablet (or I was too stupid to figure out how it works).

Positive hardware aspects

The device is very fast and they rightly decided on having a high refresh rate for the display. While DPI and contrast are not great, the device performance itself is at the upper echelons of Android tablets. Apps can be installed very quickly.

The battery is another plus. I have not made any formal tests, but even with Wi-Fi on, the DC-1 uses very little power in standby mode.

The yolk-colored backlight works well. As expected, the display also performs well in the sunlight (but not wholly free of any reflection). The build quality in general is good (but the DC-1 is quite heavy). I do wonder about repairability and recyclability, though.



My usage patterns

My original impetus for purchasing the DC-1 comes from the fact that reading PDFs on the Kindle is just a massive pain. The Kindle's PDF viewer is slow, buggy, highly limited in functionality and especially problematic when it comes to smaller font sizes or documents with multiple columns (thanks, Elsevier!). Moreover, it is somewhat awkward to send PDFs to the Kindle via email (though this also has its advantages, and Send to Kindle does exist). To add to the disaster that are PDFs on the Kindle, reading even the simplest among them involves massive battery usage. No bueno!

Once I obtained my DC-1, I immediately installed and set up F-Droid and Syncthing (the Syncthing-Fork app, to be precise). I can now copy PDFs to a folder on my other devices and they appear on the DC-1 straight away without any cloud involvement! This highlights that the DC-1 truly is a genuine Android tablet. You can thus choose to go with cloud-based, proprietary services, but you are also free to use open-source tools. I also purchased Notewise Unlimited, an app that appears to work well. It allows me to open my PDFs in the apps, add various annotations if I so desire, and export the result to another PDF. As I said: the stylus support is excellent, and I have signed many a PDF now.

The DC-1 also includes a PDF reader that allows some basic annotations. The Niagara Launcher comes preinstalled. The Google Play Store is available.

Browsing the ol' interwebs is no problem. Thanks to the DC-1's performance, all websites pretty much work without disappointment. Two other apps I like using are Wikipedia and FairEmail. I often use Android's “Split Screen” mode to open two apps, typically FairEmail and Notewise.

Other than the Niagara Launcher, the Android experience on the DC-1 is quite “stock.” To me, that is great. But if you are used to Xiaomi devices or similarly highly tuned Android experiences, you may find the DC-1's OS a bit boring. But that boring nature appears to be a great part of the appeal of the DC-1.

I have also successfully Zoomed using my DC-1. While the lack of a camera appears to limit the usefulness of the device for videoconferencing, it also provides a convenient and wholly truthful justification for not turning on one's camera for the people one has to interact with. Thus, it is a good backup device. I expect to use the DC-1 in my lectures as well (to annotate slides on the fly).

Expectations and reality

It turns out that I use the DC-1 vastly more often and more consistently than I anticipated. So I guess my revealed preference goes in its favour. I have also installed the official Kindle app, but so far I have not used it much as my Kindle is still sufficient for mere e-books.

My opinion on the DC-1 is nuanced. It is, in principle, a good device, but it cuts some corners that ought not to have been cut: GPS would have been really neat, and more display contrast too. It may appear pricey, but it turns out not to be that pricey when considering the elevated prices of good Android tablets. The stylus and its support by software and hardware are utterly excellent. If I had to give the DC-1 a grade, I would give it a B. Having actually received and used the device, the marketing is a bit much to me. They also have to work on the customer support (although I agree that difficult customers like me are annoying!).

I recommend the DC-1 for those who need to work with PDFs or other possibly complex documents and who seek to reduce their dependence on noisy social media apps.

In general, I hope they succeed—their technological efforts certainly are laudable. I hope they commit to long-term support for the device or at least liberating the device (e.g., for use with LineageOS) should they at some point no longer wish to support it.

Update after almost one year (2026-01-14)

I barely use my DC-1 anymore. I have thought a bit about why, and it comes down to three issues that had already been mentioned elsewhere in this blog post. Here they are ordered by importance:

First, the device is relatively clunky and heavy. One thing I noticed after initially publishing this post is that my posture while reading and working on documents tends to gravitate towards the floor. This is a natural result of the DC-1 being heavy. But it also leads directly to the second problem:

Second, the resolution is not high enough. I had mentioned the remarkably low density of 190 DPI above, but I can now clearly say that it’s not enough. The display is not sharp enough. Combined with the first issue, if your DC-1 descends, this means you will squint and reading will get less comfortable.

However, I have to note that this also depends on the actual apps that you use. Regrettably, Notewise has a renderer that exacerbates this issue. Not crisp enough! I do not wholly recommend Notewise anymore (and I cannot for the life of me understand why it is so highly praised, except for how feature-rich it is, which is true!). The preinstalled PDF reader that comes with the DC-1 is better on this dimension, but it is simply too limited feature-wise. Generally speaking, I could still recommend the DC-1 were it not for the third issue:

Third, the contrast is not large enough. Once again, I had mentioned this above (“The black is just not black enough.”), but it really becomes noticeable over time. My Kindle Paperwhite (fourth iteration), a now dated and by no means particularly excellent device, has a better contrast! When using the amber backlight with the DC-1, black shows as a dark-ish brown (not even that dark—more like chocolate). The lack of crispness both in color space and with respect to resolution are serious problems.

Daylight Computer has, to this day, not committed to liberating the DC-1 for usage with LineageOS or other Android-based operating systems. It is in 2026 still running a rusty Android 13. I also do not know how much there really is to the idea that colors distract. I guess it depends from person to person.

So, overall, I am now more disappointed than I was at the beginning. It is not as attractive as I thought.

Changes to this document


    2025-03-15
    Post created.

    2026-01-14
    Added table of contents.

    2026-01-14 (current version)
    Added update and struck out recommendation.

]]></description>
</item>
<item>
<title>My election model for the 2024 United States presidential election</title>
<link>https://max.pm/posts/potus2024/</link>
<guid>https://max.pm/posts/potus2024/</guid>
<description><![CDATA[

    So this was quite wrong! In any case, running Monte Carlo simulations was very useful to me during the night of the election. I put the states in Harris's and Trump's categories as the results came in, and once Pennsylvania was called for Trump, P(Trump) went from 0.995 to about 1.000. I guess I'll do more simulations from now (and more fact-finding missions!).




My FINAL election model is available here. Based on this simulation, I currently1 give Donald John Trump a 44.6% chance of becoming the next president. In this simulation, the expected number of electoral college votes is 262.6. The median is 264. The mode is 270. The skewness is &minus;0.27.

You can use these files to build your own model. Unless otherwise noted, both files are licensed under the CC0 1.0 Universal license, or (at your option) any later version.

Plots




Key assumptions


    States are independent.
    If both candidates obtain exactly 269 electoral votes, Donald John Trump wins.
    Electors vote as pledged.


Changelog


    2024-10-13
    Alabama 0.9&nearr;0.99, Arizona 0.6&nearr;0.65, Arkansas 0.95&nearr;0.99, Florida 0.8&nearr;0.85, Idaho 0.95&nearr;0.99, Iowa 0.95&nearr;0.99, Kansas 0.95&nearr;0.99, Kentucky 0.95&nearr;0.99, Mississippi 0.95&nearr;0.99, Missouri 0.95&nearr;0.99, Montana 0.95&nearr;0.99, North Carolina 0.7&nearr;0.8, North Dakota 0.95&nearr;0.99, Oklahoma 0.95&nearr;0.99, South Dakota 0.95&nearr;0.99, Tennessee 0.95&nearr;0.99, Utah 0.95&nearr;0.99, West Virginia 0.95&nearr;0.99, Wyoming 0.95&nearr;0.99 &#8614; Trump 0.321&nearr;0.424

    2024-10-14
    Nevada 0.4&nearr;0.55 &#8614; Trump 0.424&nearr;0.437

    2024-10-17
    Nevada 0.55&searr;0.5, Arizona 0.65&searr;0.6, Georgia 0.6&searr;0.55, Michigan 0.5&nearr;0.55 &#8614; Trump 0.437&searr;0.424

    2024-10-19
    Wisconsin 0.5&nearr;0.6, Michigan 0.55&nearr;0.65 &#8614; Trump 0.424&nearr;0.46

    2024-10-20
    Arizona 0.6&nearr;0.65, Minnesota 0.1&nearr;0.2 &#8614; Trump 0.46&nearr;0.482

    2024-10-27
    Georgia 0.55&nearr;0.6, Nevada 0.5&nearr;0.55, Wisconsin 0.6&nearr;0.65 &#8614; Trump 0.482&nearr;0.505 (+ added formerly implicit assumption 3)

    2024-11-04 (final change)
    Florida 0.85&searr;0.7, Iowa 0.99&searr;0.7, Michigan 0.65&searr;0.55, New Hampshire 0.2&nearr;0.25, North Carolina 0.8&searr;0.7, Ohio 0.8&nearr;0.9, Pennsylvania 0.55&nearr;0.65, Virginia 0.15&nearr;0.25 &#8614; Trump 0.505&searr;0.446




1 As of 2024-11-04 21:21:13 +0100.


.plot {
    text-align: center;
}
.plot img {
    width: 100%;
    max-width: 700px;
}

]]></description>
</item>
    </channel>
</rss>
