It is possible that our universe is infinite in both time
and space. We might therefore reasonably consider the following question: given
some sequences $u = (u_1, u_2,\dots)$ and $u' = (u_1’, u_2’,\dots)$ (where each $u_t$
represents the welfare of persons living at time $t$), how can we tell if $u$
is morally preferable to $u’$?

It has been demonstrated that there is no “reasonable”
ethical algorithm which can compare any two such sequences. Therefore, we want to
look for subsets of sequences which can be compared, and (perhaps retro-justified)
arguments for why these subsets are the only ones which practically matter.

Adam Jonsson has published a preprint of what seems to me to
be the first legitimate such ethical system. He considers the following: suppose
at any time $t$ we are choosing between a finite set of options. We have an
infinite number of times in which we make a choice (giving us an infinite
sequence), but at each time step we have only finitely many choices. (Formally,
he considers Markov Decision Processes.) He has shown that an ethical algorithm he
calls “limit-discounted utilitarianism” (LDU) can compare any two such sequences,
and moreover the outcome of LDU agrees with our ethical intuitions.

This is the first time that (to my knowledge), we have some justification for thinking that a certain algorithm is all we will "practically" need when comparing infinite utility streams.

Limit-discounted Utilitarianism (LDU)

Given $u = (u_1, u_2,\dots)$ and $u' = (u_1’, u_2’,\dots)$ it seems
reasonable to say $u\geq u’$ if

$$\sum_{t = 0} ^ {\infty} (u_t - u_t’) \geq 0$$

Of course, the problem is that this series may not converge
and then it’s unclear which sequence is preferable. A classic example is the
choice between $(0, 1, 0, 1,\dots)$ and $(1, 0, 1, 0,\dots)$. (See the example below.)

LDU handles this by using Abel summation. Here is a rough explanation of how that works.

Intuitively, we might consider adding a discount factor $0<
\delta< 1$ like this:

This modified series may converge even though the original
one doesn’t. Of course, this convergence is at the cost of us caring more about
people who are born earlier, which might not endear us to our children.

LDU has a number of desirable properties, which are
summarized on page 7 of this
paper by Jonsson and Voorneveld. I won’t go into them much here other than
to say that LDU generally extends our intuitions about what should happen in
the finite case to the infinite one.

Example

Suppose we want to compare $u = (1, 0, 1, 0,\dots)$ and $u' = (0, 1, 0, 1,\dots)$. Let's take the standard series:
$$\begin{align}
\sum_{i = 0} ^\infty (u_i - u_i') & = (1-0) + (0-1) + (1-0) + (0-1) +\dots\\
& = 1-1+1-1+\dots\\
& =\sum_{i = 0} ^\infty(-1) ^ i
\end{align}$$
This is Grandi’s
series, which famously does not converge under the usual definitions of convergence.

LDU though will place in a discount term $\delta$ to get:

$$\sum_{i = 0} ^\infty (-1) ^ i\delta ^ i =\sum_{i = 0}
^\infty (-\delta) ^ i $$

It is clear that this is simply a geometric series, and
we can find its value using the standard formula for geometric series:

Therefore, the Abel sum of this series is one half, and, since $1/2 > 0$, we have determined that $(1, 0, 1, 0,\dots)$ is better than (morally preferable to) $(0, 1, 0, 1,\dots)$.

This seems kind of intuitive: as you add more and more
terms, the value of the series oscillates between zero and one, so in some
sense the limit of the series is one half.

Markov Decision Processes (MDP)

Markov Decision Processes, according to Wikipedia,
are:

At each time step, the process is in some state $s$, and the decision maker may choose any action $a$ that is available in state $s$. The process responds at the next time step by randomly moving into a new state $s'$, and giving the decision maker a corresponding reward $R_a(s,s')$.

The probability that the process moves into its new state $s'$ is influenced by the chosen action. Specifically, it is given by the state transition function $P_a(s,s')$. Thus, the next state $s'$ depends on the current state $s$ and the decision maker's action $a$.

At each time step the decision-maker chooses between a
finite number of options, which causes the universe to (probabilistically) move
into one of a finite number of states, giving the decision-maker a (finite)
payoff. By repeating this process an infinite number of times, we can construct
a sequence $u_1, u_2,\dots$ where $u_t$ is the payoff at time $t$.

The set of all sequences generated by a decision-maker who
follows a single, time independent, (i.e. stationary) policy is what is
considered by Jonsson. Crucially, he shows that LDU is able to compare any two streams generated by a stationary Markov decision process. [1]

Why This Matters

My immediate objection upon reading this paper was “of
course if you limit us to only finitely many choices then the problem is soluble
– the entire problem only occurs because we want to examine infinite things!”

After having thought about it more though, I think this is
an important step forward, and MDP’s represent an importantly large class of
decision processes.

Even though the universe may be infinite in time and space,
in any time interval there is plausibly only finitely many states I could be in,
e.g. perhaps because there are only finitely many neurons in my brain.

(Someone who knows more about physics than me might be able
to comment on a stronger argument: if locality holds,
then perhaps it is a law of nature that only finitely many things can affect us
within a finite time window?)

Sequences generated by MDP’s are therefore plausibly the
only set of sequences a decision-maker may need to practically consider.

Outstanding Issues

My biggest outstanding concerns with modeling our decisions
with an MDP is that the payoffs have to remain constant. It seems likely that,
as we learn more, we will discover that certain states are more or less
valuable than we had previously thought. E.g. we may learn that insects are more
conscious than previously expected, and therefore insect suffering affects our
payoffs more highly than we had originally thought. It seems like maybe one
could have a “meta-MDP” which somehow models this, but I’m not familiar enough
with the area to say for sure.

A more theoretical question is: what sequences can be
generated via MDP’s? My hope is that one day someone will show LDU (or a
similarly intuitive algorithm) can compare any two computable sequences, but I
don’t think that this is that proof.

Lastly, we have the standard problems of infinitarian fanaticism
and paralysis. E.g. even if our current best model of the universe predicted
that MDP was exactly correct, there would still be some positive probability
that it was wrong and then our “meta-decision procedure” is unclear.

Conclusion

Overall, I don't think that this completely solves the questions with comparing infinite utility streams, but it's a large step forward. Previous algorithms like the overtaking criterion had fairly "obvious" incomparable streams, with no real justification for why those streams would not be encountered by a decision-maker. LDU is not complete, but we at least have some reason to think that it may be all we "practically" need.

I would like to thank Adam Jonsson for discussing this with me. I have done my best to represent LDU, but any errors in the above are mine. Notably, the justification for why MDP's are all we need to consider is entirely mine, and I'm not sure what Adam thinks about it.

1. This is not explicitly stated in Jonsson's paper, but it follows from the proof of theorem 1. Jonsson confirmed this in email discussions with me.

There is a standard result that if a "rational" agent is uncertain about what the outcome of events will be, (i.e. they have to choose between two "lotteries") then they should maximize the expectation of some utility function. Formally, if we define a lottery as $L=\sum_i p_i O_i$ where $\{O_i\}$ are the outcomes and $\{p_i\}$ their associated probabilities, then for any "rational" preference ordering $\preceq$ there is a utility function $u$ such that
$$E\left[u(L)\right]\leq E\left[u(L')\right] \leftrightarrow L \preceq L'$$
Traditionally, this is used when people aren't certain about what the outcomes of their actions will be. However, I recently attended an interesting presentation by Brian Hedden where he discussed using this in cases of normative uncertainty, i.e. in cases when we know what the outcome of our actions will be, but we just don't know what the correct thing to value is.

An analog to equation (1) in this case is to introduce ethical theories $T_1,\dots,T_n$ to which we might subscribe and $u_i(o)$ the value of an outcome $o$ under theory $T_i$ and then ask whether there is a utility function $u$ such that for $M(o) = \sum_i p(T_i)u_i(o)$ we have:
$$M(o)\leq M(o') \leftrightarrow o \preceq o'$$
Brian referred to this "meta-" theory as Maximize InterTheoretical Expectation or MITE. He believes that

There are moral theories which it can be rational to take seriously, such that if you do take them seriously, MITE cannot say anything about what you super-subjectively ought to do, given your normative uncertainty.

I show here that:

Contrary to Brian's argument, a MITE function always exists.

Furthermore, the output of this function is always just a vector of real numbers

Groups

The basis of this post is the fact that we can generalize the above equation (2) to an arbitrary ordered group $G=(\Omega,+,\leq)$. Rather than bore the reader with a recitation of the group axioms, I will just point the reader to Wikipedia and point out that the possibly questionable assumption here is existence of inverses (i.e. the claim that for any lottery $L$ there is a lottery $L'$ such that the agent is indifferent between participating in both lotteries and neither).^{1} ^{}
There are probably prettier ways of doing this, but here's a simple way of defining a group which is guaranteed to work. Let's say that:

Each theory $T_i$ has some set of possible values $V_i$ and that we can find the (intratheoretic) value of an outcome via $u_i:\mathcal{O}\to V_i$. Crucially, we are not claiming that these values are in any way comparable to each other. ($u_i$ is guaranteed to exist because it could just be the identity function.)

$\Omega_i = \mathbb R \times V_i$ is a tuple which joins the probability of an outcome with its value.

$\Omega =\prod_i \Omega_i$ and that $\pi_i:\Omega_i\hookrightarrow \Omega$ is the canonical embedding (i.e. $\pi_i(\omega)$ is zero everywhere except it puts $\omega$ into the $i$th position).

$G=(\Omega, +)$ with addition being defined element wise

Theorem 1: For any partial order $\preceq\in \Omega\times \Omega$, $G$ satisfies (2).

Proof: It's clear that
$$M(o)=\sum_i \pi_i \left(p(T_i), u_i(o)\right)$$
will just embed the information into G, which can easily inherit the order. Of course if we are really dedicated to the notation in (2) we can define $x\cdot y = \pi(x,y)$ and then get
$$M(o)=\sum_i p(T_i) \cdot u_i(o)$$
$\square$

So what?

So far we've managed to show that you can redefine addition to mean whatever you want, and therefore utility functions will basically always exist. But it will turn out that we are actually dealing with some pretty standard groups here.

First, a little commentary on terms. One of the major objections Brian raises is the notion of "options", i.e. the fact that in certain moral theories we have "optional" things and "required" things. For example we might say that donating to charities is optional but not murdering people is required. Furthermore, these types of goods bear a non-Archimedean relationship to each other – that is, no amount of donating to charity can offset a murder.

For any ordered group $G$ there is a chain of subgroups $C_1\subset C_2\subset\dots\subset G$ such that each $C_i$ is "convex". Convex subgroups represents this notion of "optionality": $C_1$ represents all the "optional" things, $C_2$ is everything that is either required or optional, etc. Note that I am not assuming anything new here; it is a standard result that the set of all convex subgroups form a chain in any ordered group (see Glass, Lemma 3.2.1).

Theorem 2: Our above group can be order-embedded into a subset of $\mathbb R ^n$ ordered lexically, i.e. we are just dealing with a set of vectors where each component of the vector is a real number. Furthermore, the number of components in the vector is identical to the number of "degrees" of optionality. Proof: This is the Hahn embedding theorem. $\square$

Corollary: if (and only if!) none of our theories that we give credence to have "optionality", then we are just dealing with the real numbers.

Example

The above was really abstract, so it's reasonable to ask for an example. But before I do that I would like to give a standard math joke:

(Prof. finishes proving Liouville's theorem that any bounded entire function is constant.)
Student: I'm not sure I really understand. Could you give an example?
Prof.: Sure. 7. (Prof. goes back to writing on the blackboard.)

The joke here is that $f(x)=7$ is "obviously" a constant function whereas the student somehow wanted a more exotic example. But the professor had just proven that no such examples exist!

So I will give some examples which the astute reader will point out are "obviously" instances of lexically ordered vectors of real numbers. This is because I have just proven that there are no other examples. Hopefully it will still be useful.

First, let's discuss how just satisficing consequentialism by itself is a lexically ordered vector. Consider the decision criterion that $(x_1,x_2)\leq (y_1,y_2)$ if and only if $x_2< y_2$ or both $(x_2 = y_2)$ and $(x_1\leq y_1)$ (i.e. it is lexically ordered from the right). So we could for example represent giving a thousand dollars to charity as $(1000,0)$ and murdering someone as $(0,-10000)$; this gives us our desired result that no amount of donations can offset a murder (i.e. $(x,-10000)\prec(0,0)$ for all $x$). And of course this is a vector of real numbers which is lexically ordered, in accordance with our theorem.

Now let's contrast this with standard utilitarianism, which would say that murdering someone could be offset by donating enough money to charity to prevent someone from dying. Let's call that amount $\$$10,000 (i.e. murdering someone has -10,000 utils). There are no "optional" things in standard utilitarianism, so we can write this as $(0,u)$ where $u$ is the utility of the outcome. In this case we have that $(0,x-10,000)\succ (0,0)$ if $x\geq 10,000$, i.e. donations greater than $\$$10,000 offset a murder.

Now let's ask about the inter-theoretic uncertainty case. We have to choose between either doing nothing or murdering someone and donating $\$$15,000 to charity. We believe in satisficing consequentialism with probability $p$ and in standard utilitarianism with probability $1-p$. Therefore we have
$$\begin{align*}
p(15000,-10000) + (1-p)(0, 5000) & = (15000p,-10000p + 5000(1-p)) \\
& = (15000p,5000-15000p)
\end{align*}
$$
This is strongly preferred to the $(0,0)$ option if $p< 1/3$; if $p=1/3$ exactly then it is weakly preferred.

This isn't the only way we can make inter-theoretic comparisons. I actually don't even think it's the best way. But is one example where we're using a lexically ordered vector of real numbers, and all other examples will be similar.

A Counterexample

It may be useful to construct a decision criterion which can't be represented using a MITE formula. (Obviously, it will have to disobey one of the ordered-group axioms due to theorem 1.)

Here's one example:

Let's say we represent an outcome having deontological value $d$ and utility $u$ as $(d,u)$ and we believe deontology with probability $p$. Then $(d_1,u_1)\preceq (d_2,u_2)$ if and only if $p(u_1\mod d_1)\leq p(u_2\mod d_2)$.

This is not order-preserving because sometimes increasing utility is good but other times increasing utility is bad. So it doesn't make up an ordered group.

Commentary

Brian took as his definition of "rational" the standard von Neumann-Morgenstern axioms. This is of course a perfectly reasonable thing to do in general, but as he points out many individual moral theories fail these axioms. (Insert joke here about utilitarianism being the only "rational" moral system.)

I personally find the idea of optionality pretty stupid and think it causes all sorts of problems even without needing to compare it to other theories. But if you do want to give it some credence, then a MITE formula will work fine for you.

Footnotes

Note that this also requires "modding out" by an indifference relation

There is a scene in Gulliver's Travels where the protagonist calls up the ghosts of all the philosophers since Aristotle, and the ghosts all admit that Aristotle was way better than them at everything. Especially Descartes – Jonathan Swift wants to make very clear that Aristotle is a way better philosopher than Descartes, and that all of Descartes's ideas are stupid. (I think this was supposed to prove a point in some long-forgotten religious dispute.)

If I ever become a prominent philosopher and we develop the technology to call up ghosts in order to win points in literary holy wars (I will let the reader decide which of those two conditions is more likely), please reincarnate me to talk ethics with Aristotle. Basically all the problems I'm worried about deal with mathematical concepts which weren't developed until around a century ago, and I'm excited to hear whether a virtuous person would accept Zorn's Lemma.

Today I want to share two mathematical assumptions which are so esoteric that even most mathematicians don't bother worrying about them. Despite that, they actually critically influence what we think about ethics.

The Axiom of Choice

The Axiom of Choice is everyone's favorite example of something which seems like an innocuous assumption but isn't. (The Axiom of Choice is the axiom of choice for such situations, if you will.) Here's Wikipedia's informal description:

The axiom of choice says that given any collection of bins, each containing at least one object, it is possible to make a selection of exactly one object from each bin.

Seems pretty reasonable right? Unfortunately, it leads to a series of paradoxes like that any ball can be doubled into two balls, both of which have the same size as the first.

In many cases, a weaker assumption known as the "axiom of dependent choice" suffices and has the advantage of not leading to any (known) paradoxes. Sadly, this doesn't work for ethics.

Consider the two following reasonable assumptions:

Weak Pareto: if we can make someone better off and no one worse off, we should.

Intergenerational Equality: we should value the welfare of every generation equally.

Theorem (proven by Zame): we cannot prove the existence of an ethical system which satisfies both Weak Pareto and Intergenerational Equality without using the axiom of choice (i.e. the axiom of dependent choice doesn't work).

Sorry grandma, but unless you can make that ball double in size we're gonna have to start means-testing Medicare

Hyperreal numbers

The observant reader will note that the previous theorem showed only that we could prove the existence of a "good" ethical system if we use the axiom of choice, it didn't say anything about us actually being able to find it. To get that we have to enter the exciting world of hyperreal numbers!

The founding fathers weren't as impressed with Thomas Jefferson's original nonconstructive proof that the Bill of Rights could, in theory, be created

I recently asked my girlfriend whether she would prefer:

Having one unit of happiness every day, for the rest of eternity, or

Having two units of happiness every day, for the rest of eternity

She told me that the answer was obvious: she's a total utilitarian and in the first circumstance she would have one unit of happiness for an infinite amount of time, i.e. one infinity's worth of happiness. But in the second case she would have two units for an infinite amount of time, i.e. two infinities of happiness. And clearly two infinities are bigger than one.

My guess is that how reasonable you think this statement is will depend in a U-shaped way on how much math you've learned:

To the average Joe, it's incredibly obvious that two infinities are bigger than one. More advanced readers will note that the above utility series don't converge, so it's not even meaningful to talk about one series being bigger than another. But those who've dealt with the bizarre world of nonstandard analysis know that notions like "convergence" and "limit" are conspiracies propagated by high school calculus teachers to hide the truth about infinitesimals. In fact, there is a perfectly well-defined sense in which two infinities are bigger than one, and the number system which this gives rise to is known as the "hyperreal numbers."

From an ethical standpoint, here are the relevant things you need to know:

Theorem (proven by Basu and Mitra): if we use only our normal "real" numbers, then we can't construct an ethical system which obeys the above Weak Pareto and Intergenerational Equality assumptions. Theorem (proven by Pivato): we can find such a system if we use the hyperreal numbers.

To any TV producers reading this: the success of the hyperreal approach over the "standard calculus" approach would make me an excellent soft-news-show guest. While most stations can drum up some old crotchety guy complaining about how schools are corrupting the minds of today's youths, only I can actually prove that calculus teaches kids to be unethical.

Conclusion / Apologies / Further Reading

As far as the laws of mathematics refer to reality, they are not certain; as far as they are certain, they do not refer to reality. - Einstein

It goes without saying that I've heavily simplified the arguments I've cited, and any mistakes are mine. If you are interested in using logical reasoning to improve the world, then you should check out Effective Altruism. If you are more of a "nonconstructive altruist" then you can do a Google scholar search for "sustainable development" or read the papers cited below to learn more.

And most importantly: if you are student who is being punished for misbehaving in a calculus class, please 1) tell your teacher the Basu-Mitra-Pivato result about how calculus causes people to disrespect their elders and 2) film their reaction and put it on YouTube. (Now that's effective altruism!)

Basu, Kaushik, and Tapan Mitra. "Aggregating infinite utility streams with intergenerational equity: the impossibility of being Paretian." Econometrica 71.5 (2003): 1557-1563.

When you look online for advice about entrepreneurship, you will see a lot of "just do it":

The best way to get experience... is to start a startup. So, paradoxically, if you're too inexperienced to start a startup, what you should do is start one. That's a way more efficient cure for inexperience than a normal job. - Paul Graham, Why to Not Not Start a Startup

There is very little you will learn in your current job as a {consultant, lawyer, business person, economist, programmer} that will make you better at starting your own startup. Even if you work at someone else’s startup right now, the rate at which you are learning useful things is way lower than if you were just starting your own. - David Albert, When should you start a startup?

This advice almost never comes with citations to research or quantitative data, from which I have concluded:

The sort of person who jumps in and gives advice to the masses without doing a lot of research first generally believes that you should jump in and do things without doing a lot of research first.

As readers of this blog know, I don't believe in doing anything without doing a ton of research first, and have therefore come to the surprising conclusion that the best way to start a startup is by doing a lot of background research first.

Specifically, I would make two claims:

It's unclear whether the average person learns anything from a startup.

It is clear that the average person learns something working in direct employment, and that they almost certainly will make more money working in direct employment (which can fund their later ventures).

I think these two theoretical claims lead to one empirical one:

If you want to start a successful startup, you should work in direct employment first.

Evidence

Rather than boring you with a narrative, I will just present some choice quotes:

"We found that among the 24 possible success factors identified in the literature, 8 are homogeneous significant success factors for NTVs [New technology ventures]: ... (6) founders' marketing experience; (7) founders' industry experience... 5 [other factors] were not significant: ... (2) founders' experience with start-ups" Success Factors in New Ventures: A Meta-analysis

"Our most important finding is that the reward to the entrepreneurs who provide the ideas and long hours of hard work in these startups is zero in almost three quarters of [startups], and small on average once idiosyncratic risk is taken into consideration"- The Burden of the Nondiversifiable Risk of Entrepreneurship

Even a stopped clock is right twice a day

It's interesting to think about what exactly the "people don't learn anything from a startup" hypothesis would look like. If we take the above cited numbers of everyone having a 20% chance of succeeding in a given startup, then even if each success is independent most people will have succeeded at least once by their fourth venture.

So the underlying message that many in the startup community say of "if you keep at it long enough, eventually you will succeed" is still completely true. I just think you could succeed quicker if you go work for someone else first.

But… Anecdata!

I am sure that there are a lot of people who sucked on their first startup, learned a ton, and then crushed it on their second startup. But those people probably also would've sucked at their first year of direct employment, learned a ton, and then crushed it even more when they did start a company.

There are probably people who learn better in a startup environment and you may be one of them, but the odds are against it.

Attribution errors

So if entrepreneurs don't learn anything in their startups, why do very smart people with a ton of experience like Paul Graham think they do? One explanation which has been advanced is the "Fundamental Attribution Error", which refers to "people's tendency to place an undue emphasis on internal characteristics to explain someone else's behavior in a given situation, rather than considering external factors." Wikipedia gives this example:

Subjects read essays for and against Fidel Castro, and were asked to rate the pro-Castro attitudes of the writers. When the subjects believed that the writers freely chose the positions they took (for or against Castro), they naturally rated the people who spoke in favor of Castro as having a more positive attitude towards Castro. However, contradicting Jones and Harris' initial hypothesis, when the subjects were told that the writer's positions were determined by a coin toss, they still rated writers who spoke in favor of Castro as having, on average, a more positive attitude towards Castro than those who spoke against him. In other words, the subjects were unable to properly see the influence of the situational constraints placed upon the writers; they could not refrain from attributing sincere belief to the writers.

Even in the extreme circumstance where people are explicitly told that an actor's performance is solely due to luck, they still believe that there must've been some internal characteristic involved. In the noisy world of startups where great ideas fail and bad ideas succeed it's no surprise that people greatly overestimate the effect of "skill". Baum and Silverman found that:

And if venture capitalists, who sole job consists of figuring out which startups will succeed, regularly make these errors then imagine how much worse it must be for the rest of us.

(It also doesn't bode well for this essay – I'm sure that even after reading all the evidence I cited most readers will still attribute their startup heros' success to said heroes' skill, intelligence and perseverance.)

Conclusion

I wrote this because I've become annoyed with the "just do it" mentality of so many entrepreneurs who spout some perversion of Lean Startup methods at me. Yes, doing experiments is awesome but learning from people who have already done those experiments is usually far more efficient. (Academics joke that "a month in the lab can save you an hour in the library.")

If you just think a startup will be fun then by all means go ahead and start something from your dorm room. But if you really want to be successful then consider apprenticing yourself to someone else for a couple years first.

(NB: I am the founder of a company which I started after eight years of direct employment.)

Works cited

Baum, Joel AC, and Brian S. Silverman. "Picking winners or building them? Alliance, intellectual, and human capital as selection criteria in venture financing and performance of biotechnology startups." Journal of business venturing 19.3 (2004): 411-436.

Gompers, Paul, et al. Skill vs. luck in entrepreneurship and venture capital: Evidence from serial entrepreneurs. No. w12592. National Bureau of Economic Research, 2006.

Kaiser, Ulrich, and Nikolaj Malchow-MÃ¸ller. "Is self-employment really a bad experience?: The effects of previous self-employment on subsequent wage-employment wages." Journal of Business Venturing 26.5 (2011): 572-588.

Song, M., Podoynitsyna, K., Van Der Bij, H. and Halman, J. I. M. (2008), Success Factors in New Ventures: A Meta-analysis. Journal of Product Innovation Management, 25: 7–27. doi: 10.1111/j.1540-5885.2007.00280.x

Population Ethics is the branch of philosophy which deals with questions involving - you guessed it - populations. Most of the problems that are solved by population ethics are things involving tradeoffs between quantity and quality of life. In bumper-sticker form, the question investigated in this post is:

Should we make more happy people, or more people happy?^{1}

When a disaster occurs, most of us have the intuition that we should help improve the lives of survivors. But very few of us feel an obligation to have more children to offset the population loss. (i.e. our intuitions line up with making "more people happy" instead of "more happy people".) This is a surprisingly difficult position to defend, but it reminds me of Brian Tomasik's joke:

Bob: "Ouch, my stomach hurts."

Classical total utilitarian: "Don't worry! Wait while I create more happy people to make up for it."

Average utilitarian: "Never fear! Let me create more people with only mild stomach aches to improve the average."

Egalitarian: "I'm sorry to hear that. Here, let me give everyone else awful stomach aches too."

...

Negative total utilitarian: "Here, take this medicine to make your stomach feel better."

Limiting theorems

It turns out that population ethics has, to a certain extent, been "solved". This is a technical result, so uninterested readers can skip to the next section, but basically the various questions I discuss in this blog post are the only questions remaining.
Specifically:

Let $\mathbf u = \left(u_1,u_2,\dots\right)$ be the utilities of people $1,2,\dots$ and similarly let $\mathbf u' = \left(u_1',u_2',\dots\right)$ be the utilities of a different population. Further, suppose we have a "reasonable" way of defining which of two populations is better. Then there is a "value function" $V$ such that population $\mathbf u$ is preferable to population $\mathbf u'$ if and only if $V(\mathbf u) > V(\mathbf u')$. Furthermore, $V$ has the form:
$$V(\mathbf u)=f(n)\sum_{i=1}^{n}\left[ g(u_i)-g(c)\right]$$

The three sections of the blog post concern:

The concavity of $g$, which moderates our inequality aversion

The value of $c$, which is known as the "critical level"

And the form of $f$, which is the "number dampening"

I hope to write a post soon on why these are the only three remaining questions, but interested readers can see (Blackorby, Bossert and Donaldson, 2000) in the mean time.^{2}

Inequality

In the wake of the financial crisis, movements like Occupy Wall Street raised wealth inequality as a major political issue.

Wealth inequality in the US

An intuition that underlies these concerns is that the worse off people are, the more important it is to help them. We might donate to a charity to help starving people eat, but not one which helps rich yuppies eat even fancier food. The formal way to model this is to state that one person's utility has diminishing returns to society's overall well-being (i.e. additional utility to that person benefits society less and less as they become better off).

$g(x)=\sqrt{x}$

(As in the rest of this post, you can use the slider to modify the function and see how changing $g$ affects our ethical choices.)

One way of visualizing the impact this has on our decisions about populations is to use an indifference curve. In the chart below, the x-axis represents the utility of person X and the y-axis the utility of person Y. Each line on the chart indicates a set of points for which we are indifferent - for example, the blue line includes the point (50,50) and the point (100,0) since if we don't believe that utility has diminishing returns we don't care about how utility is divided up between the populace. (50 + 50 = 100 + 0).

$g(x)=\sqrt{x}$

You can see that the stronger we think returns diminish, the more inequality-averse we become. For example, if $g(x)=\sqrt{x}$ we are indifferent between $(60,10)$ and $(100,0)$ since $\sqrt{60} + \sqrt{10}\approx \sqrt{100} + \sqrt{0}$, meaning that a 40-point increase in person X's welfare is needed to offset the 10-point loss in person Y's welfare, since Y's welfare is so low.
This is an important point, so I'll call it out:

Inequality aversion is a conclusion of population ethics, not an assumption^{3}

Interlude - The Representation of Populations

We've just shown a very non-trivial result: if $g$ is concave (meaning that increasing utility has diminishing returns), then we are inequality-averse. (Conversely, if $g$ were convex then we would be inequality-seeking, but I don't know of anyone who has argued this.)
One problem we're going to run into soon is that there are too many variables to easily visualize. So I want to bring up a certain fact about population ethics:

For any population $u$, there is a population $u'$ such that:

The number of people in $u$ and $u'$ are the same

Everyone in $u'$ has the same utility as each other (i.e. $u'$ is "perfectly equitable")

And we are indifferent between $u$ and $u'$

For example, if we believed utility did not have diminishing returns, we would be indifferent between $(75,25)$ and $(50,50)$ because the total utility is the same. This means that:

Any time we want to compare populations $p$ and $q$, we can instead compare $p'$ and $q'$ where both $p'$ and $q'$ are perfectly equitable (i.e. every person in $p'$ has the same utility as each other, and similarly for $q'$).

A perfectly equitable population can be parameterized by exactly two variables: the number of people in the population, and the average utility.
While there are theoretical implications of this, the most relevant fact for us is that it means we can keep using two-dimensional graphs.

Critical Levels

Back to the topic at hand. The following assumption sounds very strange, but it's made quite frequently in the literature:

Even if your life is worth living to you and you don't influence anyone else, that doesn't mean the population as a whole benefits from your existence. Specifically, your welfare must be greater than a certain amount, known as the "critical level", before your existence benefits society.^{4}

More formally:

Value to society = utility - critical level

Or
$$V(\mathbf u)=\sum_{i=1}^{n} \left(u_i - c\right)$$
where $c$ is the critical level. (Note that $c$ is a constant, and independent of $\mathbf u$.) I think this is best illustrated with an example.
Suppose we have a constant amount of utility, and we're wondering how many people to divide it up between. (As mentioned earlier, this is a perfectly equitable population, so everyone gets an equal share.) Here's how changing the critical level changes our opinion of the optimal population size:

c=10

The impact of critical levels can be summarized as:

Positive critical levels give a "penalty" for every person who's alive, whereas negative critical levels give a "bonus"

This is clear since $$V(\mathbf u)=\sum_{i=1}^{n} \left(u_i - c\right)=\left(\sum_{i=1}^{n} u_i\right)-nc$$
Here are indifference curves for different critical levels:

c=10

As the critical level gets lower, we are increasingly willing to decrease average utility in exchange for increasing the population size. The major motivation for having a positive critical level is that it avoids the mere addition paradox (sometimes known as the "Repugnant Conclusion"):

For any possible population of at least ten billion people, all with a very high quality of life, there must be some much larger imaginable population whose existence, if other things are equal, would be better even though its members have lives that are barely worth living.^{5}

In tabular form:

Population

Size

Average Utility

Total Value
(c=0)

Total Value
(c = )

A

1,000

100

100,000

90,000

B

10,000,000

0.1

1,000,000

-99,000,000

C

1,000

-4

-4,000

-15,000

D

100

-1

-100

-1,100

Many people have the intuition that A is preferable to B. We can see that only by having a positive critical level can we make this intuition hold.

Unfortunately, we can also see that having a positive value of c results in what Arrhenius has called the "sadistic conclusion": We prefer population C to population B, even though everyone in C is suffering and the people in B have positive lives. And if c is negative we have another sort of sadistic conclusion: We prefer C to D even though there are fewer people suffering in D and no one is better off in C than they are in D.

Some people will bite the bullet and prefer the Sadistic Conclusion to the Repugnant one. But it's hard to make a case for this being the less intuitive of the two, meaning we must have a critical level of zero.

Number Dampening

Canadian philosopher Thomas Hurka has argued for the two following points:

For small populations, we should care about total welfare

For large populations, we should care about average welfare

Independent of the question about whether people should care more about average welfare for large populations, it seems clear that in practice we do (as I've discussed before).

The way to formalize this is to introduce a function $f$:

$$V(\mathbf u)=f(n)\sum_{i=1}^{n}u_i$$
where
$$f(n) = \left\{
\begin{array}{lr}
1 & : n \leq n_0 \\
n_0/n & : n > n_0
\end{array}
\right.$$
If we have fewer than $n_0$ people (i.e. if the population is "small") then this is equivalent to total utilitarianism. If we have more (i.e. the population is "large") then it's equivalent to average utilitarianism. Graphically:

n_{0}=50

The non-differentiability at $n=n_0$ is pretty ridiculous though, so instead of a strict cutoff we could claim that there are diminishing returns to population size, just like we claimed that there are diminishing returns to utility in the first section. For example, we could state that $$V(\mathbf u)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}u_i$$
This gives us a graph like:

Even with this modification though, it still seems pretty implausible that population size has diminishing returns. The relevant fact is that $\sqrt{x+y}\not=\sqrt{x}+\sqrt{y}$, so we can't just break populations apart.^{6} Therefore, we have to consider every single person who has ever lived (and who ever will live) before we can make ethical decisions. As an example of the odd behavior this "holistic" reasoning implies:

Some researchers are on the verge of discovering a cure for cancer. Just before completing their research, they learn that the population of humans 50,000 years ago was smaller than they thought. As a result, they drop their research to focus instead on having more children.

An example will explain why this is the correct behavior if you believe in number-dampening. Say we're using the value function

and we can either move everyone alive from having 10 utils up to 10.1 (discovering cancer cure) or else add a new person with utility 100 (have a child). Which option is best depends on the population size:

Having a child is better if the population size is 500, but worse if the population size is 5,000.

It goes against our intuition that the population size in the distant past should affect our decisions about what to do today. One simple way around this is to just declare that "population size" is the number of people currently alive, not the people who have ever lived. Nick Beckstead's thesis has an interesting response:

The Separated Worlds: There are only two planets with life. These planets are outside
of each other’s light cones. On each planet, people live good lives. Relative to each of
these planets’ reference frames, the planets exist at the same time. But relative to the
reference frame of some comet traveling at a great speed (relative to the reference frame
of the planets), one planet is created and destroyed before the other is created.

To make this exact, let's say each planet has 1,000 people each with utility level 100. Then we have:

Dampening Amount

Value on both planets

Value on comet

$1$

$1$

None

200,000

200,000

How valuable a population is shouldn't change if you split it into arbitrary sub-populations, so it's hard to make the case for number dampening.

Conclusion

I started off by claiming (without proof) that for any "reasonable" way of determining which population is better, we could equivalently use a value function $V$ such that population $\mathbf u$ is better than population $\mathbf u'$ if and only if $V(\mathbf u) > V(\mathbf u')$. Furthermore, I claimed $V$ must have the form: $$V(\mathbf u)=f(n)\sum_{i=1}^n\left[g(u_i)-g(c)\right]$$
In this post, we investigated modifying $f,g$ and $c$. However, we saw that having $c$ be anything but zero leads to a "sadistic conclusion", and having $f$ be non-constant leads to the "Separated Worlds" problem, meaning that we conclude $V$ must be of the form $$V(\mathbf u) = \sum_{i=1}^n g(u_i)$$
Where $g$ is a continuous, monotonically increasing function. This is basically classical (or total) utilitarianism, with perhaps some inequality aversion.

It's common to view ethicists as people who just talk all day without making any progress on the issues, and to some extent this reputation is deserved. But in the area of population ethics, I hope I've convinced you that philosophers have made tremendous progress, to the point that one major question (the form of the value function) has been almost completely solved.

Footnotes

I'm sure I didn't come up with this phrase, but I can't find who originally said it. I'd be much obliged to any commenters who can let me know.

The obvious objection I'm ignoring here is the "person-affecting view", or "the slogan." I'm pretty skeptical of it, but it's worth pointing out that not all philosophers agree that population ethics must of this form.

Of course, if we came to the conclusion that inequality is good, we might start questioning our assumptions, so this is perhaps not completely true.

If the critical level is negative, then the converse holds (your life can suck but you'll still be a benefit to society). This is rarely argued.

This isn't just a problem with the square root - if $f(x+y)=f(x)+f(y)$ with $x,y\in\mathbb R$ then $f(x)=cx$ if $f$ is non-"pathological". (This is known as Cauchy's functional equation.)

Summary: It’s been suggested that improving decision making is an important thing for altruists to focus on, and there are a wide variety of computer programs which aim to improve clinician decision making ability. Since I earn to give as a programmer making healthcare software, you might naively assume that some of the good I do is through improving clinician decision making. You would be wrong. I give an overview of the problem, and suggest that the problems which make improving medical decision making hard are general, and might suggest low-hanging fruit is rare in the field of decision support.

Against stupidity the gods themselves contend in vain. - Friedrich Schiller

In 1966, the Massachusetts General Hospital Utility Multi-Programming System (MUMPS) was created as one of the first healthcare information technology platforms. Running on the “cheap” ($70,000) PDP-7, it spread to become one of the most common pieces of infrastructure in healthcare - to this day, if you walk into your doctor’s office there’s a good chance some part of what you see has MUMPS in its stack.

A few years later, researchers at Stanford using a computer with the approximate power of today’s wristwatches created MYCIN, a program capable of outperforming human physicians in diagnosing bacterial infections. Unlike MUMPS, such programs are still far from use in everyday care today: when I go to the doctor’s office I’m not diagnosed by computerized super-doctors but instead by the time-honored combination of human gut, skill and the occasional glance at a reference volume. Even “low-skill” jobs like calling patients to remind them about their appointments are still usually done by receptionists or temps with a printed call list; a process essentially indistinguishable from 50 years ago.

If people are better at making decisions, then we will be better at a whole range of things, making decision-support technology an important priority for altruists. It was listed as one of 80,000 hours top priorities, for example. I haven’t seen many empirical examinations of how decision-making technology (fails to) improve our abilities, so I offer healthcare IT as a case study.

Different, not fewer, problems

Clinicians sometimes order the wrong thing. Perhaps they forget the dosing and accidentally order 200 miligrams instead of 200 micrograms, or they order penicillin because they forgot that the patient’s allergic.

It’s relatively easy to program a computer to warn the user when their prescription is off by an order of magnitude or contraindicates with an allergy, but it turns out that doctors are actually pretty good at what they do most of the time. If they order an unusually high dose, it’s probably because the patient has an unusually severe case. If they order a med that the patient is allergic to, it’s probably because they decided the benefits outweigh the risks. As a result, these warnings are almost always noise without a signal.

The result is familiar to anyone who used the version of Microsoft Office with Clippy: clinicians slam on the keyboard to close all message boxes without bothering to read the warnings, completely negating any possible benefits. This “alert fatigue” (as it is politely termed) sometimes stems from organization’s fears of lawsuits keeping extraneous alerts around (Tiwari et al. 2013), but even in trials which are done specifically to improve health and are judged successful enough to publish, less than a fourth have any impact on patient outcomes (Hemens et al. 2011).

GIGO

Anyone who’s done computer learning is aware of the maxim “garbage-in, garbage-out”. Even the most amazing prediction algorithm will give bad results if you give it bad input, and current medical algorithms are far from perfect.

Medical records are written of, by and for humans, and there is a large resistance to change. If your program requires someone with MD-equivalent skills to translate the patient’s free-text chart into a discrete dataset that the software could analyse, then why would you use it? You might as well just hire the doctor to do the diagnosis herself.

This problem is largely what’s held back programs like MYCIN. While they work great if your research grant provides for a grad student sweatshop to code data into your specialized format, it doesn’t work so well in the real world.
Doctor-Hardness
To summarize these two problems: people had originally thought they could slice off just a tiny piece of clinicians’ jobs and improve that without worrying about the rest. But it turned out that in order to do well in this tiny slice they needed to essentially replicate all of what a doctor does - in computer science terms, these problems are “doctor-hard”.

Cost

What have we spent to get these minimal benefits?

The NIH’s Biomedical Information Science and Technology initiative has funded about $350 million dollars worth of research (not all of it in clinical decision support), but this amount pales to to what governments have spent in getting IT into the hands of front-line physicians.

The HITECH Act (part of the 2009 US stimulus bill) is expected to spend about $35 billion on increasing the adoption of electronic medical records. On the other side of the pond, the NHS’ troubled IT program ended up costing around £20 billion, up a mere order of magnitude from the original £2.3 billion estimate.

An explicit cost-benefit analysis of decision support research would require a lot more careful analysis of these expenditures, but my goal is just to point out that the lack of results is not due to lack of trying. Decades of work and billions of dollars have been spent in this area.

Efficiency

In retrospect, I think one argument we could have used to predict the non-cost-effectiveness of these interventions is to ask why they haven’t already been invented. The pre-computer medical world is filled with checklists, and so if there was an easy way to detect mistyped prescriptions or diagnose bacterial infections, it would probably already be used.

This is to make a sort of “efficiency” argument - if there is some easy way to improve decision making, it’s probably already been implemented. So when we’re examining proposed decision support techniques, we might want to ask why it hasn’t already been done. If we can’t pin it on a new disruptive technology or something similar, we might want be skeptical that the problem is really so easy to solve.

Acknowledgements

Brian Tomasik proofread an earlier version of this post.

Works Cited

Ash, Joan S., Marc Berg, and Enrico Coiera. "Some unintended consequences of information technology in health care: the nature of patient care information system-related errors." Journal of the American Medical Informatics Association 11.2 (2004): 104-112. http://171.67.114.118/content/11/2/104.full

Hemens, Brian J., et al. "Computerized clinical decision support systems for drug prescribing and management: a decision-maker-researcher partnership systematic review." Implement Sci 6.1 (2011): 89. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3179735/

Reckmann, Margaret H., et al. "Does computerized provider order entry reduce prescribing errors for hospital inpatients? A systematic review." Journal of the American Medical Informatics Association 16.5 (2009): 613-623.

Tiwari, Ruchi, et al. "Enhancements in healthcare information technology systems: customizing vendor-supplied clinical decision support for a high-risk patient population." Journal of the American Medical Informatics Association20.2 (2013): 377-380. http://171.67.114.118/content/20/2/377.abstract

Williams, D. J. P. "Medication errors." JOURNAL-ROYAL COLLEGE OF PHYSICIANS OF EDINBURGH 37.4 (2007): 343. http://www.rcpe.ac.uk/journal/issue/journal_37_4/Williams.pdf

Summary: Brian has recently argued that because "flow-through" (second-order) effects are so uncertain, charities don't (on expectation) differ in their effectiveness by more than a couple orders of magnitude. I give some arguments here about why that might be wrong.

1. Why does anything differ by many orders of magnitude?

Some cities are very big. Some are very small. This fact has probably never bothered you before. But when you look at how cities sizes stack up, it looks somewhat peculiar:

The X-axis is the size of the city, in (natural) logarithmic scale. The Y-axis corresponds to the density (fraction) of cities with that population. The peak is around the mark of 8 on the X-axis, which corresponds to $e^8\approx 3,000$ people.

You can see that the empirical sizes of cities almost perfectly matches a normal ("bell curve") distribution. What's the explanation for this? Is mayoral talent distributed exponentially? When deciding to move to a new city do people first take the log of the new city's size and then roll some normally-distributed dice?

It turns out that this is solely due to dumb luck and mathematical inevitability. Suppose every city grows by a random amount each year. One year, it will grow 10%, the next 5%, the year after it will shrink by 2%. After these three years, the total change in population is
$$1.10\cdot 1.05\cdot 0.98$$
As in the above graph, we take the log
$$\log\left(1.10\cdot 1.05\cdot 0.98\right)$$
A property of logarithms you may remember is that $\log(a\cdot b)=\log a + \log b$. Rewriting (2) with this property gives
$$\log 1.10+ \log 1.05+\log 0.98$$
The central limit theorem tells us that when you add a bunch of random things together, you'll end up with a normal distribution. We're clearly adding a bunch of random things together here, so we end up with the bell curve we see above.

2. Why charities might differ by many orders of magnitude

Some of Brian's points are about how even if a charity is good in one dimension, it's not necessarily good in others (performance is "independent"). The point of the above is to demonstrate that we don't need dependence to have widely varying impacts. We just need a structure where people's talents are randomly distributed, but critically their talents have a multiplicative effect.

There are some talents which obviously cause a multiplier. A charity's ability to handle logistics ("reduce overhead") will multiply the effectiveness of everything else they do. Their ability to increase the "denominator" of their intervention (number of bednets distributed, number of leaflets handed out, etc.) is another. PR skills, fundraising etc. all plausibly have a multiplicative impact.

More controversially, some proxies for flow-through effects might have a multiplicative impact. Scientific output is probably more valuable in times of peace than in times of war. GDP increases are probably better when there's a fair and just government, instead of the new wealth going to a few plutocrats.

Here's a simulation of charities' effectiveness with 10 dimensions, each uniformly drawn from the range [0,10].

The red line corresponds to Brian's scenario (where each dimension is independent) and as he describes effectiveness is very closely clustered around 50. But as the dimensions have more interactions, the effectiveness spreads out, until the purely multiplicative model (purple line) where charities differ by many orders of magnitude.

3. Picking winners

Say that impact is the product of measurable, direct impacts and unmeasurable flow-through effects. Algebraically: $I=DF$. By linearity of expectations
$$E[I]=E[DF]=E[D]E[F]$$
So if two charities differ by a factor of say 1,000 in their direct impact then their total impact would (on expectation) differ by 1,000 as well.

This isn't a perfect model. But I do think that it's not always correct to model impacts as a sum of iid variables, and there is a plausible case to be made that not only do charities differ "astronomically" but we can expect those differences even with our limited knowledge.

Acknowledgements

This post was obviously inspired by Brian, and I talked about it with Gina extensively. The log-normal proof is known as Gibrat's Law and is not due to me.