Discrete Choice

Many important economic decisions are fundamentally discrete: whether to participate in a welfare program, what occupation to work in, which car to buy, where to live. In each case, an individual selects one option from a finite set of alternatives. This chapter introduces the econometric framework for modeling these decisions, based on the foundational model of Random Utility Maximization (RUM) of McFadden (1974), building toward flexible specifications that can approximate arbitrary patterns of substitution across alternatives.

Motivating Examples

Before introducing the general framework, let’s consider two examples that illustrate the range of discrete choice problems.

Example 1: Welfare Program Participation and Discrete Labor Supply

Consider an individual deciding whether to participate in a welfare program and how much to work. The individual chooses among six combinations of program participation \((P\in\{0,1\})\) and hours (\(H\in\{0,1,2\}\), corresponding to no work, part-time work, and full-time work). Let \(j = P*2 + H\) indicate each of the 6 combinations of discrete choice.

Let \(W\) denote the wage, \(b\) the welfare benefit, and \(c\) the cost of program participation (stigma, compliance). The utility from each option depends on consumption, the disutility of labor, and a preference shock:

\[ U_{j} = u(Y+bP_{j} + 20 W H_{j}) - \alpha_0 P_{j} - \alpha_1 H_{j} + \varepsilon_{j} \]

where \(Y\) is non-labor income and \((\varepsilon_0,\ldots,\varepsilon_5)\) capture unobserved preference heterogeneity. The individual selects the option that maximizes utility. This is a multinomial choice problem with alternatives that differ in attributes determined partly by the individual’s characteristics (wages, non-labor income).

Example 2: Automobile Choice

A consumer chooses one car from \(J\) models available on the market. Each model \(j\) has a price \(p_j\), fuel efficiency \(x_{j1}\), horsepower \(x_{j2}\), and other attributes. Consumer \(n\)’s utility from car \(j\) is:

\[U_{nj} = \alpha_n(y_n - p_j) + \mathbf{x}_{j}\boldsymbol{\beta}_n + \varepsilon_{nj}\]

where \(y_n\) is income and \((\alpha_n,\boldsymbol{\beta}_n)\) are taste parameters that may vary across consumers. The consumer chooses the car \(j\) that maximizes \(U_{nj}\).

This is a differentiated products demand problem. Unlike the labor supply example, the alternatives are defined by product attributes rather than individual characteristics. The number of alternatives \(J\) can be large (hundreds of car models), and the key econometric challenge is modeling realistic substitution patterns: when the price of a Honda Civic increases, consumers are more likely to substitute toward other compact cars than toward a luxury SUV.

These two examples illustrate a common structure: individuals choose the option that maximizes utility, where utility depends on observed attributes, individual characteristics, and unobservable preference shocks. The Random Utility Model provides a unifying framework.

The Random Utility Model

McFadden (1974) formalized the connection between individual utility maximization and the probabilistic models used to analyze discrete choice data. The key insight is that from the econometrician’s perspective, utility contains an unobserved component, so the choice outcome is random even though the individual is choosing deterministically.

Definition 1 (Random Utility Model) A Random Utility Model (RUM) specifies that individual \(n\) faces a choice set \(\mathcal{J}_n = \{1,\ldots,J\}\) and assigns utility:

\[U_{nj} = V_{nj} + \varepsilon_{nj},\qquad j\in\mathcal{J}_n\]

where \(V_{nj}\) is the representative utility (a function of observed attributes and parameters) and \(\varepsilon_{nj}\) is a random term reflecting unobserved factors. The individual chooses alternative \(j\) if:

\[U_{nj} > U_{nk}\qquad\text{for all } k\neq j\]

The choice probability for alternative \(j\) is:

\[P_{nj} = \Pr(U_{nj} > U_{nk}\ \forall\ k\neq j) = \Pr(\varepsilon_{nk} - \varepsilon_{nj} < V_{nj} - V_{nk}\ \forall\ k\neq j)\]

This is a \((J-1)\)-dimensional integral over the distribution of the error differences. Different assumptions on the joint distribution of \((\varepsilon_{n1},\ldots,\varepsilon_{nJ})\) lead to different choice models with different substitution patterns.

Normalization

Notice that observed choices are invariant to any changes in the joint location and scale of the random vector \(U_{n}\). In practice, we must normalize both: fix the utility of one alternative to pin down the location (e.g. \(V_{n0}=0\)), and fix the variance of the errors to pin down the scale (e.g. \(\text{Var}(\varepsilon_j)=\pi^2/6\) in the logit, or \(\text{Var}(\varepsilon_j)=1\) in the probit).

Before going deeper, let’s cover two of the most standard models of discrete choice, generated by alternative assumptions on the distribution of \(\varepsilon\).

Multinomial Logit

The multinomial logit (MNL) model arises when the errors \(\varepsilon_{nj}\) are iid draws from a Type I Extreme Value (Gumbel) distribution with CDF \(F(\varepsilon) = e^{-e^{-\varepsilon}}\). Extreme Value distributions are class of random variables that, by construction, yield tractable distributions for their maximum value. Hence, this distribution yields parsimonious choice probabilities:

\[P_{nj} = \frac{e^{V_{nj}}}{\sum_{k=1}^{J}e^{V_{nk}}}\]

The MNL model is elegant and easy to estimate, but it has a well-known limitation: the independence of irrelevant alternatives (IIA) property.

Independence of Irrelevant Alternatives

The MNL model implies that the odds ratio between any two alternatives is independent of all other alternatives:

\[\frac{P_{nj}}{P_{nk}} = \frac{e^{V_{nj}}}{e^{V_{nk}}} = e^{V_{nj}-V_{nk}}\]

This ratio depends only on alternatives \(j\) and \(k\) — it does not change when other alternatives are added, removed, or modified. This means that if a new alternative enters the choice set, it draws proportionally from all existing alternatives. In the automobile example, this would imply that a new compact car draws equally (in proportional terms) from other compact cars and from luxury SUVs — a clearly unrealistic restriction.

The classic illustration is the “red bus / blue bus” problem: if commuters choose between a car and a red bus, the MNL predicts equal odds. Adding a blue bus (identical to the red bus) should not affect the odds of choosing the car, but the MNL predicts that \(P(\text{car})\) falls from \(1/2\) to \(1/3\).

The IIA property is a direct consequence of the iid assumption on the errors. It is also the primary motivation for the richer GEV models below.

Multinomial Probit

An alternative to the extreme value distribution is to assume that the errors are jointly normal: \((\varepsilon_{n1},\ldots,\varepsilon_{nJ})\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})\). This is the multinomial probit (MNP) model. The choice probabilities are:

\[P_{nj} = \Pr(\varepsilon_{nk}-\varepsilon_{nj} < V_{nj}-V_{nk}\ \forall\ k\neq j) = \int_{\mathcal{R}_j}\phi_{J-1}(\boldsymbol{\eta};\tilde{\boldsymbol{\Sigma}})d\boldsymbol{\eta}\]

where \(\boldsymbol{\eta}\) is the \((J-1)\)-vector of utility differences, \(\tilde{\boldsymbol{\Sigma}}\) is the covariance matrix of these differences, and \(\phi_{J-1}\) is the \((J-1)\)-dimensional normal density.

The key advantage of the MNP over the MNL is flexibility: by specifying the covariance matrix \(\boldsymbol{\Sigma}\), the analyst can encode arbitrary patterns of correlation among alternatives. Two alternatives whose error terms are positively correlated will be closer substitutes — exactly the kind of structure that the IIA property of the MNL rules out. The key disadvantage is that the choice probabilities require evaluating a \((J-1)\)-dimensional normal integral, which has no closed form for \(J>3\).

The GHK Simulator

For maximum simulated likelihood estimation of the MNP, we need a simulator for the multivariate normal rectangle probability \(P_{nj}\). The GHK simulator (Geweke, Hajivassiliou, Keane) provides an efficient solution by exploiting the Cholesky decomposition of \(\tilde{\boldsymbol{\Sigma}}\).

Write \(\tilde{\boldsymbol{\Sigma}} = LL'\) where \(L\) is lower triangular. The idea is to decompose the \((J-1)\)-dimensional integral into a product of one-dimensional conditional probabilities using the recursive structure of the Cholesky factorization. Defining \(\eta_k = \sum_{\ell\leq k}L_{k\ell}\nu_\ell\) where \(\nu_\ell\sim\mathcal{N}(0,1)\) independently:

\[P_{nj} = \prod_{k=1}^{J-1}\Pr\left(\nu_k < \frac{b_k - \sum_{\ell<k}L_{k\ell}\nu_\ell}{L_{kk}}\ \Big|\ \nu_1,\ldots,\nu_{k-1}\right)\]

where \(b_k\) is the upper bound for the \(k\)-th utility difference. Each factor is a univariate normal CDF, which is smooth and easy to compute. To simulate, we sequentially draw \(\nu_k\) from truncated normals and compute the product of acceptance probabilities:

\[\hat{P}_{nj} = \prod_{k=1}^{J-1}\Phi\!\left(\frac{b_k - \sum_{\ell<k}L_{k\ell}\nu_\ell^{(r)}}{L_{kk}}\right)\]

The GHK simulator has two crucial properties compared to a crude frequency simulator (which would simply draw \(\boldsymbol{\varepsilon}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma})\) and check if \(j\) is the utility-maximizing choice):

Smoothness. The GHK probability \(\hat{P}_{nj}\) is a product of normal CDFs, which is a smooth function of the parameters. This avoids the non-differentiability problem that plagues crude frequency simulators (see the simulation chapter), enabling gradient-based optimization of the simulated log-likelihood.
Efficiency. By conditioning sequentially, the GHK avoids “wasting” draws in regions of the distribution that are inconsistent with the observed choice, dramatically reducing variance.

The Generalized Extreme Value (GEV) Family

The two benchmarks covered so far leave us to make a tough choice between computational tractability and empirical flexibility.

McFadden (1978) introduced a powerful class of models — the Generalized Extreme Value (GEV) family — that generate RUM-consistent choice probabilities in closed form and allow for much greater flexibility in substitution patterns. The key is a generating function that encodes the correlation structure among alternatives.

Definition 2 (GEV Generating Function) A function \(G(y_1,\ldots,y_J)\) is a GEV generating function if it satisfies:

\(G\geq 0\)
\(G\) is homogeneous of degree 1: \(G(\alpha y_1,\ldots,\alpha y_J) = \alpha G(y_1,\ldots,y_J)\) for \(\alpha > 0\)
\(G\rightarrow\infty\) as \(y_j\rightarrow\infty\) for any \(j\)
The mixed partial derivatives satisfy: \(\frac{\partial^{k}G}{\partial y_{j_1}\cdots\partial y_{j_k}}\geq 0\) if \(k\) is odd and \(\leq 0\) if \(k\) is even, for any distinct \(j_1,\ldots,j_k\)

Given a GEV generating function \(G\), the choice probabilities take the form:

\[P_{nj} = \frac{y_j G_j(y_1,\ldots,y_J)}{G(y_1,\ldots,y_J)}\]

where \(y_j = e^{V_{nj}}\) and \(G_j = \partial G / \partial y_j\). McFadden (1978) showed that these probabilities are consistent with a RUM where the errors \((\varepsilon_{n1},\ldots,\varepsilon_{nJ})\) follow a multivariate extreme value distribution. Importantly, each choice of \(G\) defines a different correlation structure and therefore different substitution patterns.

Moreover, the expected maximum utility — a key object for welfare analysis — takes a remarkably simple form in terms of \(G\):

\[\mathbb{E}\left[\max_{j\in\mathcal{J}} U_{nj}\right] = \log G(e^{V_{n1}},\ldots,e^{V_{nJ}}) + C\]

where \(C\) is Euler’s constant (\(\approx 0.5772\)). This result, due to Williams (1977) and Daly and Zachary (1978), is discussed further in the section on the Williams-Daly-Zachary theorem below.

Let us work through the most important special cases.

Multinomial Logit

The simplest GEV model sets \(G(y_1,\ldots,y_J) = \sum_{j=1}^{J}y_j\). Since \(G_j = 1\), the choice probabilities are:

\[P_{nj} = \frac{e^{V_{nj}}}{\sum_{k=1}^{J}e^{V_{nk}}}\]

and we arrive back at the MNL model.

Nested Logit

The nested logit model relaxes IIA by grouping alternatives into nests that share a common unobserved component. Alternatives within the same nest are closer substitutes than alternatives in different nests.

Suppose the \(J\) alternatives are partitioned into \(M\) nests \(B_1,\ldots,B_M\). The GEV generating function is:

\[G(y_1,\ldots,y_J) = \sum_{m=1}^{M}\left(\sum_{j\in B_m}y_j^{1/\lambda_m}\right)^{\lambda_m}\]

where \(\lambda_m\in(0,1]\) is a dissimilarity parameter that governs the degree of correlation within nest \(m\). When \(\lambda_m = 1\) for all \(m\), the model reduces to the MNL. As \(\lambda_m\rightarrow 0\), alternatives within nest \(m\) become perfect substitutes.

The choice probabilities have a convenient sequential structure. Define:

\[I_m = \log\sum_{j\in B_m}e^{V_{nj}/\lambda_m}\]

This is the inclusive value (or log-sum) of nest \(m\). The probability of choosing alternative \(j\) in nest \(m\) can be decomposed as:

\[P_{nj} = P(j|m)\cdot P(m)\]

where:

\[\begin{align} P(j|m) &= \frac{e^{V_{nj}/\lambda_m}}{\sum_{k\in B_m}e^{V_{nk}/\lambda_m}} \\ P(m) &= \frac{e^{\lambda_m I_m}}{\sum_{\ell=1}^{M}e^{\lambda_\ell I_\ell}} \end{align}\]

The first term is a standard logit within the nest; the second is a logit over nests using the inclusive values. IIA holds within each nest but not across nests: when an alternative becomes more attractive, it draws disproportionately from alternatives in the same nest.

Example: Nested Logit for Automobile Choice

Example 1 Consider the automobile choice problem from Example 2. Suppose there are six models available, which we group into three nests by body type: compact (\(B_1 = \{\text{Civic}, \text{Corolla}\}\)), midsize (\(B_2 = \{\text{Camry}, \text{Accord}\}\)), and luxury (\(B_3 = \{\text{BMW 3}, \text{Audi A4}\}\)).

Under the MNL, if the price of the Civic increases, consumers substitute toward all other cars proportionally. A Corolla and a BMW gain the same proportional increase in market share — clearly unrealistic. The nested logit corrects this: with dissimilarity parameters \(\lambda_m < 1\) within each nest, a price increase for the Civic causes substitution primarily toward the Corolla (its nest-mate in the compact segment). The extent of within-nest substitution is governed by \(\lambda_m\): smaller values imply that models within a segment are closer substitutes.

To see this concretely, consider the cross-elasticity of demand. In the MNL, the cross-elasticity of \(P_k\) with respect to a change in \(V_j\) is \(-P_j \cdot \partial V_j / \partial x\), the same for all \(k\neq j\). In the nested logit, if \(j\) and \(k\) are in the same nest \(m\):

\[\frac{\partial P_k}{\partial V_j} = -P_k\left(\frac{1}{\lambda_m}P(j|m) + \left(1-\frac{1}{\lambda_m}\right)P_j\right)\]

while if \(k\) is in a different nest, the cross-elasticity is simply \(-P_k P_j\), just as in the MNL. Since \(\lambda_m \leq 1\), within-nest cross-elasticities are larger: a price change in the Civic has a bigger effect on the Corolla than on the BMW.

Cross-Nested Logit

The nested logit requires each alternative to belong to exactly one nest. The cross-nested logit (Vovsha (1997)) relaxes this by allowing alternatives to belong to multiple nests with different degrees of membership. The generating function is:

\[G(y_1,\ldots,y_J) = \sum_{m=1}^{M}\left(\sum_{j=1}^{J}\alpha_{jm}y_j^{1/\lambda_m}\right)^{\lambda_m}\]

where \(\alpha_{jm}\geq 0\) is the allocation parameter indicating the degree to which alternative \(j\) belongs to nest \(m\), and \(\lambda_m \in (0,1]\) is the dissimilarity parameter for nest \(m\). For each \(j\), we require \(\sum_{m}\alpha_{jm} > 0\) (every alternative belongs to at least one nest). The nested logit is the special case where each \(\alpha_{jm}\in\{0,1\}\) and each alternative has exactly one non-zero allocation.

Applying the GEV choice probability formula \(P_j = \frac{e^{V_j} G_j(e^{V_1},\ldots,e^{V_J})}{G(e^{V_1},\ldots,e^{V_J})}\) to this generating function yields the cross-nested logit choice probabilities:

\[P_j = \sum_{m=1}^{M} \underbrace{\frac{G_m^{\lambda_m / \lambda_m}}{\sum_{m'=1}^{M} G_{m'}^{\lambda_{m'}/\lambda_{m'}}}}_{\text{(not quite)}} \]

More precisely, differentiating \(G\) with respect to \(y_j\):

\[G_j \equiv \frac{\partial G}{\partial y_j} = \sum_{m=1}^{M} \alpha_{jm}\, y_j^{1/\lambda_m - 1} \left(\sum_{k=1}^{J}\alpha_{km}y_k^{1/\lambda_m}\right)^{\lambda_m - 1}\]

Evaluating at \(y_k = e^{V_k}\) and defining the inclusive value of nest \(m\) as \(G_m \equiv \sum_{k=1}^{J}\alpha_{km}e^{V_k/\lambda_m}\), the choice probability becomes:

\[\boxed{P_j = \frac{\displaystyle\sum_{m=1}^{M}\alpha_{jm}\, e^{V_j/\lambda_m}\, G_m^{\lambda_m - 1}}{\displaystyle\sum_{m'=1}^{M} G_{m'}^{\lambda_{m'}}}} \tag{1}\]

Defining the within-nest conditional share \(s_{j|m} = \alpha_{jm}e^{V_j/\lambda_m}/G_m\) and the nest probability \(\pi_m = G_m^{\lambda_m}/D\) where \(D = \sum_{m'}G_{m'}^{\lambda_{m'}}\), this can be decomposed as:

\[P_j = \sum_{m=1}^{M} \pi_m \, s_{j|m}\]

This is useful when alternatives share attributes along multiple dimensions. For example, a hybrid SUV shares features with both the “SUV” nest (body type) and the “hybrid” nest (fuel type).

Example: Cross-Nested Logit for Program Participation

Example 2 Returning to the welfare program participation example, recall that there are six alternatives: all combinations of program participation (\(P\in\{0,1\}\)) and hours (\(H\in\{0,1,2\}\)). The nested logit forces each alternative into a single nest, but the natural substitution patterns here run along multiple dimensions: program participation, the extensive margin of labor supply, and the intensive margin. The cross-nested logit handles this naturally.

Define four nests capturing these dimensions:

	Nest 1: No program	Nest 2: Program	Nest 3: Work	Nest 4: Full-time
\(j=0\): \((H=0,P=0)\)	1	0	0	0
\(j=1\): \((H=1,P=0)\)	1	0	1	0
\(j=2\): \((H=2,P=0)\)	1	0	0	1
\(j=3\): \((H=0,P=1)\)	0	1	0	0
\(j=4\): \((H=1,P=1)\)	0	1	1	0
\(j=5\): \((H=2,P=1)\)	0	1	0	1

The key feature of this structure is that the working alternatives (\(j=1,4\)) belong to both their participation nest and the work nest, while full-time alternatives (\(j=2,5\)) belong to both their participation nest and the full-time nest. The not-working alternatives (\(j=0,3\)) each belong only to their participation nest. This allows the model to capture substitution along the participation dimension (between \(j=0,1,2\) and \(j=3,4,5\)), the extensive margin (between \(H=0\) and \(H>0\)), and the intensive margin (between \(H=1\) and \(H=2\)), each governed by a separate dissimilarity parameter.

Write the dissimilarity parameters as \(\lambda_P\) for the two participation nests (shared for symmetry), \(\lambda_E\) for the work (extensive margin) nest, and \(\lambda_I\) for the full-time (intensive margin) nest. The four inclusive values are: \[\begin{align} G_1 &= e^{V_0/\lambda_P} + e^{V_1/\lambda_P} + e^{V_2/\lambda_P} \\ G_2 &= e^{V_3/\lambda_P} + e^{V_4/\lambda_P} + e^{V_5/\lambda_P} \\ G_3 &= e^{V_1/\lambda_E} + e^{V_4/\lambda_E} \\ G_4 &= e^{V_2/\lambda_I} + e^{V_5/\lambda_I} \end{align}\]

and the denominator is \(D = G_1^{\lambda_P} + G_2^{\lambda_P} + G_3^{\lambda_E} + G_4^{\lambda_I}\).

To derive own-derivatives \(\partial P_j / \partial V_j\), differentiate Equation 1. The numerator is \(N_j = \sum_m \alpha_{jm} e^{V_j/\lambda_m} G_m^{\lambda_m - 1}\), so: \[\frac{\partial N_j}{\partial V_j} = \sum_m \frac{\alpha_{jm}\,e^{V_j/\lambda_m}\,G_m^{\lambda_m - 1}}{\lambda_m}\left(1 + (\lambda_m - 1)s_{j|m}\right)\] Since \(\partial D / \partial V_j = N_j = P_j D\), the quotient rule gives: \[\frac{\partial P_j}{\partial V_j} = \sum_m \frac{\pi_m\, s_{j|m}}{\lambda_m}\left(1 + (\lambda_m - 1)s_{j|m}\right) - P_j^2 \tag{2}\]

Now apply this to the three alternatives with \(P=1\).

\(j=3\): participate, don’t work (\(\alpha_{3,2}=1\) only). Since \(P_3 = \pi_2 s_{3|2}\): \[\frac{\partial P_3}{\partial V_3} = \frac{\pi_2\, s_{3|2}}{\lambda_P}\left(1 + (\lambda_P - 1)s_{3|2}\right) - P_3^2 = \frac{P_3}{\lambda_P}\left(1 + (\lambda_P - 1)s_{3|2}\right) - P_3^2\]

The responsiveness of the “participate but don’t work” alternative is governed entirely by \(\lambda_P\), since this alternative belongs only to the program nest.

\(j=4\): participate, work part-time (\(\alpha_{4,2}=1\), \(\alpha_{4,3}=1\)). Since \(P_4 = \pi_2 s_{4|2} + \pi_3 s_{4|3}\): \[\frac{\partial P_4}{\partial V_4} = \frac{\pi_2\, s_{4|2}}{\lambda_P}\left(1 + (\lambda_P - 1)s_{4|2}\right) + \frac{\pi_3\, s_{4|3}}{\lambda_E}\left(1 + (\lambda_E - 1)s_{4|3}\right) - P_4^2\]

The own-derivative has two channels: substitution within the program nest (governed by \(\lambda_P\)) and substitution within the work nest (governed by \(\lambda_E\)). A lower \(\lambda_E\) amplifies the responsiveness of \(P_4\) to changes in \(V_4\) through the extensive margin channel, while \(\lambda_P\) controls the participation channel.

\(j=5\): participate, work full-time (\(\alpha_{5,2}=1\), \(\alpha_{5,4}=1\)): \[\frac{\partial P_5}{\partial V_5} = \frac{\pi_2\, s_{5|2}}{\lambda_P}\left(1 + (\lambda_P - 1)s_{5|2}\right) + \frac{\pi_4\, s_{5|4}}{\lambda_I}\left(1 + (\lambda_I - 1)s_{5|4}\right) - P_5^2\]

The structure is the same as for \(j=4\), but the intensive margin dissimilarity \(\lambda_I\) replaces \(\lambda_E\). If \(\lambda_I < \lambda_E\), then the full-time/part-time margin is more responsive than the extensive margin, holding nest shares fixed—working full-time is “closer” to not working full-time (in terms of correlation of unobservables) than working is to not working. This allows the model to accommodate, for instance, settings where it is relatively easy to switch between part-time and full-time (low \(\lambda_I\)), harder to enter or exit the labor force (moderate \(\lambda_E\)), and hardest to change program participation status (high \(\lambda_P\)).

Paired Combinatorial Logit

The paired combinatorial logit (Koppelman and Wen (2000)) takes a different approach: instead of grouping alternatives into nests, it defines a “nest” for every pair of alternatives. The generating function is:

\[G(y_1,\ldots,y_J) = \sum_{j=1}^{J}\sum_{k=j+1}^{J}\left(y_j^{1/\lambda_{jk}} + y_k^{1/\lambda_{jk}}\right)^{\lambda_{jk}}\]

where \(\lambda_{jk}\in(0,1]\) governs the correlation between alternatives \(j\) and \(k\). This model allows for a rich, pairwise correlation structure without requiring the analyst to specify a nesting structure a priori. The cost is a large number of parameters: \(J(J-1)/2\) dissimilarity parameters, which may need to be restricted for tractability.

The Williams-Daly-Zachary Theorem

We have already noted that the expected maximum utility in a GEV model takes a simple form in terms of the generating function \(G\). The Williams-Daly-Zachary (WDZ) theorem (Williams (1977); Daly and Zachary (1978)) establishes a deeper result: for any random utility model, the expected maximum utility is a convex function of the representative utilities, and the choice probabilities are its gradients. The GEV formula is a special case.

Theorem 1 (Williams-Daly-Zachary Theorem) Let \(U_{nj} = V_{nj} + \varepsilon_{nj}\) for \(j\in\mathcal{J}\) define a random utility model with continuous joint density for \((\varepsilon_{n1},\ldots,\varepsilon_{nJ})\). Define the social surplus function:

\[W(V_{n1},\ldots,V_{nJ}) = \mathbb{E}\left[\max_{j\in\mathcal{J}} U_{nj}\right]\]

Then:

\(W\) is convex in \((V_{n1},\ldots,V_{nJ})\).
The choice probabilities are the gradients of \(W\): \[P_{nj} = \frac{\partial W}{\partial V_{nj}}\]

This result says that the expected maximum utility plays the role of an expenditure function (or its dual) in the discrete choice setting, with the choice probabilities recovered by differentiation — exactly as Roy’s identity recovers demand functions from the indirect utility function in continuous consumer theory. The analogy is precise: in continuous demand, \(x_j = -\frac{\partial V/\partial p_j}{\partial V/\partial y}\); in discrete choice, \(P_j = \frac{\partial W}{\partial V_j}\). The social surplus function \(W\) is the discrete choice analogue of the indirect utility function.

Mixed Multinomial Logit

The GEV models provide closed-form choice probabilities with richer substitution patterns than the basic MNL. But they still impose a specific parametric structure on the correlation among unobservables. The Mixed Multinomial Logit (MMNL) model — also known as the random coefficients logit — takes a different and more flexible approach: it allows the parameters of the utility function to vary randomly across individuals.

Specification

In the MMNL, individual \(n\)’s utility for alternative \(j\) is:

\[U_{nj} = V_{nj}(\boldsymbol{\beta}_n) + \varepsilon_{nj}\]

where \(\varepsilon_{nj}\) is iid Type I Extreme Value (as in the standard logit) and \(\boldsymbol{\beta}_n\) is a vector of individual-specific taste parameters drawn from a mixing distribution \(f(\boldsymbol{\beta}|\boldsymbol{\theta})\). Conditional on \(\boldsymbol{\beta}_n\), the choice probabilities are standard logit:

\[L_{nj}(\boldsymbol{\beta}) = \frac{e^{V_{nj}(\boldsymbol{\beta})}}{\sum_{k}e^{V_{nk}(\boldsymbol{\beta})}}\]

The unconditional choice probability is obtained by integrating over the mixing distribution:

\[P_{nj} = \int L_{nj}(\boldsymbol{\beta})f(\boldsymbol{\beta}|\boldsymbol{\theta})d\boldsymbol{\beta}\]

This integral generally does not have a closed-form solution, which is why simulation methods (from the previous chapter) are essential.

Why Mixed Logit Is So Flexible

McFadden and Train (2000) proved a useful result: any RUM-consistent choice probabilities can be approximated arbitrarily closely by a MMNL model. To state this precisely, we need a slightly more formal definition of the random utility model.

Consider a general RUM where the indirect utility of consumer \(n\) for alternative \(j\) is:

\[U_{nj}^* = U^*(\mathbf{z}_j, \mathbf{s}_n, \varepsilon_j, \nu_n)\]

Here \(\mathbf{z}_j\) are observed choice characteristics (e.g. price, product attributes), \(\mathbf{s}_n\) are observed consumer characteristics (e.g. income, demographics), \(\varepsilon_j\) are unobserved product-level factors, and \(\nu_n\) are unobserved consumer-level factors. The choice probabilities are obtained by integrating over the unobservables:

\[P_{nj} = \Pr\left(U_{nj}^* \geq U_{nk}^*\ \forall\ k\neq j\ \Big|\ \mathbf{z}, \mathbf{s}_n\right) = \int \mathbf{1}\{U_{nj}^* \geq U_{nk}^*\ \forall\ k\neq j\}\ dF(\varepsilon,\nu)\]

McFadden and Train (2000) prove that a MMNL can approximate these choice probabilities arbitrarily well. Their approximation is built of a polynomial approximation to \(U^*_{n}\), leading to a MMNL with indirect utility given by:

\[V_{nj}(\boldsymbol{\beta}) = \mathbf{x}_{nj}'\boldsymbol{\beta}\]

where \(\mathbf{x}_{nj}\) is a vector of polynomial basis functions in \((\mathbf{z}_j, \mathbf{s}_n)\) and \(\boldsymbol{\beta}\) is the random coefficient vector. The term \(\mathbf{x}_{nj}'\boldsymbol{\beta}\) can be understood as being proportional to the underlying polynomial approximation to the true indirect utility \(U^*\).

Theorem 2 (McFadden-Train Approximation Theorem) Let \(\{P_{nj}^*\}\) be any set of choice probabilities derived from a random utility model \(U^*(\mathbf{z}_j, \mathbf{s}_n, \varepsilon_j, \nu_n)\) that satisfies mild regularity conditions. For any \(\delta > 0\), there exists a MMNL specification — a choice of basis functions \(\mathbf{x}_{nj}(\mathbf{z}_j, \mathbf{s}_n)\) and mixing distribution \(f(\boldsymbol{\beta}|\boldsymbol{\theta})\) — such that the MMNL choice probabilities satisfy:

\[\sup_{n,j}|P_{nj} - P_{nj}^*| < \delta\]

The mixing distribution then generates the required correlation structure among utilities. This is analogous to the universal approximation theorems for neural networks, and the connection is made precise by the latent class version discussed below.

This result makes the MMNL an attractive “default” specification when the analyst does not have strong prior information about the correlation structure. In practice, common choices for \(f\) include:

Normal: \(\boldsymbol{\beta}_n\sim\mathcal{N}(\mathbf{b},\mathbf{\Sigma})\), where \(\mathbf{b}\) and \(\mathbf{\Sigma}\) are the parameters to be estimated.
Log-normal: \(\log\boldsymbol{\beta}_n\sim\mathcal{N}(\mathbf{b},\mathbf{\Sigma})\), useful when coefficients should have a particular sign (e.g. price coefficients should be negative).
Discrete (latent class): \(\boldsymbol{\beta}_n\) takes one of \(K\) values \(\boldsymbol{\beta}^{1},\ldots,\boldsymbol{\beta}^{K}\) with probabilities \(\pi_1,\ldots,\pi_K\). See the next section.

Latent Class Models

An important special case of the MMNL arises when the mixing distribution is discrete. Suppose there are \(K\) latent classes (or types), each with its own parameter vector \(\boldsymbol{\beta}^k\) and class probability \(\pi_k\):

\[P_{nj} = \sum_{k=1}^{K}\pi_k L_{nj}(\boldsymbol{\beta}^k)\]

where \(\pi_k > 0\) and \(\sum_k\pi_k = 1\). This is a finite mixture of logits. The parameters to estimate are \(\{(\pi_k,\boldsymbol{\beta}^k)\}_{k=1}^K\), and the number of classes \(K\) is typically selected by information criteria (BIC) or cross-validation.

The latent class model has an interesting interpretation. For each observation, the model computes \(K\) different sets of logit probabilities (one per class) and then averages them with weights \(\pi_k\). This is formally identical to a single-layer feedforward neural network: the “hidden layer” consists of the \(K\) class-specific logit models, and the “output layer” combines them with the mixing weights. The universal approximation result of McFadden and Train (2000) can be understood through this lens — just as neural networks with enough hidden units can approximate any function, latent class logit models with enough classes can approximate any set of RUM-consistent choice probabilities.

Estimation

Maximum Likelihood for the MNL

For the multinomial logit, the log-likelihood takes a simple form. If individual \(n\) chooses alternative \(d_n\), the contribution to the log-likelihood is:

\[\ell_n(\theta) = \log P_{n,d_n} = V_{n,d_n} - \log\sum_{j=1}^{J}e^{V_{nj}}\]

The MNL log-likelihood is globally concave, so estimation is straightforward:

\[\hat{\theta}_{MLE} = \arg\max_{\theta}\sum_{n=1}^{N}\ell_n(\theta)\]

Example: MNL Estimation

Example 3 Let’s simulate data from a simple MNL model and estimate the parameters by maximum likelihood.

using Distributions, Random, Optim, LinearAlgebra, ForwardDiff

Random.seed!(321)

# True parameters
J = 4           # number of alternatives
N = 2000        # number of individuals
K = 2           # number of covariates

β_true = [0.5, -0.3]       # coefficients on covariates
α_true = [0.0, 1.0, -0.5]  # alternative-specific constants (alt 1 normalized to 0)

# Simulate data
X = randn(N, J, K)   # alternative-specific covariates
V = zeros(N, J)
for n in 1:N, j in 1:J
    V[n, j] = (j <= 3 ? α_true[j] : 0.0) + sum(X[n, j, :] .* β_true)
end

# Draw choices
ε = -log.(-log.(rand(N, J)))  # Type I EV draws
U = V .+ ε
choices = [argmax(U[n, :]) for n in 1:N]

# Log-likelihood
function log_likelihood(θ, X, choices)
    J = size(X, 2)
    K = size(X, 3)
    α = vcat(θ[1:J-1], 0.0)  # normalize last ASC to 0
    β = θ[J:J+K-1]
    ll = 0.0
    for n in axes(X, 1)
        V = [α[j] + sum(X[n, j, :] .* β) for j in 1:J]
        logsum = log(sum(exp.(V .- maximum(V)))) + maximum(V)
        ll += V[choices[n]] - logsum
    end
    return ll
end

# Estimate
θ0 = zeros(J - 1 + K)
res = optimize(θ -> -log_likelihood(θ, X, choices), θ0, Newton(), autodiff=:forward)
θ_hat = res.minimizer

println("True vs Estimated Parameters:")
println("  ASC:  true = $(α_true)  est = $(round.(θ_hat[1:J-1], digits=3))")
println("  β:    true = $(β_true)   est = $(round.(θ_hat[J:end], digits=3))")

True vs Estimated Parameters:
  ASC:  true = [0.0, 1.0, -0.5]  est = [0.111, 1.066, -0.345]
  β:    true = [0.5, -0.3]   est = [0.504, -0.305]

Simulated Maximum Likelihood for Mixed Logit

For the MMNL, the choice probability involves an integral that does not have a closed form:

\[P_{nj} = \int L_{nj}(\boldsymbol{\beta})f(\boldsymbol{\beta}|\boldsymbol{\theta})d\boldsymbol{\beta}\]

We approximate this by simulation. Draw \(R\) values \(\boldsymbol{\beta}^1,\ldots,\boldsymbol{\beta}^R\) from \(f(\cdot|\boldsymbol{\theta})\) and compute:

\[\hat{P}_{nj} = \frac{1}{R}\sum_{r=1}^{R}L_{nj}(\boldsymbol{\beta}^r)\]

The simulated log-likelihood is:

\[\hat{\ell}(\boldsymbol{\theta}) = \sum_{n=1}^{N}\log\hat{P}_{n,d_n}\]

This is exactly the maximum simulated likelihood (MSL) estimator from the simulation chapter. Recall the key subtlety: because \(\log\) is concave, \(\mathbb{E}[\log\hat{P}]\leq\log P\), so the simulated log-likelihood is biased downward. Consistency requires \(R\rightarrow\infty\) with \(\sqrt{N}/R\rightarrow 0\).

An important practical advantage of the MMNL is that the integrand \(L_{nj}(\boldsymbol{\beta})\) is a smooth function of \(\boldsymbol{\beta}\), which means the simulated probability \(\hat{P}_{nj}\) is a smooth function of the parameters \(\boldsymbol{\theta}\). This is in contrast to the non-differentiability problems that arise when simulating models with discrete outcomes (as discussed in the simulation chapter). The MMNL is therefore well-suited to gradient-based optimization.

Example: Simulated MLE for Mixed Logit

Example 4 Let’s simulate data from a mixed logit (with normally distributed random coefficients) and estimate the parameters by simulated maximum likelihood.

Random.seed!(42)

# True parameters
N = 1000     # individuals
T_obs = 1    # choice occasions per individual
J = 3
β_mean_true = [1.0, -0.5]
β_sd_true = [0.5, 0.3]      # standard deviations of random coefficients

# Each individual has T choice occasions with different X but same β_n
X = randn(N, T_obs, J, 2)
β_ind = hcat([β_mean_true .+ β_sd_true .* randn(2) for _ in 1:N]...)'

choices = zeros(Int, N, T_obs)
for n in 1:N, t in 1:T_obs
    V = [sum(X[n, t, j, :] .* β_ind[n, :]) for j in 1:J]
    ε = -log.(-log.(rand(J)))
    choices[n, t] = argmax(V .+ ε)
end

We pre-draw and fix the simulation draws outside the likelihood function. This ensures that the simulated likelihood is a smooth, deterministic function of \(\boldsymbol{\theta}\) — there is no simulation noise across evaluations — and allows gradient-based optimization.

R = 200
sim_draws = randn(N, R, 2)   # fixed draws: N individuals × R simulations × K coefficients

function sim_log_likelihood_panel(θ, X, choices, draws)
    N, T, J, K = size(X)
    b = θ[1:K]
    s = exp.(θ[K+1:2K])  # ensure positive std devs

    ll = zero(eltype(θ))
    R = size(draws, 2)
    for n in 1:N
        P_sim = zero(eltype(θ))
        for r in 1:R
            β_r = b .+ s .* @view(draws[n, r, :])
            logP = zero(eltype(θ))
            for t in 1:T
                V = [sum(X[n, t, j, :] .* β_r) for j in 1:J]
                maxV = maximum(V)
                logP += V[choices[n, t]] - maxV - log(sum(exp.(V .- maxV)))
            end
            P_sim += exp(logP)
        end
        ll += log(P_sim / R)
    end
    return ll
end

# First, get good starting values from a standard (homogeneous) MNL
function mnl_ll_panel(b, X, choices)
    N, T, J, K = size(X)
    ll = zero(eltype(b))
    for n in 1:N, t in 1:T
        V = [sum(X[n, t, j, :] .* b) for j in 1:J]
        maxV = maximum(V)
        ll += V[choices[n, t]] - maxV - log(sum(exp.(V .- maxV)))
    end
    return ll
end
res_mnl = optimize(b -> -mnl_ll_panel(b, X, choices), zeros(2), Newton(), autodiff=:forward)
b_start = res_mnl.minimizer

# Initialize: means at MNL estimates, std devs at a small positive value
θ0 = [b_start; log(0.1); log(0.1)]

res = optimize(θ -> -sim_log_likelihood_panel(θ, X, choices, sim_draws),
               θ0, LBFGS(), autodiff=:forward,
               Optim.Options(iterations=500))
θ_hat_mmnl = res.minimizer

println("True vs Estimated Parameters:")
println("  β_mean: true = $(β_mean_true)  est = $(round.(θ_hat_mmnl[1:2], digits=3))")
println("  β_sd:   true = $(β_sd_true)   est = $(round.(exp.(θ_hat_mmnl[3:4]), digits=3))")

True vs Estimated Parameters:
  β_mean: true = [1.0, -0.5]  est = [0.901, -0.46]
  β_sd:   true = [0.5, 0.3]   est = [0.472, 0.643]

Several details matter here. First, the simulation draws are pre-drawn and held fixed, making the objective a smooth deterministic function of the parameters. Second, we use L-BFGS with automatic differentiation rather than Nelder-Mead: because the logit probability is a smooth function of \(\boldsymbol{\beta}\) and hence of \(\boldsymbol{\theta}\), the simulated likelihood is differentiable and gradient-based methods are both faster and more reliable. Third, we initialize the means at the MNL estimates and the standard deviations at a small positive value — starting from the homogeneous model and “growing” the heterogeneity. Finally, with panel data, the key identifying variation comes from observing the same individual make different choices across occasions: the pattern of choices reveals each individual’s coefficients, and the distribution of these patterns across individuals identifies the mixing distribution.

Discussion: Choosing the Mixing Distribution

The choice of mixing distribution \(f(\boldsymbol{\beta}|\boldsymbol{\theta})\) involves important trade-offs:

Continuous vs. discrete mixing. Continuous distributions (normal, log-normal) are parsimonious and can capture smooth heterogeneity. Discrete distributions (latent class) are nonparametric in the sense that they do not impose a functional form on the heterogeneity distribution, but they require choosing \(K\).
Support restrictions. If economic theory implies that a coefficient must have a particular sign (e.g. negative price coefficient), a log-normal distribution enforces this while a normal distribution does not.
Flexibility vs. parsimony. A full covariance matrix \(\mathbf{\Sigma}\) for normally distributed coefficients has \(K(K+1)/2\) parameters, which can be large. In practice, many applications assume diagonal \(\mathbf{\Sigma}\) (independent random coefficients) and only allow a few key coefficients to be random.

The McFadden and Train (2000) approximation theorem guarantees that some mixing distribution works. It does not tell us how to find it. In practice, the choice is guided by a combination of theory, diagnostic checks, and computational tractability. See Train (2009) for an extensive practical discussion.

Summary

This chapter introduced the foundational framework for discrete choice econometrics:

The Random Utility Model provides the behavioral foundation: individuals maximize utility, and randomness in choices arises from the econometrician’s incomplete information.
The multinomial probit allows flexible substitution patterns through correlated normal errors, with estimation via the GHK simulator.
The GEV family generates closed-form, RUM-consistent choice probabilities. The multinomial logit (iid errors, IIA) is the simplest case; the nested logit, cross-nested logit, and paired combinatorial logit allow richer substitution patterns through structured error correlations.
The Williams-Daly-Zachary theorem connects choice probabilities to the expected maximum utility — the discrete choice analogue of Roy’s identity — providing a foundation for welfare analysis.
The Mixed Multinomial Logit allows arbitrary substitution patterns by introducing random coefficients. The McFadden and Train (2000) theorem shows that it can approximate any RUM. Estimation requires simulation methods from the previous chapter.

In the next chapter, we extend these ideas to dynamic settings where individuals make sequences of discrete choices over time, introducing the additional challenge of modeling expectations about the future.