-
Notifications
You must be signed in to change notification settings - Fork 28
/
Copy pathesl_gumbel.tex
615 lines (505 loc) · 23.6 KB
/
esl_gumbel.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
Gumbel statistics are often used to estimate the statistical
significance of local alignment scores.
The Gumbel distribution is the so-called Type I extreme value
distribution (EVD). It occurs so frequently in sequence analysis
applications, compared to the type II (Fr\'{e}chet) and type III
(Weibull) extreme value distributions, that ``Gumbel'' and ``EVD'' are
often used interchangeably in bioinformatics. Easel has a separate
module, the \eslmod{gev} module, that implements the generalized
extreme value distribution.
Karlin/Altschul statistics are a special case of the Gumbel
distribution that apply to the scores of ungapped local alignments
between infinitely long random sequences. Empirically, Karlin/Altschul
statistics also apply reasonably well to the more useful case of
gapped alignment of finite-length sequences. Karlin/Altschul
statistics predict how the Gumbel's two parameters depend on the
length of the query and target sequences. In the case of ungapped
alignments, Karlin/Altschul statistics allow the Gumbel parameters to
be estimated directly, without the need for a compute-intensive
simulation.
\subsection{The gumbel API}
The \eslmod{gumbel} API consists of the following functions:
\vspace{0.5em}
\begin{center}
\begin{tabular}{ll}\hline
\multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\
\ccode{esl\_gumbel\_pdf()} & Returns the probability density, $P(S=x)$.\\
\ccode{esl\_gumbel\_logpdf()} & Returns the log of the pdf, $\log P(S=x)$.\\
\ccode{esl\_gumbel\_cdf()} & Returns the cumulative probability distribution, $P(S \leq x)$.\\
\ccode{esl\_gumbel\_logcdf()} & Returns the log of the cdf, $\log P(S \leq x)$.\\
\ccode{esl\_gumbel\_surv()} & Returns right tail mass, 1-cdf, $P(S > x)$\\
\ccode{esl\_gumbel\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\
\multicolumn{2}{c}{\textbf{sampling:}}\\
\ccode{esl\_gumbel\_Sample()} & Returns a Gumbel-distributed random sample.\\
\multicolumn{2}{c}{\textbf{maximum a posteriori parameter fitting:}}\\
\ccode{esl\_gumbel\_FitComplete()} & Estimates $\mu,\lambda$ from complete data.\\
\ccode{esl\_gumbel\_FitCompleteLoc()} & Estimates $\mu$ when $\lambda$ is known.\\
\ccode{esl\_gumbel\_FitCensored()} & Estimates $\mu,\lambda$ from censored data.\\
\ccode{esl\_gumbel\_FitCensoredLoc()} & Estimates $\mu$ when $\lambda$ is known.\\
\ccode{esl\_gumbel\_FitTruncated()}& Estimates $\mu,\lambda$ from truncated data.\\\hline
\end{tabular}
\end{center}
\vspace{0.5em}
The Gumbel distribution depends on two parameters, $\mu$ and
$\lambda$. When $\mu$ and $\lambda$ are known, the statistical
significance (P-value) of a single score $x$ is $P(S>x)$, obtained by
a call to \ccode{esl\_gumbel\_surv()}. The E-value for obtaining that
score or better in searching a database of $N$ sequences is just
$NP(S>x)$.
When $\mu$ and $\lambda$ are unknown, they are estimated from scores
obtained from comparisons of simulated random data. (Analytical
solutions for $\mu$ and $\lambda$ are only available in the case of
ungapped sequence alignments.) The \ccode{esl\_evd\_Fit*()} functions
provide maximum likelihood parameter fitting routines for different
types of data.
\subsection{Example of using the gumbel API}
An example that samples 10,000 data points from a Gumbel distribution
with $\mu=-20$, $\lambda=0.4$; reports the min and max samples, and
the probability mass to the left of the min and to the right of the
max (both of which should be about $\frac{1}{10000}$, since we took
10,000 samples); and then fits those simulated data to a Gumbel and
reports the fitted $\mu$ and $\lambda$:
\input{cexcerpts/gumbel_example}
\subsection{Gumbel densities}
The probability density function (pdf) and the cumulative distribution
function (cdf) of the extreme value distribution are:
\begin{equation}
P(x) = \lambda \exp \left[ -\lambda (x - \mu) - e^{- \lambda (x - \mu)} \right]
\label{eqn:gumbel_density}
\end{equation}
\begin{equation}
P(S < x) = \exp \left[ -e^{-\lambda(x - \mu)} \right]
\label{eqn:gumbel_distribution}
\end{equation}
The extreme value density and distribution functions for $\mu = 0$ and
$\lambda = 1.0$ are shown below.
\begin{center}
\includegraphics[width=3in]{figures/evd_basic}
\end{center}
The $\mu$ and $\lambda$ parameters are {\em location} and {\em scale}
parameters, respectively:
\centerline{
\begin{minipage}{3in}
\includegraphics[width=2.8in]{figures/evd_location}
\end{minipage}
\begin{minipage}{3in}
\includegraphics[width=2.8in]{figures/evd_scale}
\end{minipage}
}
For more details, a classic reference is \citep{Lawless82}. Gumbel
distributions can have their long tail to the right or to the
left. The form given here is for the long tail to the right. This is
the form that arises when the extreme value is a maximum, such as when
our score is the maximum over the individual scores of all possible
alignments. The equations in \citep{Lawless82} are for extremal
minima; use $(x - u) = -(x - \mu)$ and $b = 1 / \lambda$ to convert
Lawless' notation to the notation used here.
\subsection{Fitting Gumbel distributions to observed data}
Given a set of $n$ observed samples $\mathbf{x}$, we may want to
estimate the $\mu$ and $\lambda$ parameters.
One might try to use linear regression to fit to a $\log \log$
transformation of the $P(S < x)$ histogram, which gives a straight
line with slope $-\lambda$ and $x$ intercept $\mu$:
\begin{equation}
\log \left[ -\log P(S<x) \right] = -\lambda x + \lambda \mu
\end{equation}
However, the linear regression method is undesirable because it is
sensitive to outliers. The following table shows the \% error for
estimating $\hat{\mu}$ and $\hat{\lambda}$ from 500 simulated complete
datasets, sampled from a Gumbel with $\mu = -20.0$ and $\lambda =
0.4$, for four different dataset sizes:
\begin{center}
\begin{tabular}{lrrrr} \hline
& \multicolumn{4}{c}{\# of samples}\\
& 100 & 1000 & 10,000 & 100,000 \\
\% error in $\hat{\mu}$ & 2\%& 1\% & 0.9\% & 0.9\% \\
max error in $\hat{\mu}$ & 24\%& 13\% & 10\% & 10\% \\
\% error in $\hat{\lambda}$ & 12\%& 7\% & 5\% & 3\% \\
max error in $\hat{\lambda}$ & 49\%& 33\% & 25\% & 20\% \\ \hline
\end{tabular}
\end{center}
A better rough estimate of $\hat{\mu}$ and $\hat{\lambda}$ can be
obtained from the sample mean $m$ and variance $s^2$ of the observed
data \citep{Evans00}:\footnote{All simulation data are generated by
the \eslmod{evd} module's stats driver. The only exception is the
linear regression fit data, which come from an old version of HMMER.}
\begin{eqnarray*}
\hat{\lambda} & = & \frac{\pi}{\sqrt{6s^2}}\\
\hat{\mu} & = & m - \frac{0.57722}{\hat{\lambda}}
\end{eqnarray*}
The mean/variance method is more accurate than linear regression, as
shown by the following simulation results:
\begin{center}
\begin{tabular}{lrrrr} \hline
& \multicolumn{4}{c}{\# of samples}\\
& 100 & 1000 & 10,000 & 100,000 \\
\% error in $\hat{\mu}$ & 1\%& 0.3\% & 0.1\% & 0.03\% \\
max error in $\hat{\mu}$ & 5\%& 1\% & 0.4\% & 0.1\% \\
\% error in $\hat{\lambda}$ & 9\%& 3\% & 0.8\% & 0.3\% \\
max error in $\hat{\lambda}$ & 40\%& 12\% & 3\% & 0.9\% \\ \hline
\end{tabular}
\end{center}
Still, the mean/variance method is not as accurate as a maximum
likelihood estimation (especially for $\lambda$). Also, it requires
complete data, whereas we also need to solve problems where we fit to
\emph{truncated} or \emph{censored} data. Easel's main estimation
methods are therefore maximum likelihood methods.
\subsubsection{Maximum likelihood estimation, complete data}
Given $n$ samples $x_1..x_n$ from some distribution that depends on
parameters $\theta$, we want to estimate maximum likelihood parameter
estimates $\hat{\theta}$ that maximize the log likelihood:
\[
\hat{\theta} = \argmax_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta)
\]
These are also \emph{maximum a posteriori} parameter estimates, if we
assume a uniform prior $P(\theta)$.
Specifically, for samples $x_i$ drawn from an extreme value
distribution, the log likelihood to optimize is:
\begin{equation}
\log L(\lambda, \mu) = n \log \lambda - \sum_{i=1}^{n} \lambda(x_i -
\mu) - \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)}
\label{eqn:gumbel_logL}
\end{equation}
This objective function is differentiable with respect to $\mu$ and
$\lambda$:
\begin{eqnarray}
\frac{\partial \log L}{\partial \mu} & = &
n \lambda - \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}\\%
\\%
\label{eqn:mupartial}
\frac{\partial \log L}{\partial \lambda} & = &
\frac{n}{\lambda} - \sum_{i=1}^{n} (x_i - \mu) +
\sum_{i=1}^{n} (x_i - \mu) e^{-\lambda (x_i - \mu)}
\label{eqn:lambdapartial}
\end{eqnarray}
The maximum likelihood estimates $\hat{\lambda}$ and $\hat{\mu}$ are
the solutions to $\frac{\partial \log L}{\partial \mu} = 0$ and
$\frac{\partial \log L}{\partial \lambda} = 0$. Lawless
\citep{Lawless82} gives a useful trick here that lets us solve both of
these simultaneously. When (\ref{eqn:mupartial}) is set to zero, it
can be used to get $\hat{\mu}$ in terms of $\hat{\lambda}$:
\begin{eqnarray}
e^{-\hat{\lambda} \hat{\mu}} & = & \frac{1}{n} \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}
\label{eqn:substitute}\\
\hat{\mu} & = & - \frac{1}{\hat{\lambda}}
\log \left[ \frac{1}{n} \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right]
\label{eqn:solvemu}
\end{eqnarray}
Substituting (\ref{eqn:substitute}) into (\ref{eqn:lambdapartial}),
gives us an equation for solving $\hat{\lambda}$ in terms of the
$x_i$'s:
\begin{eqnarray}
\frac{1}{\hat{\lambda}} - \frac{1}{n} \sum_{i=1}^{n} x_i +
\frac{\sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}}
{\sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
& = & 0
\label{eqn:newtontarget}
\end{eqnarray}
This is our target function. We could solve it readily enough (by
bisection search, for example) and obtain $\hat{\lambda}$. We can
solve it even faster using the Newton/Raphson algorithm, because it is
differentiable with respect to lambda:
\begin{equation}
\frac{d}{d\hat{\lambda}} =
\frac{\left( \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i} \right)^2 }
{\left( \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right)^2 }
-
\frac{\sum_{i=1}^{n} x_i^2 e^{-\hat{\lambda} x_i}}
{\sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
-
\frac{1}{\hat{\lambda}^2}
\label{eqn:newtonderivative}
\end{equation}
Now, the key equations are (\ref{eqn:solvemu}),
(\ref{eqn:newtontarget}), and (\ref{eqn:newtonderivative}). In
summary, the inference procedure is the following:
\begin{itemize}
\item Guess an initial $\hat{\lambda}$ (using the mean/variance
method, for example, but any reasonable guess works).
\item Use Newton/Raphson iterations to find the $\hat{\lambda}$ that satisfies
(\ref{eqn:newtontarget}):
\begin{itemize}
\item calculate the target function $f$ and
its first derivative $f'$ at $\hat{\lambda}$, using
(\ref{eqn:newtontarget}) to calculate $f$ and
(\ref{eqn:newtonderivative}) to calculate $f'$.
\item If $f$ is within some absolute tolerance of zero
(e.g., $10^{-6}$), stop; we have found $\hat{\lambda}$.
\item Else, estimate a new $\hat{\lambda} = \hat{\lambda} - \frac{f}{f'}$,
and do another iteration.
\end{itemize}
\item Plug $\hat{\lambda}$ into (\ref{eqn:solvemu}) to get $\hat{\mu}$.
\end{itemize}
This algorithm is implemented in \ccode{esl\_evd\_FitComplete()}. An
auxiliary function, \ccode{lawless416()}, calculates the target
function and its derivative (equations (\ref{eqn:newtontarget}) and
(\ref{eqn:newtonderivative})) given the current estimate of
$\hat{\lambda}$. The name comes from Lawless' equation 4.1.6, the
target function \citep{Lawless82}.
The accuracy of fitting to simulated data (generated with $\mu=-20$
and $\lambda=0.4$), collated over 500 simulations, is shown in the
following table:
\begin{center}
\begin{tabular}{lrrrr} \hline
& \multicolumn{4}{c}{\# of samples}\\
& 100 & 1000 & 10,000 & 100,000 \\
\% error in $\hat{\mu}$ & 1\%& 0.3\% & 0.1\% & 0.03\% \\
max error in $\hat{\mu}$ & 4\%& 2\% & 0.5\% & 0.1\% \\
\% error in $\hat{\lambda}$ & 6\%& 2\% & 0.6\% & 0.2\% \\
max error in $\hat{\lambda}$ & 36\%& 9\% & 2\% & 0.8\% \\ \hline
\end{tabular}
\end{center}
This is in accord with theoretical expectation. The distribution of
$\frac{\lambda}{\hat{\lambda}}$ is approximately normal with mean 1 and
standard error $\frac{0.78}{\sqrt{N}}$ \citep{Lawless82,Altschul01}.
% Altschul says \frac{\hat{\lambda}}{\lambda}, actually, but I believe
% that's wrong. xref J1/46.
\subsubsection{Maximum likelihood fitting to censored data}
A \emph{censored} data problem is when we have $N$ samples, but we
only observe the values of a subset of $n$ samples $x_1..x_n$ that are
greater or equal to some cutoff $\phi$. The remaining $z = N-n$
samples are \emph{censored}, and for these we only know that $x <
\phi$. $x_i..x_n$, $n$, $\phi$, and $z$ are all known in a censored
data problem.
To estimate maximum likelihood parameters $\hat{\theta}$ for some
distribution from censored data \citep{Gelman95}, the log likelihood
to maximize is:
\[
\hat{\theta} = \argmax_{\theta} z \log P(x<\phi \mid \theta)
+ \sum_{i=1}^n \log P(x_i \mid \theta)
\]
Specifically, when fitting a Gumbel distribution, the log likelihood
to optimize is:
\begin{equation}
\log L(\lambda, \mu) =
n \log \lambda
- z e^{-\lambda(\phi - \mu)}
- \sum_{i=1}^{n} \lambda(x_i - \mu)
- \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)}
\label{eqn:censor_logL}
\end{equation}
To optimize this, we follow a similar procedure as used for complete
data \citep{Lawless82}. The log likelihood is differentiable with
respect to $\lambda$ and $\mu$:
\begin{eqnarray}
\frac{\partial \log L}{\partial \mu} & = &
n \lambda
- z \lambda e^{-\lambda (\phi - \mu)}
- \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}
\label{eqn:censor_dmu}
\\%
\frac{\partial \log L}{\partial \lambda} & = &
\frac{n}{\lambda}
+ z (\phi - \mu) e^{-\lambda (\phi - \mu)}
- \sum_{i=1}^{n} (x_i - \mu)
+ \sum_{i=1}^{n} (x_i - \mu) e^{-\lambda (x_i - \mu)}
\label{eqn:censor_dlambda}
\end{eqnarray}
Setting (\ref{eqn:censor_dmu}) to zero and solving for $\hat{\mu}$ in
terms of $\hat{\lambda}$ gives:
\begin{equation}
\hat{\mu} = - \frac{1}{\hat{\lambda}}
\log \left[ \frac{1}{n}
\left( z e^{-\hat{\lambda} \phi}
+ \sum_{i=1}^{n} e^{-\hat{\lambda} x_i} \right)
\right]
\label{eqn:censor_solvemu}
\end{equation}
Substituting (\ref{eqn:censor_solvemu}) into
(\ref{eqn:censor_dlambda}) gives the target equation:
\begin{equation}
\frac{1}{\hat{\lambda}}
- \frac{1}{n} \sum_{i=1}^{n} x_i +
\frac{z \phi e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}}
{z e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
= 0
\label{eqn:censor_newtontarget}
\end{equation}
To use Newton-Raphson root finding (instead of a slower bisection
search) we also need the first derivative of this target equation with
respect to $\lambda$:
\begin{equation}
\frac{d}{d\hat{\lambda}} =
\frac{\left(
z \phi e^{-\hat{\lambda} \phi}
+ \sum_{i=1}^{n} x_i e^{-\hat{\lambda} x_i}
\right)^2 }
{\left(
z e^{-\hat{\lambda} \phi}
+ \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}
\right)^2 }
-
\frac{z \phi^2 e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} x_i^2 e^{-\hat{\lambda} x_i}}
{z e^{-\hat{\lambda} \phi} + \sum_{i=1}^{n} e^{-\hat{\lambda} x_i}}
-
\frac{1}{\hat{\lambda}^2}
\label{eqn:censor_newtonderiv}
\end{equation}
In summary: given $n$ observed samples $x_1..x_n$ from a total sample
of $N$ samples, $z = N-n$ of which were censored because they have
values $< \phi$, we solve for maximum likelihood estimates
$\hat{\lambda}$ and $\hat{\mu}$ using the same procedure we used for
complete data, by using equations (\ref{eqn:censor_solvemu}),
(\ref{eqn:censor_newtontarget}), and (\ref{eqn:censor_newtonderiv}) in
place of equations (\ref{eqn:solvemu}), (\ref{eqn:newtontarget}), and
(\ref{eqn:newtonderivative}). Easel implements this procedure in
\ccode{esl\_evd\_FitCensored()}. The target function
(\ref{eqn:censor_newtontarget}) and its derivative
(\ref{eqn:censor_newtonderiv}) are implemented in the auxiliary
function \ccode{lawless422()} \citep{Lawless82}.
Results on 500 simulated datasets with $\mu = -20, \lambda = 0.4$,
censored at $\phi = -20$ -- the expected peak of the histogram; that
is, a censored fit only to the right tail, which contains about 63\%
of the samples:
\begin{center}
\begin{tabular}{lrrrr} \hline
& \multicolumn{4}{c}{\# samples in EVD histogram}\\
& 100 & 1000 & 10,000 & 100,000 \\
\% error in $\mu$ & 1\%& 0.4\% & 0.1\% & 0.04\% \\
max error in $\mu$ & 5\%& 2\% & 0.5\% & 0.2\% \\
\% error in $\lambda$ & 9\%& 3\% & 0.9\% & 0.3\% \\
max error in $\lambda$ & 33\%& 11\% & 3\% & 1\% \\ \hline
\end{tabular}
\end{center}
\subsubsection{Maximum likelihood fitting to truncated data}
A \emph{truncated} dataset is when we only observe $n$ samples $x_i$,
and an \emph{unknown} number $z$ of samples less than some threshold
$\phi$ were unobserved. Thus, only the right tail of $n$ samples $x_i
\geq \phi$ as observed. In a truncated dataset, $x_1..x_n$, $n$, and
$\phi$ are known, but $z$ is unknown.
Solving a truncated data problem motivates a Bayesian approach,
because we need to integrate out (marginalize) the nuisance $z$
parameter, and to do this, we have to specify a prior distribution for
$P(z)$. Gelman \emph{et al.} describe a general Bayesian framework for
thinking about various types of missing data problems, including
censored and truncated data \citep{Gelman95}.
In short, to obtain maximum likelihood parameters $\hat{\theta}$ for
some distribution, given truncated data, the log likelihood we wish to
maximize is:
\begin{equation}
\hat{\theta} = \argmax_\theta -n \log P(x \geq \phi \mid \theta)
+ \sum_{i=1}^n \log P(x_i \mid \theta).
\label{eqn:truncated_objective}
\end{equation}
\textbf{Detour: derivation of the truncated data likelihood}
The derivation of the above equation may not be immediately obvious.
The presence of the $n P(x \geq \phi \mid \theta)$ term may be
counterintuitive, as opposed to the more intuitive $z P(x < \phi \mid
\theta)$ term that accounts for the missing data in a censored data
problem. Gelman \emph{et al.} actually don't even show the equation in
their book; I obtained it from an exercise solution on their web site.
To convince you (and to remind me) of its correctness, a sketch of the
derivation follows.
We start with the same likelihood equation that arises in censored
data for a \emph{known} total number of samples $N$ (where $N=n+z$),
but since $N$ is unknown, we need to integrate over all possible $N$
from $n$ to $\infty$:
\begin{eqnarray*}
P(\mathbf{x} \mid \theta, \phi) & = &
\sum_{N=n}^{\infty} P(\mathbf{x} \mid \theta, \phi, N) P(N)\\
& = &
\prod_{i=1}^n P(x_i \mid \theta)
\left[
\sum_{N=n}^\infty {N \choose n} P(x < \phi \mid \theta)^{N-n} P(N)
\right]\\
\end{eqnarray*}
The $\prod_{i=1}^n P(x_i \mid \theta)$ is straightforward; that sum is
our problem. The trick is to rearrange it so we can treat it as a
convergent negative binomial series:
\[
(1-p)^{-a} = 1 + ap + \frac{a(a+1)}{2!} p^2 +
\frac{a(a+1)(a+2)}{3!} p^3...
\]
To get the sum into the form of this series, Gelman \emph{et al.}
suggest using an informative prior $P(N) = \frac{1}{N}$, an apparently
unmotivated choice that happens to make the sum collapse nicely:
\begin{eqnarray*}
&=& P(N=n)
+ (n+1) P(x < \phi \mid \theta) P(N=n+1)
+ \frac{(n+1)(n+2)}{2!} P(x < \phi \mid \theta)^2 P(N=n+2) ...\\
&= & \frac{1}{n} \left[
1
+ n P(x < \phi \mid \theta)
+ \frac{n(n+1)}{2!} P(x < \phi \mid \theta)^2
+ \frac{n(n+1)(n+2)}{3!} P(x < \phi \mid \theta)^3 \right]\\
&=& \frac{1}{n} (1 - P(x < \phi \mid \theta))^{-n}\\
&=& \frac{1}{n} P(x \geq \phi \mid \theta)^{-n}\\
\end{eqnarray*}
The $\frac{1}{n}$ is a constant, so we drop it from the likelihood
equation we'll maximize. Putting this term back together with the
probability of the observed data and taking the log, we obtain the log
likelihood in equation (\ref{eqn:truncated_objective}).
Alternatively, we might choose an uninformative improper uniform prior
$P(N) \propto 1$. This gives a log likelihood that only differs by a
term of $n+1$ versus $n$:
\begin{equation}
\hat{\theta} = \argmax_\theta -(n+1) \log P(x \geq \phi \mid \theta)
+ \sum_{i=1}^n \log P(x_i \mid \theta).
\end{equation}
However, empirically, this form is ineffective, at least for fitting
Gumbels. The $\frac{1}{N}$ prior performs much better, probably
because it constrains the solutions to favor smaller, finite, more
reasonable choices of $N$.
\textbf{Back to fitting a truncated Gumbel}
For the specific case of fitting a truncated Gumbel, the log
likelihood (\ref{eqn:truncated_objective}) to optimize is:
\[
\log L(\lambda, \mu) =
n \log \lambda
- \sum_{i=1}^{n} \lambda(x_i - \mu)
- \sum_{i=1}^{n} e^{-\lambda(x_i - \mu)}
- n \log (1 - \exp(-e^{-\lambda(\phi - \mu)}))
\label{eqn:truncated_logL}
\]
This is differentiable with respect to $\lambda$ and $\mu$, but it's
not going to reduce to the clean root-finding problem that we obtained
for complete data or censored data. Instead we're going to be left
with a numerical optimization problem. We can use standard numerical
optimization code, such as steepest descent or conjugate gradient
descent. There's just one hitch. These algorithms assume unconstrained
parameters, but $\lambda$ is constrained to values $>0$. We do a
change of variables, and use the transformation $\lambda = e^w$ so we
can optimize the unconstrained parameter $w = \log \lambda$ instead of
optimizing $\lambda$ directly. The necessary partial derivatives are
then:
\begin{eqnarray}
\frac{\partial \log L}{\partial \mu} & = &
n \lambda
- \lambda \sum_{i=1}^{n} e^{-\lambda (x_i - \mu)}
- \frac{n \lambda \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]}
{1 - \exp(-e^{-\lambda(\phi - \mu)})}
\label{eqn:truncated_dmu}
\\%
\frac{\partial \log L}{\partial w} & = &
n
- \sum_{i=1}^{n} \lambda(x_i - \mu)
+ \sum_{i=1}^{n} \lambda(x_i - \mu) e^{-\lambda (x_i - \mu)}
+ \frac{n\lambda (\phi-\mu) \exp \left[ -\lambda (\phi - \mu) - e^{- \lambda (\phi - \mu)} \right]}
{1 - \exp(-e^{-\lambda(\phi - \mu)})}
\label{eqn:truncated_dw}
\end{eqnarray}
This optimization is carried out by \ccode{esl\_evd\_FitTruncated()}.
The likelihood (\ref{eqn:truncated_logL}) is implemented in
\ccode{tevd\_func()}, and the derivatives (\ref{eqn:truncated_dmu}) and
(\ref{eqn:truncated_dw}) are implemented in \ccode{tevd\_dfunc()}.
\ccode{esl\_evd\_FitTruncated()} simply sets up the problem and passes
it all off to a conjugate gradient descent optimizer.
Results on 500 simulated datasets with $\mu = -20, \lambda = 0.4$,
truncated at $\phi = -20$ (leaving the right tail, containing about
63\% of the samples):
\begin{center}
\begin{tabular}{lrrrr} \hline
& \multicolumn{4}{c}{\# samples}\\
& 100 & 1000 & 10,000 & 100,000 \\
\% error in $\hat{\mu}$ & 13\%& 2\% & 0.8\% & 0.3\% \\
max error in $\hat{\mu}$ &260\%& 42\% & 3\% & 1\% \\
\% error in $\hat{\lambda}$ & 15\%& 5\% & 2\% & 0.6\% \\
max error in $\hat{\lambda}$ & 68\%& 18\% & 6\% & 2\% \\ \hline
\end{tabular}
\end{center}
Fitting truncated Gumbel distributions is difficult, requiring much
more data than fitting complete or censored data. The problem is that
the right tail becomes a scale-free exponential when $\phi >> \mu$,
and $\mu$ becomes undetermined. Fits become very inaccurate as $\phi$
gets larger than $\mu$, and for sufficiently large $\phi$, the
numerical optimizer will completely fail.