-
Notifications
You must be signed in to change notification settings - Fork 28
/
Copy pathesl_gev.tex
197 lines (160 loc) · 7.84 KB
/
esl_gev.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
The generalized extreme value distribution (GEV) includes all three
types of extreme value distributions: Type I (Gumbel), type II
(Fr\'{e}chet), and type III (Weibull). Empirically, the scores of some
sequence alignment algorithms appear to follow GEV distributions. The
\eslmod{gev} module is used in estimating the statistical significance
of such scores.
Most local sequence alignment scores follow the Gumbel distribution.
Easel's \eslmod{gumbel} module applies specifically to the Gumbel. The
\eslmod{gev} module is used for Type II or III extreme value
distributions, or for determining which of the three types of
distribution that a dataset best fits.
\subsection{The gev API}
The \eslmod{gev} API consists of the following functions:
\vspace{0.5em}
\begin{center}
\begin{tabular}{ll}\hline
\multicolumn{2}{c}{\textbf{evaluating densities and distributions:}}\\
\ccode{esl\_gev\_pdf()} & Returns the probability density, $P(S=x)$.\\
\ccode{esl\_gev\_logpdf()} & Returns the log of the pdf, $\log P(S=x)$.\\
\ccode{esl\_gev\_cdf()} & Returns the cumulative probability distribution, $P(S \leq x)$.\\
\ccode{esl\_gev\_logcdf()} & Returns the log of the cdf, $\log P(S \leq x)$.\\
\ccode{esl\_gev\_surv()} & Returns right tail mass, 1-cdf, $P(S > x)$\\
\ccode{esl\_gev\_logsurv()} & Returns log of 1-cdf, $\log P(S > x)$.\\
\multicolumn{2}{c}{\textbf{sampling:}}\\
\ccode{esl\_gev\_Sample()} & Returns a GEV-distributed random sample.\\
\multicolumn{2}{c}{\textbf{maximum likelihood parameter fitting:}}\\
\ccode{esl\_gev\_FitComplete()} & Estimates GEV parameters from complete data.\\
\end{tabular}
\end{center}
\vspace{0.5em}
The Gumbel distribution depends on three parameters, $\mu$, $\lambda$,
and $\alpha$. When these parameters are known, the statistical
significance (P-value) of a single score $x$ is $P(S>x)$, obtained by
a call to \ccode{esl\_gev\_surv()}. The E-value for obtaining that
score or better in searching a database of $N$ sequences is just
$NP(S>x)$.
When the parameters are unknown, they can be estimated from scores
obtained from comparisons of simulated random data. The
\ccode{esl\_gev\_FitComplete()} function performs maximum likelihood
parameter fitting \citep{Coles01}.
\subsection{Example of using the gev API}
Below is a code example that samples 10,000 data points from a
Fr\'{e}chet distribution with $\mu=-20$, $\lambda=0.4$, $\alpha=0.1$;
reports the min and max samples, and the probability mass to the left
of the min and to the right of the max (both of which should be about
$\frac{1}{10000}$, since we took 10,000 samples); and then fits those
simulated data to a Gumbel and reports the fitted $\mu$ and $\lambda$:
\input{cexcerpts/gev_example}
\subsection{GEV densities}
The probability density function (pdf) and the cumulative distribution
function (cdf) of the generalized extreme value distribution are
\citep{Coles01}:
\begin{eqnarray}
P(X=x) & = & \lambda \left[ 1 + \alpha \lambda (x - \mu) \right]^{-\frac{\alpha+1}{\alpha}}
\exp \left\{ - \left[ 1 + \alpha \lambda (x - \mu)
\right]^{-\frac{1}{\alpha}} \right\}
\\%
\label{eqn:gev_density}
P(X \geq x) & = & \exp \left\{ - \left[ 1 +
\alpha\lambda(x-\mu) \right]^{-\frac{1}{\alpha}} \right\}
\\%
\label{eqn:gev_distribution}
\end{eqnarray}
The parameters $\mu$, $\lambda$, and $\alpha$ are location, scale, and
shape parameters, respectively, with $-\infty < \mu < \infty$, $0 <
\lambda < \infty$, and $-\infty < \alpha < \infty$.
The Type II (Fr\'{e}chet) distribution corresponds to $\alpha > 0$,
and the Type III (Weibull) distribution corresponds to $\alpha < 0$.
The Type I (Gumbel) distribution arises in the limit $\alpha
\rightarrow 0$. At values $\alpha \simeq 0$, Easel's GEV functions
revert to the Gumbel limit case, as opposed to dividing by zero and
failing.
Technically the GEV is only defined for values of $x$ such that $1 +
\alpha \lambda (x - \mu) > 0$. However, Easel's functions return
sensible values outside this domain, such as 0 for nonexistent
densities.
Generalized extreme value densities for $\mu = 0$ and $\lambda = 1$
are shown below (left) for three settings of $\alpha$; $\alpha = 0$
(Gumbel), $\alpha = 0.1$ (Fr\'{e}chet), and $\alpha = -0.1$
(Weibull). The figure on the right shows the log densities, which more
clearly show how, relative to the exponential right tail of the
Gumbel, the Fr\'{e}chet's tail is longer, and the Weibull's tail is
shorter.
\centerline{
\begin{minipage}{3in}
\includegraphics[width=2.8in]{figures/gev_density}
\end{minipage}
\begin{minipage}{3in}
\includegraphics[width=2.8in]{figures/gev_logdensity}
\end{minipage}
}
For more details, see the excellent description in \citep{Coles01}.
Easel's $\{ \mu, \lambda, \alpha \}$ notation differs from the $\{
\mu, \sigma, \xi \}$ parameterization used by Coles. Use $\lambda =
\frac{1}{\sigma}$ and $\alpha = \xi$ to translate.
\subsection{Fitting GEV distributions to observed data}
Easel fits GEVs by maximum likelihood estimation by numerically
optimizing the log likelihood function, using first derivative
information and conjugate gradient descent. See the \eslmod{gumbel}
chapter for a more general introduction to maximum likelihood fitting.
\subsubsection{Maximum likelihood estimation, complete data}
The function \ccode{esl\_gev\_FitComplete()} uses gradient information
to find parameters that optimize the likelihood function, using the
conjugate gradient descent code in the \eslmod{minimizer} module.
Given $n$ samples $x_1..x_n$, we want to estimate maximum likelihood
parameter estimates $\{ \hat{\mu}, \hat{\lambda}, \hat{\alpha} \}$
that maximize the log likelihood:
\begin{equation}
\log L(\lambda, \mu, \alpha) = n \log \lambda
- \frac{\alpha+1}{\alpha}
\sum_{i=1}^{n} \log\left[1+ \alpha\lambda(x_i - \mu) \right]
- \sum_{i=1}^{n} \left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}
\label{eqn:gev_logL}
\end{equation}
The $\left[ 1 + \alpha\lambda (x_i - \mu) \right]^{\frac{1}{\alpha}}$
term can be rewritten in a more conveniently differentiable form as
$\exp \left\{ \frac{1}{\alpha} \log \left[ 1 + \alpha\lambda (x_i - \mu)
\right] \right\}$.
Since the $\lambda$ parameter is constrained to $\lambda > 0$ but the
numerical optimizer expects unconstrained parameters, we use a change
of variables $\lambda = e^w$ and optimize an unconstrained value $w$.
The gradient of the log likelihood with respect to $\mu$, $w$, and
$\alpha$ is:
%% xref: STL9/118-120
\begin{eqnarray}
\frac{\partial \log L}{\partial \mu} & = &
\sum_{i=1}^n \frac{\lambda (\alpha+1)}{1+\alpha\lambda(x_i-\mu)}
-\sum_{i=1}^n \lambda \exp
\left\{ -\frac{\alpha+1}{\alpha} \log
\left[1+\alpha\lambda(x_i-\mu)\right] \right\}
\\%
\label{eqn:gev_mupartial}
\frac{\partial \log L}{\partial w} & = &
n - \sum_{i=1}^{n} \frac{\lambda (\alpha+1) (x_i - \mu)}
{1 + \alpha \lambda (x_i - \mu)}
+ \sum_{i=1}^n \lambda (x_i - \mu)
\exp \left\{ -\frac{\alpha+1}{\alpha} \log
\left[1+\alpha\lambda(x_i-\mu)\right] \right\}
\\%
\label{eqn:gev_wpartial}
\frac{\partial \log L}{\partial \alpha} & = &
\sum_{i=1}^n \left\{
\begin{array}{l}
- \frac{\alpha+1}{\alpha} \frac{\lambda(x_i-\mu)}
{1 +\alpha\lambda(x_i-\mu)}\\
+ \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]\\
+ \frac{1}{\alpha} \frac{\lambda(x_i-\mu)}
{1 +\alpha\lambda(x_i-\mu)}
e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu) \right]}\\
- \frac{1}{\alpha^2} \log \left[ 1 + \alpha\lambda(x_i - \mu) \right]
e^{-\frac{1}{\alpha} \log\left[ 1 + \alpha\lambda(x_i - \mu)
\right]}
\end{array}
\right.
\\%
\label{eqn:gev_alphapartial}
\end{eqnarray}
When $|\alpha\lambda(x_i - \mu)|$ approaches $0$, the GEV approximates
a Gumbel distribution and these equations can be simplified using the
approximation $\log(1+a) \simeq a$.