-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathevaluation.tex
138 lines (122 loc) · 10.7 KB
/
evaluation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
\begin{table}[t]
\centering
%\setlength\tabcolsep{3.5pt} % default value: 6pt
\begin{center}
\begin{tabular}{p{2.2cm}p{0.7cm}ddd}
\toprule
\multicolumn{1}{C{2.2cm}}{\bfseries Benchmark} & \multicolumn{1}{C{0.7cm}}{\bfseries Data Type} & \multicolumn{1}{C{1.1cm}}{\bfseries Serial (ms)} & \multicolumn{1}{C{1.1cm}}{\bfseries \code{rfactor} (ms)} & \multicolumn{1}{C{1cm}}{\bfseries Speed-up} \\
% Benchmark & Serial (ms) & \code{rfactor} (ms) & Speed-up \\
\midrule
Maximum & int32 & 5.54 & 1.22 & 4.5 \\
2D histogram & int32 & 8.80 & 1.71 & 5.1 \\
4D \code{argmin} & int8 & 28.52 & 1.07 & 26.6 \\
Complex & int32 & 28.53 & 2.47 & 11.5 \\
multiplication & & \\
Dot product & float & 25.9 & 2.66 \dagger & 9.7 \\
Kitchen sink & int32 & 30.13 & 1.91 & 15.7 \\
\bottomrule
\end{tabular}
\end{center}
\caption{Benchmark results: serial reductions vs. parallel reductions using \code{rfactor}. $\dagger$ To give the numbers some context, Intel's MKL~\cite{MKL} takes 2.8ms on the dot product task.}
\label{tab:table}
\end{table}
In this section we discuss the speed-ups one can expect by using \code{rfactor} to vectorize and parallelize serial reductions. These numbers are unsurprising -- they are equivalent to the speed-ups one can attain by manually factoring Halide reductions. We benchmark the feature using a suite of reductions of varying complexity\footnote{https://github.com/halide/rfactor\_benchmarks}. Some operations, like large matrix multiplication or convolution, reduce along some axes and are data parallel along others. \code{rfactor} provides little benefit in these cases, as they are already straight-forward to vectorize and parallelize, so we do not include such cases. Our benchmarks are:
\begin{itemize}
\item Maximum: The maximum integer in a list.
\item Dot product: The dot product of two vectors.
\item 4D \code{argmin}: The coordinates of the minimum value in a four-dimensional volume.
\item 2D histogram: A histogram of values present in an 8-bit image.
\item Complex multiplication: The product of a list of complex numbers.
\item Kitchen sink: An 8-tuple-element reduction that simultaneously computes the sum, product, minimum, maximum, \code{argmin}, \code{argmax}, sum-of-squares, and a count of the number of even values in a list of integers. This tests exists to demonstrate we can decompose multi-valued reductions into primitive ones.
\end{itemize}
All benchmarks run on an inputs of size $2^{24}$, which is a typical number of pixels for an input image to a Halide program. The input image data types are specified in Table~\ref{tab:table}. We run the benchmarks on 8 cores of a Xeon E5-2690. In each case we use the same Halide \emph{algorithm} code, and compare the performance attainable with \code{rfactor} to the performance attainable without it. For each benchmark, we took the minimum across 10 trials where each trial is the average of 10 iterations. We measure the execution time of the pipeline, not including the compilation time. Table~\ref{tab:table} shows the results. Without \code{rfactor}, each algorithm would require almost twice as much algorithm code to reach the same performance (see Table~\ref{tab:code_reduction}). Most importantly, \code{rfactor} provides a much less error-prone way of factoring a reduction as more of the logic is in the schedule instead of the algorithm. We also measured the increase in compile times due to the call to \code{rfactor}, and found it to be consistently under three milliseconds. The time taken to search the table for a matching operator is shown in Table~\ref{tab:search_time}. The search is fast because the table is split into subtables by the root IR node type (the largest subtable has $\sim 2800$ entries), and the most common operations are simple, and so they are close to the top of the tables. For a non-associative operation (which ultimately returns a compile error) \code{rfactor} must search to the end of the table. This takes around 1.2 ms.
Benchmarks generally fall into two categories. Either we benefit from from vectorization \emph{and} multi-core parallelism, or we benefit from multi-core parallelism alone. Histogram and maximum, fall into the second category. The histogram benchmark cannot be cleanly vectorized, because the bulk of the work involves scattering to data-dependent locations. The maximum benchmark does vectorize cleanly, but underneath Halide LLVM\footnote{Trunk LLVM as of Sept 9, 2016} auto-vectorizes the reference code without \code{rfactor}, so we only see a speed-up from multi-core parallelism. Dot product, 4D \code{argmin}, complex multiplication, and the kitchen sink test all benefit from parallelism and vectorization. Complex multiplication, dot product, and maximum all hit the memory bandwidth limit, limiting the possible benefit from parallelism.
Note that we did not add any patterns to the tables manually; the generated fragments were sufficient for the benchmarks and the applications in the Halide open source repository. The generated fragments were also sufficient for the reductions present in the HDR+ pipeline~\cite{HDRPlus}, which is the largest Halide pipeline of which we are aware. We could, however, manually add operators to the table if necessary in the future (for example for quaternion multiplication).
\begin{table}[t]
\centering
%\setlength\tabcolsep{3.5pt} % default value: 6pt
\begin{center}
\begin{tabular}{p{1in}ddd}
\toprule
\multicolumn{1}{C{1.5cm}}{\bfseries Benchmark} & \multicolumn{1}{C{1.2cm}}{\bfseries Serial (lines)} & \multicolumn{1}{C{1.3cm}}{\bfseries \code{rfactor} (lines)} & \multicolumn{1}{C{1.5cm}}{\bfseries Reduction (\%)} \\
% Benchmark & Serial (ms) & \code{rfactor} (ms) & Speed-up \\
\midrule
Maximum & 9 & 5 & 44.4 \\
2D histogram & 6 & 4 & 33.3 \\
4D \code{argmin} & 24 & 13 & 45.8 \\
Complex & 12 & 7 & 41.7 \\
multiplication & & \\
Dot product & 9 & 5 & 44.4 \\
Kitchen sink & 45 & 17 & 62.2 \\
\bottomrule
\end{tabular}
\end{center}
\caption{Using \code{rfactor} reduces the lines of code in the benchmarks by 45\% on average. Only the lines of code required to define the reduction functions and \code{rfactor} calls are included in the calculation.}
\label{tab:code_reduction}
\end{table}
\begin{table}[t]
\centering
%\setlength\tabcolsep{3.5pt} % default value: 6pt
\begin{center}
\begin{tabular}{p{1in}dd}
\toprule
\multicolumn{1}{C{1.5cm}}{\bfseries Benchmark} & \multicolumn{1}{C{1.5cm}}{\bfseries Search time (ms)} & \multicolumn{1}{C{2cm}}{\bfseries Total compilation time (ms)} \\
% Benchmark & Serial (ms) & \code{rfactor} (ms) & Speed-up \\
\midrule
Maximum & 0.08 & 127.5 \\
2D histogram & 0.09 & 220.9 \\
4D \code{argmin} & 0.57 & 196.2 \\
Complex & 0.21 & 150.4 \\
multiplication & & \\
Dot product & 0.12 & 131.2 \\
Kitchen sink & 0.55 & 187.1 \\
\bottomrule
\end{tabular}
\end{center}
\caption{The time taken to search the table to find a matching operator is relatively small with respect to the total compilation time.}
\label{tab:search_time}
\vspace*{-10pt}
\end{table}
%We ran the binary associative operator generator for 1.5 days for both the single- and two- dimensional tuples of signed 32-bit integers and unsigned 32-bit integers. We found around 10,000 associative operators in total for signed 32-bit integer single-dimensional tuple and around 6,000 for unsigned 32-bit integer single-dimensional tuple. For the two-dimensional tuple, we found around 300 operators for both signed 32-bit integer and unsigned 32-bit integer (see Table \ref{tab:int32_ops} and \ref{tab:uint32_ops}). Note that there are quite a few associative operators for the two-dimensional tuple we rejected during generation as they are decomposable into two one-dimensional associative operator.
%\begin{table*}[h!]
%\caption{Precomputed table size for single- and two- dimensional tuples of signed 32-bit integers generated in 1.5 days.}
%\label{tab:int32_ops}
%\centering
%\setlength\tabcolsep{3.5pt} % default value: 6pt
%\begin{center}
%\begin{tabular}{p{10cm}d}
%\toprule
%\multicolumn{1}{C{10cm}}{Tuple Size} & \multicolumn{1}{C{3cm}}{Size}\\
%\midrule
%Single-dimensional (with constant, no \code{select}, max of 8 leaves) & 7737 \\
%Single-dimensional (with constant, \code{select} only, max of 3 leaves for conditional, 4 leaves for the true/false) & 3162 \\
%Two-dimensional (with constant, no \code{select}, max of 7 leaves) & 327 \\
%\bottomrule
%\end{tabular}
%\end{center}
%\label{default}
%\end{table*}
%\begin{table*}[h!]
%\caption{Precomputed table size for single- and two- dimensional tuples of unsigned 32-bit integers generated in 1.5 days.}
%\label{tab:uint32_ops}
%\centering
%\setlength\tabcolsep{3.5pt} % default value: 6pt
%\begin{center}
%\begin{tabular}{p{10cm}d}
%\toprule
%\multicolumn{1}{C{10cm}}{Tuple Size} & \multicolumn{1}{C{3cm}}{Size}\\
%\midrule
%Single-dimensional (with constant, no \code{select}, max of 8 leaves) & 5813 \\
%Single-dimensional (with constant, \code{select} only, max of 3 leaves for conditional, 4 leaves for the true/false) & 518 \\
%Two-dimensional (with constant, no \code{select}, max of 7 leaves) & 348 \\
%\bottomrule
%\end{tabular}
%\end{center}
%\label{default}
%\end{table*}
%Discussion
%Synthetic functions (also to show limitations): approximating 128-bit add with 2 64-bit integers -> z3 cannot prove that it is associative, although the max/min forms are provable with z3. \\
%Limitations: we need an identity, symmetric intermediate and merge functions: they should be of the same form, constrained by the look-up table (we can only match to whatever are in the table -> whatever z3 can prove to be associative) + whatever we thought to generate. Some associative ops (e.g. 4x4 matrix multiply) are just expensive to generate. Technically it's doable, provided we limit the ops to addition and multiplication and restricting the variables involved in the expr to be unique (no repeats) during lookup table generation. Other ops may not be covered in our search (e.g. one that exploits a special property unique to some large constant). \\
%Real-world stuff: same code that can be simplified (reducing number of lines of code, etc) when using rfactor \\
%Performance of generation/search/synthesis. <TODO: Not sure what this is about? time taken when doing the matching? how many associative operations can be generated within some hours? > \\
%Case study of importance of "code reduction"? <TODO: Not sure what this is about? SK: I think the idea is that without the approach in this paper, you have to write more code (= more bugs) but also that you have to write arch-specific code (if you want to do parallel reduction on one arch and not the other) > \\