CPiH.tex

% -*- mode: LaTeX; compile-command: "rubber -d --unsafe CPiH.tex" -*-

\documentclass{book}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Packages
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\usepackage{hyperref}
\usepackage{url}
\usepackage{xcolor}

\hypersetup{
    colorlinks,
    linkcolor={green!50!black},
    citecolor={blue!50!black},
    urlcolor={blue!80!black}
}

\usepackage{graphicx}
\graphicspath{{images/}}

\usepackage{etoolbox}
\usepackage{mdframed}

\usepackage{prettyref}

\usepackage{minted}

\usepackage{todonotes}

\usepackage[style=authoryear]{biblatex}
\addbibresource{references.bib}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% prettyref
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newrefformat{fig}{Figure~\ref{#1}}
\newrefformat{sec}{section~\ref{#1}}
\newrefformat{chap}{Chapter~\ref{#1}}
\newrefformat{eq}{equation~\eqref{#1}}
\newrefformat{prob}{Problem~\ref{#1}}
\newrefformat{tab}{Table~\ref{#1}}
\newrefformat{thm}{Theorem~\ref{#1}}
\newrefformat{lem}{Lemma~\ref{#1}}
\newrefformat{prop}{Proposition~\ref{#1}}
\newrefformat{defn}{Definition~\ref{#1}}
\newrefformat{cor}{Corollary~\ref{#1}}
\newrefformat{ex}{Exercise~\ref{#1}}
\newrefformat{alg}{Algorithm~\ref{#1}}

\newcommand{\pref}[1]{\prettyref{#1}}

% \Pref is just like \pref but it uppercases the first letter; for use
% at the beginning of a sentence.
\newcommand{\Pref}[1]{%
  \expandafter\ifx\csname r@@#1\endcsname\relax {\scriptsize[ref]}
    \else
    \edef\reftext{\prettyref{#1}}\expandafter\MakeUppercase\reftext
    \fi
}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Misc semantic markup
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\module}[1]{\emph{#1}}
\newcommand{\pkg}[1]{\texttt{#1}}
\newcommand{\term}[1]{\emph{#1}}

\newcommand{\h}[1]{\mintinline{haskell}{#1}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Kattis
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\kattislogo}{\raisebox{-0.2em}{\includegraphics[height=0.9\baselineskip]{Kattis}}}

\newcommand{\kattislist}[1]{%
  \def\nextitem{\def\nextitem{, }}
  \renewcommand*{\do}[1]{\nextitem\kattislink{##1}}
  \docsvlist{#1}
}

\newcommand{\kattis}[1]{%
  \begin{mdframed}[frametitle={Practice problems}]
    \kattislogo
    \kattislist{#1}
  \end{mdframed}
}

\newcommand{\kattislink}[1]{\href{https://open.kattis.com/problems/#1}{\texttt{#1}}}

\newcommand{\inlinekattis}[1]{\kattislogo\!\!\kattislist{#1}\!}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Title page
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{Competitive Programming in Haskell}
\author{Brent A. Yorgey}

\begin{document}

\maketitle

\tableofcontents

\chapter{Introduction}
\label{chap:intro}

\todo{code style: stylish-haskell? fourmolu?}

\chapter{Getting Started}
\label{chap:getting-started}

As a basic running example, we will consider the same example problem
that Open Kattis uses in its \href{https://open.kattis.com/help}{help
  section}, A Different Problem \mbox{(\inlinekattis{different})}.  In
this problem, we are told that the input will consist of a number of
pairs of integers between $0$ and $10^{15}$, one pair per line, and we
should output the absolute value of the difference between each
pair. The given example is that if the input looks like this:
\begin{verbatim}
10 12
71293781758123 72784
1 12345677654321
\end{verbatim}
then our program should produce output that looks like this:
\begin{verbatim}
2
71293781685339
12345677654320
\end{verbatim}

This is of course an extremely simple problem, so we can just focus on
the mechanics of solving a problem in Haskell.

\section{Competitive programming is functional}
\label{sec:cp-functional}

You can see that this problem specifies an \emph{input} in a
particular format (which will be provided via our program's standard
input), and requires us to produce a specified \emph{output} (via
standard output).  Most competitive programming problems are of this
form.\footnote{There certainly can exist problems requiring
  interaction beyond standard input and output, such as reading from a
  certain file, doing network I/O, and so on, but we will not consider
  such problems in this book.} An imperative approach to such a
problem involves doing a sequence of input commands, some computation,
and a sequence of output commands---possibly interleaved with one
another---and we might immediately think to start using functions like
\h{getLine} and \h{putStrLn} to do the required I/O in Haskell.  However,
there is a much more fruitful functional perspective: we are simply
being asked to implement a particular (partial) function of type
\h{String -> String}.\footnote{If you are worried about the use of
  \h{String}, fear not: in \pref{chap:parsing}, once we develop a simple parsing
  framework, we will switch to \h{ByteString} for efficiency.}  The fact
that the function's input and output should be hooked up to the
program's standard input and output is just an implementation detail.
Competitive programming is functional at heart.

It turns out that Haskell's standard library already has the perfect
built-in function for this scenario:
\begin{minted}{haskell}
interact :: (String -> String) -> IO ()
\end{minted}
\h{interact} takes a pure \h{String -> String} function and turns it into
an \h{IO} action which reads from standard input, passes the input to
the given function, and prints the result to standard output.  It even
does this using \emph{lazy} I/O, which can be strange and problematic
in some scenarios, but is perfect for this situation: the input is
read lazily, as demanded by the function, so that the output and input
can be automatically interleaved depending on which parts of the
output depend on which parts of the input. In particular, this means
that that the entire input need not be stored in memory at once. If
the inputs can be processed into outputs in a streaming fashion---as
is the case in the example problem we are currently
considering---then the input and output will be interleaved.

Thus, the \h{interact} function lets us immediately pass to a functional
view of a problem, worrying only about the essential details of
transforming the given input into the requested output.  This is the
last time \h{IO} will appear in this book! \todo{Check this---maybe I
  want to write about interactive problems.}

\section{A basic solution pipeline}
\label{sec:pipeline}

So now we need to write a pure function which transforms the input
into the output.  Of course, in true Haskell fashion, we will do this
by constructing a chained pipeline of functions to do the job
incrementally.  The general plan of attack (for any problem) is
as follows:
\begin{enumerate}
\item First, parse the input, that is, transform the raw input
  into some more semantically meaningful representation.
\item Next, solve the problem, turning a semantically meaningful
  representation of the input into a semantically meaningful
  representation of the output.
\item Finally, format the output into a \h{String}.
\end{enumerate}
Figure~\ref{fig:skeleton-different} has a simple skeleton solution along
these lines. There are a few things to point out.

\begin{figure}
  \centering
  \inputminted{haskell}{code/intro/Skeleton.hs}
  \caption{A skeleton solution for A Different Problem}
  \label{fig:skeleton-different}
\end{figure}

\begin{itemize}
\item Notice the use of the backwards function composition operator
  \h{(>>>)} from \h{Control.Arrow} (essentially, \h{f >>> g = g . f},
  although in actuality \h{(>>>)} is a bit more general than that).
  When solving programming problems, I like to be able to type
  function pipelines from left to right as I think about data flowing
  through the pipeline from beginning to end.  This is of course a
  personal preference, and one could also write %
  \h{format . solve . parse} instead of %
  \h{parse >>> solve >>> format}.
\item If the machine on which our solution will run has a 64-bit
  architecture (this is always true for Open Kattis, but not
  necessarily so for other platforms such as Codeforces), for this
  problem we could technically get away with using \h{Int} instead of
  \h{Integer}. On a 64-bit architecture, \h{maxBound :: Int} is $2^{63} -
  1$, which is a bit more than $9 \times
  10^{18}$, plenty big enough for this problem, with inputs only up to
  $10^{15}$. For more computationally intensive problems, using \h{Int}
  instead of \h{Integer} can be an important optimization; however, for
  simple problems, using \h{Integer} is preferred since it eliminates
  the potential for bugs due to overflow.

\item In simple problems such as this, the \h{solve} function could
  probably be inlined (it can be a fun challenge to solve easy
  problems in a single line of code!).  However, in general, I prefer
  to split it out explicitly in order to specify its type, which both
  prevents problems with \h{read}/\h{show} ambiguity, and also serves as a
  sanity check on the parsing and formatting code.

\item And one last thing: I said we were going to parse the input into
  a ``semantically meaningful representation'', but I lied a teensy
  bit: the problem says we are going to get a \emph{pair} of integers,
  but the type of \h{solve} says that it takes a \emph{list} of
  integers.  We will discuss and justify this choice in more detail in
  \pref{sec:partial}.
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Parsing and formatting}
\label{sec:parse-format}

The outer shell of our basic solution pipeline deals with converting
between the raw strings used as input and output, and more
semantically meaningful representations.  In many cases, it suffices
to use basic list-processing functions from \module{Prelude} or
\module{Data.List}.

\begin{itemize}
\item For parsing input, some of the most frequently useful functions
  include \h{lines}, \h{words}, \h{drop}, and \h{read}.  The \h{split} function
  from the \h{Data.List.Split} module (from the \pkg{split}
  package) can also be helpful.
\item For formatting output, some of the most frequently useful
  functions include \h{unlines}, \h{unwords}, \h{show}, \h{concat},
  \h{intersperse}, and \h{intercalate}.
\end{itemize}

If any of these functions are unfamiliar to you, it's worth spending
some time reading their documentation and trying them on some
examples.  In \pref{chap:parsing} we will consider more sophisticated
tools for parsing in particular, but until then we will stick to
problems that only require these basic tools.  Fortunately, there are
many such problems.

Returning to our running example solving A Different Problem, the
given input consists of a number of lines, each containing two
numbers.  This is easily parsed using a combination of \h{lines},
\h{words}, \h{read}, and \h{map} (\pref{fig:parse-different}).

\begin{figure}
  \centering
  \inputminted[firstline=5, lastline=6]{haskell}{code/intro/Different.hs}
  \caption{\h{parse} function for A Different Problem}
  \label{fig:parse-different}
\end{figure}

As another example, consider the problem Pot
\mbox{(\inlinekattis{pot})}.  The input description for this problem says

\begin{quote}
  The first line of input contains the integer $N (1 \leq N \leq 10)$, the number
  of the addends from the task. Each of the following $N$ lines contains
  the integer $P_i$ ($10 \leq P_i \leq 9999$, $i = 1, \dots, N$) from the task.
\end{quote}

At first glance, it looks like we need a more sophisticated parsing
framework for this problem: the first line is a number which tells us
\emph{how many} additional lines of input we need to parse.  The need
for intermediate values to be able to affect later parsing is exactly
what necessitates the use of a \emph{monadic} parsing framework.
However, in this case, we can do something much simpler: just ignore
the number entirely, and process all the lines besides the first!  We
don't actually need the number to tell us how many lines to expect
since the end of the input is delimited by EOF.

\begin{minted}{haskell}
parse :: String -> [Integer]
parse = lines >>> drop 1 >>> map read
\end{minted}

This approach often works. For example, even if we have a variable
number of lines, each of which has a variable number of items (with a
number at the beginning saying how many), we can ignore all the
counts: just split into lines, drop the first line, and then split
each line into words, dropping the first.  Besides the lines being
delimited by EOF overall, each individual line is delimited by the end
of line character, so we don't need any of the counts to tell us how
far to read.

However, this doesn't work when we really have no intrinsic way to
tell when a certain section ends and another starts.  For example,
some problems consist of a number of test cases, where each test case
input consists of a variable numbers of lines. In that case we would
need to use provided information about the length of each test case.
We will discuss such parsing in \pref{chap:parsing}.

\todo{include full solution for A Different Problem?}

\section{Using partial functions}
\label{sec:partial}

You might wonder about using a list as the output of \h{parse} and the
input of \h{solve}.  Shouldn't we use something like \h{(Int,Int)}, since
we know we will be given exactly two integers? \todo{paste in blog post
about this.  Note you can skip if you're not interested in the
philosophy.}

The fact is that I almost never use actual Haskell tuples in my
solutions, because they are too awkward and inconvenient. Representing
homogeneous tuples as Haskell lists of a certain known length allows
us to read and process ``tuples'' using standard functions like
\h{words} and \h{map}, to combine them using \h{zipWith}, and so on.
And since we get to assume that the input always precisely follows the
specification---which will never change---this is one of the few
situations where, in my opinion, we are fully justified in writing
partial functions like this if it makes the code easier to write.  So
I often represent homogeneous tuples as lists and just pattern match
on lists of the appropriate (known) length.  (If I need heterogeneous
tuples, on the other hand, I create an appropriate \h{data} type.)

\section{Explicit recursion}
\label{sec:explicit-recursion}

Some styles of Haskell programming eschew explicit recursion in favor
of folds and other recursion schemes.  And certainly, one should use
simple higher-order functions like \h{map}, \h{filter}, and so on when
appropriate.  However, it's often not worth trying to come up with
some clever fold when a simple recursive function will be faster to
write and easier to debug.

\todo{Example}

\section{Practice problems}
\label{sec:gs-practice}

Below are a few simple problems for you to practice on.  This is just
a very small sample; there are many, many problems on Open Kattis that
can be solved with not much more than the techniques explained in this
chapter, and I encourage you to just solve a bunch of them!  To start,
look for problems with a difficulty rating less than 2.0.  To find
such problems easily, you can go to the list of all problems
(\url{http://open.kattis.com/problems/}) and sort by difficulty.

\kattis{jobexpenses, judgingmoose, quickestimate}

\chapter{Wholemeal Programming}
\label{chap:wholemeal}

As we have seen in the previous chapter, functional programming
encourages us to think in terms of assembling functions into pipelines
that incrementally transform an input into an output.  \term{Wholemeal
  programming} is a closely related concept, explained best by Ralf
Hinze \parencite*{hinze2009tour}:
\begin{quote}
  Functional languages excel at wholemeal programming, a term coined
  by Geraint Jones. Wholemeal programming means to think big: work
  with an entire list, rather than a sequence of elements; develop a
  solution space, rather than an individual solution; imagine a graph,
  rather than a single path.
\end{quote}
This is often a very fruitful approach to solving competitive
programming problems.  It helps us avoid getting lost---or making
mistakes!---dealing with trivial low-level details.

\section{Example: processing a list of strings}

As a simple example, consider the problem of writing a function which
takes a list of strings as input, and adds up the lengths of all
strings that do not contain the letter \h{'e'}.  For example, given
the input
\begin{minted}{haskell}
["dog", "sheep", "horse", "capybara", "lemur"]
\end{minted}
the function should return 11 (the combined length of \h{"dog"} and
\h{"capybara"}).

A standard imperative approach to this problem would be to loop over
the list of words, updating an accumulator variable every time we see
a string that does not contain the letter e.  An inexperienced
Haskeller might produce something similar:
\inputminted{haskell}{code/wholemeal/LengthWithoutE.hs}
This code works, but there is a better way.  Instead of thinking in
terms of processing the list one item at a time, an experienced
Haskeller will think in terms of incrementally transforming the input
into the desired output: first, filter out the strings we don't want;
next, turn each string into its length; and finally sum the lengths.
\inputminted{haskell}{code/wholemeal/LengthWithoutEWholemeal.hs}
This is better in almost every way: it is shorter, easier to read and
understand, easier to refactor, harder to get wrong, and easier to
prove correct.

Experienced programmers might worry about efficiency: doesn't this do
three passes over the list instead of just one?  Even worse, doesn't
it construct several intermediate lists from scratch?  Actually,
Haskell's laziness, combined with optimization technology like list
fusion, make this very efficient in practice.  And in any case, we
usually don't need to worry much about constant factors as long as we
have the right asymptotic complexity.

\section{Basic data structures}

Wholemeal programming in Haskell usually revolves around the use of
several basic data types provided in Haskell standard libraries, such
as lists, \h{Set}, \h{Map}, \h{Tree}, and \h{Array}.  We will touch on
each of these, highlighting their important features and suggesting
some practice problems to solve.

\subsection{Lists}
\label{sec:lists}

Lists (including \h{String}s) are ubiquitous in Haskell, and the
standard library provides a large number of useful list-processing
functions.  Trying to come up with a specific catalog of list
functions which are of use in competitive programming would be a
fool's errand, since almost all of the standard \h{Data.List}
functions
(\url{https://hackage.haskell.org/package/base-4.18.0.0/docs/Data-List.html})
can be of use.  If you're not already familiar with all the standard
list functions, you should spend time looking through the
documentation and practicing their use, especially if you want to be
able to solve easy problems quickly.

One thing worth mentioning is efficiency: lists have quite a bit of
extra overhead compared to arrays. To avoid this becoming an issue, we
should aim to use lists as \emph{control structures} rather than
\emph{data structures}. That is, in a situation where you would write
a \texttt{for} loop in an imperative language, you can probably safely
use a list in Haskell.  It is perfectly fine to have a long list and
then iterate over it multiple times with standard functions like
\h{map}, \h{filter}, and so on.  The multiple traversals may fuse
away, and even if they don't the extra overhead is unlikely to make a
big difference.  What you definitely don't want to do is use many
small lists as data structures.  That is, for example, having two
lists with $10^5$ elements each is probably fine; having $10^5$ lists
with two elements each is probably going to be too slow.

\todo{pick two or three representative sample problems and discuss
  their solutions?  One using scanl?}

\kattis{foo, bar}

\subsection{Sets}
\label{sec:sets}

The standard \h{Data.Set} type stores elements in a balanced tree
structure, so most operations take time logarithmic in the number of
items stored.  This is not as fast as a hash table, but quite fast in
practice, and provides a much richer API than a hash-based data
structure. (The \pkg{unordered-containers} package---which is
available in the Open Kattis test environment---does provide
\h{Data.HashSet}, but it's rarely needed.)

It is worth having a good working knowledge of standard \h{Set}
functions like \h{empty}, \h{singleton}, \h{fromList}, \h{toList},
\h{insert}, \h{delete}, \h{union}/\h{unions}, \h{intersection},
\h{difference}, \h{map}, \h{filter}, \h{member}, \h{null}, and
\h{size}.

There are many others which are occasionally useful and can be looked
up in the documentation when needed, such as (but not limited to)
\h{split}, \h{takeWhile}, \h{lookupIndex}, \h{lookupGT}, \h{elemAt},
and \h{findMin}/\h{findMax}.

\subsubsection{Example: Booking a Room}

As an example, consider \inlinekattis{bookingaroom}.  We are given as
input integers $r$ and $n$, and then list of $n$ integers which are
between $1$ and $r$ inclusive, representing hotel rooms which are
booked.  We must output the number of some available room (or an error
message if all rooms are booked).

There are many ways to solve this.  One simple way might be to just
read the given numbers into a list, and then check every number from
$1$ to $r$ to see if it is a member of the list or not (using
\h{elem}), and print the first one we find which is not.  This
solution would take $O(nr)$ time.  Since $n$ and $r$ are guaranteed to
be at most $100$, it would work, but we can do better---not only in an
asymptotic sense, but in terms of code size!  Checking every number
from $1$ to $r$ has a very ``imperative'' feel; we are looping through
some items and doing something with each one individually.  Instead,
we can think in terms of computing the \emph{entire} set of available
rooms at once, by taking the difference of the set of all rooms
$\{1 \dots r\}$ and the set of booked rooms.  Picking one arbitrary
element from a \h{Set} can then be accomplished either by converting
to a list and picking the first element (\emph{i.e.}  \h{listToMaybe
  . toList}), or, even better, we can use the built-in
\h{Data.Set.lookupMin :: Ord a => Set a -> Maybe
  a}. \pref{fig:bookingaroom} demonstrates a complete solution.

\begin{figure}
  \centering
  \inputminted{haskell}{code/wholemeal/BookingARoom.hs}
  \caption{Booking a Room solution}
  \label{fig:bookingaroom}
\end{figure}

\kattis{foo, bar}

\subsection{Maps}
\label{sec:maps}

The standard \h{Data.Map} type stores key-value mappings in a balanced
tree structure, so, like \h{Data.Set}, most operations take time
logarithmic in the number of items stored.  A hash table-based map
such as \h{Data.HashMap} would be faster in theory, but \h{Data.Map}
is quite fast in practice.  There are other map variants as well, such
as \h{Data.Map.Strict} or \h{Data.IntMap}. In my experience it usually
doesn't make a whole lot of difference, and I mostly stick to
\h{Data.Map} (\pref{chap:dp} will explain one particular application
of lazy, as opposed to strict, maps).  The major exceptions seem to
be:

\begin{itemize}
\item If you want to make a large map with string (\h{ByteString})
  keys, a \h{HashMap} can be considerably faster than \h{Map}
  (\inlinekattis{pakethanterare}).
\item \todo{When is IntMap faster?}
\end{itemize}

It is worth having a good working knowledge of standard \h{Map}
functions like \h{empty}, \h{null}, \h{size}, \h{singleton},
\h{fromList}, \h{fromListWith}, \h{assocs}, \h{elems}, \h{keys},
\h{keysSet}, \h{insert}, \h{insertWith}, \h{delete}, \h{alter},
\h{lookup}, \h{member}, \h{union}, \h{unionWith}, \h{intersection},
\h{intersectionWith}, and \h{map}.  \h{Map k} is also an instance of
\h{Functor}, \h{Foldable}, and \h{Traversable}, allowing the use of
standard functions like \h{fmap}, \h{foldMap}, \h{sum}, \h{maximum},
\h{minimum}, and \h{traverse}/\h{mapM}. There are many, many
other functions which are occasionally useful and can be looked up in
the documentation when needed, such as (but not limited to)
\h{compose}, \h{split}, \h{restrictKeys}, \h{partition}, \h{mapMaybe},
\h{updateAt}, \h{take}, \h{drop}, and \h{findMin}/\h{findMax}.

\kattis{pakethanterare}

\subsection{Arrays}
\label{sec:arrays}

\subsection{Trees}
\label{sec:trees}

\section{Additional practice problems}
\label{sec:wholemeal-practice}

\kattis{foo,bar}

\chapter{Input Parsing}
\label{chap:parsing}

So far, all the example problems we have considered have input in a
format that we were able to parse using simple standard functions like
\h{lines} and \h{words}.  There is another common class of problems,
however, which follow this pattern:

\begin{quote}
The first line of the input consists of an integer $T$.  Each of the
next $T$ lines consists of \dots
\end{quote}

That is, the input contains integers which are not input data per se,
but just tell you how many things are to follow. This is really easy
to process in an imperative language like Java or C++.  For example,
in Java we might write code like this:

\begin{minted}{java}
Scanner in = new Scanner(System.in);
int T = in.nextInt();
for (int i = 0; i < T; i++) {
   // process each line
}
\end{minted}

As we already saw in \pref{chap:getting-started}, we can often get
away with completely ignoring this extra information.  For example, if
the input consists of a number $T$ followed by $T$ lines, each of
which contains a number $n$ followed by a list of $n$ numbers, we can
simply write

\begin{minted}{haskell}
main = interact $
  lines >>> drop 1 >>> map (words >>> drop 1 >>> map read) >>> ...
\end{minted}
%$

That is, we can ignore the first line containing $T$ since the
end-of-file will tell us how many lines there are; and we can ignore
the $n$ at the beginning of each line, since the newline character
tells us when the list on that line is done.

Sometimes, however, this isn't possible, especially when there are
multiple test cases, or when a single test case has multiple parts,
each of which can have a variable length.  For example, consider
Popular Vote (\inlinekattis{vote}), which describes
its input as follows:

\begin{quote}
The first line of input contains a single positive integer $T \leq 500$
indicating the number of test cases. The first line of each test case
also contains a single positive integer $n$ indicating the number of
candidates in the election. This is followed by $n$ lines, with the $i$th
line containing a single nonnegative integer indicating the number of
votes candidate $i$ received.
\end{quote}

How would we parse this?  We could still ignore $T$---just keep reading
until the end of the file---but there's no way we can ignore the $n$
values. Since the values for each test case are all on separate lines
instead of on one line, there's otherwise no way to know when one test
case ends and the next begins.

One possible approach would be to use \h{splitAt} and explicit
recursion, like so:

\begin{minted}{haskell}
type Election = [Int]

readInput :: String -> [Election]
readInput = lines >>> drop 1 {- ignore T -} >>> map read >>> go
  where
    go :: [Int] -> [Election]
    go []     = []
    go (n:xs) = votes : go rest
      where (votes,rest) = splitAt n xs
\end{minted}

However, this is really annoying to write and easy to get wrong.
There are way too many variable names to keep track of (\h{n}, \h{xs},
\h{votes}, \h{rest}, \h{go}) and for more complex inputs it becomes
simply unmanageable.  You might think we should switch to using a real
parser combinator library---\h{parsec} is indeed installed in the
environment Kattis uses to run Haskell solutions---and although
sometimes a full-blown parser combinator library is needed
(\pref{chap:parser-combs}), in this case it's quite a bit more
heavyweight than we would like.

\section{Scanner}
\label{sec:scanner}

The heart of the issue is that we want to be able to specify a
high-level description of the sequence of things we expect to see in
the input, without worrying about managing the stream of tokens
explicitly.  Another key insight is that 99\% of the time, we
\emph{don't} need the ability to deal with parse failure or the
ability to parse multiple alternatives.  With these insights in mind,
we can create a very simple \h{Scanner} abstraction, which is just a
\h{State}ful computation over a list of tokens:

\begin{minted}{haskell}
type Scanner = State [String]

runScanner :: Scanner a -> String -> a
runScanner s = evalState s . words
\end{minted}

To run a scanner, we just feed it the entire input as a \h{String},
which gets chopped into tokens using \h{words}.  (Of course in some
scenarios we might want to use \h{lines} instead of \h{words}, or even do
more complex tokenization.)

Note since \h{Scanner} is just a type synonym for \h{State [String]}, it
is automatically an instance of \h{Functor}, \h{Applicative}, and
\h{Monad} (but not \h{Alternative}).

So let's develop a little \h{Scanner} DSL.  The most fundamental thing
we can do is read the next token.
\begin{minted}{haskell}
str :: Scanner String
str = get >>= \case { s:ss -> put ss >> return s }
\end{minted}
(This uses the \h{LambdaCase} extension\todo{reference}, though we
could easily rewrite it without.)  \h{str} gets the current list of
tokens, puts it back without the first token, and returns the first
token.  Note that there is no case for the empty list.  You might
think we want to include a case for the empty token list and have it
return the empty string or something like that.  But since the input
will always be properly formatted, if \h{str} is ever called when
there are no tokens left it means our program has a bug---for example,
perhaps we misunderstood the description of the input format.  In this
scenario we \emph{want} it to crash loudly, as soon as possible,
rather than continuing on with some bogus data.

We can now add some scanners for reading specific token types other
than \h{String}, simply by mapping the \h{read} function over the output
of \h{str}:
\begin{minted}{haskell}
int :: Scanner Int
int = read <$> str

integer :: Scanner Integer
integer = read <$> str

double :: Scanner Double
double = read <$> str
\end{minted}
%$
Again, these will crash if they see a token in an unexpected format,
and that is a very deliberate choice.

Now, as explained earlier, a very common pattern is to have an
integer $n$ followed by $n$ copies of something.  So let's make a
combinator to encapsulate that pattern:
\begin{minted}{haskell}
numberOf :: Scanner a -> Scanner [a]
numberOf s = int >>= flip replicateM s
\end{minted}
\h{numberOf s} expects to first see an \h{Int} value $n$, and then it runs
the provided scanner $n$ times, returning a list of the results.

It's also sometimes useful to have a way to repeat a \h{Scanner} some
unknown number of times until encountering EOF (for example, the input
for some problems doesn't specify the number of test cases up front
the way that \inlinekattis{vote} does).  This is similar to the \h{many}
combinator from \h{Alternative}, except that it only stops when the
token stream runs out, not when the given \h{Scanner} fails (remember
that we have no way to handle \h{Scanner} failure other than crashing).
\begin{minted}{haskell}
many :: Scanner a -> Scanner [a]
many s = get >>= \case { [] -> return []; _ -> (:) <$> s <*> many s }
\end{minted}
%$
\h{many s} repeats the scanner \h{s} as many times as it can, returning a
list of the results.  In particular it first peeks at the current
token list to see if it is empty.  If so, it returns the empty list of
results; if there are more tokens, it runs \h{s} once and then
recursively calls \h{many s}, consing the results together.

Finally, it's quite common to want to parse a specific small number of
something, for example, two double values representing a 2D coordinate pair.
We could just write \h{replicateM 2 double}, but this is common enough
that I find it helpful to define dedicated combinators with short
names:
\begin{minted}{haskell}
two, three, four :: Scanner a -> Scanner [a]
[two, three, four] = map replicateM [2..4]
\end{minted}

\todo{Also mention times / \h{><} and pair}

\todo{Show complete code listing? Or put it in Appendix?  In any case,
  link to complete code on GitHub.}

\section{Examples}

So what have we gained?  Writing the parser for \kattisinline{vote} is now
almost trivial:
\begin{minted}{haskell}
type Election = [Int]

main = interact $ runScanner elections >>> ...

elections :: Scanner [Election]
elections = numberOf (numberOf int)
\end{minted}
%$

In practice I would probably just inline the definition of \h{elections}
directly: \h{interact $ runScanner (numberOf (numberOf int)) >>> ...}
%$

As a slightly more involved example, chosen almost at random, consider
Board Wrapping (\inlinekattis{wrapping}):
\begin{quote}
On the first line of input there is one integer, $N \leq 50$, giving
the number of test cases (moulds) in the input. After this line, $N$
test cases follow. Each test case starts with a line containing one
integer $n, 1 \leq n \leq 600$, which is the number of boards in the
mould. Then $n$ lines follow, each with five floating point numbers
$x,y,w,h,v$ where $0 \leq x,y,w,h \leq 10000$ and $-90^{\circ} < v
\leq 90^{\circ}$. The $x$ and $y$ are the coordinates of the center of
the board and $w$ and $h$ are the width and height of the board,
respectively. $v$ is the angle between the height axis of the board to
the $y$-axis in degrees, positive clockwise.
\end{quote}

Here's how I might set up the input, using \h{Scanner} and a custom
data type to represent boards (in practice I would actually use a
custom data type for 2D points/vectors defined in my geometry library
rather than the type \h{V} defined below; see \pref{chap:geometry}).

\begin{minted}{haskell}
import Scanner

type V = [Double]     -- 2D vectors/points
newtype A = A Double  -- angle (radians)
                      -- newtype helps avoid conversion errors

fromDeg :: Double -> A
fromDeg d = A (d * pi / 180)

data Board = Board { boardLoc :: !V, boardDims :: !V, boardAngle :: !A }

board :: Scanner Board
board = Board
  <$> two double
  <*> two double
  <*> ((fromDeg . negate) <$> double)

main = interact $
  runScanner (numberOf (numberOf board)) >>> ...
\end{minted}
%$

\section{ByteString}

Using the \h{String} type for input can be convenient, since we get to
use all the standard library list functions, but sometimes \h{String}
isn't good enough.

A good example is the problem Army Strength (Hard)
(\inlinekattis{armystrengthhard}).  There are a number of separate
test cases; each test case consists of two lines of positive integers
which record the strengths of monsters in two different armies.
Supposedly the armies will have a sequence of battles, where the
weakest monster dies each time, with some complex-sounding rules about
how to break ties. It sounds way more complicated than it really is,
though; a bit of thought reveals that to find out who wins we really
just need to see which army's maximum-strength monster is strongest.

So our strategy for each test case is to read in the two lists of
integers, find the maximum of each list, and compare.  Seems pretty
straightforward, right?  Something like this:

\begin{minted}{haskell}
import Control.Arrow
import Data.List.Split

main = interact $
  lines >>> drop 1 >>> chunksOf 4 >>>
  map (drop 2 >>> map (words >>> map read) >>> solve) >>>
  unlines

solve :: [[Int]] -> String
solve [gz, mgz] = case compare (maximum gz) (maximum mgz) of
  LT -> "MechaGodzilla"
  _  -> "Godzilla"
\end{minted}
%$

Note we didn't actually use the \h{Scanner} abstraction here, though
we could have; in this case it's actually easier to just ignore the
numbers telling us how many test cases there are and the length of
each line, and just split up the input by lines and go from there.

This seems straightforward enough, but sadly, it results in a Time
Limit Exceeded (TLE) error on the third of three test cases.
Apparently this program takes longer than the allowed 1 second.
What's going on?

If we look carefully at the limits for the problem, we see that there
could be up to $50$ test cases, each test case could have two lists of
length $10^5$, and the numbers in the lists can be up to $10^9$.  If
all those are maxed out (as they probably are in the third, secret
test case), we are looking at an input file many megabytes in size.
At that point the time to simply read the input is a big factor.
Reading the input as a \h{String} has a lot of overhead: each
character gets its own cons cell, which form a linked list in memory;
breaking the input into lines and words requires traversing over these
cons cells one by one.  We need a representation with less overhead.

Now, if this were a \emph{real} application, we would reach for
\h{Text}, which is made for representing textual information and can
correctly handle Unicode encodings and all that good stuff.  However,
this isn't a real application: competitive programming problems
\emph{always} limit the input and output strictly to ASCII, so
characters are synonymous with bytes.  Therefore we will commit a
``double no-no'': not only are we going to use \h{ByteString} to
represent text, we're going to use \h{Data.ByteString.Lazy.Char8}
which simply assumes that each 8 bits is one character. As explained
in \pref{chap:XYZ}, however, I think this is one of those things that
is usually a no-no but is completely justified in this context.

Let's start by just replacing some of our string manipulation with
corresponding \h{ByteString} versions:
\begin{minted}{haskell}
import Control.Arrow
import Data.ByteString.Lazy.Char8 qualified as BS
import Data.List.Split

main = BS.interact $
  BS.lines >>> drop 1 >>> chunksOf 4 >>>
  map (
    drop 2 >>>
    map (BS.words >>> map (BS.unpack >>> read)) >>>
    solve
  ) >>>
  BS.unlines

solve :: [[Int]] -> BS.ByteString
solve [gz, mgz] = case compare (maximum gz) (maximum mgz) of
  LT -> BS.pack "MechaGodzilla"
  _  -> BS.pack "Godzilla"
\end{minted}
%$
This already helps a lot: this version is actually accepted, taking
0.66 seconds.

But we can do even better: it turns out that \h{read} also has a lot
of overhead, and if we are specifically reading \h{Int} values we can
do it much faster.  The \h{ByteString} module comes with a function
\begin{minted}{haskell}
BS.readInt :: BS.ByteString -> Maybe (Int, BS.ByteString)
\end{minted}
Since, in this context, we know we will always get an integer with
nothing left over, we can replace \h{BS.unpack >>> read} with \h{BS.readInt
>>> fromJust >>> fst}.  While we're at it, we can also enable the
\h{OverloadedStrings} language extension so that we don't have to wrap
string literals in \h{BS.pack}:

\begin{minted}{haskell}
{-# LANGUAGE OverloadedStrings #-}

import Control.Arrow
import Data.ByteString.Lazy.Char8 qualified as BS
import Data.List.Split
import Data.Maybe (fromJust)

main = BS.interact $
  BS.lines >>> drop 1 >>> chunksOf 4 >>>
  map (drop 2 >>> map (BS.words >>> map readInt) >>> solve) >>>
  BS.unlines

 where
  readInt = BS.readInt >>> fromJust >>> fst

solve :: [[Int]] -> BS.ByteString
solve [gz, mgz] = case compare (maximum gz) (maximum mgz) of
  LT -> "MechaGodzilla"  -- automatically wrapped in BS.pack
  _  -> "Godzilla"
\end{minted}
Now we're talking---this version completes in a blazing 0.04 seconds!

We can take these principles and use them to make a variant of the
\h{Scanner} module which uses \h{ByteString} instead of
\h{String}, including the use of the \h{readInt} functions to read
\h{Int} values quickly.  You can find it
at \url{https://github.com/byorgey/comprog-hs/blob/master/ScannerBS.hs}.

\chapter{Monoids}
\label{chap:monoids}

\chapter{Trees}
\label{chap:trees}

\chapter{Data Structures}
\label{chap:data-structures}

\chapter{Dynamic Programming}
\label{chap:dp}

\chapter{Number Theory}
\label{chap:number-theory}

\chapter{Combinatorics}
\label{chap:combinatorics}

\chapter{Geometry}
\label{chap:geometry}

\chapter{Graphs}
\label{chap:graphs}

\chapter{Strings}
\label{chap:strings}

\chapter{Parser Combinators}
\label{chap:parser-comb}

\chapter{Mutability}
\label{chap:mutability}

\chapter{Miscellaneous Topics}
\label{chap:misc}

\printbibliography

\end{document}