A workflow for handwritten text recognition (HTR) of weather records. The presented workflow provides a set of end to end instructions to follow in order to automatically transcribe your weather records using machine learning.
Caution
This python package is a consolidation of previously dispersed proof-of-concept script components written across years. Although consolidated, gaps in documentation and usability considerations still exist. For example, paths are not always validated and arguments might not have defaults.
Read the below documentation carefully before proceeding. The software is provided AS IS and no responsibility is taken in the event of data loss.
Vast amounts of historical weather records remain archived and unexplored. Despite advances in the recognition of (handwritten) text in general the case of tabulated weather data remains challenging. The data is exact, provides limited contextual clues and errors propagate due to easy misinterpretation of table layouts due to messy formats, deterioration of paper and other factors. Retaining context is therefore key to quality assurance of the data retained within these weather records.
Although machine learning (ML) layout detection presents itself as feasible part of a transcription workflow a template matching approach is given priority. Layout matching with a good accuracy (90%) on vast numbers of records could still corrupt large amounts of data which would be need to be filtered (detected) post-hoc, due to its unsupervised nature.
The template matching methods requires more up-front work, but flags poor template matches earlier in the processing chain making it easier to detect errors and screen for quality. Faulty layout detection can be partial, where only a part of the table is found, making absolute classifications of faulty or correct data hard.
Furthermore, the law of large numbers makes small error rates affect a significant amount of data. Partial matches and shifts in both columns or rows of a table also propagate throughout the chronological order of the remaining data. Unlike text data, where contextual clues can allow for meaningful re-orderning of sentences the nature of numeric data is such that the encoding of the position within a table must be absolute. Therefore, a preference is given to a semi-supervised template matching and table detection workflow.
Transcription models can be flexibly deployed, where the default model is trained upon thousands of handwritten table entries from the COBECORE project where either the Transformer based TrOCR model or Tesseract is used.
Clone the package to your local directory.
git clone https://github.com/bluegreen-labs/weaHTR_workflow.git
The Dockerfile included provides a (GPU) torch setup. You can build this docker image using the below command. This will download the NVIDIA CUDA drivers for GPU support, the tidyverse, rstudio IDE and quarto publishing environment. Note that this setup will require some time to build given the the large downloads involved. Once build locally no further downloads will be required.
docker build -f Dockerfile -t weahtr .
Note
Repeatedly building a docker image can result in a large cache being created easily 10x the data of the image (which in itself is multiple GB in size).
If you find you are running out of storage space out of the blue, check the docker build cache, and prune it.
docker buildx prune -f
Make sure to have interfacing libraries running, when relying on different docker base images.
apt-get install nvidia-container-runtime
To spin up a GPU docker image and drop into the command prompt use in the project directory:
docker run -it --rm --gpus all -v $(pwd):/data weahtr bash
For independent installs using conda
conda env create -f environment.yml
For now no pip
based install is supported. You can install the package
by working in editor mode (if under development), or through a linked pip
install. This works both in a Docker or in normal install, given that requirement
are met. Editting mode also allows you to dynamically change the code.
For editor mode use:
pip install -e /path/to/pkg
For a static local install use:
pip install mypackage --no-index --find-links /path/to/pkg
Note
When using a Docker image you will have to install the package in every new session. A final Docker image with the library installed will be available when the package matures.
The workflow consists of six steps:
- Sort your data into particular tabular formats, these should be unique.
- Create empty, reference, template images
- Mark rows and columns on the template images
- Edit the configuration YAML file
- Template match the data to find the tables (three methods are provided)
# import libraries
import glob
from weahtr.template import *
# list files to process
images = glob.glob("./demo_input/format_1/images/*.jpg", recursive=True)
# initiate the setup
t = template.template(
images = images,
template = "./demo_input/format_1.jpg",
config = "./demo_input/format_1/format_1.yml"
)
# match all templates, write homography data to file
t.match(method = "features")
The data is processed and stored in an output directory keeping track of each image (table). Preview options are provided to generate an image to screen the template matching before proceeding to the final transcription step (6).
All operations are non-destructive on the original data, and relatively little additional image files are created. For example, any image registration is stored a homography file (a translation matrix), which is a small text file. This homography file is used, when available (i.e. pre-calculated), to find the table with data values.
- Run the OCR/HTR transcription model on the matched data
# using the tesseract setup
t.label(method = "tesseract")
The ability to store the homography files allows you to pre-calculate all table template matching / image registration up front, so this step does not need to be repeated when running the table through different OCR/HTR transcription models. This should allow you to iterate quickly over various ML models once the image registration is completed and saved to file. Finally, after every run it is adviced to screen the logs and visual output for quality control
# import libraries
import glob, os
from weahtr import *
import pandas as pd
# file with two columns:
# 'file_name' and 'text'
# the file name only lists the
# image file name (not the absolute long form path)
df = pd.read_csv("../data/character_training_data/labels.csv")
# setup model training
m = weahtr.model(
model = "trocr", # model to use
config = "./demo_input/format_1/format_1.yml", # config file
labels = df, # data frame with the labels
images = "image/path" # path with the images listed in the data frame
)
# initiate a training run
m.train()
In order to increase representation of handwritten text recognition one can use
synthetic data as compiled from handwritten text databases. The package includes
a functions generate()
in the generator class which generates random numbers
and their matching table of labels in a CSV file. The generator uses MNIST and
UNIPEN data
formatted by Sueiras et al. 2016 to compile these random images.
The UNIPEN data only include decimals (.,) and signs
(-) to be used as additional modifiers, where MNIST data is dynamically loaded
from the Torchvision library to include all handwritten MNIST numbers.
Warning
Paths are not validated for now. Make sure paths exist.
# import libraries
from weahtr.generator import generator
# initiate the setup
# listing the UNIPEN path
# generating numbers up to 3 digits long
# with a decimal separator and random sign
g = generator(
data_path = "input/data",
background = "background_image.png",
values = 3,
decimal = True,
sign = True
)
# generate 10 random samples and put output
# in the designated path
g.generate(
samples = 10,
path = "~/output/path/"
)
The package includes three methods to align the source image with
a known template or guides, there is a table
, fft
and the features
method.
The table
recognition method is based on polygon dilation and erosion, to
calculate the four corners of the largest table in an image. This assumes
that the tables are well defined (bounded by a well defined grid). In addition,
one needs a JSON guides file covering the full extent of the table. This
therefore also requires you to set the rows to skip in the processing of the
table content, should you use the template for the further division of the
table in rows an columns. The latter, is not required as there is an automatic
row and column detection routine which can be used.
Generally, this method is fast but can be unreliable on very noisy data. It also requires the tables to be relatively well constrained. For example, the routine will fail or return faulty registrations on tables with only row dividers.
The fft
method uses translation and rotation invariant mapping within the
frequency domain to align two images, the source image and an empty template
of a table you want to reference.
This method is relatively slow, but rather robust to noisy input. In short, reference templates which are approximately the same as the source (content) you want to reference will work to some extent.
In this case a guides file with the location of cells in the table is only required for the transcription processing of the cell content, not the table matching.
The features
method uses the correspondence between key image features (i.e
recognizable patterns in the image) to align both images correctly. This method
is computationally expensive but if it works well also the most accurate way
of referencing data.
As with the fft
method, a guides file with the location of cells in the table
is only required for the transcription processing of the cell content, not the
table matching.
In the analysis
folder of the project you find a number of demonstration
scripts which should work within the provided Docker environment. These scripts
allow you to experiment with the workflow, so you can familiarize yourself with
the process.
These scripts rely on small demo datasets included in the github repository, and are real life examples of climate data recovery efforts. They include three data formats from data recovery efforts within the COBECORE (format 1 and 6) and VUB Hydrology department (Yangambi). The included tables are both in portrait and landscape mode. For each table format a configuration file (*.yml) and empty template is provided. Experimenting with different table detection / registration methods will show that these methods perform better or worse depending on the type and quality of the image/table provided.
J. Sueiras, et al.: "Using Synthetic Character Database for Training Deep Learning Models Applied to Offline Handwritten Character Recognition", Proc. Intl. Conf.Intelligent Systems Design and Applications (ISDA), Springer, 2016.