Handwriting Recognition Using Deep Learning

University of Toronto Machine Intelligence Team

7 min readJun 12, 2022

A UTMIST Project by Justin Tran, Fernando Assad, Kiara Chong, Armaan Lalani, and Eeshan Narula.

Introduction

Recognizing handwritten text is a long-studied problem in machine learning, with one of the most well-known datasets being the MNIST [1] for handwritten digits. While recognizing individual digits is a solved problem, scientists have been looking for ways of recognizing the full corpus of text at once, as it makes digitizing documents easier. We present a solution to this problem of Handwritten Text Recognition (HTR), together with an overview of current advances in the field.

Motivation and Goal

Recognizing written text is key to many applications that depend on digitizing documents, including healthcare, insurance, and banking. The biggest challenge arises due to the high variation of styles in the way people write, especially in cursive form. While many software applications already implement HTR (e.g. photos on iPhones) these are far from perfect, and research in the field is still very active.

The project aims to develop an algorithm that can read a corpus of text by segmenting it into lines and applying a transformer-based implementation of HTR [2]. This method was developed and published last year and was shown to outperform the then state-of-the-art models, notably CRNN implementations like that of Shdiel et al. [1].

Related Work

HTR is a sub-problem of Optical Character Recognition (OCR) — converting typed or handwritten text from images into machine-encoded text. OCR systems are divided into two modules: a detection module and a recognition module. The detection module aims to localize blocks of text within the image via an object detection model, while the recognition module aims to understand and transcribe the detected text.

Typically, text recognition is achieved through combined CNN and RNN architecture, where CNNs are used for image understanding and RNNs for text generation. However, in more recent studies, the use of Transformer architecture has shown significant improvements in text recognition models. This has led to the development of hybrid architectures, where CNNs still serve as the backbone of the model.

To further improve text recognition, recent studies have investigated the ability for Transformers to replace these CNN backbones. This has resulted in the development of higher accuracy models such as TrOCR, an end-to-end Transformer-based architecture that features both pre-trained image and text Transformers.

Dataset

We used the IAM dataset [3] to train and test our model. It includes 13,353 images of lines of handwritten text, which are used as our inputs to our HTR algorithm (still, word and document levels are also provided). Given a document, the bounding boxes for them are also provided and were used to train the segmentation step of our model. The data are freely available.

Network structure (Segmentation)

The algorithm used to perform word segmentation is based on the paper ‘Scale Space Techniques for Word Segmentation in Handwritten Documents’ [5]. We initially attempted a ResNet architecture to accomplish this task, but issues with training forced us to pivot to an alternative method. The algorithm proposed in the paper considers the scale-space behavior of blobs in images containing lines. The underlying concept of this algorithm is the utilization of Gaussian filters to generate an image’s scale space, which involves developing a family of signals where details are progressively removed.

In order to accurately segment the lines within a document, projection profile techniques were utilized. The vertical projection profile is defined by summing the pixel intensity values along a particular row. The lines are identified by determining local peaks in the projection profiles after applying Gaussian smoothing to deal with false positives.

After line segmentation, second-order differential Gaussian filters are applied along both orientations of the image in order to form a blob (a connected region) representation of the image. By convolving the image with the second order differential Gaussian, a scale-space representation is created. The blobs in this representation appear brighter or darker than the background after this convolution. By altering the parameters of the filters, the blobs will transition from characters to words.

Network Structure (Classification)

The architecture we used was based on the paper “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models” by Li et al. [2]. We implemented the architecture for classification after the segmentation step. It is composed of an encoder and a decoder which are described below.

The image is first resized and broken into a batch of smaller 16x16 patches, which are then flattened to form what we call patch embeddings. Based on the position of each patch, they are also given a position embedding. These inputs are then passed into a stack of encoders composed of a multi-head self-attention module and a feed-forward network. This process is shown the figure below.

Image retrieved from [2] explaining the classification network.

The outputs of the encoder are then fed into a decoder architecture, which is almost the same as the encoder architecture. The difference lies in adding an “encoder-decoder” attention module between the two original encoder components. The embedding from the decoder is then projected into the size of the vocabulary, and after applying the softmax function and beam search, we determine the final output.

Training

Because the segmentation was done without Machine Learning, only the classification network needed to be trained. Input sentence lines were first resized to 384x384 resolution so they could be divided into 16x16 patches. We split the IAM dataset into 90% training data, and 10% validation data. For the test set, a subset of 5 full-page handwritten documents were given to the model starting at the segmentation algorithm, and the output of the classification network was evaluated using the Levenshtein distance [4] divided by the length of the ground truth label. This metric will be referred to as the character error. The testing pipeline begins with the segmentation algorithm so that the model is evaluated in its entirety.

Recursive implementation of the Levenshtein distance of two strings a and b.

The network was trained using cross-entropy loss and the adam optimizer, with a learning rate of 0.00001 and a 0.0001 weight decay. The inverse square root schedule is used as the learning rate scheduler. The initial warmup learning rate is 1e-8 with 500 warmup updates. The final hyperparameters are the number of epochs and batch size, which are set at 50 and 2, respectively. Due to financial constraints, we could not conduct a hyperparameter search.

Results

Overall, our project was able to predict the text of the test documents fairly well. The test documents were read with an average character error of 0.17. For comparison, when trying with a CRNN approach, the character error was 1.16. For reference, if the model had output the word “Hyundai” when the label was “Honda,” the character error would be 0.6. During classification training, the final average training cross-entropy loss was 0.118, and the final average validation loss was 0.386.

However, it should be noted that the data used were either pre-segmented text lines in the case of classification training or documents where text lines were reasonably on level, i.e., parallel to the page and not slanted. The model does not perform as well when the text lines are not on level due to the limitations of the segmentation algorithm.

Plot of the training and validation loss over epochs

Conclusion

The goal of our project was to create a model which takes in a handwritten document and outputs that text in a way such that the original handwritten content is discernible. The low character error from the test set, as well as the low validation and training errors demonstrate that our model was able to accomplish this goal with some limitations.

Limitations of our model include being unable to segment text lines which are not level with the page. Another major limitation is the inablility to classify text lines which include too much noise, such as a non-white background. For future work, we would like to improve upon the segmentation portion, perhaps with a state of the art segmentation neural network, as well as employ filtering techniques to remove excessive noise.

References

[1] Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142.

[2] Li, Minghao & Lv, Tengchao & Cui, Lei & Lu, Yijuan & Florencio, Dinei & Zhang, Cha & Li, Zhoujun & Wei, Furu. (2021). TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models.

[3] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int. Journal on Document Analysis and Recognition, Volume 5, pages 39–46, 2002.

[4] Gooskens, Charlotte & Heeringa, Wilbert. (2004). Perceptive evaluation of Levenshtein dialect distance measurements using Norwegian dialect data. Language Variation and Change. 16. 189–207. 10.1017/S0954394504163023.

[5] Manmatha, R. and Srimal, N., n.d. State Space Technique for Word Segmentation in Handwritten Documents. University of Massachusetts.