PROC. OF THE IEEE, NOVEMBER 1998 1
Gradient-Based Learning Applied to Do cument
Recognition
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haner
Abstract
|
Multilayer Neural Networks trained with the backpropa-
gation algorithm constitute the best example of a successful
Gradient-Based Learning technique. Given an appropriate
network architecture, Gradient-Based Learning algorithms
can be used to synthesize a complex decision surface that can
classify high-dimensional patterns such as handwritten char-
acters, with minimal preprocessing. This paper reviews var-
ious methods applied to handwritten character recognition
and compares them on a standard handwritten digit recog-
nition task. Convolutional Neural Networks, that are specif-
ically designed to deal with the variability of 2D shap es, are
shown to outperform all other techniques.
Real-life document recognition systems are comp osed
of multiple modules including eld extraction, segmenta-
tion, recognition, and language mo deling. A new learning
paradigm, called Graph Transformer Networks (GTN), al-
lows suchmulti-module systems to b e trained globally using
Gradient-Based metho ds so as to minimize an overall p er-
formance measure.
Two systems for on-line handwriting recognition are de-
scribed. Experiments demonstrate the advantage of global
training, and the exibility of Graph Transformer Networks.
A Graph Transformer Network for reading bank checkis
also described. It uses Convolutional Neural Network char-
acter recognizers combined with global training techniques
to provides record accuracy on business and p ersonal checks.
It is deployed commercially and reads several million checks
per day.
Keywords
| Neural Networks, OCR, Do cument Recogni-
tion, Machine Learning, Gradient-Based Learning, Convo-
lutional Neural Networks, Graph Transformer Networks, Fi-
nite State Transducers.
Nomenclature
GT Graph transformer.
GTN Graph transformer network.
HMM Hidden Markov model.
HOS Heuristic oversegmentation.
K-NN K-nearest neighbor.
NN Neural network.
OCR Optical character recognition.
PCA Principal comp onent analysis.
RBF Radial basis function.
RS-SVM Reduced-set support vector metho d.
SDNN Space displacement neural network.
SVM Supp ort vector method.
TDNN Time delay neural network.
V-SVM Virtual support vector metho d.
The authors are with the Speech and Image Pro-
cessing Services Research Laboratory, AT&T Labs-
Research, 100 Schulz Drive Red Bank, NJ 07701. E-mail:
f
yann,leonb,yoshua,haner
g
@research.att.com. Yoshua Bengio
is also with the Departement d'Informatique et de Recherche
Operationelle, UniversitedeMontreal, C.P. 6128 Succ. Centre-Ville,
2920 Chemin de la Tour, Montreal, Quebec, Canada H3C 3J7.
I. Introduction
Over the last several years, machine learning techniques,
particularly when applied to neural networks, haveplayed
an increasingly imp ortant role in the design of pattern
recognition systems. In fact, it could be argued that the
availability of learning techniques has b een a crucial fac-
tor in the recent success of pattern recognition applica-
tions suchas continuous speech recognition and handwrit-
ing recognition.
The main message of this pap er is that b etter pattern
recognition systems can b e built by relying more on auto-
matic learning, and less on hand-designed heuristics. This
is made possible by recent progress in machine learning
and computer technology. Using character recognition as
a case study, we show that hand-crafted feature extrac-
tion can b e advantageously replaced by carefully designed
learning machines that op erate directly on pixel images.
Using do cument understanding as a case study, we show
that the traditional way of building recognition systems by
manually integrating individually designed modules can b e
replaced by a unied and well-principled design paradigm,
called
Graph Transformer Networks
, that allows training
all the mo dules to optimize a global performance criterion.
Since the early days of pattern recognition it has been
known that the variability and richness of natural data,
be it sp eech, glyphs, or other types of patterns, make it
almost imp ossible to build an accurate recognition system
entirely by hand. Consequently, most pattern recognition
systems are built using a combination of automatic learn-
ing techniques and hand-crafted algorithms. The usual
method of recognizing individual patterns consists in divid-
ing the system into twomainmo dules shown in gure 1.
The rst module, called the feature extractor, transforms
the input patterns so that they can b e represented bylow-
dimensional vectors or short strings of symbols that (a) can
be easily matched or compared, and (b) are relatively in-
variant with resp ect to transformations and distortions of
the input patterns that do not change their nature. The
feature extractor contains most of the prior knowledge and
is rather specic to the task. It is also the focus of most of
the design eort, because it is often entirely hand-crafted.
The classier, on the other hand, is often general-purpose
and trainable. One of the main problems with this ap-
proach is that the recognition accuracy is largely deter-
mined by the ability of the designer to come up with an
appropriate set of features. This turns out to be a daunt-
ing task which, unfortunately,must b e redone for eachnew
problem. A large amount of the pattern recognition liter-
ature is devoted to describing and comparing the relative