没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
EMNIST: an extension of MNIST to handwritten
letters
Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr
´
e van Schaik
The MARCS Institute for Brain, Behaviour and Development
Western Sydney University
Penrith, Australia 2751
Email: g.cohen@westernsydney.edu.au
Abstract—The MNIST dataset has become a standard bench-
mark for learning, classification and computer vision systems.
Contributing to its widespread adoption are the understandable
and intuitive nature of the task, its relatively small size and
storage requirements and the accessibility and ease-of-use of
the database itself. The MNIST database was derived from a
larger dataset known as the NIST Special Database 19 which
contains digits, uppercase and lowercase handwritten letters. This
paper introduces a variant of the full NIST dataset, which we
have called Extended MNIST (EMNIST), which follows the same
conversion paradigm used to create the MNIST dataset. The
result is a set of datasets that constitute a more challenging
classification tasks involving letters and digits, and that shares
the same image structure and parameters as the original MNIST
task, allowing for direct compatibility with all existing classifiers
and systems. Benchmark results are presented along with a
validation of the conversion process through the comparison of
the classification results on converted NIST digits and the MNIST
digits.
I. INTRODUCTION
The importance of good benchmarks and standardized prob-
lems cannot be understated, especially in competitive and fast-
paced fields such as machine learning and computer vision.
Such tasks provide a quick, quantitative and fair means of
analyzing and comparing different learning approaches and
techniques. This allows researchers to quickly gain insight into
the performance and peculiarities of methods and algorithms,
especially when the task is an intuitive and conceptually simple
one.
As single dataset may only cover a specific task, the
existence of a varied suite of benchmark tasks is important in
allowing a more holistic approach to assessing and characteriz-
ing the performance of an algorithm or system. In the machine
learning community, there are several standardized datasets
that are widely used and have become highly competitive.
These include the MNIST dataset [1], the CIFAR-10 and
CIFAR-100 [2] datasets, the STL-10 dataset [3], and Street
View House Numbers (SVHN) dataset [4].
Comprising a 10-class handwritten digit classification task
and first introduced in 1998, the MNIST dataset remains the
most widely known and used dataset in the computer vision
and neural networks community. However, a good dataset
needs to represent a sufficiently challenging problem to make
it both useful and to ensure its longevity [5]. This is perhaps
where MNIST has suffered in the face of the increasingly high
accuracies achieved using deep learning and convolutional
neural networks. Multiple research groups have published
accuracies above 99.7% [6]–[10], a classification accuracy at
which the dataset labeling can be called into question. Thus,
it has become more of a means to test and validate a classifi-
cation system than a meaningful or challenging benchmark.
The accessibility of the MNIST dataset has almost certainly
contributed to its widespread use. The entire dataset is rel-
atively small (by comparison to more recent benchmarking
datasets), free to access and use, and is encoded and stored
in an entirely straightforward manner. The encoding does
not make use of complex storage structures, compression, or
proprietary data formats. For this reason, it is remarkably easy
to access and include the dataset from any platform or through
any programming language.
The MNIST database is a subset of a much larger dataset
known as the NIST Special Database 19 [11]. This dataset
contains both handwritten numerals and letters and represents
a much larger and more extensive classification task, along
with the possibility of adding more complex tasks such as
writer identification, transcription tasks and case detection.
The NIST dataset, by contrast to MNIST, has remained
difficult to access and use. Driven by the higher cost and
availability of storage when it was collected, the NIST dataset
was originally stored in a remarkably efficient and compact
manner. Although source code to access the data is provided,
it remains challenging to use on modern computing platforms.
For this reason, the NIST recently released a second edition
of the NIST dataset [12]. The second edition of the dataset
is easier to access, but the structure of the dataset, and the
images contained within, differ from that of MNIST and are
not directly compatible.
The NIST dataset has been used occasionally in neural
network systems. Many classifiers make use of only the digit
classes [13], [14], whilst others tackle the letter classes as
well [15]–[18]. Each paper tackles the task of formulating the
classification tasks in a slightly different manner, varying such
fundamental aspects as the number of classes to include, the
training and testing splits, and the preprocessing of the images.
In order to bolster the use of this dataset, there is a clear
need to create a suite of well-defined datasets that thoroughly
specify the nature of the classification task and the structure of
the dataset, thereby allowing for easy and direct comparisons
arXiv:1702.05373v1 [cs.CV] 17 Feb 2017
between sets of results.
This paper introduces such a suite of datasets, known as
Extended Modified NIST (EMNIST). Derived from the NIST
Special Database 19, these datasets are intended to represent
a more challenging classification task for neural networks
and learning systems. By directly matching the image spec-
ifications, dataset organization and file formats found in the
original MNIST dataset, these datasets are designed as drop-
in replacements for existing networks and systems.
This paper introduces these datasets, documents the conver-
sion process used to create the images, and presents a set of
benchmark results for the dataset. These results are then used
to further characterize and validate the datasets.
A. The MNIST and NIST Dataset
The NIST Special Database 19 [11] contains handwritten
digits and characters collected from over 500 writers. The
dataset contains binary scans of the handwriting sample collec-
tion forms, and individually segmented and labeled characters
which were extracted from the forms. The characters include
numerical digits and both uppercase and lowercase letters.
The database was published as a complete collection in 1995
[11], and then re-released using a more modern file format
in September 2016 [12]. The dataset itself contains, and
supersedes, a number of previously released NIST handwriting
datasets, such as the Special Databases 1, 3 and 7.
The MNIST dataset is derived from a small subset of the
numerical digits contained within the NIST Special Databases
1 and 3, and were converted using the method outlined in
[1]. The NIST Special Database 19, which represents the final
collection of handwritten characters in that series of datasets,
contains additional handwritten digits and an extensive collec-
tion of uppercase and lowercase handwritten letters.
The authors and collators of the NIST dataset also suggest
that the data contained in Special Database 7 (which is
included in Special Database 19) be used exclusively as a
testing set as the samples were collected from high school
students and pose a more challenging problem.
The NIST dataset was intended to provide multiple optical
character recognition tasks and therefore presents the char-
acter data in five separate organizations, referred to as data
hierarchies. There are as follows:
• By
Page: This hierarchy contains the unprocessed full-
page binary scans of handwriting sample forms. The
character data used in the other hierarchies was collected
through a standardized set of forms which the writers
were asked to complete. 3699 forms were completed.
• By Author: This hierarchy contains individually seg-
mented handwritten characters images organized by
writer. It allows for such tasks as writer identification but
offers little in the way of classification benefit as each
grouping contains digits from multiple classes.
• By Field: This organization contains the digits and char-
acter sorted by the field on the collection form in which
they appear. This is primarily useful for segmenting the
digit classes as they appear in their own isolated fields.
TABLE I
BREAKDOWN OF THE NUMBER OF AVAILABLE TRAINING AND TESTING
SAMPLES IN THE NIST SPECIAL DATABASE 19 USING THE ORIGINAL
TRAINING AND TESTING SPLITS.
Type No. Classes Training Testing Total
By Class Digits 10 344,307 58,646 402,953
Uppercase 26 208,363 11,941 220,304
Lowercase 26 178,998 12,000 190,998
Total 62 731,668 82,587 814,255
By Merge Digits 10 344,307 58,646 402,953
Letters 37 387,361 23,941 411,302
Total 47 731,668 82,587 814,255
MNIST [1] Digits 10 60,000 10,000 70,000
• By Class: This represents the most useful organization
from a classification perspective as it contains the seg-
mented digits and characters arranged by class. There are
62 classes comprising [0-9], [a-z] and [A-Z]. The data is
also split into a suggested training and testing set.
• By Merge: This data hierarchy addresses an interesting
problem in the classification of handwritten digits, which
is the similarity between certain uppercase and lowercase
letters. Indeed, these effects are often plainly visible when
examining the confusion matrix resulting from the full
classification task on the By Class dataset. This variant
on the dataset merges certain classes, creating a 47-class
classification task. The merged classes, as suggested by
the NIST, are for the letters C, I, J, K, L, M, O, P, S, U,
V, W, X, Y and Z.
The conversion process described in this paper and the
provided code is applicable to all hierarchies with the excep-
tion of the By Page hierarchy as it contains fundamentally
different images. However, the primary focus of this work
rests with the By Class and By Merge organizations as they
encompass classification tasks that are directly compatible
with the standard MNIST dataset classification task.
Table I shows the breakdown of the original training and
testing sets specified in the releases of the NIST Special
Database 19. Both the By Class and By Merge hierarchies
contain 814,255 handwritten characters consisting of a sug-
gested 731,668 training samples and 82,587 testing samples. It
should be noted however, that almost half of the total samples
are handwritten digits.
The By Author class represents an interesting opportunity
to formulate fundamentally new classification tasks, such as
writer identification from handwriting samples, but this is
beyond the scope of this work.
II. METHODOLOGY
This paper introduces the EMNIST datasets and then applies
an OPIUM-based classifier to the classification tasks based
on these datasets. The purpose of the classifiers is to provide
a means of validating and characterizing the datasets whilst
also providing benchmark classification results. The nature and
剩余9页未读,继续阅读
资源评论
Jayden丶
- 粉丝: 1
- 资源: 6
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 电能转化功能电路辅助设计分析仿 真平台设计.pdf
- 呼吸机管路积水故障自动监测算法研究.pdf
- 基于深度学习的无人机目标检测研究综述.pdf
- TopN成对相似度迁移的三元组跨模态检索.pdf
- 单值中智集的集结模型及其在多属性群决策问题中的应用研究.pdf
- 基于改进熵率超像素和区域合并的岩屑图像分割.pdf
- 基于SVD和分块DCT的半脆弱图像水印算法.pdf
- 基于自注意力机制改进GCNN模型的图书标签分类研究.pdf
- 铝合金激光深熔焊熔池小孔演变行为数值模拟研究.pdf
- 弧焊机器人运动学分析及笛卡尔空间轨迹规划.pdf
- 基于雷达图序列的海洋多维数据可视化方法.pdf
- 相似多线外形之间特征曲线网的复制.pdf
- 基于多尺度融合的卷积神经网络的杂草幼苗识别.pdf
- 一种基于KSVD-ETF的测量矩阵优化方法.pdf
- 基于Android智能终端的物流车辆调度平台的设计.pdf
- 一种基于时序损失的语音驱动面部运动方法.pdf
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功