手写字母和数字数据集介绍，与mnist完美兼容。_mnist手写字母数据集资源-CSDN文库

需积分: 27 193 浏览量 2020-12-22 20:03:35 上传评论 3 收藏 1.03MB PDF 举报

资源推荐

资源详情

资源评论

EMNIST: an extension of MNIST to handwritten

letters

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr

e van Schaik

The MARCS Institute for Brain, Behaviour and Development

Western Sydney University

Penrith, Australia 2751

Email: g.cohen@westernsydney.edu.au

Abstract—The MNIST dataset has become a standard bench-

mark for learning, classiﬁcation and computer vision systems.

Contributing to its widespread adoption are the understandable

and intuitive nature of the task, its relatively small size and

storage requirements and the accessibility and ease-of-use of

the database itself. The MNIST database was derived from a

larger dataset known as the NIST Special Database 19 which

contains digits, uppercase and lowercase handwritten letters. This

paper introduces a variant of the full NIST dataset, which we

have called Extended MNIST (EMNIST), which follows the same

conversion paradigm used to create the MNIST dataset. The

result is a set of datasets that constitute a more challenging

classiﬁcation tasks involving letters and digits, and that shares

the same image structure and parameters as the original MNIST

task, allowing for direct compatibility with all existing classiﬁers

and systems. Benchmark results are presented along with a

validation of the conversion process through the comparison of

the classiﬁcation results on converted NIST digits and the MNIST

digits.

I. INTRODUCTION

The importance of good benchmarks and standardized prob-

lems cannot be understated, especially in competitive and fast-

paced ﬁelds such as machine learning and computer vision.

Such tasks provide a quick, quantitative and fair means of

analyzing and comparing different learning approaches and

techniques. This allows researchers to quickly gain insight into

the performance and peculiarities of methods and algorithms,

especially when the task is an intuitive and conceptually simple

one.

As single dataset may only cover a speciﬁc task, the

existence of a varied suite of benchmark tasks is important in

allowing a more holistic approach to assessing and characteriz-

ing the performance of an algorithm or system. In the machine

learning community, there are several standardized datasets

that are widely used and have become highly competitive.

These include the MNIST dataset [1], the CIFAR-10 and

CIFAR-100 [2] datasets, the STL-10 dataset [3], and Street

View House Numbers (SVHN) dataset [4].

Comprising a 10-class handwritten digit classiﬁcation task

and ﬁrst introduced in 1998, the MNIST dataset remains the

most widely known and used dataset in the computer vision

and neural networks community. However, a good dataset

needs to represent a sufﬁciently challenging problem to make

it both useful and to ensure its longevity [5]. This is perhaps

where MNIST has suffered in the face of the increasingly high

accuracies achieved using deep learning and convolutional

neural networks. Multiple research groups have published

accuracies above 99.7% [6]–[10], a classiﬁcation accuracy at

which the dataset labeling can be called into question. Thus,

it has become more of a means to test and validate a classiﬁ-

cation system than a meaningful or challenging benchmark.

The accessibility of the MNIST dataset has almost certainly

contributed to its widespread use. The entire dataset is rel-

atively small (by comparison to more recent benchmarking

datasets), free to access and use, and is encoded and stored

in an entirely straightforward manner. The encoding does

not make use of complex storage structures, compression, or

proprietary data formats. For this reason, it is remarkably easy

to access and include the dataset from any platform or through

any programming language.

The MNIST database is a subset of a much larger dataset

known as the NIST Special Database 19 [11]. This dataset

contains both handwritten numerals and letters and represents

a much larger and more extensive classiﬁcation task, along

with the possibility of adding more complex tasks such as

writer identiﬁcation, transcription tasks and case detection.

The NIST dataset, by contrast to MNIST, has remained

difﬁcult to access and use. Driven by the higher cost and

availability of storage when it was collected, the NIST dataset

was originally stored in a remarkably efﬁcient and compact

manner. Although source code to access the data is provided,

it remains challenging to use on modern computing platforms.

For this reason, the NIST recently released a second edition

of the NIST dataset [12]. The second edition of the dataset

is easier to access, but the structure of the dataset, and the

images contained within, differ from that of MNIST and are

not directly compatible.

The NIST dataset has been used occasionally in neural

network systems. Many classiﬁers make use of only the digit

classes [13], [14], whilst others tackle the letter classes as

well [15]–[18]. Each paper tackles the task of formulating the

classiﬁcation tasks in a slightly different manner, varying such

fundamental aspects as the number of classes to include, the

training and testing splits, and the preprocessing of the images.

In order to bolster the use of this dataset, there is a clear

need to create a suite of well-deﬁned datasets that thoroughly

specify the nature of the classiﬁcation task and the structure of

the dataset, thereby allowing for easy and direct comparisons

arXiv:1702.05373v1 [cs.CV] 17 Feb 2017

between sets of results.

This paper introduces such a suite of datasets, known as

Extended Modiﬁed NIST (EMNIST). Derived from the NIST

Special Database 19, these datasets are intended to represent

a more challenging classiﬁcation task for neural networks

and learning systems. By directly matching the image spec-

iﬁcations, dataset organization and ﬁle formats found in the

original MNIST dataset, these datasets are designed as drop-

in replacements for existing networks and systems.

This paper introduces these datasets, documents the conver-

sion process used to create the images, and presents a set of

benchmark results for the dataset. These results are then used

to further characterize and validate the datasets.

A. The MNIST and NIST Dataset

The NIST Special Database 19 [11] contains handwritten

digits and characters collected from over 500 writers. The

dataset contains binary scans of the handwriting sample collec-

tion forms, and individually segmented and labeled characters

which were extracted from the forms. The characters include

numerical digits and both uppercase and lowercase letters.

The database was published as a complete collection in 1995

[11], and then re-released using a more modern ﬁle format

in September 2016 [12]. The dataset itself contains, and

supersedes, a number of previously released NIST handwriting

datasets, such as the Special Databases 1, 3 and 7.

The MNIST dataset is derived from a small subset of the

numerical digits contained within the NIST Special Databases

1 and 3, and were converted using the method outlined in

[1]. The NIST Special Database 19, which represents the ﬁnal

collection of handwritten characters in that series of datasets,

contains additional handwritten digits and an extensive collec-

tion of uppercase and lowercase handwritten letters.

The authors and collators of the NIST dataset also suggest

that the data contained in Special Database 7 (which is

included in Special Database 19) be used exclusively as a

testing set as the samples were collected from high school

students and pose a more challenging problem.

The NIST dataset was intended to provide multiple optical

character recognition tasks and therefore presents the char-

acter data in ﬁve separate organizations, referred to as data

hierarchies. There are as follows:

• By

Page: This hierarchy contains the unprocessed full-

page binary scans of handwriting sample forms. The

character data used in the other hierarchies was collected

through a standardized set of forms which the writers

were asked to complete. 3699 forms were completed.

• By Author: This hierarchy contains individually seg-

mented handwritten characters images organized by

writer. It allows for such tasks as writer identiﬁcation but

offers little in the way of classiﬁcation beneﬁt as each

grouping contains digits from multiple classes.

• By Field: This organization contains the digits and char-

acter sorted by the ﬁeld on the collection form in which

they appear. This is primarily useful for segmenting the

digit classes as they appear in their own isolated ﬁelds.

TABLE I

BREAKDOWN OF THE NUMBER OF AVAILABLE TRAINING AND TESTING

SAMPLES IN THE NIST SPECIAL DATABASE 19 USING THE ORIGINAL

TRAINING AND TESTING SPLITS.

Type No. Classes Training Testing Total

By Class Digits 10 344,307 58,646 402,953

Uppercase 26 208,363 11,941 220,304

Lowercase 26 178,998 12,000 190,998

Total 62 731,668 82,587 814,255

By Merge Digits 10 344,307 58,646 402,953

Letters 37 387,361 23,941 411,302

Total 47 731,668 82,587 814,255

MNIST [1] Digits 10 60,000 10,000 70,000

• By Class: This represents the most useful organization

from a classiﬁcation perspective as it contains the seg-

mented digits and characters arranged by class. There are

62 classes comprising [0-9], [a-z] and [A-Z]. The data is

also split into a suggested training and testing set.

• By Merge: This data hierarchy addresses an interesting

problem in the classiﬁcation of handwritten digits, which

is the similarity between certain uppercase and lowercase

letters. Indeed, these effects are often plainly visible when

examining the confusion matrix resulting from the full

classiﬁcation task on the By Class dataset. This variant

on the dataset merges certain classes, creating a 47-class

classiﬁcation task. The merged classes, as suggested by

the NIST, are for the letters C, I, J, K, L, M, O, P, S, U,

V, W, X, Y and Z.

The conversion process described in this paper and the

provided code is applicable to all hierarchies with the excep-

tion of the By Page hierarchy as it contains fundamentally

different images. However, the primary focus of this work

rests with the By Class and By Merge organizations as they

encompass classiﬁcation tasks that are directly compatible

with the standard MNIST dataset classiﬁcation task.

Table I shows the breakdown of the original training and

testing sets speciﬁed in the releases of the NIST Special

Database 19. Both the By Class and By Merge hierarchies

contain 814,255 handwritten characters consisting of a sug-

gested 731,668 training samples and 82,587 testing samples. It

should be noted however, that almost half of the total samples

are handwritten digits.

The By Author class represents an interesting opportunity

to formulate fundamentally new classiﬁcation tasks, such as

writer identiﬁcation from handwriting samples, but this is

beyond the scope of this work.

II. METHODOLOGY

This paper introduces the EMNIST datasets and then applies

an OPIUM-based classiﬁer to the classiﬁcation tasks based

on these datasets. The purpose of the classiﬁers is to provide

a means of validating and characterizing the datasets whilst

also providing benchmark classiﬁcation results. The nature and

剩余9页未读，继续阅读

评论收藏

内容反馈

Jayden丶

粉丝: 1
资源: 6

手写字母和数字数据集介绍，与mnist完美兼容。

手写字母数据集MNIST.zip

类似mnist的手写英文体训练数据

MNIST手写字符数据样本库

数字以及字母数据集，包含mnist图片数据集

mnist手写体数据集

英文字母和数字识别数据集

印刷体数字&&字母数据集

大小写字母和数字0-9共47种手写字符数据集（其中大小写字母很相近的类归到一起统一为一类）.zip

手写数字字母识别数据集

数字和字母训练数据集

手写英文字母数据集

MNIST+手写数字识别+资源合集

手写识别数据集

MNIST手写字体数据集

手写数据集MNIST.py

mnist手写数据

mnist手写数字识别数据集

MNIST手写数字数据集（7000张图片）.rar

MNIST手写数字数据集+7000张图片.rar

手写数字MNIST数据集.zip

手写数字识别数据集，MNIST000

Mnist格式的手写数字测试数据集0~9,650个字符

mnist手写数字图片集

深度学习基础网络模型(mnist手写体识别数据集)

手写数字字母数据集[0-9,a-z,A-Z]

车牌识别字符数据集（数字、字母、31个省份）

手写体数字图片训练数据集

验证码训练数据6000个（数字字母）

最新资源