Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

Naseem, Usman; Dunn, Adam G.; Khushi, Matloob; Kim, Jinman

Computer Science > Computation and Language

arXiv:2107.04374 (cs)

[Submitted on 9 Jul 2021]

Title:Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

Authors:Usman Naseem, Adam G. Dunn, Matloob Khushi, Jinman Kim

View PDF

Abstract:The availability of biomedical text data and advances in natural language processing (NLP) have made new applications in biomedical NLP possible. Language models trained or fine tuned using domain specific corpora can outperform general models, but work to date in biomedical NLP has been limited in terms of corpora and tasks. We present BioALBERT, a domain-specific adaptation of A Lite Bidirectional Encoder Representations from Transformers (ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical (MIMIC-III) corpora and fine tuned for 6 different tasks across 20 benchmark datasets. Experiments show that BioALBERT outperforms the state of the art on named entity recognition (+11.09% BLURB score improvement), relation extraction (+0.80% BLURB score), sentence similarity (+1.05% BLURB score), document classification (+0.62% F1-score), and question answering (+2.83% BLURB score). It represents a new state of the art in 17 out of 20 benchmark datasets. By making BioALBERT models and data available, our aim is to help the biomedical NLP community avoid computational costs of training and establish a new set of baselines for future efforts across a broad range of biomedical NLP tasks.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2107.04374 [cs.CL]
	(or arXiv:2107.04374v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2107.04374

Submission history

From: Usman Naseem [view email]
[v1] Fri, 9 Jul 2021 11:47:13 UTC (4,347 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-07

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Adam G. Dunn
Matloob Khushi
Jinman Kim

export BibTeX citation

Computer Science > Computation and Language

Title:Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators