ACMComputingSurveys文章：一文入门贝叶斯神经网络

5星 · 超过95%的资源需积分: 15 168 浏览量 2021-01-07 21:23:20 上传评论 1 收藏 1.29MB PDF 举报

贝叶斯神经网络是一种结合了贝叶斯统计方法和神经网络的机器学习技术，目的是为了解决深度学习模型中预测不确定性的量化问题。传统的深度学习模型，如卷积神经网络、循环神经网络和深度前馈网络等，虽然在处理复杂问题上取得了巨大成功，但其“黑盒”的特性使得模型的内部工作机制不透明，预测结果的不确定性难以量化。贝叶斯统计提供了一种框架，可以通过概率分布的形式来量化不确定性。在贝叶斯神经网络中，模型参数不再被视为单一的点估计，而是参数的概率分布。这使得我们能够考虑模型在给定数据集上的所有可能的参数配置，并且能够计算预测结果的不确定度。贝叶斯神经网络的核心思想是对神经网络的权重进行贝叶斯推断，通过对权重的概率分布进行推断来预测新样本。这种推断通常通过贝叶斯规则来完成，即后验概率分布正比于似然函数和先验概率分布的乘积。在这个过程中，先验概率分布体现了我们对参数的先验知识，而似然函数则描述了给定模型参数下观测到数据的概率。通过贝叶斯推断，我们能够得到后验分布，它结合了先验知识和观测数据，从而能够量化模型预测的不确定性。研究贝叶斯神经网络对于理解深度学习模型的工作机制、提高模型的泛化能力以及为实际应用提供更可靠的预测非常关键。在实际应用中，比如自动驾驶汽车、医疗诊断或金融市场交易等领域，模型的可靠性和预测的不确定性具有重要的现实意义。贝叶斯神经网络能够帮助避免由于过度自信的预测而导致的潜在风险。在贝叶斯神经网络的研究中，主要的挑战之一是如何有效地计算后验分布。由于神经网络的复杂性，直接计算通常是不可行的，因此研究者们开发了多种近似推断技术，如变分贝叶斯、马尔可夫链蒙特卡洛（MCMC）方法等，来近似后验分布。这些方法能够在保持一定计算效率的同时，提供对后验分布的合理估计。贝叶斯神经网络的研究和应用不仅需要理解神经网络本身，还要熟悉贝叶斯统计方法、概率论以及数值计算技术。尽管这一领域相对较新，但已迅速发展成为机器学习领域的重要研究方向之一。贝叶斯神经网络是一个集成深度学习和贝叶斯统计的强大工具，为深度学习中的不确定性量化提供了一个科学的解决方案。通过提供预测不确定性的量化，贝叶斯神经网络在提高模型的可靠性和准确性、增强其在重要领域的应用潜力方面，发挥着至关重要的作用。随着研究的不断深入和技术的不断进步，我们有理由相信贝叶斯神经网络将在未来的学习算法中扮演更加重要的角色。

资源推荐

资源详情

资源评论

Hands-on Bayesian Neural Networks - a Tutorial for Deep

Learning Users

LAURENT VALENTIN JOSPIN, University of Western Australia

WRAY BUNTINE, Monash University

FARID BOUSSAID, University of Western Australia

HAMID LAGA, Murdoch university

MOHAMMED BENNAMOUN, University of Western Australia

Modern deep learning methods have equipped researchers and engineers with incredibly powerful tools to

tackle problems that previously seemed impossible. However, since deep learning methods operate as black

boxes, the uncertainty associated with their predictions is often challenging to quantify. Bayesian statistics

oer a formalism to understand and quantify the uncertainty associated with deep neural networks predictions.

This paper provides a tutorial for researchers and scientists who are using machine learning, especially deep

learning, with an overview of the relevant literature and a complete toolset to design, implement, train, use

and evaluate Bayesian neural networks.

CCS Concepts:

• Mathematics of computing → Probability and statistics

;

• Computing methodologies

→ Neural networks; Bayesian network models; Ensemble methods; Regularization.

Additional Key Words and Phrases: Bayesian methods, Bayesian Deep Learning, Approximate Bayesian

methods

ACM Reference Format:

Laurent Valentin Jospin, Wray Buntine, Farid Boussaid, Hamid Laga, and Mohammed Bennamoun. 2020.

Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users. ACM Comput. Surv. 1, 1 (July 2020),

35 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

Deep learning has led to a revolution in machine learning, providing solutions to tackle more and

more complex and challenging real-life problems. However, deep learning models are prone to

overtting, which adversely aects their generalization capabilities. Deep learning models also

tend to be overcondent about their predictions (when they do provide a condence interval). All

of this is problematic for applications such as self driving cars [

], medical diagnostics [

] or

trading and nance [

], where silent failure can lead to dramatic outcomes. Consequently, many

approaches have been proposed to mitigate this risk, especially via the use of stochastic neural

networks to estimate the uncertainty in the model prediction. The Bayesian paradigm provides a

Authors’ addresses: Laurent Valentin Jospin, laurent.jospin@research.uwa.edu.au, University of Western Australia, 35

Stirling Hwy, Crawley, Western Australia, 6009; Wray Buntine, wray.buntine@monash.edu, Monash University, Wellington

Rd, Monash, Victoria, 3800; Farid Boussaid, farid.boussaid@uwa.edu.au, University of Western Australia, 35 Stirling Hwy,

Crawley, Western Australia, 6009; Hamid Laga, h.laga@murdoch.edu.au, Murdoch university, 90 South St, Murdoch, Western

Australia, 6150; Mohammed Bennamoun, mohammed.bennamoun@uwa.edu.au, University of Western Australia, 35 Stirling

Hwy, Crawley, Western Australia, 6009.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

0360-0300/2020/7-ART $15.00

https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2020.

arXiv:2007.06823v1 [cs.LG] 14 Jul 2020

2 Jospin et al.

BNN (Sec.2-3)

Conception

(Sec.4-6)

Learning algorithms

(Sec.7-8)

Toolsets

(Sec.10)

PyMC

Tensorflow probability/Edwards

Stan

MCMC (Sec.7.1)

Variational Inference (Sec.7.2)

Bayes by backprop (Sec7.3-7.4)

MC dropout (Sec.8.1)

SGD Dynamics (Sec.8.2)

Bayesian methods

(Sec.7)

Approximate

(Sec.8)

PPL (Sec.10.1)

Deep Learning

Frameworks (Sec.10.2)

Priors (Sec.4-5)

Regularization (Sec.5.3)

Consistency (Sec.5.4)

PGM (Sec.4)

Pytorch/Pyro

Supervision

(Sec.6)

Self-supervised

(Sec.6.3)

Semi-supervised

(Sec.6.1)

Data augmentation

(Sec.6.2)

Evaluation

(Sec.9)

Sharpness

Calibration

Fig. 1. A mind map of the topics covered in this article. These can broadly be divided into the conception of

Bayesian deep neural networks, the dierent (strictly or approximately Bayesian) learning approaches, the

evaluation methods, and the tool sets available to researchers for implementation.

rigorous framework to analyse and train such stochastic neural networks, and more generally to

support the development of learning algorithms.

The Bayesian paradigm in statistics is often opposed to the pure frequentist paradigm, a major

area of distinction being in hypothesis testing [

]. The Bayesian paradigm is based on two simple

ideas. The rst is that probability is a measure of belief in the occurrence of events, rather than just

some limit in the frequency of occurrence when the number of samples goes towards innity. The

second is that prior beliefs inuence posterior beliefs. All of this is summarized by Bayes’ theorem, a

very simple formula to invert conditional probabilities, and its interpretation in Bayesian statistics.

P(H |D) =

P(D|H)P(H)

P(D)

P(D|H)P(H)



P(D|H

′

)P(H

′

)dH

′

P(D, H )



P(D, H

′

)dH

′

. (1)

In the classical interpretation,

and

are simply considered as sets of outcomes, while the Bayesian

interpretation explicitly considers H to be a hypothesis, such as deep neural network parameters,

and

to be some data, while in the classical interpretation one cannot dene a probability law for

an hypothesis.

P(D|H)

is called the likelihood,

P(H)

the prior,

P(D)

the evidence, and

P(H |D)

the

posterior. We designate P(D|H)P(H) = P(D, H ) as the joint probability of D and H.

This interpretation, which can be understood as learning from the data

, means that the

Bayesian paradigm does not just oer a solid approach for the quantication of uncertainty in

deep learning models. It also gives a mathematical framework to understand many regularization

techniques and learning strategies that are already used in classic deep learning [69].

There is a rich literature in the eld of Bayesian (deep) learning, including reviews [

but none of which explores, in a specic and exhaustive way, the general theory of Bayesian

neural networks. However, the eld of Bayesian learning is much larger than just stochastic neural

networks trained using a Bayesian approach, i.e., Bayesian neural networks. It makes it hard to

navigate this literature without prior knowledge of Bayesian methods and advanced statistics,

meaning there is an additional layer of complexity for deep learning practitioners willing to

understand how to build and use Bayesian neural networks. This is one of the reason which

explains why the number of theoretical contributions in the eld is large, while the practical

applications of Bayesian methods in deep learning are scarce. The other main reason probably

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2020.

Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users 3

relates to the lack of ecient algorithms to overcome the computational challenges to combine

Bayesian methods and big data, or the lack of knowledge about recent contributions to address

those challenges.

This paper is meant to ll in this gap. It is conceived as a tutorial for scientists and postgraduate

students who are already familiar with standard deep learning approaches, and who are interested

in using Bayesian methods. It covers all of the basic principles needed to design, implement, train

and evaluate a Bayesian neural network (Fig. 1). It also oers an extensive overview of the relevant

literature about Bayesian neural networks, from the early seminal work dating back to the end

of the 20th century [

] to the most recent contributions that have not been covered in any of

the previously cited reviews. This tutorial also puts a strong focus on the practical aspects. A

large number of approaches have been developed to build Bayesian neural networks, sometimes

quite dierent in their intrinsic approach, and a good knowledge of those dierent methods is a

prerequisites for an ecient use of Bayesian neural networks. To the best of our knowledge, there is

no prior work in the literature which provides a systematic review of all those dierent approaches.

We start by dening, in Section 2, the concept of a Bayesian neural network. In Section 3, provide

some of the motivations behind the use of deep Bayesian networks and why they are useful. We then

present, in Section 4, some important concepts in statistics that are used to conceive and analyse

Bayesian neural networks. Then, in Section 5, we present how prior knowledge, an important part

of Bayesian statistics, is accounted for in Bayesian deep learning. We also consider the relationships

between priors for Bayesian neural networks and regularization of traditional neural networks. In

Section 6, we explain how Bayesian design tools are used to adjust the degree of supervision and to

dene a learning strategy. In Section 7, we explore some of the most important algorithms that

are used for Bayesian inference. We review in Section 8 how Bayesian methods were specically

adapted to deep learning, to reduce the computational complexity or memory footprint. In Section

9, we present the methods used to evaluate the performance of a Bayesian neural network. In

Section 10, we review the dierent frameworks that can be used for Bayesian deep learning. Finally,

we conclude in Section 11.

2 WHAT IS A BAYESIAN NEURAL NETWORK?

A Bayesian neural network is dened slightly dierently across the literature, but a common

denition is that a Bayesian neural network is a stochastic articial neural network trained using

Bayesian inference (Fig. 2).

The goal of Articial neural networks (ANNs) is to represent an arbitrary function y = N N (x).

Traditional ANNs (e.g., feedforward networks, recurrent networks, branched networks, ...) are

built using a succession of one input layer, a number of hidden layers and one output layer. We

designate the input variables as

, and output variables (predictions) as

. In feedforward networks,

the simplest architecture, each layer

is represented as a linear transformation of the previous one,

followed by a non linear operation nl (a.k.a activation function):

= x,

= nl

i−1

+ b

) ∀i ∈ [1, n],

y = l

(2)

More complex architectures also exist (e.g., networks with multiple inputs, outputs, exotic activation

functions, recurrent architectures ...). This means that a given ANN architecture represents a set

of functions isomorphic to the set of possible coecients

, which represent all the weights

and biases

of the network. Deep learning is the process of regressing the parameters

on some

training data

, usually a series of inputs

and their corresponding labels

. The standard approach

is to approximate a minimal cost point estimate

θ (Fig. 3a) using the back-propagation algorithm,

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2020.

4 Jospin et al.

Point estimates

neural networks

Stochastic

neural networks

Regularization /

Priors

MAP Point estimates

neural networks

Bayesian (deep) learning

Bayesian

neural networks

MLE

Point estimates

neural networks

Fig. 2. A classification of neural networks from a statistical point of view. We distinguish point estimate

neural networks, where a single instance of parameters is learned, and stochastic neural networks, where

a distribution over the parameters is learned. Point estimate models without regularization, which imply

implicit uniform prior, are learned using a maximum-likelihood estimator, while point estimate models with

regularization are learned with a maximum a-posteriori estimator. Bayesian neural networks are stochastic

neural networks with priors.

with all other possible parametrizations discarded. The cost function is often dened as the log

likelihood of the training set, sometimes with a regularization term to penalize parametrizations.

From a statistician point of view, this can be considered to be a Maximum Likelihood Estimation

(MLE), respectively a Maximum A Posteriori (MAP) estimation when regularization is used (Fig. 2).

The point estimate approach is relatively easy (with modern algorithms and software packages),

but tends to lack explainability and might generalize in unforeseen and overcondent ways on

out-of-training-distribution data points [

]. This property, and inability of ANNs to answer

“I don’t know" is problematic in elds where their predictions have critical implications, such

as trading, autonomous driving or medical applications. Techniques exist to mitigate this risk

[

] based either on a threshold for the softmax-predicted class logit or an additional module to

classify out-of-distribution samples. Another method for out-of-distribution detection is the use

of Deep Generative Models, a class of ANNs (e.g., Generative Adversarial Networks) meant to

encode complex data distributions [

]. Concerns have, however, been raised about these dierent

approaches, either because they are too simple, like a threshold on the softmax logits, or because

the additional module or deep generative models used for out-of-distribution detection might

themselves suer from the same overcondence problems they where supposed to x in the

rst place [

]. The aws of these dierent approaches is one of the main motivations for the

introduction of stochastic neural networks.

Stochastic neural networks

are a type of ANNs built by introducing stochastic components

into the network (by giving the network stochastic activation: Fig. 3b or stochastic weights: Fig. 3c)

to simulate multiple possible models

with their associated probability distribution

p(θ)

. They can,

therefore, be considered as a special case of

ensemble learning

[

], where instead of training

one single model, a set of models is trained and their predictions are aggregated.

The main motivation behind ensemble learning comes from the observation that aggregating

the predictions of a large set of average performing but independent predictors can lead to much

better predictions than one output by a single well-performing expert predictor [

]. Stochastic

neural networks can be used in a similar fashion. It has been observed that their predictions

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2020.

Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users 5

(a) (b) (c)

Fig. 3. Point estimate neural networks (a), where only a set of weights is learned, stochastic activation neural

networks (b), where only a set of weights along with a probability distribution for the activation is learned,

and stochastic coeicients neural networks (c), where a probability distribution over the weights is learned.

can improve over their point-estimate counterparts, even if there is no evidence that the use of

stochastic neural networks is the best way to improve accuracy. Instead, the main goal of using a

stochastic neural network architecture is to get a better idea of the uncertainty associated with

the underlying processes. This is accomplished by comparing the predictions of multiple sampled

model parametrization

. If the dierent models agree the uncertainty is low. If they disagree, then

it is high. This process can be summarized as follow:

θ ∼ p(θ),

y = N N

(x) + ϵ,

(3)

where

represents random noise to account for the fact that the function

N N

is just an approxi-

mation.

A Bayesian Neural Network (BNN) can then be dened as any stochastic articial neural network

trained using Bayesian inference [

]. To design a BNN, the rst step is the choice of a deep neural

network architecture, i.e., of a functional model. Then, one has to choose a stochastic model, i.e.,

a prior distribution over the possible model parametrization

p(θ)

and a prior condence in the

predictive power of the model

p(y|x, θ)

(Fig. 4a). The model parametrization can be considered to

be the hypothesis

and the training set is the data

. In the rest of this paper we will designate

the model parameter as

, and use

to designate the training set,

to designate the training

features and

to designate the training labels. This is to distinguish between the training data

and any input/output pair

(x, y)

. Applying Bayes theorem, and enforcing independence between

the model parameters and the inputs, the Bayesian posterior can then be written as:

p(θ |D) =

p(D

, θ)p(θ)



p(D

, θ

′

)p(θ

′

)dθ

′

∝ p(D

, θ)p(θ). (4)

Computing this distribution, and moreover sampling from it using standards methods, is usually

an intractable problem, especially since computing the evidence



p(D

, θ

′

)p(θ

′

)dθ

′

is hard.

To address this, two approaches are possible. The rst one is to use a Markov Chain Monte Carlo

algorithm, which allows to sample the posterior directly, but needs to cache a collection of samples

. The second one is to use a variational inference approach, which learns a variational distribution

(θ)

to approximate the exact posterior (Fig. 4b). Both of these methods bypass the computation

of the evidence (denominator of Eq. 4 ), which explains why the posterior is often given up to a

scaling constant, as we do in the rest of this tutorial for simplicity. These algorithms are presented

ACM Comput. Surv., Vol. 1, No. 1, Article . Publication date: July 2020.

剩余34页未读，继续阅读

评论收藏

内容反馈

BJWcn

2023-07-27

这篇文章对贝叶斯神经网络的优缺点进行了全面的分析，具有很高的实用价值。
柔粟

2023-07-27

阅读完这篇文章，我对贝叶斯神经网络有了更深入的了解，感觉收获很大。
查理捡钢镚

2023-07-27

作者以简明的语言解释了贝叶斯神经网络的原理，让人容易理解。
RandyRhoads

2023-07-27

这篇文章对贝叶斯神经网络的介绍浅显易懂，适合新手入门。
邢小鹏

2023-07-27

文章详细讲述了贝叶斯神经网络在ACM Computing Surveys领域的应用，有助于研究者深入了解。

syp_net

粉丝: 158
资源: 1179

ACM Computing Surveys文章：一文入门贝叶斯神经网络

最新资源

ACM Computing Surveys文章：一文入门贝叶斯神经网络

贝叶斯神经网络

贝叶斯神经网络说明书

用matlab贝叶斯方法实现神经网络算法

贝叶斯神经网络.rar

贝叶斯神经网络建模预测方法及其应用

CSUR 2012-ACM Computing Surveys 2012

L-M 优化算法和贝叶斯正则化算法训练BP网络

贝叶斯神经网络（NYU-WESLEY MADDOX）

ACM投稿指南.pdf

ACM经典试题：最小差值对Minimum Difference Pair+编程知识+技术开发

ACM国际大学生程序设计竞赛：知识与入门

ACM经典C++题目：Gone Fishing算法详解及源代码.doc

ACM入门ACM入门ACM入门ACM入门

PKU+ACM.rar_ACM_PKU_acm pku_acm 北大_site:www.pudn.com

《ACM国际大学生程序设计竞赛：知识与入门》高清带书签

ACM影响力评估：学术界与工业界的桥梁

深入ACM会议论文审稿：揭秘学术发表的幕后流程

ACM 程序设计竞赛入门：第11讲 网络流.ppt

ACM 程序设计竞赛入门：第1讲 快速入门.ppt

ACM比赛逆袭秘诀：零基础大学生的成功之道！.zip

ACM国际大学生程序设计竞赛：知识与入门 俞勇

acm.rar_ACM java_java package acm

MIT最新《贝叶斯深度学习》综述论文

ACM国际大学生程序设计竞赛：知识与入门(完整高清版带书签)

从NOI、CSP到ACM的全面指导：经验分享与代码资源汇总

ACM 程序设计竞赛入门：第5讲 搜索题.ppt

高通/MTK平台ACM串口驱动(USB转ACM串口)

最新资源

ACM 程序设计竞赛入门：第11讲网络流.ppt

ACM 程序设计竞赛入门：第1讲快速入门.ppt

ACM国际大学生程序设计竞赛：知识与入门俞勇

ACM 程序设计竞赛入门：第5讲搜索题.ppt