New Release
We released Adversarial training for both LM pre-training/finetuning and f-divergence.
Large-scale Adversarial training for LMs: ALUM code.
Hybrid Neural Network Model for Commonsense Reasoning: HNN code
If you want to use the old version, please use following cmd to clone the code:
git clone -b v0.1 https://github.com/namisan/mt-dnn.git
Update
We regret to inform you that due the policy changing, it no longer provides a public storage for model sharing. We are working hard to find a solution.
This PyTorch package implements the Multi-Task Deep Neural Networks (MT-DNN) for Natural Language Understanding, as described in:
Xiaodong Liu*, Pengcheng He*, Weizhu Chen and Jianfeng Gao
Multi-Task Deep Neural Networks for Natural Language Understanding
ACL 2019
*: Equal contribution
Xiaodong Liu, Pengcheng He, Weizhu Chen and Jianfeng Gao
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
arXiv version
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Jiawei Han
On the Variance of the Adaptive Learning Rate and Beyond
ICLR 2020
Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Tuo Zhao
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
ACL 2020
Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao
The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding
ACL 2020
Hao Cheng and Xiaodong Liu and Lis Pereira and Yaoliang Yu and Jianfeng Gao
Posterior Differential Regularization with f-divergence for Improving Model Robustness
NAACL 2021
-
python3.6
Reference to download and install : https://www.python.org/downloads/release/python-360/ -
install requirements
> pip install -r requirements.txt
-
Pull docker
> docker pull allenlao/pytorch-mt-dnn:v1.3
-
Run docker
> docker run -it --rm --runtime nvidia allenlao/pytorch-mt-dnn:v1.3 bash
Please refer to the following link if you first use docker: https://docs.docker.com/
-
Download data
> sh download.sh
Please refer to download GLUE dataset: https://gluebenchmark.com/ -
Preprocess data
> sh experiments/glue/prepro.sh
-
Training
> python train.py
Note that we ran experiments on 4 V100 GPUs for base MT-DNN models. You may need to reduce batch size for other GPUs.
-
MTL refinement: refine MT-DNN (shared layers), initialized with the pre-trained BERT model, via MTL using all GLUE tasks excluding WNLI to learn a new shared representation.
Note that we ran this experiment on 8 V100 GPUs (32G) with a batch size of 32.- Preprocess GLUE data via the aforementioned script
- Training:
>scripts\run_mt_dnn.sh
-
Finetuning: finetune MT-DNN to each of the GLUE tasks to get task-specific models.
Here, we provide two examples, STS-B and RTE. You can use similar scripts to finetune all the GLUE tasks.- Finetune on the STS-B task
> scripts\run_stsb.sh
You should get about 90.5/90.4 on STS-B dev in terms of Pearson/Spearman correlation. - Finetune on the RTE task
> scripts\run_rte.sh
You should get about 83.8 on RTE dev in terms of accuracy.
- Finetune on the STS-B task
-
Domain Adaptation on SciTail
>scripts\scitail_domain_adaptation_bash.sh
-
Domain Adaptation on SNLI
>scripts\snli_domain_adaptation_bash.sh
-
Preprocess data
a) Download NER data to data/ner including: {train/valid/test}.txt
b) Convert NER data to the canonical format:> python experiments\ner\prepro.py --data data\ner --output_dir data\canonical_data
c) Preprocess the canonical data to the MT-DNN format:> python prepro_std.py --root_dir data\canonical_data --task_def experiments\ner\ner_task_def.yml --model bert-base-uncased
-
Training
> python train.py --data_dir <data-path> --init_checkpoint <bert-base-uncased> --train_dataset squad,squad-v2 --test_dataset squad,squad-v2 --task_def experiments\squad\squad_task_def.yml
-
Preprocess data
a) Download SQuAD data to data/squad including: {train/valid}.txt and then change file name to: {squad_train/squad_dev}.json
b) Convert data to the MT-DNN format:> python experiments\squad\squad_prepro.py --root_dir data\canonical_data --task_def experiments\squad\squad_task_def.yml --model bert-base-uncased
-
Training
> python train.py --data_dir <data-path> --init_checkpoint <bert-model> --train_dataset ner --test_dataset ner --task_def experiments\ner\ner_task_def.yml
Adv training at the fine-tuning stages:
> python train.py --data_dir <data-path> --init_checkpoint <bert/mt-dnn-model> --train_dataset mnli --test_dataset mnli_matched,mnli_mismatched --task_def experiments\glue\glue_task_def.yml --adv_train --adv_opt 1
-
Extracting embeddings of a pair text example
>python extractor.py --do_lower_case --finput input_examples\pair-input.txt --foutput input_examples\pair-output.json --bert_model bert-base-uncased --checkpoint mt_dnn_models\mt_dnn_base.pt
Note that the pair of text is split by a special token|||
. You may referinput_examples\pair-output.json
as example. -
Extracting embeddings of a single sentence example
>python extractor.py --do_lower_case --finput input_examples\single-input.txt --foutput input_examples\single-output.json --bert_model bert-base-uncased --checkpoint mt_dnn_models\mt_dnn_base.pt
-
Gradient Accumulation
If you have small GPUs, you may need to use the gradient accumulation to make training stable.
For example, if you use the flag:--grad_accumulation_step 4
during the training, the actual batch size will bebatch_size * 4
. -
FP16 The current version of MT-DNN also supports FP16 training, and please install apex.
You just need to turn on the flag during the training:--fp16
Please refer the script:scripts\run_mt_dnn_gc_fp16.sh
Here, we go through how to convert a Chinese Tensorflow BERT model into mt-dnn format.
-
Download BERT model from the Google bert web: https://github.com/google-research/bert
-
Run the following script for MT-DNN format
python scripts\convert_tf_to_pt.py --tf_checkpoint_root chinese_L-12_H-768_A-12\ --pytorch_checkpoint_path chinese_L-12_H-768_A-12\bert_base_chinese.pt
- Publish pretrained Tensorflow checkpoints.
Yes, we released the pretrained shared embedings via MTL which are aligned to BERT base/large models: mt_dnn_base.pt
and mt_dnn_large.pt
.
To obtain the similar models:
- run the
>sh scripts\run_mt_dnn.sh
, and then pick the best checkpoint based on the average dev preformance of MNLI/RTE. - strip the task-specific layers via
scritps\strip_model.py
.
For SciTail/SNLI tasks, the purpose is to test generalization of the learned embedding and how easy it is adapted to a new domain instead of complicated model structures for a direct comparison with BERT. Thus, we use a linear projection on the all domain adaptation settings.
The difference is in the QNLI dataset. Please refere to the GLUE official homepage for more details. If you want to formulate QNLI as pair-wise ranking task as our paper, make sure that you use the old QNLI data.
Then run the prepro script with flags: > sh experiments/glue/prepro.sh --old_glue
If you have issues to access the old version of the data, please contact the GLUE team.
We can use the multi-task refinement model to run the prediction and produce a reasonable result. But to achieve a better result, it requires a fine-tuneing on each task. It is worthing noting the paper in arxiv is a littled out-dated and on the old GLUE dataset. We will update the paper as we mentioned below.
BERT pytorch is from: https://github.com/huggingface/pytorch-pretrained-BERT
BERT: https://github.com/google-research/bert
We also used some code from: https://github.com/kevinduh/san_mrc
- Pretrained UniLM: https://github.com/microsoft/unilm
- Pretrained Response Generation Model: https://github.com/microsoft/DialoGPT
- Internal MT-DNN repo: https://github.com/microsoft/mt-dnn
@inproceedings{liu2019mt-dnn,
title = "Multi-Task Deep Neural Networks for Natural Language Understanding",
author = "Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://222.178.203.72:19005/whst/63/=vvvzZbkvdaznqf//anthology/P19-1441",
pages = "4487--4496"
}
@article{liu2019mt-dnn-kd,
title={Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding},
author={Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng},
journal={arXiv preprint arXiv:1904.09482},
year={2019}
}
@inproceedings{liu2019radam,
title={On the Variance of the Adaptive Learning Rate and Beyond},
author={Liu, Liyuan and Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Han, Jiawei},
booktitle={International Conference on Learning Representations},
year={2020}
}
@inproceedings{jiang2019smart,
title={SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization},
author={Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Zhao, Tuo},
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
year={2020}
}
@article{liu2020mtmtdnn,
title={The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding},
author={Liu, Xiaodong and Wang, Yu and Ji, Jianshu and Cheng, Hao and Zhu, Xueyun and Awa, Emmanuel and He, Pengcheng and Chen, Weizhu and Poon, Hoifung and Cao, Guihong and Jianfeng Gao},
journal={arXiv preprint arXiv:2002.07972},
year={2020}
}
@inproceedings{liu2020mtmtdnn,
title = "The {M}icrosoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding",
author={Liu, Xiaodong and Wang, Yu and Ji, Jianshu and Cheng, Hao and Zhu, Xueyun and Awa, Emmanuel and He, Pengcheng and Chen, Weizhu and Poon, Hoifung and Cao, Guihong and Jianfeng Gao},
publisher = "Association for Computational Linguistics",
url = "http://222.178.203.72:19005/whst/63/=ZbkZmsgnknfxznqf//2020.acl-demos.16",
year = "2020"
}
@article{cheng2020posterior,
title={Posterior Differential Regularization with f-divergence for Improving Model Robustness},
author={Cheng, Hao and Liu, Xiaodong and Pereira, Lis and Yu, Yaoliang and Gao, Jianfeng},
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
year = "2021",
publisher = "Association for Computational Linguistics",
url = "http://222.178.203.72:19005/whst/63/=ZbkZmsgnknfxznqf//2021.naacl-main.85",
doi = "10.18653/v1/2021.naacl-main.85",
}
For help or issues using MT-DNN, please submit a GitHub issue.
For personal communication related to this package, please contact Xiaodong Liu (xiaodl@microsoft.com
), Yu Wang (yuwan@microsoft.com
), Pengcheng He (penhe@microsoft.com
), Weizhu Chen (wzchen@microsoft.com
), Jianshu Ji (jianshuj@microsoft.com
), Hao Cheng (chehao@microsoft.com
) or Jianfeng Gao (jfgao@microsoft.com
).