利用BERT构建助句进行面向方面的情感分析(NAACL2019).zip资源-CSDN文库

共22个文件

py：16个

json：3个

xml：2个

版权申诉

bert

情感分析

情感挖掘

163 浏览量 2023-12-01 14:52:58 上传评论收藏 455KB ZIP 举报

标题中的“利用BERT构建助句进行面向方面的情感分析(NAACL 2019)”是一项在自然语言处理领域的研究工作，该工作在2019年的北美计算语言学协会年会（NAACL）上发表。这里的核心技术是BERT（Bidirectional Encoder Representations from Transformers），一种预训练的深度学习模型，由Google在2018年推出，它在多项自然语言理解任务中取得了显著的性能提升。 1. **BERT模型**：BERT是基于Transformer架构的双向Transformer编码器，它的主要创新在于引入了掩码语言模型和下一个句子预测两种预训练任务，使得模型能够从上下文的两个方向学习到语言表示，相比传统的单向模型（如LSTM或GRU）有着更强的语义理解能力。 2. **情感分析**：情感分析是自然语言处理的一个重要应用，其目标是自动识别和提取文本中的主观信息，包括情感极性（正面、负面、中性）、情感强度以及情感目标（面向方面的情感分析）。在面向方面的情感分析中，我们不仅要判断整个文本的情感，还要识别出与特定方面（产品特性、服务等）相关的情感。 3. **情感挖掘**：情感挖掘是情感分析的一种扩展，除了识别情感外，还包括对情感的深入理解和提取，如情感主题、情感触发词、情感原因等，以获取更丰富的信息。 4. **情感预测**：通过训练模型，可以预测未知文本的情感倾向，这在产品推荐、舆情分析、市场调研等领域有广泛应用。 5. **run_classifier_TABSA.py**：这是一个Python脚本，很可能用于训练或测试BERT模型在面向方面的情感分析任务上的表现。"TABSA"可能代表“Target-Aware Aspect-Based Sentiment Analysis”，即目标感知的面向方面的情感分析。 6. **modeling.py、processor.py、evaluation.py、tokenization.py、optimization.py**：这些文件分别对应BERT模型的实现、数据处理、模型评估、分词以及优化等关键步骤。例如，`modeling.py`可能包含BERT模型的定义和操作，`processor.py`负责将原始数据转化为模型可输入的形式，`evaluation.py`用于计算模型的性能指标，`tokenization.py`处理文本的分词，而`optimization.py`涉及模型的优化算法。 7. **convert_tf_checkpoint_to_pytorch.py**：此脚本可能是将TensorFlow保存的模型权重转换为PyTorch可用的格式，因为BERT最初是在TensorFlow框架下开发的，而许多后续的工作可能使用了PyTorch。 8. **data**：这个文件夹可能包含了用于训练和验证模型的数据集，可能包括标注好的评论文本和对应的情感标签。 9. **generate**：这个文件可能是一个生成器，用于根据训练好的模型生成新的面向方面的情感分析结果。这个压缩包包含了一个使用BERT进行面向方面的情感分析的完整项目，涵盖了从数据预处理、模型训练、评估到模型转换等多个环节，是研究和实践BERT在情感分析领域应用的一个实例。通过学习和理解这些文件，我们可以深入了解如何利用预训练的BERT模型进行特定任务的微调，并进行有效的情感分析。

资源推荐

资源详情

资源评论

收起资源包目录

利用BERT构建助句进行面向方面的情感分析(NAACL 2019).zip （22个子文件）

evaluation.py 16KB

generate

generate_sentihood_BERT_single.py 5KB

generate_semeval_NLI_B_QA_B.py 2KB

generate_sentihood_NLI_M.py 4KB

data_utils_sentihood.py 3KB

generate_semeval_QA_M.py 5KB

generate_semeval_BERT_single.py 5KB

generate_sentihood_NLI_B_QA_B.py 8KB

generate_semeval_NLI_M.py 5KB

generate_sentihood_QA_M.py 4KB

make.sh 151B

data

sentihood

sentihood-train.json 990KB

sentihood-test.json 493KB

sentihood-dev.json 247KB

semeval2014

Restaurants_Test_Gold.xml 351KB

Restaurants_Train.xml 1.18MB

modeling.py 20KB

optimization.py 7KB

tokenization.py 9KB

run_classifier_TABSA.py 20KB

convert_tf_checkpoint_to_pytorch.py 3KB

processor.py 18KB

# coding=utf-8 """BERT finetuning runner.""" from __future__ import absolute_import, division, print_function import argparse import collections import logging import os import random import numpy as np import torch import torch.nn.functional as F from torch.utils.data import DataLoader, TensorDataset from torch.utils.data.distributed import DistributedSampler from torch.utils.data.sampler import RandomSampler, SequentialSampler from tqdm import tqdm, trange import tokenization from modeling import BertConfig, BertForSequenceClassification from optimization import BERTAdam from processor import (Semeval_NLI_B_Processor, Semeval_NLI_M_Processor, Semeval_QA_B_Processor, Semeval_QA_M_Processor, Semeval_single_Processor, Sentihood_NLI_B_Processor, Sentihood_NLI_M_Processor, Sentihood_QA_B_Processor, Sentihood_QA_M_Processor, Sentihood_single_Processor) logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', datefmt = '%m/%d/%Y %H:%M:%S', level = logging.INFO) logger = logging.getLogger(__name__) class InputFeatures(object): """A single set of features of data.""" def __init__(self, input_ids, input_mask, segment_ids, label_id): self.input_ids = input_ids self.input_mask = input_mask self.segment_ids = segment_ids self.label_id = label_id def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer): """Loads a data file into a list of `InputBatch`s.""" label_map = {} for (i, label) in enumerate(label_list): label_map[label] = i features = [] for (ex_index, example) in enumerate(tqdm(examples)): tokens_a = tokenizer.tokenize(example.text_a) tokens_b = None if example.text_b: tokens_b = tokenizer.tokenize(example.text_b) if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)] # The convention in BERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambigiously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned. tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0) if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1) input_ids = tokenizer.convert_tokens_to_ids(tokens) # The mask has 1 for real tokens and 0 for padding tokens. Only real # tokens are attended to. input_mask = [1] * len(input_ids) # Zero-pad up to the sequence length. while len(input_ids) < max_seq_length: input_ids.append(0) input_mask.append(0) segment_ids.append(0) assert len(input_ids) == max_seq_length assert len(input_mask) == max_seq_length assert len(segment_ids) == max_seq_length label_id = label_map[example.label] features.append( InputFeatures( input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_id=label_id)) return features def _truncate_seq_pair(tokens_a, tokens_b, max_length): """Truncates a sequence pair in place to the maximum length.""" # This is a simple heuristic which will always truncate the longer sequence # one token at a time. This makes more sense than truncating an equal percent # of tokens from each, since if one sequence is very short then each token # that's truncated likely contains more information than a longer sequence. while True: total_length = len(tokens_a) + len(tokens_b) if total_length <= max_length: break if len(tokens_a) > len(tokens_b): tokens_a.pop() else: tokens_b.pop() def main(): parser = argparse.ArgumentParser() ## Required parameters parser.add_argument("--task_name", default=None, type=str, required=True, choices=["sentihood_single", "sentihood_NLI_M", "sentihood_QA_M", \ "sentihood_NLI_B", "sentihood_QA_B", "semeval_single", \ "semeval_NLI_M", "semeval_QA_M", "semeval_NLI_B", "semeval_QA_B"], help="The name of the task to train.") parser.add_argument("--data_dir", default=None, type=str, required=True, help="The input data dir. Should contain the .tsv files (or other data files) for the task.") parser.add_argument("--vocab_file", default=None, type=str, required=True, help="The vocabulary file that the BERT model was trained on.") parser.add_argument("--bert_config_file", default=None, type=str, required=True, help="The config json file corresponding to the pre-trained BERT model. \n" "This specifies the model architecture.") parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the model checkpoints will be written.") ## Other parameters parser.add_argument("--init_checkpoint", default=None, type=str, help="Initial checkpoint (usually from a pre-trained BERT model).") parser.add_argument("--init_eval_checkpoint", default=None, type=str, help="Initial checkpoint (usually from a pre-trained BERT model + classifier).") parser.add_argument("--do_save_model", default=False, action='store_true', help="Whether to save model.") parser.add_argument("--eval_test", default=False, action='store_true', help="Whether to run eval on the test set.")

评论收藏

内容反馈

版权申诉