基于句子分组的中英机器翻译研究

doi:10.3969/j.issn.1671-1122.2021.07.008

摘要/Abstract

摘要：

虽然神经机器翻译模型使用大规模数据集进行训练能够改善翻译模型的表现,但是数据集中有关句子内容类别以及结构的信息并未得到充分利用,模型仍有提高空间。文章提出了一种基于句子分组的神经机器翻译模型架构,在训练之前,首先按照内容类别、句子结构信息对数据集中的句子进行分组,再使用组别标签和平行语料共同对模型进行训练,使得模型能够更充分利用数据集中的信息。大量对比实验证明了分组思想的合理性,基于分组架构训练得到的Transformer模型的翻译结果得到了一定提高,与普通的Transformer模型相比,文章模型的BLEU值最多可以提升1.2。

关键词: 机器翻译, 句子分组, 结构信息

Abstract:

Although neural machine translation models can obtain improvements when using larger data set for training, the information about categories and structures of sentences in the data set has not been properly utilized. This paper proposes a neural machine translation model based on sentence grouping, which adds a discriminator based on attention mechanism after encoders. In addition, this paper proposes a method to calculate the structural information vector of sentences as well. These vectors can be used to obtain the group labels by unsupervised method. Before training, sentences in the data set will be divided according to their content category and sentence structure to get group labels. Then the model is trained with these labels and parallel corpus at the same time, which will help the model identify the group that sentences belong to. In this way, the information in the data set can be more fully utilized. Sufficient comparative experiments show the rationality of the grouping idea. The translation results of Transformer model based on group architecture have been improved. Compared with the vanilla Transformer model, the BLEU score of our model has increased by at most 1.2.

Key words: machine translation, sentence grouping, structural information

中图分类号:

TP309

赵彧然, 孟魁. 基于句子分组的中英机器翻译研究[J]. 信息网络安全, 2021, 21(7): 63-71.

ZHAO Yuran, MENG Kui. Research on English-Chinese Machine Translation Based on Sentence Grouping[J]. Netinfo Security, 2021, 21(7): 63-71.

图/表 13

图1

表1

表2

表3

表4

表5

表6

图2

图3

图4

表7

表8

图5

参考文献 21

[1]	KALCHBRENNER N, BLUNSOM P. Recurrent Continuous Translation Models[C]// ACL. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, October 18-21, 2013, Washington, Stroudsburg: ACL, 2013: 1700-1709.
[2]	SUTSKEVER I, VINYALS O, LE Q V. Sequence to Sequence Learning with Neural Networks[C]// Neural Information Processing Systems Foundation. Proceedings of the 27th International Conference on Neural Information Processing Systems, December 8-13, 2014, Montreal, Quebec, Canada. New York: Curran Associates, 2014: 3104-3112.
[3]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is All you Need[C]// Neural Information Processing Systems Foundation. Proceedings of the 31st International Conference on Neural Information Processing Systems, December 4-9, 2017, Long Beach, CA, USA. New York: Curran Associates, 2017: 5998-6008.
[4]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// ACL. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, June 2-7, 2019, Minneapolis, MN, USA. Stroudsburg: ACL, 2019: 4171-4186.
[5]	SALTON G, FOX E A, WU H. Extended Boolean Information Retrieval[J]. Communications of the ACM, 1983, 26(11):1022-1036. doi: 10.1145/182.358466 URL
[6]	MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed Representations of Words and Phrases and their Compositionality[C]// Neural Information Processing Systems Foundation. Proceedings of the 26th International Conference on Neural Information Processing Systems, December 5-8, 2013, Lake Tahoe, Nevada, USA. New York: Curran Associates, 2013: 3111-3119.
[7]	PENNINGTON J, SOCHER R, MANNING C D. Glove: Global Vectors for Word Representation[C]// ACL. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, October 25-29, 2014, Doha, Qatar. Stroudsburg: ACL, 2014: 1532-1543.
[8]	DAI A M, LE Q V. Semi-supervised Sequence Learning[C]// Neural Information Processing Systems Foundation. Proceedings of the 28th International Conference on Neural Information Processing Systems, December 7-12, 2015, Montreal, Quebec, Canada. New York: Curran Associates, 2015: 3079-3087.
[9]	PETERS M, NEUMANN M, IYYER M, et al. Deep Contextualized Word Representations[C]// ACL. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, June 1-6, 2018, New Orleans, USA. Stroudsburg: ACL, 2018: 2227-2237.
[10]	CHEN T, GUESTRIN C. Xgboost: A Scalable Tree Boosting System[C]// ACM. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13-17, 2016, San Francisco, CA, USA. New York: ACM, 2016: 785-794.
[11]	CORTES C, VAPNIK V. Support-vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[12]	LIU P, QIU X, HUANG X. Recurrent Neural Network for Text Classification with Multi-task Learning[C]// IJCAI. Proceedings of the 25th International Joint Conference on Artificial Intelligence, July 9-15, 2016, New York. Menlo Park: AAAI, 2016: 2873-2879.
[13]	KIM Y. Convolutional Neural Networks for Sentence Classification[C]// ACL. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, October 25-29, 2014, Doha, Qatar. Stroudsburg: ACL, 2014: 1746-1751.
[14]	LLOYD S. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2):129-136. doi: 10.1109/TIT.1982.1056489 URL
[15]	ARTHUR D, VASSILVITSKII S. K-means++: The Advantages of Careful Seeding[C]// ACM. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, January 7-9, 2007, New Orleans, Louisiana, USA. Philadelphia: SIAM, 2007: 1027-1035.
[16]	SCULLEY D. Web-scale K-means Clustering[C]// ACM. Proceedings of the 19th International Conference on World Wide Web, April 26-30, 2010, Raleigh, North Carolina, USA. New York: ACM, 2010: 1177-1178.
[17]	BRITZ D, LE Q, PRYZANT R. Effective Domain Mixing for Neural Machine Translation[C]// ACL. Proceedings of the Second Conference on Machine Translation, September 7-8, 2017, Copenhagen, Denmark. Stroudsburg: ACL, 2017: 118-126.
[18]	KIFER D, BEN-DAVID S, GEHRKE J. Detecting Change in Data Streams[C]// VLDB. Proceedings of the Thirtieth International Conference on Very Large Data Bases, August 31-September 3, 2004, Toronto, Canada. Trondheim, Norway: VLDB Endowment, 2004: 180-191.
[19]	BEND S, EIRON N, LONG P M. On the Difficulty of Approximately Maximizing Agreements[J]. Journal of Computer and System Sciences, 2003, 3(66):496-514.
[20]	BEND S, BLITZER J, CRAMMER K, et al. Analysis of Representations for Domain Adaptation[C]// Neural Information Processing Systems Foundation. Proceedings of the 20th Annual Conference on Neural Information Processing Systems, December 4-7, 2006, Vancouver, British Columbia, Canada. Cambridge, USA: MIT Press, 2007: 137-144.
[21]	HEWITT J, MANNING C D. A Structural Probe for Finding Syntax in Word Representations[C]// ACL. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, June 2-7, 2019, Minneapolis, MN, USA. Stroudsburg: ACL, 2019: 4129-4138.

编辑推荐 0

Metrics

阅读次数

全文

132

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	12	0	0	120

来源	本网站	其他网站

次数	114	18
比例	86%	14%

摘要

482

最新录用	在线预览	正式出版

0	0	482

	来源	本网站

	次数	482
	比例	100%

数据集来源	句子对数目/个	所属领域
CWMT	3,050,000	新闻
UN	15,886,041	政治
AI Challenger 2018	3,262,499	口语
医疗数据集	369,984	医疗

数据集组别	训练集/条	符号数/个	验证集/条	测试集/条
新闻	354,500	7,795,353	1,250	1,250
政治	312,500	7,792,509	1,250	1,250
口语	637,500	7,796,759	1,250	1,250
医疗	365,000	7,798,532	1,250	1,250

数据集句子长度	训练集/条	验证集/条	测试集/条
[1,10]	322,382	1,368	1,443
[11,17]	366,472	1,462	1,590
[18,24]	368,691	944	851
[25, ∞]	402,455	1,226	1,126

数据集组别	训练集/条	验证集/条	测试集/条
组别1	722,172	3,036	2,534
组别2	109,725	372	468
组别3	330,716	818	1,136
组别4	297,387	774	862

组别组别	口语	新闻	政治	医疗
口语		1.01	1.81	1.98
新闻	1.01		1.48	1.88
政治	1.81	1.48		1.98
医疗	1.98	1.88	1.98