Status | 已發表Published |
UM-PCorpus: A Large Portuguese-Chinese Parallel Corpus | |
Lidia S. Chao1; Derek F. Wong1; Chi Hong Ao1; Ana Luísa Leal2 | |
2018 | |
Conference Name | Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) |
Source Publication | LREC 2018 Conference Proceedings |
Conference Date | May 7-12, 2018 |
Conference Place | Miyazaki, Japan |
Publication Place | Paris, France |
Publisher | ELRA – European Language Resources Association |
Abstract | This paper describes the creation of a high quality parallel corpus for Portuguese and Chinese that extracted from parallel and comparable documents. The corpus is constructed using an on-line alignment platform, UM-pAligner. The UM-pAligner consists of two main alignment components, parallel sentence identification and classification model, for acquiring the parallel sentences from either the parallel or comparable texts in a semi-automatic manner. The extracted parallel sentences are manually verified. The resulting corpus is composed of the parallel sentences covering the texts of the newswire, legal, subtitle, technical and general on-line publications, around 6 million parallel sentences. About 1 million parallel sentences are compiled and made available for download at the NLP2CT website. |
Keyword | Portuguese-chinese Alignment Platform Um-paligner Parallel Corpus Machine Translation |
URL | View the original |
Language | 英語English |
Document Type | Conference paper |
Collection | DEPARTMENT OF PORTUGUESE |
Affiliation | 1.NLP2CT Lab / Department of Computer and Information Science University of Macau, Macau SAR, China 2.Department of Portuguese, University of Macau, Macau SAR, China |
First Author Affilication | University of Macau |
Recommended Citation GB/T 7714 | Lidia S. Chao,Derek F. Wong,Chi Hong Ao,et al. UM-PCorpus: A Large Portuguese-Chinese Parallel Corpus[C], Paris, France:ELRA – European Language Resources Association, 2018. |
APA | Lidia S. Chao., Derek F. Wong., Chi Hong Ao., & Ana Luísa Leal (2018). UM-PCorpus: A Large Portuguese-Chinese Parallel Corpus. LREC 2018 Conference Proceedings. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment