喺廿一世紀初嘅語言學上,語料庫定義係指一啲大型有結構嘅語言資源,當中「有結構」係一個關鍵字:語料庫個英文名嚟自拉丁文corpus[6],大致上拉丁文入面「嚿嘢」噉解,所以 text corpus 字面上涵意可以理解做「成嚿文字」;不過喺實際應用上,語料庫唔淨只要儲住啲語料,仲要俾語言學家同第啲工作者(例如可以睇吓教 AI 處理語言嘅工作)攞去用,所以齋攞咗啲語料返嚟係唔夠嘅-整語料庫嘅人仲要將啲語料執到有條有理噉[1]。
Input:San Pedro is a town on the southern part of the island of Ambergris Caye in the Belize District of the nation of Belize, in Central America. According to 2015 mid-year estimates, the town has a population of about 16, 444.
Output:
San Pedro is a town on the southern part of the island of Ambergris Caye in the 2.Belize District of the nation of Belize, in Central America.
According to 2015 mid-year estimates, the town has a population of about 16, 444. It is the second-largest town in the Belize District and largest in the Belize Rural South constituency.(分割咗做唔同句子)
詞形還原(lemmatization):指教部電腦一個字喺唔同詞性下嘅唔同樣,等部電腦識得將呢啲唔同樣當做同一個字噉嚟分析;例如英文wolf 同 wolves 都係指狼,後者係眾數,做詞形還原就會教部電腦知 wolf 同 wolves 係同一個字嘅唔同樣[11]。
分析樹(parse tree):一種語言學分析上成日會用嘅樹狀數據;簡單講即係按文法將一串字串嘅句法結構表述出嚟,例如好似 John hit the ball(阿莊打中咗個波)呢句簡單嘅英文句子噉,就可以畫做以下噉嘅分析樹[12]:
Liu, V., & Curran, J. R. (2006, April). Web text corpus for natural language processing. In 11th Conference of the European Chapter of the Association for Computational Linguistics (pp. 233-240).
Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter. (1992).
Neunerdt, M., Trevisan, B., Reyer, M., & Mathar, R. (2013). Part-of-speech tagging for social media texts. In Language Processing and Knowledge in the Web (pp. 139-150). Springer, Berlin, Heidelberg.
Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation". Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33.
Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). Joint Lemmatization and Morphological Tagging with LEMMING. 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. pp. 2268-2274.
Hanke, T., Schulder, M., Konrad, R., & Jahn, E. (2020, May). Extending the Public DGS Corpus in size and depth. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives (pp. 75-82).
Renouf, A., & Kehoe, A. (2013). Filling the gaps: Using the WebCorp Linguist's Search Engine to supplement existing text resources. International Journal of Corpus Linguistics, 18(2), 167-198.