語料庫

語料庫jyu5 liu2 fu3（英文：text corpus，眾數係 text corpora）係指一啲大型有結構嘅語言資源。一隻語言嘅語料庫會包含大量屬嗰隻語言嘅錄音同文字。喺廿世紀初或者打前，呢啲資料（語料）通常都係以文件嘅形式存在嘅，而到咗廿一世紀初就變成主要係以電腦形式儲起^[1]^[2]。

舉個例說明，國際英文語料庫（International Corpus of English，ICE）就係一個好出名嘅英文語料庫，ICE 搵勻嗮世界各地超過 20 個以英文做官方語言一部份嘅國家或者地區（包括香港），每個國家地區對應嘅英文語料都係儲咗當地啲人講英文嘅錄音，仲有係當地啲人用英文寫嘅隨筆、書信、學術文同新聞報道等嘅多種文字材料；到咗 2018 年，ICE 對包括嘅每個國家地區都最少有 1,500,000 字咁長嘅材料（大型）^[3]^[4]^{[註 1]}。

語料庫語言學（corpus linguistics）泛指靠語料庫嚟做嘅語言學研究^[5]：語言學定義上就係研究語言嘅學問，而要研究一樣嘢，就實要攞大量屬嗰個類嘅事物嚟做樣本－語料庫正正就能夠提供大量嘅語言材料，語言學家攞住一隻語言嘅語料，就可以郁手分析嗰隻語言嘅文法等嘅特性，對現代嘅語言學研究嚟講非常重要。

喺廿一世紀初嘅語言學上，語料庫定義係指一啲大型有結構嘅語言資源，當中「有結構」係一個關鍵字：語料庫個英文名嚟自拉丁文 corpus ^[6]，大致上拉丁文入面「嚿嘢」噉解，所以 text corpus 字面上涵意可以理解做「成嚿文字」；不過喺實際應用上，語料庫唔淨只要儲住啲語料，仲要俾語言學家同第啲工作者（例如可以睇吓教 AI 處理語言嘅工作）攞去用，所以齋攞咗啲語料返嚟係唔夠嘅－整語料庫嘅人仲要將啲語料執到有條有理噉^[1]。

舉個例說明，廿一世紀初嘅語料庫基本上實會做詞性標注（part-of-speech tagging）－即係同啲語料入嘅每隻字標明佢係咩詞性，每隻字都標明嗮佢係名詞定動詞定形容詞定點^[7]^[8]；做到類似下面噉^[9]－

原句：She sells seashells on the seashore.（佢喺海岸上面賣貝殼。），

做咗詞性標注就會變成好似噉－

She (pronoun) sells (verb) seashells (noun) on (in) the (det) seashore (noun).

想像有位語言學家想研究一隻語言啲詞性，如果佢用嗰個語料庫係經已做咗詞性標注，佢就唔使人手噉幫啲字標詞性，可以（例如）寫隻程式叫隻程式自動數每隻詞性嘅字出現咗幾多次－慳返好多時間精神^[9]。廿一世紀初嘅語料庫會做包括詞性標注在內嘅好多功夫，務求想整到啲語料「有結構」容易攞嚟用^[1]。

廿一世紀初嘅語料庫會「有結構」，即係喺用家攞啲數據去用之前經已做咗啲處理。除咗詞性標注之外，常見嘅事前處理仲包括有：

文字分割（text segmentation）：指將要處理嗰段字分割做若干嚿各自有意思嘅單位，方便做一步嘅分析，常見嘅有將段字切割做句子或者個別嘅字呀噉；即係例如^[10]：
Input：San Pedro is a town on the southern part of the island of Ambergris Caye in the Belize District of the nation of Belize, in Central America. According to 2015 mid-year estimates, the town has a population of about 16, 444.

Output：
San Pedro is a town on the southern part of the island of Ambergris Caye in the 2.Belize District of the nation of Belize, in Central America.

According to 2015 mid-year estimates, the town has a population of about 16, 444. It is the second-largest town in the Belize District and largest in the Belize Rural South constituency.（分割咗做唔同句子）
詞形還原（lemmatization）：指教部電腦一個字喺唔同詞性下嘅唔同樣，等部電腦識得將呢啲唔同樣當做同一個字噉嚟分析；例如英文 wolf 同 wolves 都係指狼，後者係眾數，做詞形還原就會教部電腦知 wolf 同 wolves 係同一個字嘅唔同樣^[11]。
分析樹（parse tree）：一種語言學分析上成日會用嘅樹狀數據；簡單講即係按文法將一串字串嘅句法結構表述出嚟，例如好似 John hit the ball（阿莊打中咗個波）呢句簡單嘅英文句子噉，就可以畫做以下噉嘅分析樹^[12]：

... 呀噉。

一般認為，一個良好嘅語料庫會具有以下呢啲特徵：

要做好啲事前處理（睇返上面）
要大：語料庫基本上就係為語言學同相關嘅研究提供語言樣本，而樣本通常係愈大愈好嘅；一個語料庫嘅大細通常會用入面啲語料嘅總字數嚟計，字數愈高個語料庫就算係愈大，（假設第啲因素不變）個語料庫愈大就愈有用^[13]；例如到咗 2020 年，牛津英文語料庫（Oxford English Corpus；一個描述廿一世紀初嘅英文嘅語料庫^[14]）總字數有成 21 億字咁多^[15]；另一方面，國際英文語料庫（ICE）就每個佢哋包括嘅國家地區都最少有 1,500,000 字咁長嘅材料，都仲俾人批評話佢細得滯^[4]。
要夠近期：語言係會隨時間而演變嘅，例如潮流用語等嘅現象就反映咗語言嘅演變；因為噉，如果一個語料庫啲數據唔夠近期，就會搞到研究者攞唔到最新嘅資訊，例如想像有位人工智能研究者想寫一隻曉同人傾偈嘅傾偈機械人，但佢用嘅語料庫數據唔夠新，冇近 10 年嘅數據，就會搞到隻機械人唔識用近 10 年嘅潮流用語－做唔到自然嘅傾談^[16]。喺實際應用上，語料庫仲會同啲語料標明埋每件語料係源自咩年份嘅，方便用家判斷嗰件語料係咪啱用。
要夠廣泛：一般認為，語料庫最好能夠包含唔同類嘅語料，例如牛津英文語料庫同國際英文語料庫都既會包含用英文寫嘅文字，又會包括啲人講嘢嘅錄音；除此之外，唔同情境下寫字同講嘢都會唔同，例如新聞報導、散文故仔同學術文之間就有相當嘅差異，而啲人「喺正式場合講嘢」同「喺一般日常屋企人之間講嘢」又會好唔同，國際英文語料庫等嘅語料庫都有諗到呢一點，所以喺搵語料用嗰陣會包含唔同嘅料，會有新聞報導、虛構故仔同學術文等嘅多種唔同材料^[17]。
中繼資料（metadata）：中繼資料係指「描繪數據嘅數據」，係語料庫不可或缺嘅一環；例如家陣有一份語料，份語料係篇隨筆，篇隨筆入面嘅字就係數據本身，而中繼資料就係指描述呢件數據嘅數據－包括係描述篇文嘅作者、字數、出嘅年份同埋出處... 呀噉；中繼資料喺好多分析上都好有用，例如描述啲數據「係咩年份出」嘅中繼資料就可以幫手分析隻語言隨時間嘅變化^[18]。

語料庫嘅功能就係在於提供語料做研究同分析語言嘅樣本，所以任何要用語言資料嚟做樣本嘅應用都會用到語料庫，包括：

... 呀噉。

多語嘅

英文嘅

中文嘅

粵語嘅

[註 1]
不過都有人指出，以廿一世紀初嘅語料庫嚟講，1,500,000 字算係有啲細。

低資源語言：世上有好多語言都因為「啲人肯講但唔肯寫」等嘅原因而唔夠語料用，例如廿一世紀初嘅粵語就被指有噉嘅情況。
口語語料庫
網絡搜尋器
資訊系統

[1]
Liu, V., & Curran, J. R. (2006, April). Web text corpus for natural language processing. In 11th Conference of the European Chapter of the Association for Computational Linguistics (pp. 233-240).
[2]
Language Corpora. University of Queensland.
[3]
Greenbaum, S., & Nelson, G. (1996). The international corpus of English (ICE) project. World Englishes, 15(1), 3-15.
[4]
Kirk, J., & Nelson, G. (2018). The International Corpus of English project: A progress report (PDF). World Englishes, 37(4), 697-716.
[5]
Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter. (1992).
[6]
Corpus Linguistics 互聯網檔案館嘅歸檔，歸檔日期2022年1月24號，..
[7]
Neunerdt, M., Trevisan, B., Reyer, M., & Mathar, R. (2013). Part-of-speech tagging for social media texts. In Language Processing and Knowledge in the Web (pp. 139-150). Springer, Berlin, Heidelberg.
[8]
Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer speech & language, 6(3), 225-242.
[9]
NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields. Medium.
[10]
Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation". Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33.
[11]
Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). Joint Lemmatization and Morphological Tagging with LEMMING. 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. pp. 2268-2274.
[12]
Noam Chomsky (26 December 2014). Aspects of the Theory of Syntax. MIT Press.
[13]
Hanke, T., Schulder, M., Konrad, R., & Jahn, E. (2020, May). Extending the Public DGS Corpus in size and depth. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives (pp. 75-82).
[14]
Culpeper, J. (2009). The metalanguage of impoliteness: using Sketch Engine to explore the Oxford English Corpus. na.
[15]
"The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. Retrieved 27 October 2016.
[16]
Renouf, A., & Kehoe, A. (2013). Filling the gaps: Using the WebCorp Linguist's Search Engine to supplement existing text resources. International Journal of Corpus Linguistics, 18(2), 167-198.
[17]
International Corpus of English (ICE) Homepage.
[18]
What is metadata and why do you need it?. Developing Linguistic Corpora: a Guide to Good Practice.

Text Corpus for NLP. Developedia.
Unsupervised outlier detection in text corpus using Deep Learning Medium.
What is metadata and why do you need it?. Developing Linguistic Corpora: a Guide to Good Practice.

[5] [註 1]
不過都有人指出，以廿一世紀初嘅語料庫嚟講，1,500,000 字算係有啲細。

[liu2006-1] [1]
Liu, V., & Curran, J. R. (2006, April). Web text corpus for natural language processing. In 11th Conference of the European Chapter of the Association for Computational Linguistics (pp. 233-240).

[2] [2]
Language Corpora. University of Queensland.

[3] [3]
Greenbaum, S., & Nelson, G. (1996). The international corpus of English (ICE) project. World Englishes, 15(1), 3-15.

[kirknelson2018-4] [4]
Kirk, J., & Nelson, G. (2018). The International Corpus of English project: A progress report (PDF). World Englishes, 37(4), 697-716.

[6] [5]
Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82). Berlin: Mouton de Gruyter. (1992).

[7] [6]
Corpus Linguistics 互聯網檔案館嘅歸檔，歸檔日期2022年1月24號，..

[8] [7]
Neunerdt, M., Trevisan, B., Reyer, M., & Mathar, R. (2013). Part-of-speech tagging for social media texts. In Language Processing and Knowledge in the Web (pp. 139-150). Springer, Berlin, Heidelberg.

[9] [8]
Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer speech & language, 6(3), 225-242.

[POStags-10] [9]
NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields. Medium.

[11] [10]
Freddy Y. Y. Choi (2000). "Advances in domain independent linear text segmentation". Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00). pp. 26–33.

[12] [11]
Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). Joint Lemmatization and Morphological Tagging with LEMMING. 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. pp. 2268-2274.

[13] [12]
Noam Chomsky (26 December 2014). Aspects of the Theory of Syntax. MIT Press.

[14] [13]
Hanke, T., Schulder, M., Konrad, R., & Jahn, E. (2020, May). Extending the Public DGS Corpus in size and depth. In Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives (pp. 75-82).

[15] [14]
Culpeper, J. (2009). The metalanguage of impoliteness: using Sketch Engine to explore the Oxford English Corpus. na.

[16] [15]
"The Oxford English Corpus". Sketch Engine. Lexical Computing CZ s.r.o. Retrieved 27 October 2016.

[17] [16]
Renouf, A., & Kehoe, A. (2013). Filling the gaps: Using the WebCorp Linguist's Search Engine to supplement existing text resources. International Journal of Corpus Linguistics, 18(2), 167-198.

[18] [17]
International Corpus of English (ICE) Homepage.

[19] [18]
What is metadata and why do you need it?. Developing Linguistic Corpora: a Guide to Good Practice.

[1]

[2]

[3]

[4]

[註 1]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

語料庫

多語嘅

英文嘅

中文嘅

粵語嘅

Wikiwand in your browser!

多語嘅

英文嘅

中文嘅

粵語嘅

Wikiwand in your browser!

定位

事前處理

評估條件

用途

出名語料庫

註釋

睇埋

攷

拎