語料庫

corpus linguistics

語料庫 (corpus) 是語言學的資料庫，但是和一般資料庫不同點在於語料庫通常伴有「語言訊息標記」(annotation)。可以是構詞句法 (POS)、詞意 (word sense)、義類 (semantic class)、語用言談標記 (discourse marker)、情緒等等，端視研究與應用目的而定。
在一般 text mining 的研究，則單純是文集 (text collection)。

用 `tm`來建立(小型)語料庫

文本格式

tm (Feinerer and Hornik, 2014) 支持的格式包括了 text, PDF, Microsoft Word, 以及 XML。

Quanteda

Previous自然語言處理 Next機器學習

Last updated 5 years ago