語料收集

語料庫是一個 self-contained 的語言物件。
任何形式的語言實現素材，如語音，口語，書面文本，或神經心理語言實驗資料，都可以視為是「語言資料」(linguistic data)。
將「語言資料」數位化處理，並加以自動或手工標記語言訊息的資料庫，謂之語料庫 (corpus)。
統計上來說，是個語言樣本。

目前已經有許多地方可以拿到開源的語料庫。

http://www.helsinki.fi/varieng/CoRD/
http://linghub.org/
http://datahub.io/

語料怎麼蒐集

人工手動亂蒐（廣蒐）法
程式自動化蒐集
- wget
- 市面上的爬蟲程式
  - import.io: Instantly Turn Web Pages into Data.
  - Kimono: Turn websites into structured APIs from your browser in seconds
  - Blockspring: Access web services from spreadsheets.

語言學家寫的套件

單機版

bootcat :Simple Utilities to Bootstrap Corpora And Terms from the Web

開發者是計算語言學家，更能考量到語言分析的方法，該網站也提供了詳盡的步驟與適用各個平台的前端介面。
The BootCaT toolkit implementS an iterative procedure to bootstrap specialized corpora and terms from the web, requiring only a list of 'seed terms' (terms that are expected to be typical of the domain of interest) as input.
It is modular in that one can easily run a subset of the programs, look at intermediate output files, add new tools to the suite, or change one program without having to worry about the others.
The BootCaT front-end is a graphical interface for the BootCaT toolkit, it's basically just a wizard that guides you through the process of creating a simple web corpus. (advanced users comfortable with text UIs should consider using the Perl scripts instead of the front-end)

安裝畫面

目前用 bing 來搜尋網頁，申請一個免費帳號一個月容許 5000 筆 query（一個 tuple 算一個）
抓回來的文本基本上有清理過，但是還是存在許多雜訊。
在指令列的版本比較好。其中 readme 中提到還可以繼續做 (英語) 的 pos tagging 和 cwb-indexed。

THE BASIC PROCEDURE
1) Choose seeds/keywords
2) Build n-tuples from seeds
3) Use n-tuples as search engine queries and retrieve urls
4) Fetch corresponding pages and build corpus

(以下僅能透過指令來完成的)

(Begin of optional iteration phase)
5) Identify typical words in corpus by frequency comparison with general corpus
6) Using typical words as new seeds, re-start from 2)
(End of optional iteration phase)

7) POS-tag the corpus
8) Index with CWB (Corpus Workbench)