語言分析與資料科學
  • 本書規劃
  • PART I:基礎知識
    • 導論
    • 語言學
      • 語言的實證研究方法
    • 數學與統計
      • 統計
        • 機率與機率分佈
          • 推論統計
        • 迴歸模型
      • 線性代數
    • 自然語言處理
      • 語料庫
    • 機器學習
      • kNN
      • Probabilistic learning using Naive Bayes
      • 決策樹 Decision Trees
      • 支持向量機 Support Vector Machines
      • 迴歸分析 Regression models
      • 神經網路與深度學習 Neural Network and Deep Learning
      • 關聯規則挖掘 Association Rules
      • k-means 分群 Clusterings
      • 社會網路分析
    • 資料科學的 OSEMN 模式
  • PART II: 文本分析:資料處理、表徵與語意計算
    • 文本分析是什麼
      • 程式處理架構
    • 文本前處理
      • 文本收集
      • 文本清理
      • 自動分詞與詞類標記
      • 文本標記
    • 文本數據探索性分析
    • 文本語意與統計
      • 語意表徵
      • 文本訊息視覺化
      • 文本相似與關聯
    • 文本知識抽取
  • PART III:文本分析:模型應用與專案
    • 文本迴歸預測
    • 文本分類
      • 情緒分析
      • 垃圾訊息偵測
    • 文本自動生成
      • 自動摘要
    • 文本聚類
    • 主題模型
    • 立場、意圖與價值
    • 個人文體風格
    • 文本真實性
      • 重複文本偵測
    • 資料科學報告與部署
  • 附錄
    • R 存活指令
    • Python 存活指令
    • Git and Github 入手
    • Linux 存活指令
    • 正則表示法
    • 參考書目
Powered by GitBook
On this page
  • LDA
  • Topic model with R
  • Deep learning

Was this helpful?

  1. PART III:文本分析:模型應用與專案

主題模型

Previous文本聚類Next立場、意圖與價值

Last updated 5 years ago

Was this helpful?

  • Topic modeling 指的是一組從文集中抽取隱藏「主題」thematic structures 的技術方法。在許多的應用,我們都想自動抽取一篇文章或一段話所表達的「中心思想」。

  • 最早的模型是 pLSI (probabilistic latent semantic indexing) ,後來發展的 LDA (Latent Dirichlet allocation) (LDA,潜在狄利克雷分配模型) 模型及其延伸變成了最常用的模型。LDA topic model 涉及比較深一點的數學,包括 Dirichlet distribution, 多項分佈、EM 算法、Gibbs sampling 等等。LDA是一種非監督式的機器學習技術,已經被廣泛用來識別大規模文集(document collection)或語料庫(corpus)中潛藏的主題訊息。

  • 主題模型通常將文本表徵成詞袋 (a bag of words) (Harris,1954),把整個文本當成是詞的集合,至於語法或任何詞序都可以忽略。

LDA

(Blei et al. 2003)

LDA 是一個貝氏機率模式,不管 PLSI 還是 LDA 都遵循以下通式:

P(w∣d)=Σp(w∣z)∗p(z∣d)P(w|d) = \Sigma p(w|z) * p(z|d)P(w∣d)=Σp(w∣z)∗p(z∣d)

這個 model 預設了:

  • 存在 KKK 個主題。

  • 每一篇文檔代表了一些主題所構成的一個機率分佈,而每一個主題又代表了很多單詞所構成的一個機率分佈。In LDA, documents are represented as probability distributions over latent topics where each topic is characterized by a distribution over words.

There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k. There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d. Each word in a document was generated by first randomly picking a topic (from the document's distribution of topics) and then randomly picking a word (from the topic's distribution of words).

由於 Dirichlet分佈隨機向量各份量間的弱相關性(之所以還有點「相關」,是因為各份量之和必須為1),使得我們假想的潛在主題之間也幾乎是不相關的,這與很多實際問題並不相符,從而造成了LDA的又一個遺留問題。

Topic model with R

  • mallet:

  • topicmodels: Topic modeling interface to the C code developed by by David M. Blei for Topic Modeling (Latent Dirichlet Allocation (LDA), and Correlated Topics Models (CTM)).

  • lda

  • LDAvis Interactive visualization of topic models.

  • RTextTools

library(RTextTools)
library(topicmodels)

# loading the data (the bundled NYTimes dataset contains headlines from front-page NYTimes articles)
# 隨機挑 1000 篇
data(NYTimes)
data <- NYTimes[sample(1:3100, size=1000, replace=FALSE),]

# Create a DocumentTermMatrix
# create_matrix() 建立的 dtm 可以當topimodels 的 LDA() 的input
matrix <- create_matrix(cbind(as.vector(data$Title),
            as.vector(data$Subject)), 
            language="english", 
            removeNumbers=TRUE, 
            stemWords=TRUE, 
            weighting=weightTf)

# Perform Latent Dirichlet Allocation


## First we want to determine the number of topics in our data. 
# In the case of the NYTimes dataset, the data have already been classified as a training set for supervised learning algorithms. 
# Therefore, we can use the unique() function to determine the number of unique topic categories (k) in our data.
# Next, we use our matrix and this k value to generate the LDA model.

k <- length(unique(data$Topic.Code))
lda <- LDA(matrix, k)

# View the Results
## view the results by most likely term per topic, or most likely topic per document.

terms(lda)
topics(lda)

Deep learning

  • The probability distribution generated from LDA prefers to describe the statistical relationship of occurrences rather than real semantic information embedded in words, topics and documents.

  • Also LDA will assign high probabilities to high frequency words and those words with low probabilities are hard to be chosen as representatives of topics.But in practice, low probability words sometimes distinguish topics better. For example, LDA will assign higher probability and choose “food” as representative other than “cheeseburger”, “drug” other than “aricept” and “technology” other than “smartphone”.

  • Recently, the embedded representations have shown more effectiveness than LDA-style representations in many tasks.

把「主題」嵌入 semantic vector space 裏。

(2015). Topic2Vec: Learning Distributed Representations of Topics
(2015). Topical Word Embeddings.