在文本挖掘中,合併變形詞的詞頻是很必要的。雖然步驟較簡單,但很多人都沒有注意。
下面比較一下合併跟不合並的差別:
library("tm") library("wordcloud") data(crude) crude <- tm_map(crude, removePunctuation) crude <- tm_map(crude, function(x) removeWords(x, stopwords())) tdm <- TermDocumentMatrix(crude) m <- as.matrix(tdm) v <- sort(rowSums(m), decreasing = TRUE) d <- data.frame(word = names(v), freq = v) cls <- c("gray50", brewer.pal(8, "Dark2"), "orangeRed") wordcloud(d$word, d$freq, scale = c(6, 0.5), color = cls, random.order = FALSE) subfix <- c("s", "es", "ed", "ing", "y", "ive", "ic", "al", "ous", "ious", "ish", "able", "ible", "ize", "ise") del <- 0 for (ss in subfix) { w1 <- d$word w2 <- paste0(w1, ss) sel <- w2 %in% w1 pls <- w1 %in% w2 if (sum(pls) > 0) { f1 <- d$freq f1[sel] <- f1[sel] + f1[pls] d$freq <- f1 d <- d[!pls, ] del <- del + sum(pls) } } del
## [1] 104
wordcloud(d$word, d$freq, scale = c(6, 0.5), color = cls, random.order = FALSE)
可以看到合併後opec(歐佩克), market(市場), Kuwait(科威特)等詞的重要性明顯提高。
安裝 SnowballC 軟件包後也可以用 tm_map(x, stemDocument) 合併變形詞,但效果很差,可以試試。