java tf-idf提取關鍵字

最近在研究nlp，nlp第一步就是分詞，目前開源的工具中，java的有中科院的分詞工具nlpir、還有word分詞器，ansj_seg等，python的比較火的jieba，ansj_seg5.x版本之後提供了提取關鍵字的方法，jieba也提供了提取關鍵字的方法。
提取關鍵字比較常用的算法有tf-idf、textrank。其中tf-idf是統計詞頻和逆文檔詞頻，textrank是基於pagerank原理。這兩個工具的提取關鍵字地方法各有利弊。
首先新聞的組成是人物時間地點發生了什麼事兒，人名比，專業名詞，地名等一般形容詞或者動詞沒有名詞重要或者更具有說服力。
而且出現在標題的詞語要比出現在正文中的詞重要，需要給予其權重。詞語的長度也需要一定的權重。
詞語的詞性也需要賦予一定的權重，基於以上幾點實現tfidf

public static String TFIDF (String title,String content, int topK){

        FilterRecognition filterRecognition = new FilterRecognition();
        filterRecognition.insertStopWords(stopWords);
        filterRecognition.insertStopWord("事兒", "有沒有", "前有", "後有", "更多");
        filterRecognition.insertStopNatures("d", "p", "m", "r", "w", "a", "j", "l","null","num");

        List<Term> terms = NlpAnalysis.parse(content).recognition(filterRecognition).getTerms();
        //詞的總數
        int totalWords= terms.size();
        Map<String, Integer> wordsCount = new HashMap<String, Integer>();
        //根據詞的長度加權
        int maxWordLen = 0;

        for(Term term:terms){
            Integer count = wordsCount.get(term.getName());
            count = count == null ? 0 : count;
            wordsCount.put(term.getName(), count+1);
            if(maxWordLen<term.getName().length()){
                maxWordLen = term.getName().length();
            }
        }

        //計算tf
        Map<String, Double> tf = new HashMap<String, Double>();
        for(String word:wordsCount.keySet()){
            tf.put(word, (double)wordsCount.get(word)/(totalWords+1));
        }

        //保留詞的長度
        Set<Integer> perWordLen = new HashSet<Integer>();
        //計算每個詞的詞長權重
        Map<String, Double> lenWeight = new HashMap<String, Double>();
        for( String key:tf.keySet()){
            lenWeight.put(key, (double)key.length()/maxWordLen);
            perWordLen.add(key.length());
        }

        //標題中出現的關鍵詞
        List<Term> titleTerms = NlpAnalysis.parse(title).recognition(filterRecognition).getTerms();
        Map<String, String> titleWords = new HashMap<String, String>();
        for(Term term:titleTerms){
            titleWords.put(term.getName(), term.getNatureStr());
        }
        //計算idf
        Map<Integer, Integer> map = new HashMap<Integer, Integer>();
        for(int len:perWordLen){
            int sum = 0;
            for(String w:wordsCount.keySet()){
                if(w.length()==len){
                    sum += wordsCount.get(w);
                }
            }
            map.put(len, sum);
        }
        Map<String, Double> idf = new HashMap<String, Double>();
        for(String w:wordsCount.keySet()){
            Integer integer = wordsCount.get(w);
            int len = w.length();
            Integer totalSim = map.get(len);
            idf.put(w, Math.log(((double)totalSim/integer)+1));
        }
        //計算每個詞的在文章中的權重

        Map<String, Double> wordWeight = new HashMap<String, Double>();

        for(Term term:terms){
            String word = term.getName();
            String nature = term.getNatureStr();
            if(word.length()<2){
                continue;
            }
            if(wordWeight.get(word)!=null){
                continue;
            }
            Double aDouble = tf.get(word);
            Double aDouble1 = idf.get(word);
            double weight = 1.0;
            if(titleWords.keySet().contains(word)){
                weight += 3.0;
            }
            weight += (double)word.length()/maxWordLen;
            switch (nature){
                case "en":
                    weight += 3.0;
                case "nr":
                    weight += 6.0;
                case "nrf":
                    weight += 6.0;
                case "nw" :
                    weight += 3.0;
                case "nt":
                    weight += 6.0;
                case "nz":
                    weight += 3.0;
                case "kw":
                    weight += 3.0;
                case "ns":
                    weight += 3.0;
                default:
                    weight += 1.0;
            }

            wordWeight.put(word,aDouble*aDouble1*weight);
        }

        Map<String, Double> stringDoubleMap = MapUtil.sortByValue(wordWeight);

        List<String> topKSet = new ArrayList<String>();

        int i = 0;
        for(String word:stringDoubleMap.keySet()){
            if(i >= topK){
                break;
            }
            topKSet.add(word+" ``
+stringDoubleMap.get(word));
            i++;
        }
        return StringUtils.join(topKSet, "\t");
    }

int topK = 10;

    String title = "余文樂本來習慣一個人王棠雲2個優點抓住男神心";

    String content = "余文樂夫婦據臺灣媒體報道，36歲香港男星余文樂宣佈和王棠雲（Sarah）結婚，兩人認愛1年感情修成正果，婚紗照曝光讓大批粉絲涌入祝福。他在2016年的聖誕節被目擊在紐約求婚率最高的法國餐廳用餐，戀情因此曝光，隨後由低調逐漸轉爲高調，年初宣傳電影時，曾鬆口提到女友的2個優點，成爲兩人願意相守一生的關鍵。\n" +
            "余文樂發文余文樂2月在香港參加電影活動時，被問到和女友王棠雲相處的過程，坦言雖然工作很忙，但是仍會抽時間陪伴女友，“其實我自己也很不習慣，不習慣有另外一半，本來以前一個人習慣了儘快將工作完成，但是現在要分配時間給對方，所以要好好分配時間，但不敢保證可以分配好。”他工作忙碌，女友卻沒有半句抱怨，提到最欣賞對方什麼地方，靦腆稱讚“簡單、單純的女孩”，兩人相處非常舒服。\n" +
            "走紅兩岸三地的余文樂非常顧家，他高中被挖角出道，一肩扛起家中經濟重任，拍戲17年來從沒停下拍戲腳步，“家人的開心健康對我來說比什麼都重要。”他認愛後更直言婚後不希望另一半工作，“未來我和太太（的生活），也希望是我負責（開銷），不希望她來工作，除非她很想工作，我會尊重她。”小兩口交往1年後結婚，再度公開示愛：“感恩妳把人生的餘下日子交到我手上，我一定會把幸福帶給妳，我一定會好好的照顧妳！I love you。";

余文樂 1.8134162231616129 認愛 1.5255924833176655 王棠雲 1.1672971055953232 挖角 0.89414671294365 相守 0.5643384991529595 love 0.5615120226399731 求婚率 0.481328374683345

感覺還可以，還沒有優化，後面實現textrank和tfidf的另一種實現提取關鍵字

java tf-idf提取關鍵字

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

Json字符串轉Java Bean

Spark-SQL adaptive 自適應框架

spark streaming讀取kafka 零丟失（四）

spark批量寫入redis

kafka 安裝配置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結