抽取式文本摘要實現

1、介紹

 

     1、本文自動文本摘要實現的依據就是詞頻統計

     2、文章是由句子組成的,文章的信息都包含在句子中,有些句子包含的信息多,有些句子包含的信息少。

     3、句子的信息量用"關鍵詞"來衡量。如果包含的關鍵詞越多,就說明這個句子越重要。

     4、"自動摘要"就是要找出那些包含信息最多的句子,也就是包含關鍵字最多的句子

     5、而通過統計句子中關鍵字的頻率的大小,進而進行排序,通過對排序的詞頻列表對文檔中句子逐個進行打分,進而把打分高的句子找出來,就是我們要的摘要。

2、實現步驟

    1、加載停用詞

    2、將文檔拆分成句子列表

    3、將句子列表分詞

    4、統計詞頻,取出100個最高的關鍵字

    5、根據詞頻對句子列表進行打分

    6、取出打分較高的前5個句子

  3、原理

    這種方法最早出自1958年的IBM公司科學家H.P. Luhn的論文《The Automatic Creation of Literature Abstracts》。Luhn提出用"簇"(cluster)表示關鍵詞的聚集。所謂"簇"就是包含多個關鍵詞的句子片段。

    

    上圖就是Luhn原始論文的插圖,被框起來的部分就是一個"簇"。只要關鍵詞之間的距離小於"門檻值",它們就被認爲處於同一個簇之中。Luhn建議的門檻值是4或5。

    也就是說,如果兩個關鍵詞之間有5個以上的其他詞,就可以把這兩個關鍵詞分在兩個簇。

    簇重要性分值計算公式:

    

    以前圖爲例,其中的簇一共有7個詞,其中4個是關鍵詞。因此,它的重要性分值等於 ( 4 x 4 ) / 7 = 2.3。

    然後,找出包含分值最高的簇的句子(比如前5句),把它們合在一起,就構成了這篇文章的自動摘要

  4、相關代碼

  python實現代碼:

#!/usr/bin/env python
# -*- coding:utf-8 -*- 
# Author: Jia ShiLin

#!/user/bin/python
# coding:utf-8

import nltk
import numpy
import jieba
import codecs
import os

class SummaryTxt:
    def __init__(self,stopwordspath):
        # 單詞數量
        self.N = 100
        # 單詞間的距離
        self.CLUSTER_THRESHOLD = 5
        # 返回的top n句子
        self.TOP_SENTENCES = 5
        self.stopwrods = {}
        #加載停用詞
        if os.path.exists(stopwordspath):
            stoplist = [line.strip() for line in codecs.open(stopwordspath, 'r', encoding='utf8').readlines()]
            self.stopwrods = {}.fromkeys(stoplist)


    def _split_sentences(self,texts):
        '''
        把texts拆分成單個句子,保存在列表裏面,以(.!?。!?)這些標點作爲拆分的意見,
        :param texts: 文本信息
        :return:
        '''
        splitstr = '.!?。!?'#.decode('utf8')
        start = 0
        index = 0  # 每個字符的位置
        sentences = []
        for text in texts:
            if text in splitstr:  # 檢查標點符號下一個字符是否還是標點
                sentences.append(texts[start:index + 1])  # 當前標點符號位置
                start = index + 1  # start標記到下一句的開頭
            index += 1
        if start < len(texts):
            sentences.append(texts[start:])  # 這是爲了處理文本末尾沒有標

        return sentences

    def _score_sentences(self,sentences, topn_words):
        '''
        利用前N個關鍵字給句子打分
        :param sentences: 句子列表
        :param topn_words: 關鍵字列表
        :return:
        '''
        scores = []
        sentence_idx = -1
        for s in [list(jieba.cut(s)) for s in sentences]:
            sentence_idx += 1
            word_idx = []
            for w in topn_words:
                try:
                    word_idx.append(s.index(w))  # 關鍵詞出現在該句子中的索引位置
                except ValueError:  # w不在句子中
                    pass
            word_idx.sort()
            if len(word_idx) == 0:
                continue
            # 對於兩個連續的單詞,利用單詞位置索引,通過距離閥值計算族
            clusters = []
            cluster = [word_idx[0]]
            i = 1
            while i < len(word_idx):
                if word_idx[i] - word_idx[i - 1] < self.CLUSTER_THRESHOLD:
                    cluster.append(word_idx[i])
                else:
                    clusters.append(cluster[:])
                    cluster = [word_idx[i]]
                i += 1
            clusters.append(cluster)
            # 對每個族打分,每個族類的最大分數是對句子的打分
            max_cluster_score = 0
            for c in clusters:
                significant_words_in_cluster = len(c)
                total_words_in_cluster = c[-1] - c[0] + 1
                score = 1.0 * significant_words_in_cluster * significant_words_in_cluster / total_words_in_cluster
                if score > max_cluster_score:
                    max_cluster_score = score
            scores.append((sentence_idx, max_cluster_score))
        return scores

    def summaryScoredtxt(self,text):
        # 將文章分成句子
        sentences = self._split_sentences(text)

        # 生成分詞
        words = [w for sentence in sentences for w in jieba.cut(sentence) if w not in self.stopwrods if
                 len(w) > 1 and w != '\t']
        # words = []
        # for sentence in sentences:
        #     for w in jieba.cut(sentence):
        #         if w not in stopwords and len(w) > 1 and w != '\t':
        #             words.append(w)

        # 統計詞頻
        wordfre = nltk.FreqDist(words)

        # 獲取詞頻最高的前N個詞
        topn_words = [w[0] for w in sorted(wordfre.items(), key=lambda d: d[1], reverse=True)][:self.N]

        # 根據最高的n個關鍵詞,給句子打分
        scored_sentences = self._score_sentences(sentences, topn_words)

        # 利用均值和標準差過濾非重要句子
        avg = numpy.mean([s[1] for s in scored_sentences])  # 均值
        std = numpy.std([s[1] for s in scored_sentences])  # 標準差
        summarySentences = []
        for (sent_idx, score) in scored_sentences:
            if score > (avg + 0.5 * std):
                summarySentences.append(sentences[sent_idx])
                print (sentences[sent_idx])
        return summarySentences

    def summaryTopNtxt(self,text):
        # 將文章分成句子
        sentences = self._split_sentences(text)

        # 根據句子列表生成分詞列表
        words = [w for sentence in sentences for w in jieba.cut(sentence) if w not in self.stopwrods if
                 len(w) > 1 and w != '\t']
        # words = []
        # for sentence in sentences:
        #     for w in jieba.cut(sentence):
        #         if w not in stopwords and len(w) > 1 and w != '\t':
        #             words.append(w)

        # 統計詞頻
        wordfre = nltk.FreqDist(words)

        # 獲取詞頻最高的前N個詞
        topn_words = [w[0] for w in sorted(wordfre.items(), key=lambda d: d[1], reverse=True)][:self.N]

        # 根據最高的n個關鍵詞,給句子打分
        scored_sentences = self._score_sentences(sentences, topn_words)

        top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-self.TOP_SENTENCES:]
        top_n_scored = sorted(top_n_scored, key=lambda s: s[0])
        summarySentences = []
        for (idx, score) in top_n_scored:
            print( sentences[idx])
            summarySentences.append(sentences[idx])

        return sentences



if __name__=='__main__':
    obj =SummaryTxt('stopword.txt')

    # txt=u'十八大以來的五年,是黨和國家發展進程中極不平凡的五年。面對世界經濟復甦乏力、局部衝突和動盪頻發、全球性問題加劇的外部環境,面對我國經濟發展進入新常態等一系列深刻變化,我們堅持穩中求進工作總基調,迎難而上,開拓進取,取得了改革開放和社會主義現代化建設的歷史性成就。' \
    #     u'爲貫徹十八大精神,黨中央召開七次全會,分別就政府機構改革和職能轉變、全面深化改革、全面推進依法治國、制定“十三五”規劃、全面從嚴治黨等重大問題作出決定和部署。五年來,我們統籌推進“五位一體”總體佈局、協調推進“四個全面”戰略佈局,“十二五”規劃勝利完成,“十三五”規劃順利實施,黨和國家事業全面開創新局面。' \
    #     u'經濟建設取得重大成就。堅定不移貫徹新發展理念,堅決端正發展觀念、轉變發展方式,發展質量和效益不斷提升。經濟保持中高速增長,在世界主要國家中名列前茅,國內生產總值從五十四萬億元增長到八十萬億元,穩居世界第二,對世界經濟增長貢獻率超過百分之三十。供給側結構性改革深入推進,經濟結構不斷優化,數字經濟等新興產業蓬勃發展,高鐵、公路、橋樑、港口、機場等基礎設施建設快速推進。農業現代化穩步推進,糧食生產能力達到一萬二千億斤。城鎮化率年均提高一點二個百分點,八千多萬農業轉移人口成爲城鎮居民。區域發展協調性增強,“一帶一路”建設、京津冀協同發展、長江經濟帶發展成效顯著。創新驅動發展戰略大力實施,創新型國家建設成果豐碩,天宮、蛟龍、天眼、悟空、墨子、大飛機等重大科技成果相繼問世。南海島礁建設積極推進。開放型經濟新體制逐步健全,對外貿易、對外投資、外匯儲備穩居世界前列。' \
    #     u'全面深化改革取得重大突破。蹄疾步穩推進全面深化改革,堅決破除各方面體制機制弊端。改革全面發力、多點突破、縱深推進,着力增強改革系統性、整體性、協同性,壓茬拓展改革廣度和深度,推出一千五百多項改革舉措,重要領域和關鍵環節改革取得突破性進展,主要領域改革主體框架基本確立。中國特色社會主義制度更加完善,國家治理體系和治理能力現代化水平明顯提高,全社會發展活力和創新活力明顯增強。'

    # txt ='The information disclosed by the Film Funds Office of the State Administration of Press, Publication, Radio, Film and Television shows that, the total box office in China amounted to nearly 3 billion yuan during the first six days of the lunar year (February 8 - 13), an increase of 67% compared to the 1.797 billion yuan in the Chinese Spring Festival period in 2015, becoming the "Best Chinese Spring Festival Period in History".' \
    #      'During the Chinese Spring Festival period, "The Mermaid" contributed to a box office of 1.46 billion yuan. "The Man From Macau III" reached a box office of 680 million yuan. "The Journey to the West: The Monkey King 2" had a box office of 650 million yuan. "Kung Fu Panda 3" also had a box office of exceeding 130 million. These four blockbusters together contributed more than 95% of the total box office during the Chinese Spring Festival period.' \
    #      'There were many factors contributing to the popularity during the Chinese Spring Festival period. Apparently, the overall popular film market with good box office was driven by the emergence of a few blockbusters. In fact, apart from the appeal of the films, other factors like film ticket subsidy of online seat-selection companies, cinema channel sinking and the film-viewing heat in the middle and small cities driven by the home-returning wave were all main factors contributing to this blowout. A management of Shanghai Film Group told the 21st Century Business Herald.'
    txt = 'Monetary policy summary The Bank of England’s Monetary Policy Committee (MPC) sets monetary policy in order to meet the 2% inflation target and in a way that helps to sustain growth and employment.  At its meeting ending on 5 August 2015, the MPC voted by a majority of 8-1 to maintain Bank Rate at 0.5%.  The Committee voted unanimously to maintain the stock of purchased assets financed by the issuance of central bank reserves at £375 billion, and so to reinvest the £16.9 billion of cash flows associated with the redemption of the September 2015 gilt held in the Asset Purchase Facility. CPI inflation fell back to zero in June.  As set out in the Governor’s open letter to the Chancellor, around three quarters of the deviation of inflation from the 2% target, or 1½ percentage points, reflects unusually low contributions from energy, food, and other imported goods prices.  The remaining quarter of the deviation of inflation from target, or ½ a percentage point, reflects the past weakness of domestic cost growth, and unit labour costs in particular.  The combined weakness in domestic costs and imported goods prices is evident in subdued core inflation, which on most measures is currently around 1%. With some underutilised resources remaining in the economy and with inflation below the target, the Committee intends to set monetary policy in order to ensure that growth is sufficient to absorb the remaining economic slack so as to return inflation to the target within two years.  Conditional upon Bank Rate following the gently rising path implied by market yields, the Committee judges that this is likely to be achieved. In its latest economic projections, the Committee projects UK-weighted world demand to expand at a moderate pace.  Growth in advanced economies is expected to be a touch faster, and growth in emerging economies a little slower, than in the past few years.  The support to UK exports from steady global demand growth is expected to be counterbalanced, however, by the effect of the past appreciation of sterling.  Risks to global growth are judged to be skewed moderately to the downside reflecting, for example, risks to activity in the euro area and China. Private domestic demand growth in the United Kingdom is expected to remain robust.  Household spending has been supported by the boost to real incomes from lower food and energy prices.  Wage growth has picked up as the labour market has tightened and productivity has strengthened.  Business and consumer confidence remain high, while credit conditions have continued to improve, with historically low mortgage rates providing support to activity in the housing market.  Business investment has made a substantial contribution to growth in recent years.  Firms have invested to expand capacity, supported by accommodative financial conditions.  Despite weakening slightly, surveys suggest continued robust investment growth ahead.  This will support the continuing increase of underlying productivity growth towards past average rates. Robust private domestic demand is expected to produce sufficient momentum to eliminate the margin of spare capacity over the next year or so, despite the continuing fiscal consolidation and modest global growth.  This is judged likely to generate the rise in domestic costs expected to be necessary to return inflation to the target in the medium term. The near-term outlook for inflation is muted.  The falls in energy prices of the past few months will continue to bear down on inflation at least until the middle of next year.  Nonetheless, a range of measures suggest that medium-term inflation expectations remain well anchored.  There is little evidence in wage settlements or spending patterns of any deflationary mindset among businesses and households. Sterling has appreciated by 3½% since May and 20% since its trough in March 2013.  The drag on import prices from this appreciation will continue to push down on inflation for some time to come, posing a downside risk to its path in the near term.  Set against that, the degree of slack in the economy has diminished substantially over the past two and a half years.  The unemployment rate has fallen by more than 2 percentage points since the middle of 2013, and the ratio of job vacancies to unemployment has returned from well below to around its pre-crisis average.  The margin of spare capacity is currently judged to be around ½% of GDP, with a range of views among MPC members around that central estimate.  A further modest tightening of the labour market is expected, supporting a continued firming in the growth of wages and unit labour costs over the next three years, counterbalancing the drag on inflation from sterling. Were Bank Rate to follow the gently rising path implied by market yields, the Committee judges that demand growth would be sufficient to return inflation to the target within two years.  In its projections, inflation then moves slightly above the target in the third year of the forecast period as sustained growth leads to a degree of excess demand. Underlying those projections are significant judgements in a number of areas, as described in the August Inflation Report.  In any one of these areas, developments might easily turn out differently than assumed with implications for the outlook for growth and inflation, and therefore for the appropriate stance of monetary policy.  Reflecting that, there is a spread of views among MPC members about the balance of risks to inflation relative to the best collective judgement presented in the August Report.  At the Committee’s meeting ending on 5 August, the majority of MPC members judged it appropriate to leave the stance of monetary policy unchanged at present.  Ian McCafferty preferred to increase Bank Rate by 25 basis points, given his view that demand growth and wage pressures were likely to be greater, and the margin of spare capacity smaller, than embodied in the Committee’s collective August projections. All members agree that, given the likely persistence of the headwinds weighing on the economy, when Bank Rate does begin to rise, it is expected to do so more gradually and to a lower level than in recent cycles.  This guidance is an expectation, not a promise.  The actual path Bank Rate will follow over the next few years will depend on the economic circumstances.  The Committee will continue to monitor closely the incoming data.'
    print (txt)
    print ("--")
    obj.summaryScoredtxt(txt)

    print ("----")
    obj.summaryTopNtxt(txt)

  java實現代碼:

//import com.hankcs.hanlp.HanLP;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.PriorityQueue;
import java.util.Queue;
import java.util.Set;
import java.util.TreeMap;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

//import com.hankcs.hanlp.seg.common.Term;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
/**
 * @Author:sks
 * @Description:文本摘要提取文中重要的關鍵句子,使用top-n關鍵詞在句子中的比例關係
 * 返回過濾句子方法爲:1.均值標準差,2.top-n句子,3.最大邊緣相關top-n句子
 * @Date:Created in 16:40 2017/12/22
 * @Modified by:
 **/
public class NewsSummary {
    //保留關鍵詞數量
    int N = 50;

    //關鍵詞間的距離閥值
    int CLUSTER_THRESHOLD = 5;

    //前top-n句子
    int TOP_SENTENCES = 10;

    //最大邊緣相關閥值
    double λ = 0.4;

    //句子得分使用方法
    final Set<String> styleSet = new HashSet<String>();

    //停用詞列表
    Set<String> stopWords = new HashSet<String>();

    //句子編號及分詞列表
    Map<Integer,List<String>> sentSegmentWords = null;

    public NewsSummary(){
        this.loadStopWords("D:\\work\\Solr\\solr-python\\CNstopwords.txt");
        styleSet.add("meanstd");
        styleSet.add("default");
        styleSet.add("MMR");
    }

    /**
     * 加載停詞
     * @param path
     */
    private void loadStopWords(String path){
        BufferedReader br=null;
        try
        {
            InputStreamReader reader = new InputStreamReader(new FileInputStream(path),"utf-8");
            br = new BufferedReader(reader);
            String line=null;
            while((line=br.readLine())!=null)
            {
                stopWords.add(line);
            }
            br.close();
        }catch(IOException e){
            e.printStackTrace();
        }
    }

    /**
     * @Author:sks
     * @Description:利用正則將文本拆分成句子
     * @Date:
     */
    private List<String> SplitSentences(String text){
        List<String> sentences = new ArrayList<String>();
        String regEx = "[!?。!?.]";
        Pattern p = Pattern.compile(regEx);
        String[] sents = p.split(text);
        Matcher m = p.matcher(text);

        int sentsLen = sents.length;
        if(sentsLen>0)
        {  //把每個句子的末尾標點符號加上
            int index = 0;
            while(index < sentsLen)
            {
                if(m.find())
                {
                    sents[index] += m.group();
                }
                index++;
            }
        }
        for(String sentence:sents){
            //文章從網頁上拷貝過來後遺留下來的沒有處理掉的html的標誌
            sentence=sentence.replaceAll("(&rdquo;|&ldquo;|&mdash;|&lsquo;|&rsquo;|&middot;|&quot;|&darr;|&bull;)", "");
            sentences.add(sentence.trim());
        }
        return sentences;
    }

    /**
     * 這裏使用IK進行分詞
     * @param text
     * @return
     */
    private List<String> IKSegment(String text){
        List<String> wordList = new ArrayList<String>();
        Reader reader = new StringReader(text);
        IKSegmenter ik = new IKSegmenter(reader,true);
        Lexeme lex = null;
        try {
            while((lex=ik.next())!=null)
            {
                String word=lex.getLexemeText();
                if(word.equals("nbsp") || this.stopWords.contains(word))
                    continue;
                if(word.length()>1 && word!="\t")
                    wordList.add(word);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return wordList;
    }


    /**
     * 每個句子得分  (keywordsLen*keywordsLen/totalWordsLen)
     * @param sentences 分句
     * @param topnWords keywords top-n關鍵詞
     * @return
     */
    private Map<Integer,Double> scoreSentences(List<String> sentences,List<String> topnWords){
        Map<Integer, Double> scoresMap=new LinkedHashMap<Integer,Double>();//句子編號,得分
        sentSegmentWords=new HashMap<Integer,List<String>>();
        int sentence_idx=-1;//句子編號
        for(String sentence:sentences){
            sentence_idx+=1;
            List<String> words=this.IKSegment(sentence);//對每個句子分詞
//            List<String> words= HanLP.segment(sentence);
            sentSegmentWords.put(sentence_idx, words);
            List<Integer> word_idx=new ArrayList<Integer>();//每個關詞鍵在本句子中的位置
            for(String word:topnWords){
                if(words.contains(word)){
                    word_idx.add(words.indexOf(word));
                }else
                    continue;
            }
            if(word_idx.size()==0)
                continue;
            Collections.sort(word_idx);
            //對於兩個連續的單詞,利用單詞位置索引,通過距離閥值計算一個族
            List<List<Integer>> clusters=new ArrayList<List<Integer>>();//根據本句中的關鍵詞的距離存放多個詞族
            List<Integer> cluster=new ArrayList<Integer>();
            cluster.add(word_idx.get(0));
            int i=1;
            while(i<word_idx.size()){
                if((word_idx.get(i)-word_idx.get(i-1))<this.CLUSTER_THRESHOLD)
                    cluster.add(word_idx.get(i));
                else{
                    clusters.add(cluster);
                    cluster=new ArrayList<Integer>();
                    cluster.add(word_idx.get(i));
                }
                i+=1;
            }
            clusters.add(cluster);
            //對每個詞族打分,選擇最高得分作爲本句的得分
            double max_cluster_score=0.0;
            for(List<Integer> clu:clusters){
                int keywordsLen=clu.size();//關鍵詞個數
                int totalWordsLen=clu.get(keywordsLen-1)-clu.get(0)+1;//總的詞數
                double score=1.0*keywordsLen*keywordsLen/totalWordsLen;
                if(score>max_cluster_score)
                    max_cluster_score=score;
            }
            scoresMap.put(sentence_idx,max_cluster_score);
        }
        return scoresMap;
    }


    /**
    * @Author:sks
    * @Description:利用均值方差自動文摘
    * @Date:
    */
    public String SummaryMeanstdTxt(String text){
        //將文本拆分成句子列表
        List<String> sentencesList = this.SplitSentences(text);

        //利用IK分詞組件將文本分詞,返回分詞列表
        List<String> words = this.IKSegment(text);
//        List<Term> words1= HanLP.segment(text);

        //統計分詞頻率
        Map<String,Integer> wordsMap = new HashMap<String,Integer>();
        for(String word:words){
            Integer val = wordsMap.get(word);
            wordsMap.put(word,val == null ? 1: val + 1);

        }


        //使用優先隊列自動排序
        Queue<Map.Entry<String, Integer>> wordsQueue=new PriorityQueue<Map.Entry<String,Integer>>(wordsMap.size(),new Comparator<Map.Entry<String,Integer>>()
        {
            //            @Override
            public int compare(Entry<String, Integer> o1,
                               Entry<String, Integer> o2) {
                return o2.getValue()-o1.getValue();
            }
        });
        wordsQueue.addAll(wordsMap.entrySet());

        if( N > wordsMap.size())
            N = wordsQueue.size();

        //取前N個頻次最高的詞存在wordsList
        List<String> wordsList = new ArrayList<String>(N);//top-n關鍵詞
        for(int i = 0;i < N;i++){
            Entry<String,Integer> entry= wordsQueue.poll();
            wordsList.add(entry.getKey());
        }

        //利用頻次關鍵字,給句子打分,並對打分後句子列表依據得分大小降序排序
        Map<Integer,Double> scoresLinkedMap = scoreSentences(sentencesList,wordsList);//返回的得分,從第一句開始,句子編號的自然順序



        //approach1,利用均值和標準差過濾非重要句子
        Map<Integer,String> keySentence = new LinkedHashMap<Integer,String>();

        //句子得分均值
        double sentenceMean = 0.0;
        for(double value:scoresLinkedMap.values()){
            sentenceMean += value;
        }
        sentenceMean /= scoresLinkedMap.size();

        //句子得分標準差
        double sentenceStd=0.0;
        for(Double score:scoresLinkedMap.values()){
            sentenceStd += Math.pow((score-sentenceMean), 2);
        }
        sentenceStd = Math.sqrt(sentenceStd / scoresLinkedMap.size());

        for(Map.Entry<Integer, Double> entry:scoresLinkedMap.entrySet()){
            //過濾低分句子
            if(entry.getValue()>(sentenceMean+0.5*sentenceStd))
                keySentence.put(entry.getKey(), sentencesList.get(entry.getKey()));
        }

        StringBuilder sb = new StringBuilder();
        for(int  index:keySentence.keySet())
            sb.append(keySentence.get(index));
        return sb.toString();
    }

    /**
    * @Author:sks
    * @Description:默認返回排序得分top-n句子
    * @Date:
    */
    public String SummaryTopNTxt(String text){
        //將文本拆分成句子列表
        List<String> sentencesList = this.SplitSentences(text);

        //利用IK分詞組件將文本分詞,返回分詞列表
        List<String> words = this.IKSegment(text);
//        List<Term> words1= HanLP.segment(text);

        //統計分詞頻率
        Map<String,Integer> wordsMap = new HashMap<String,Integer>();
        for(String word:words){
            Integer val = wordsMap.get(word);
            wordsMap.put(word,val == null ? 1: val + 1);

        }


        //使用優先隊列自動排序
        Queue<Map.Entry<String, Integer>> wordsQueue=new PriorityQueue<Map.Entry<String,Integer>>(wordsMap.size(),new Comparator<Map.Entry<String,Integer>>()
        {
            //            @Override
            public int compare(Entry<String, Integer> o1,
                               Entry<String, Integer> o2) {
                return o2.getValue()-o1.getValue();
            }
        });
        wordsQueue.addAll(wordsMap.entrySet());

        if( N > wordsMap.size())
            N = wordsQueue.size();

        //取前N個頻次最高的詞存在wordsList
        List<String> wordsList = new ArrayList<String>(N);//top-n關鍵詞
        for(int i = 0;i < N;i++){
            Entry<String,Integer> entry= wordsQueue.poll();
            wordsList.add(entry.getKey());
        }

        //利用頻次關鍵字,給句子打分,並對打分後句子列表依據得分大小降序排序
        Map<Integer,Double> scoresLinkedMap = scoreSentences(sentencesList,wordsList);//返回的得分,從第一句開始,句子編號的自然順序
        List<Map.Entry<Integer, Double>> sortedSentList = new ArrayList<Map.Entry<Integer,Double>>(scoresLinkedMap.entrySet());//按得分從高到底排序好的句子,句子編號與得分
        //System.setProperty("java.util.Arrays.useLegacyMergeSort", "true");
        Collections.sort(sortedSentList, new Comparator<Map.Entry<Integer, Double>>(){
            //            @Override
            public int compare(Entry<Integer, Double> o1,Entry<Integer, Double> o2) {
                return o2.getValue() == o1.getValue() ? 0 :
                        (o2.getValue() > o1.getValue() ? 1 : -1);
            }

        });

        //approach2,默認返回排序得分top-n句子
        Map<Integer,String> keySentence = new TreeMap<Integer,String>();
        int count = 0;
        for(Map.Entry<Integer, Double> entry:sortedSentList){
            count++;
            keySentence.put(entry.getKey(), sentencesList.get(entry.getKey()));
            if(count == this.TOP_SENTENCES)
                break;
        }

        StringBuilder sb=new StringBuilder();
        for(int  index:keySentence.keySet())
            sb.append(keySentence.get(index));
        return sb.toString();
    }

    /**
    * @Author:sks
    * @Description:利用最大邊緣相關自動文摘
    * @Date:
    */
    public String SummaryMMRNTxt(String text){
        //將文本拆分成句子列表
        List<String> sentencesList = this.SplitSentences(text);

        //利用IK分詞組件將文本分詞,返回分詞列表
        List<String> words = this.IKSegment(text);
//        List<Term> words1= HanLP.segment(text);

        //統計分詞頻率
        Map<String,Integer> wordsMap = new HashMap<String,Integer>();
        for(String word:words){
            Integer val = wordsMap.get(word);
            wordsMap.put(word,val == null ? 1: val + 1);

        }


        //使用優先隊列自動排序
        Queue<Map.Entry<String, Integer>> wordsQueue=new PriorityQueue<Map.Entry<String,Integer>>(wordsMap.size(),new Comparator<Map.Entry<String,Integer>>()
        {
            //            @Override
            public int compare(Entry<String, Integer> o1,
                               Entry<String, Integer> o2) {
                return o2.getValue()-o1.getValue();
            }
        });
        wordsQueue.addAll(wordsMap.entrySet());

        if( N > wordsMap.size())
            N = wordsQueue.size();

        //取前N個頻次最高的詞存在wordsList
        List<String> wordsList = new ArrayList<String>(N);//top-n關鍵詞
        for(int i = 0;i < N;i++){
            Entry<String,Integer> entry= wordsQueue.poll();
            wordsList.add(entry.getKey());
        }

        //利用頻次關鍵字,給句子打分,並對打分後句子列表依據得分大小降序排序
        Map<Integer,Double> scoresLinkedMap = scoreSentences(sentencesList,wordsList);//返回的得分,從第一句開始,句子編號的自然順序
        List<Map.Entry<Integer, Double>> sortedSentList = new ArrayList<Map.Entry<Integer,Double>>(scoresLinkedMap.entrySet());//按得分從高到底排序好的句子,句子編號與得分
        //System.setProperty("java.util.Arrays.useLegacyMergeSort", "true");
        Collections.sort(sortedSentList, new Comparator<Map.Entry<Integer, Double>>(){
            //            @Override
            public int compare(Entry<Integer, Double> o1,Entry<Integer, Double> o2) {
                return o2.getValue() == o1.getValue() ? 0 :
                        (o2.getValue() > o1.getValue() ? 1 : -1);
            }

        });

        //approach3,利用最大邊緣相關,返回前top-n句子
        if(sentencesList.size()==2)
        {
            return sentencesList.get(0)+sentencesList.get(1);
        }
        else if(sentencesList.size()==1)
            return sentencesList.get(0);
        Map<Integer,String> keySentence = new TreeMap<Integer,String>();
        int count = 0;
        Map<Integer,Double> MMR_SentScore = MMR(sortedSentList);
        for(Map.Entry<Integer, Double> entry:MMR_SentScore.entrySet())
        {
            count++;
            int sentIndex=entry.getKey();
            String sentence=sentencesList.get(sentIndex);
            keySentence.put(sentIndex, sentence);
            if(count==this.TOP_SENTENCES)
                break;
        }

        StringBuilder sb=new StringBuilder();
        for(int  index:keySentence.keySet())
            sb.append(keySentence.get(index));
        return sb.toString();
    }
    /**
     * 計算文本摘要
     * @param text
     * @param style(meanstd,default,MMR)
     * @return
     */
    public String summarize(String text,String style){
        try {
            if(!styleSet.contains(style) || text.trim().equals(""))
                throw new IllegalArgumentException("方法 summarize(String text,String style)中text不能爲空,style必須是meanstd、default或者MMR");
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
            System.exit(1);
        }

        //將文本拆分成句子列表
        List<String> sentencesList = this.SplitSentences(text);

        //利用IK分詞組件將文本分詞,返回分詞列表
        List<String> words = this.IKSegment(text);
//        List<Term> words1= HanLP.segment(text);

        //統計分詞頻率
        Map<String,Integer> wordsMap = new HashMap<String,Integer>();
        for(String word:words){
            Integer val = wordsMap.get(word);
            wordsMap.put(word,val == null ? 1: val + 1);

        }


        //使用優先隊列自動排序
        Queue<Map.Entry<String, Integer>> wordsQueue=new PriorityQueue<Map.Entry<String,Integer>>(wordsMap.size(),new Comparator<Map.Entry<String,Integer>>()
        {
//            @Override
            public int compare(Entry<String, Integer> o1,
                               Entry<String, Integer> o2) {
                return o2.getValue()-o1.getValue();
            }
        });
        wordsQueue.addAll(wordsMap.entrySet());

        if( N > wordsMap.size())
            N = wordsQueue.size();

        //取前N個頻次最高的詞存在wordsList
        List<String> wordsList = new ArrayList<String>(N);//top-n關鍵詞
        for(int i = 0;i < N;i++){
            Entry<String,Integer> entry= wordsQueue.poll();
            wordsList.add(entry.getKey());
        }

        //利用頻次關鍵字,給句子打分,並對打分後句子列表依據得分大小降序排序
        Map<Integer,Double> scoresLinkedMap = scoreSentences(sentencesList,wordsList);//返回的得分,從第一句開始,句子編號的自然順序


        Map<Integer,String> keySentence=null;

        //approach1,利用均值和標準差過濾非重要句子
        if(style.equals("meanstd"))
        {
            keySentence = new LinkedHashMap<Integer,String>();

            //句子得分均值
            double sentenceMean = 0.0;
            for(double value:scoresLinkedMap.values()){
                sentenceMean += value;
            }
            sentenceMean /= scoresLinkedMap.size();

            //句子得分標準差
            double sentenceStd=0.0;
            for(Double score:scoresLinkedMap.values()){
                sentenceStd += Math.pow((score-sentenceMean), 2);
            }
            sentenceStd = Math.sqrt(sentenceStd / scoresLinkedMap.size());

            for(Map.Entry<Integer, Double> entry:scoresLinkedMap.entrySet()){
                //過濾低分句子
                if(entry.getValue()>(sentenceMean+0.5*sentenceStd))
                    keySentence.put(entry.getKey(), sentencesList.get(entry.getKey()));
            }
        }

        List<Map.Entry<Integer, Double>> sortedSentList = new ArrayList<Map.Entry<Integer,Double>>(scoresLinkedMap.entrySet());//按得分從高到底排序好的句子,句子編號與得分
        //System.setProperty("java.util.Arrays.useLegacyMergeSort", "true");
        Collections.sort(sortedSentList, new Comparator<Map.Entry<Integer, Double>>(){
            //            @Override
            public int compare(Entry<Integer, Double> o1,Entry<Integer, Double> o2) {
                return o2.getValue() == o1.getValue() ? 0 :
                        (o2.getValue() > o1.getValue() ? 1 : -1);
            }

        });

        //approach2,默認返回排序得分top-n句子
        if(style.equals("default")){
            keySentence = new TreeMap<Integer,String>();
            int count = 0;
            for(Map.Entry<Integer, Double> entry:sortedSentList){
                count++;
                keySentence.put(entry.getKey(), sentencesList.get(entry.getKey()));
                if(count == this.TOP_SENTENCES)
                    break;
            }
        }

        //approach3,利用最大邊緣相關,返回前top-n句子
        if(style.equals("MMR"))
        {
            if(sentencesList.size()==2)
            {
                return sentencesList.get(0)+sentencesList.get(1);
            }
            else if(sentencesList.size()==1)
                return sentencesList.get(0);
            keySentence = new TreeMap<Integer,String>();
            int count = 0;
            Map<Integer,Double> MMR_SentScore = MMR(sortedSentList);
            for(Map.Entry<Integer, Double> entry:MMR_SentScore.entrySet())
            {
                count++;
                int sentIndex=entry.getKey();
                String sentence=sentencesList.get(sentIndex);
                keySentence.put(sentIndex, sentence);
                if(count==this.TOP_SENTENCES)
                    break;
            }
        }
        StringBuilder sb=new StringBuilder();
        for(int  index:keySentence.keySet())
            sb.append(keySentence.get(index));
        //System.out.println("summarize out...");
        return sb.toString();
    }

    /**
     * 最大邊緣相關(Maximal Marginal Relevance),根據λ調節準確性和多樣性
     * max[λ*score(i) - (1-λ)*max[similarity(i,j)]]:score(i)句子的得分,similarity(i,j)句子i與j的相似度
     * User-tunable diversity through λ parameter
     * - High λ= Higher accuracy
     * - Low λ= Higher diversity
     * @param sortedSentList 排好序的句子,編號及得分
     * @return
     */
    private Map<Integer,Double> MMR(List<Map.Entry<Integer, Double>> sortedSentList){
        //System.out.println("MMR In...");
        double[][] simSentArray=sentJSimilarity();//所有句子的相似度
        Map<Integer,Double> sortedLinkedSent=new LinkedHashMap<Integer,Double>();
        for(Map.Entry<Integer, Double> entry:sortedSentList){
            sortedLinkedSent.put(entry.getKey(),entry.getValue());
        }
        Map<Integer,Double> MMR_SentScore=new LinkedHashMap<Integer,Double>();//最終的得分(句子編號與得分)
        Map.Entry<Integer, Double> Entry=sortedSentList.get(0);//第一步先將最高分的句子加入
        MMR_SentScore.put(Entry.getKey(), Entry.getValue());
        boolean flag=true;
        while(flag){
            int index=0;
            double maxScore=Double.NEGATIVE_INFINITY;//通過迭代計算獲得最高分句子
            for(Map.Entry<Integer, Double> entry:sortedLinkedSent.entrySet()){
                if(MMR_SentScore.containsKey(entry.getKey())) continue;
                double simSentence=0.0;
                for(Map.Entry<Integer, Double> MMREntry:MMR_SentScore.entrySet()){//這個是獲得最相似的那個句子的最大相似值
                    double simSen=0.0;
                    if(entry.getKey()>MMREntry.getKey())
                        simSen=simSentArray[MMREntry.getKey()][entry.getKey()];
                    else
                        simSen=simSentArray[entry.getKey()][MMREntry.getKey()];
                    if(simSen>simSentence){
                        simSentence=simSen;
                    }
                }
                simSentence=λ*entry.getValue()-(1-λ)*simSentence;
                if(simSentence>maxScore){
                    maxScore=simSentence;
                    index=entry.getKey();//句子編號
                }
            }
            MMR_SentScore.put(index, maxScore);
            if(MMR_SentScore.size()==sortedLinkedSent.size())
                flag=false;
        }
        //System.out.println("MMR out...");
        return MMR_SentScore;
    }

    /**
     * 每個句子的相似度,這裏使用簡單的jaccard方法,計算所有句子的兩兩相似度
     * @return
     */
    private double[][] sentJSimilarity(){
        //System.out.println("sentJSimilarity in...");
        int size=sentSegmentWords.size();
        double[][] simSent=new double[size][size];
        for(Map.Entry<Integer, List<String>> entry:sentSegmentWords.entrySet()){
            for(Map.Entry<Integer, List<String>> entry1:sentSegmentWords.entrySet()){
                if(entry.getKey()>=entry1.getKey()) continue;
                int commonWords=0;
                double sim=0.0;
                for(String entryStr:entry.getValue()){
                    if(entry1.getValue().contains(entryStr))
                        commonWords++;
                }
                sim=1.0*commonWords/(entry.getValue().size()+entry1.getValue().size()-commonWords);
                simSent[entry.getKey()][entry1.getKey()]=sim;
            }
        }
        //System.out.println("sentJSimilarity out...");
        return simSent;
    }

    public static void main(String[] args){

        NewsSummary summary=new NewsSummary();


        String text="我國古代歷史演義小說的代表作。明代小說家羅貫中依據有關三國的歷史、雜記,在廣泛吸取民間傳說和民間藝人創作成果的基礎上,加工、再創作了這部長篇章回小說。" +
                "作品寫的是漢末到晉初這一歷史時期魏、蜀、吳三個封建統治集團間政治、軍事、外交等各方面的複雜鬥爭。通過這些描寫,揭露了社會的黑暗與腐朽,譴責了統治階級的殘暴與奸詐," +
                "反映了人民在動亂時代的苦難和明君仁政的願望。小說也反映了作者對農民起義的偏見,以及因果報應和宿命論等思想。戰爭描寫是《三國演義》突出的藝術成就。" +
                "這部小說通過驚心動魄的軍事、政治鬥爭,運用誇張、對比、烘托、渲染等藝術手法,成功地塑造了諸葛亮、曹操、關羽、張飛等一批鮮明、生動的人物形象。" +
                "《三國演義》結構宏偉而又嚴密精巧,語言簡潔、明快、生動。有的評論認爲這部作品在藝術上的不足之處是人物性格缺乏發展變化,有的人物渲染誇張過分導致失真。" +
                "《三國演義》標誌着歷史演義小說的輝煌成就。在傳播政治、軍事鬥爭經驗、推動歷史演義創作的繁榮等方面都起過積極作用。" +
                "《三國演義》的版本主要有明嘉靖刻本《三國志通俗演義》和清毛宗崗增刪評點的《三國志演義》";


        String keySentences=summary.SummaryMeanstdTxt(text);
        System.out.println("summary: "+keySentences);

        String topSentences=summary.SummaryTopNTxt(text);
        System.out.println("summary: "+topSentences);

        String mmrSentences=summary.SummaryMMRNTxt(text);
        System.out.println("summary: "+mmrSentences);


    }
}

 

另外也可以引用漢語語言處理包,該包裏面有摘要的方法

  

  import com.hankcs.hanlp.HanLP;
  import com.hankcs.hanlp.corpus.tag.Nature;
  import com.hankcs.hanlp.seg.common.Term;
  //摘要,200是摘要的最大長度
  String summary = HanLP.getSummary(txt,200);

  引用hanlp-portable-1.5.2.jar和hanlp-solr-plugin-1.1.2.jar

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章