【java辦公自動化（8）】-- 樸素貝葉斯自動新聞分類

自動新聞分類，很簡單，只需要一億點細節，再經過2千年後，數據已經分類好了，我當時害怕極了。

我們已經用樸素貝葉斯自動篩選垃圾郵件，自動檢測人名性別。同理，今天實現自動將文章分類。首先，需要足夠足夠多的文本數據。。。

1、特徵表示

一篇新聞中，可以把新聞中出現的詞作爲特徵向量表示出來，如 X = {昨日，是，國內，投資，市場…}

2、特徵選擇

特徵中由於一些詞對分類沒有比較顯著的幫助，甚至會有導致一些噪音，
因此，我們需要減一億點細節。。。
我們需要去除，如“是”、“昨日”等，經過選擇的特徵可能是 X = {國內，投資，市場…}

3、模型選擇

實戰步驟：
創建文件夾，創建文件，多線程爬取，js模擬點擊獲取內容。

for(var i = 0;i<100;i++){
    $(".more").click();
}

int b = f33.length + 1;
                                key44 = b+"";
                                if(key44.length() == 1){
                                    key44 = "000"+key44;
                                }else if(key44.length() == 2){
                                    key44 = "00"+key44;
                                }else if(key44.length() == 3){
                                    key44 = "0"+key44;
                                }
                                key44 = key44+".txt";

 * 多項式樸素貝葉斯分類結果
 * P(C_i|w_1,w_2...w_n) = P(w_1,w_2...w_n|C_i) * P(C_i) / P(w_1,w_2...w_n)
 * = P(w_1|C_i) * P(w_2|C_i)...P(w_n|C_i) * P(C_i) / (P(w_1) * P(w_2) ...P(w_n))

難點：

深度優先遍歷

Files.walkFileTree(Paths.get(trainFileDir.getAbsolutePath()), new SimpleFileVisitor<Path>() {
            @Override// 正在訪問一個文件時要幹啥
            public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
                musicList2.add(file.toFile());
                String filePath = file.toFile().getAbsolutePath();
                //分詞處理，獲取每條訓練集文本的詞和詞頻
                Map<String, Integer> contentSegs = null;
                try {
                    contentSegs = IKWordSegmentation.segString(FileOptionUtil.readFile(filePath));
                } catch (Exception e) {
                    e.printStackTrace();
                }
                if (allTrainFileSegsMap.containsKey(trainFileDir.getName())) {
                    Map<String, Map<String, Integer>> allSegsMap = allTrainFileSegsMap.get(trainFileDir.getName());
                    allSegsMap.put(filePath, contentSegs);
                    allTrainFileSegsMap.put(trainFileDir.getName(), allSegsMap);
                } else {
                    Map<String, Map<String, Integer>> allSegsMap = new HashMap<String, Map<String, Integer>>();
                    allSegsMap.put(filePath, contentSegs);
                    allTrainFileSegsMap.put(trainFileDir.getName(), allSegsMap);
                }

福利函數：

/**
     * 詞頻統計
     *
     * @param content     內容
     * @param frequencies 詞頻；key：詞語；value:出現次數
     * @return
     * @throws IOException
     */
    public static Map<String, Integer> count(String content, Map<String, Integer> frequencies) throws IOException {
        if (frequencies == null) {
            frequencies = new HashMap<>();
        }
        if (StringUtils.isBlank(content)) {
            return frequencies;
        }

        IKSegmenter ikSegmenter = new IKSegmenter(new StringReader(content), true);

        Lexeme lexeme;
        while ((lexeme = ikSegmenter.next()) != null) {
            final String text = lexeme.getLexemeText();

            if (text.length() > 1) {
                //遞增
                if (frequencies.containsKey(text)) {
                    frequencies.put(text, frequencies.get(text) + 1);
                } else {//首次出現
                    frequencies.put(text, 1);
                }
            }
        }

        return frequencies;


    }

    /**
     * 按出現次數，從高到低排序
     *
     * @param data
     * @return
     */
    public static List<Map.Entry<String, Integer>> order(Map<String, Integer> data) {
        List<Map.Entry<String, Integer>> result = new ArrayList<>(data.entrySet());
        Collections.sort(result, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue() - o1.getValue();
            }
        });
        return result;
    }

福利類：

public class FileOptionUtil {
    public static void main(String[] args) {
        readDirs("C:\\Users\\yanhui\\Desktop\\課件和代碼\\第2課\\Lecture_2\\Lecture_2\\Naive-Bayes-Text-Classifier\\Database\\SogouC\\Sample");
    }

    public static List<String> readDirs(String absolutePath) {
        List<String> readDirs = new ArrayList<>();
        File Dir = new File(absolutePath);
        //獲取文件夾路徑下的所有java文件
        File[] arr = Dir.listFiles();//獲取文件或文件夾對象
        for (File file : arr) {//遍歷File數組
            if (file.isFile() && file.getName().endsWith(".txt")) {//判斷對象是否是以.java結尾的類型的文件，是的話就輸出
                readDirs.add(file.getAbsolutePath());
            } else if (file.isDirectory()) {//判斷是否是目錄，是的話，就繼續調用PrintJavaFile（）方法進行遞歸
                readDirs(file.getAbsolutePath());
            }
        }
        return readDirs;
    }

    public static String readFile(String filePath) {
        return _txtUtils.readTxtFile(filePath);
    }
}

知識付費

如需獲取代碼，請加WX（bin490647751），支付9.9元，可獲取【java辦公自動化】系列文章。

【java辦公自動化（8）】-- 樸素貝葉斯自動新聞分類

【java辦公自動化（8）】-- 樸素貝葉斯自動新聞分類

難點：

深度優先遍歷

知識付費

python gdal 安裝使用（Windows， python 3.6.8）

200行JAVA代碼寫個"狗屁不通"文章生成器

【java辦公自動化（9）】-- windows加黃金拍檔spleeter，完美分離抖音網紅歌曲人聲和背景聲

占星課程1：疫情期間因爲知道了這12個星座，我成了別人眼中的星座達人！

java研發打造自己專屬海報

豔輝源碼操作神器

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結