用多線程實現本地MapReduce來計算莎士比亞用的最多的單詞

最近學校的課程作業要求自己實現一個MapReduce來計算莎士比亞用的最多的單詞，並將結果輸出到txt文件中。

編寫文件處理方法

public class FileOperator {
    public static String readFile(String path){
        StringBuilder res = new StringBuilder();
        try(InputStreamReader read = new InputStreamReader(new FileInputStream(path));
            BufferedReader reader = new BufferedReader(read)){
            while(true){
                String line = reader.readLine();
                if(line==null)
                    break;
                res.append(line).append(" ");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return res.toString();
    }

    public static String outputResultToFile(List<Map.Entry<String,Integer>> sortedReduce){
        File writeName = new File("./output.txt");
        try {
            if(writeName.exists()){
                if(!writeName.delete())
                    throw new IOException("File already exist and can't be deleted");
            }
            if(!writeName.createNewFile())
                throw new IOException("Failed in creating file");
        } catch (IOException e) {
            e.printStackTrace();
        }
        try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(writeName)))) {
            for(Map.Entry<String,Integer> entry:sortedReduce){
                out.write(entry.getKey()+","+entry.getValue());
                out.newLine();
            }
            out.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return "Succeed in computing";
    }
    public static List<Map.Entry<String,Integer>> sort(Map<String,Integer> reduce){
        List<Map.Entry<String, Integer>> sortedReduce = new ArrayList<>(reduce.entrySet());
            sortedReduce.sort((e1, e2) -> -(e1.getValue() - e2.getValue()));
            return sortedReduce;
    }
}

readFile()方法將輸入的txt轉換爲一個字符串輸出給MapFunc類處理，outputResultToFile()方法將Reducer類輸出的結果寫入文件中。這裏的reader和writer都是使用的帶緩衝區的BufferedReader和BufferedWriter。可以加快文件讀取和寫入速度。

編寫Map類

因爲我設計的Map和Reduce之間使用生產者消費者模式，所以線程之間需要阻塞隊列來作爲緩衝區傳遞數據。首先編寫一個Transfer類

public class Transfer {
    public static final BlockingDeque<Map<String, List<Integer>>> buffer = new LinkedBlockingDeque<>();
    public static final BlockingDeque[] pipeline = new LinkedBlockingDeque[8];
    static {
        for (int i = 0; i < pipeline.length ; i++) {
            pipeline[i] = new LinkedBlockingDeque<Map<String,List<Integer>>>();
        }
    }
}

該類的兩個類變量就是緩衝區，第一個buffer是一個阻塞隊列，而第二個pipeline是一個阻塞隊列數組，我這裏是想測試所有線程只用一個阻塞隊列和每兩個線程之間就用一個阻塞隊列，這兩種方法中哪種性能更佳。

接下來編寫Map類，stopwords是一個set，用於存放不計入統計的單詞。將stopwords的初始化邏輯寫入static代碼塊中，在類加載的初始化階段對其進行初始化，而不是每一個實例都初始化一次。

對於輸入的字符串處理，將其轉換成一個String數組，數組的每一個元素都是一個單詞，具體的處理邏輯見代碼。

map()方法是隻用一個阻塞隊列的方法，mapUsingPipeline()是每兩個線程之間使用一個阻塞隊列的方法。用線程池來管理任務，這樣可以實現線程重用，而不是每一個工作任務都去創建一個新線程，完成之後再回收。CountDownLatch用於主線程的等待，當所有工作線程都完成任務後，主線程才關閉線程池。我這裏設置了8個線程，將數組分片成8段給每個線程分別處理。

public class MapFunc {
    private static final Set<String> stopWords;
    //阻塞隊列(緩衝區),每個任務處理完後放入緩衝區讓reduce處理
    private static final BlockingDeque<Map<String,List<Integer>>> blockingDeque =Transfer.buffer;
    //管道
    private static final BlockingDeque[] pipeline = Transfer.pipeline;


    static {
        stopWords = new HashSet<>();
        String path1 = "./stopwords1.txt";
        String path2 = "./stopwords2.txt";
        getStopWords(path1);
        getStopWords(path2);
    }

    private static void getStopWords(String path) {
        try(InputStreamReader read = new InputStreamReader(new FileInputStream(path));
            BufferedReader reader = new BufferedReader(read)){
            while(true){
                String line = reader.readLine();
                if(line==null)
                    break;
                stopWords.add(line.toLowerCase());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }


    public void map(String string){
        String[] temp = generateUnfilteredWords(string);
        //TODO 多線程處理
        ExecutorService executorService =  Executors.newCachedThreadPool();
        final int tasks = 8;//處理map的任務數
        final CountDownLatch countDownLatch = new CountDownLatch(tasks);//所有線程處理完畢後關閉線程池
        for (int i = 0; i < tasks ; i++) {
            int finalI = i;
            int purSize = temp.length/tasks;
            executorService.execute(()->{
                Map<String,List<Integer>> keyValues = new ConcurrentHashMap<>();
                int index = finalI*purSize;
                    for (int j = index; j < index+ purSize; j++) {
                        String element = temp[j].toLowerCase();
                        if (!element.equals("")&&!stopWords.contains(element)) {
                            if(keyValues.get(element)==null){
                                //list應該保證線程安全
                                List<Integer> list = new CopyOnWriteArrayList<>();
                                list.add(1);
                                keyValues.put(element,list);
                                continue;
                            }

                            keyValues.get(element).add(1);
                        }
                    }
                try {
                    blockingDeque.put(keyValues);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                    countDownLatch.countDown();
                });
        }

        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        executorService.shutdown();

    }

    public void mapUsingPipeline(String string){
        String[] temp = generateUnfilteredWords(string);
        //TODO 多線程處理
        ExecutorService executorService =  Executors.newCachedThreadPool();
        final int tasks = 8;//處理map的任務數
        final CountDownLatch countDownLatch = new CountDownLatch(tasks);//所有線程處理完畢後關閉線程池
        for (int i = 0; i < tasks ; i++) {
            int finalI = i;
            int purSize = temp.length/tasks;
            executorService.execute(()->{
                Map<String,List<Integer>> keyValues = new ConcurrentHashMap<>();
                int index = finalI*purSize;
                for (int j = index; j < index+ purSize; j++) {
                    String element = temp[j].toLowerCase();
                    if (!element.equals("")&&!stopWords.contains(element)) {
                        if(keyValues.get(element)==null){
                            //list應該保證線程安全
                            List<Integer> list = new CopyOnWriteArrayList<>();
                            list.add(1);
                            keyValues.put(element,list);
                            continue;
                        }

                        keyValues.get(element).add(1);
                    }
                }
                try {
                    this.pipeline[finalI].put(keyValues);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                countDownLatch.countDown();
            });
        }

        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        executorService.shutdown();

    }


    public static String[] generateUnfilteredWords(String string){
        char[] chars = string.toCharArray();
        for (int i = 0; i <chars.length ; i++) {
            char c = chars[i];
            if(!isCharacter(c,chars,i)){
                chars[i] = ' ';
            }
        }
        return String.valueOf(chars).split(" ");
    }

    private static boolean isCharacter(char c,char[] chars,int index){
            return Character.isUpperCase(c) || Character.isLowerCase(c) ||
                    (index > 0 && index < chars.length - 1
                            && c == '\'' && chars[index + 1] != ' ' && chars[index - 1] != ' ');
        }
}

編寫reducer

reducer類中的每個線程從阻塞隊列中讀取對應map線程的處理結果，並將各自分別的結果寫入到一個作爲類屬性的ConcurrentHashMap中。

public class ReduceFunc {
    public static final Map<String,Integer> reduceResult = new ConcurrentHashMap<>();
    //緩衝區，接收map任務處理完後的結果
    private static final BlockingDeque<Map<String,List<Integer>>> blockingDeque = Transfer.buffer;

    private static final BlockingDeque[] pipeline = Transfer.pipeline;

    public Map<String,Integer> reduce(){
        ExecutorService executorService = Executors.newCachedThreadPool();
        int tasks = 8;
        CountDownLatch countDownLatch = new CountDownLatch(tasks);
        for (int i = 0; i <tasks ; i++) {
            //reduce任務
            executorService.execute(()->{
                try {
                    //從緩衝區拿map任務的結果，如果還沒有就阻塞
                    Map<String,List<Integer>> map1 = blockingDeque.take();
                    for(Map.Entry<String,List<Integer>> entry:map1.entrySet()){
                        int sum = 0;
                        try {
                            for(Integer integer:entry.getValue()){

                                sum+=integer;
                            }
                            //將本任務的處理結果加入最終結果中
                            reduceResult.put(entry.getKey(),reduceResult.getOrDefault(entry.getKey(),0)+sum);
                        }
                        catch (NullPointerException e){
                            System.out.println(entry);
                            e.printStackTrace();
                        }

                    }
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                countDownLatch.countDown();
            });
        }
        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        executorService.shutdown();

        return reduceResult;
    }

    public Map<String,Integer> reduceUsingPipeline(){
        ExecutorService executorService = Executors.newCachedThreadPool();
        int tasks = 8;
        CountDownLatch countDownLatch = new CountDownLatch(tasks);
        for (int i = 0; i <tasks ; i++) {
            int finalI = i;
            //reduce任務
            executorService.execute(()->{
                try {
                    //從緩衝區拿map任務的結果，如果還沒有就阻塞
                    Map<String,List<Integer>> map1 = (Map<String,List<Integer>>)pipeline[finalI].take();
                    for(Map.Entry<String,List<Integer>> entry:map1.entrySet()){
                        int sum = 0;
                        try {
                            for(Integer integer:entry.getValue()){

                                sum+=integer;
                            }
                            //將本任務的處理結果加入最終結果中
                            reduceResult.put(entry.getKey(),reduceResult.getOrDefault(entry.getKey(),0)+sum);
                        }
                        catch (NullPointerException e){
                            System.out.println(entry);
                            e.printStackTrace();
                        }

                    }
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                countDownLatch.countDown();
            });
        }
        try {
            countDownLatch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        executorService.shutdown();

        return reduceResult;
    }

}

編寫測試方法

public class TestMain {
    public static void main(String[] args) {
        String res = FileOperator.readFile("./shakespeare.txt");
        MapFunc mapper = new MapFunc();
        ReduceFunc reducer = new ReduceFunc();
        long begin = System.currentTimeMillis();
        mapper.map(res);
        Map<String,Integer> reduce = reducer.reduce();
        System.out.println("processing takes "+String.valueOf(((double) System.currentTimeMillis()-begin)/1000)+"s");
        System.out.println(FileOperator.outputResultToFile(FileOperator.sort(reduce)));

    }
}

運行測試方法，可以看到類路徑中生成了output.txt。測試成功，經測試，只用一個阻塞隊列和每兩個線程之間就用一個阻塞隊列的性能相差不大。

已上傳GitHub:https://github.com/scientist272/Shakespeare_1

用多線程實現本地MapReduce來計算莎士比亞用的最多的單詞

編寫文件處理方法

編寫Map類

編寫reducer

編寫測試方法

Nio Server實現

BILSTM-Conv2D 文本分類模型

用多線程實現本地MapReduce來計算莎士比亞用的最多的單詞

kafka集羣搭建以及踩過的坑

Java併發包中CountDownLatch和CyclicBarrier的區別

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結