阿里雲數加Max Compute的Java Map Reduce程序讀取文本資源及其命令行和IDE運行配置

原創

2018-09-04 18:12

最近有個業務是想從商品數據中解析出需要的關鍵詞。關鍵詞來自一個詞庫，詞庫文件包括產品類目詞、菜品詞等等。選擇用阿里雲Max Compute 的Map Reduce（MR）來實現。

開始以爲MR不能讀取文件，後來發現是可以讀取的。參考：https://help.aliyun.com/document_detail/27891.html

       try {
                byte[] buffer = new byte[1024 * 1024];
                int bytesRead = 0;
                bufferedInput = context.readResourceFileAsStream("category_words.txt");
                while ((bytesRead = bufferedInput.read(buffer)) != -1) {
                    String line = new String(buffer, 0, bytesRead);

                    String[] lines = line.split("\n");
                    for (String chunk : lines) {

                        dictSet.add(chunk);
                        //System.out.println("add word:" + chunk);
                    }
                }
                bufferedInput.close();

                filter = new SensitivewordFilter(dictSet);

            } catch (FileNotFoundException ex) {
                throw ex;
            } catch (IOException ex) {
                System.err.print(ex.getStackTrace().toString());
            } finally {
            }

注意用context.readResourceFileAsStream 一次就把文件的全部字符讀取出來，然後自己分行。分行之後在處理填寫到自己的詞表裏面。

注意：byte[] buffer = new byte[1024 * 1024] ，我先開了一個1M內存的空間，一次把全部文件讀入。

命令行傳入輸入和輸出表參數：

public static void main(String[] args) throws java.lang.Exception {
        if (args.length != 2) {
            System.err.println("Usage: WordCount <in_table> <out_table>");
            System.exit(2);
        }

        JobConf job = new JobConf();

        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(SumCombiner.class);
        job.setReducerClass(SumReducer.class);

        //arg[0] : projectxxx.dwd_input_table.dt=20170601

        System.out.println(args[0]);
        String []arr = args[0].split("-");
        String inputProject = arr[0];
        String table = arr[1];
        String part = arr[2];

        job.setMapOutputKeySchema(SchemaUtils.fromString("mall_id:string,mall_name:string,shop_id:string,shop_name:string,word:string"));
        job.setMapOutputValueSchema(SchemaUtils.fromString("count:bigint"));
        InputUtils.addTable(TableInfo.builder().projectName(inputProject).tableName(table).partSpec(part).build(), job);
        OutputUtils.addTable(TableInfo.builder().tableName(args[1]).build(), job);

        JobClient.runJob(job);
    }

通過Idea打jar之後，jar需要上傳到阿里雲資源組裏面。

然後自己的筆記本上可以用ODPS 命令行執行 odps任務，參考： https://yq.aliyun.com/articles/1487

命令：

jar -resources ChoseWords.jar,category_words.txt -classpath /Users/xxx/IdeaProjects/ChoseWords/out/artifacts/ChoseWords_jar/ChoseWords.jar com.aliyun.odps.mapred.open.example.WordCount projectxxx-dwd_input_table-dt=20170601 dwd_shop_tags;

上面輸入表的project，table 和partition放在一個參數裏面，自己在程序裏面切分。

最後當我們想把數據放在阿里雲任務調度中執行，設置好mapper，reducer和combiner就可以執行。注意mapper 最後一個類前面是$符合，不是dot。有點坑爹。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

阿里雲數加Max Compute的Java Map Reduce程序讀取文本資源及其命令行和IDE運行配置

芯片產業管理和營銷指北（1）—— 產品線經理主要職能

記一次疑似JVM內存泄漏的排查過程

遞歸遍歷子目錄改後綴名（批量文本改名rename）

讀取txt中的字段key，然後編號再輸出

tensorflow 通過TextLineDataset dataset.map 讀取數據

阿里雲endpoint

tensorflow 的 hashtable 和index table 讀取，求均值向量，缺失值處理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結