Mahout基於hadoop實現itembased協同過濾流程解析

注：我們要想分析mahout itembased協同過濾，必須先找到執行協同過濾的觸發類

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob。通過分析RecommenderJob的代碼來了解mahout itembased協同過濾在hadoop上的實現方式。從RecommenderJob類中run（）方法中我們可以看到前面的addOption部分是參數設置說明和參數設置方式AtomicInteger currentPhase = new AtomicInteger()語句開始進行itembased協同過濾。currentPhase初始值爲0。

itembased協同過濾在hadoop上的實現主要步驟

第一步：執行PreparePreferenceMatrixJob大作業

if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      ToolRunner.run(getConf(), new PreparePreferenceMatrixJob(), new String[]{
        "--input", getInputPath().toString(),
        "--output", prepPath.toString(),
        "--minPrefsPerUser", String.valueOf(minPrefsPerUser),
        "--booleanData", String.valueOf(booleanData),
        "--tempDir", getTempPath().toString(),
      });
      numberOfUsers = HadoopUtil.readInt(new Path(prepPath, PreparePreferenceMatrixJob.NUM_USERS), getConf());
    }

分析shouldRunNextPhase(parsedArgs, currentPhase)源碼可知，這個函數的作用是判斷currentPhase是否在startPhase和endPhase之間，同時將currentPhase自增1。如果滿足在startPhase和endPhase之間的條件，執行PreparePreferenceMatrixJob作業。

此時，定位到org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob類，看此類的run函數，在此作業中存在很多子作業。

首先執行itemIDIndex子作業，進行item和id映射。

Job itemIDIndex = prepareJob(getInputPath(), getOutputPath(ITEMID_INDEX), TextInputFormat.class,
            ItemIDIndexMapper.class, VarIntWritable.class, VarLongWritable.class, ItemIDIndexReducer.class,
            VarIntWritable.class, VarLongWritable.class, SequenceFileOutputFormat.class);
itemIDIndex.setCombinerClass(ItemIDIndexReducer.class);

其次執行toUserVectors子作業，將原始數據整理爲用戶向量

Job toUserVectors = prepareJob(getInputPath(),
                                   getOutputPath(USER_VECTORS),
                                   TextInputFormat.class,
                                   ToItemPrefsMapper.class,
                                   VarLongWritable.class,
                                   booleanData ? VarLongWritable.class : EntityPrefWritable.class,
                                   ToUserVectorsReducer.class,
                                   VarLongWritable.class,
                                   VectorWritable.class,
                                   SequenceFileOutputFormat.class);
    toUserVectors.getConfiguration().setBoolean(RecommenderJob.BOOLEAN_DATA, booleanData);
    toUserVectors.getConfiguration().setInt(ToUserVectorsReducer.MIN_PREFERENCES_PER_USER, minPrefsPerUser);
    toUserVectors.getConfiguration().set(ToEntityPrefsMapper.RATING_SHIFT, String.valueOf(ratingShift));
    succeeded = toUserVectors.waitForCompletion(true);

再次，執行toItemVectors子作業，將原始數據整理爲item向量，即評分矩陣

第二步：此步有兩個大作業，第一個執行RowSimilarityJob大作業，計算行相似度

if (shouldRunNextPhase(parsedArgs, currentPhase)) {

      /* special behavior if phase 1 is skipped */
      if (numberOfUsers == -1) {
        numberOfUsers = (int) HadoopUtil.countRecords(new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS),
                PathType.LIST, null, getConf());
      }

      //calculate the co-occurrence matrix
      ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
        "--input", new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
        "--output", similarityMatrixPath.toString(),
        "--numberOfColumns", String.valueOf(numberOfUsers),
        "--similarityClassname", similarityClassname,
        "--maxObservationsPerRow", String.valueOf(maxPrefsInItemSimilarity),
        "--maxObservationsPerColumn", String.valueOf(maxPrefsInItemSimilarity),
        "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
        "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
        "--threshold", String.valueOf(threshold),
        "--randomSeed", String.valueOf(randomSeed),
        "--tempDir", getTempPath().toString(),
      });

此時，定位到org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob類，查看此類的run函數，從函數中我們可以看到此類執行了多個job子作業。
首先，執行countObservations子作業。

Job countObservations = prepareJob(getInputPath(), getTempPath("notUsed"), CountObservationsMapper.class,
        NullWritable.class, VectorWritable.class, SumObservationsReducer.class, NullWritable.class,
        VectorWritable.class);
    countObservations.setCombinerClass(VectorSumCombiner.class);
    countObservations.getConfiguration().set(OBSERVATIONS_PER_COLUMN_PATH, observationsPerColumnPath.toString());
    countObservations.setNumReduceTasks(1);
    countObservations.waitForCompletion(true);

其次，執行normsAndTranspose子作業，得到weights文件。

if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job normsAndTranspose = prepareJob(getInputPath(), weightsPath, VectorNormMapper.class, IntWritable.class,
          VectorWritable.class, MergeVectorsReducer.class, IntWritable.class, VectorWritable.class);
      normsAndTranspose.setCombinerClass(MergeVectorsCombiner.class);
      Configuration normsAndTransposeConf = normsAndTranspose.getConfiguration();
      normsAndTransposeConf.set(THRESHOLD, String.valueOf(threshold));
      normsAndTransposeConf.set(NORMS_PATH, normsPath.toString());
      normsAndTransposeConf.set(NUM_NON_ZERO_ENTRIES_PATH, numNonZeroEntriesPath.toString());
      normsAndTransposeConf.set(MAXVALUES_PATH, maxValuesPath.toString());
      normsAndTransposeConf.set(SIMILARITY_CLASSNAME, similarityClassname);
      normsAndTransposeConf.set(OBSERVATIONS_PER_COLUMN_PATH, observationsPerColumnPath.toString());
      normsAndTransposeConf.set(MAX_OBSERVATIONS_PER_ROW, String.valueOf(maxObservationsPerRow));
      normsAndTransposeConf.set(MAX_OBSERVATIONS_PER_COLUMN, String.valueOf(maxObservationsPerColumn));
      normsAndTransposeConf.set(RANDOM_SEED, String.valueOf(randomSeed));

      boolean succeeded = normsAndTranspose.waitForCompletion(true);
      if (!succeeded) {
        return -1;
      }
    }

再次，執行pairwiseSimilarity子作業，得到兩兩相似度文件pairwiseSimilarity。

if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job pairwiseSimilarity = prepareJob(weightsPath, pairwiseSimilarityPath, CooccurrencesMapper.class,
          IntWritable.class, VectorWritable.class, SimilarityReducer.class, IntWritable.class, VectorWritable.class);
      pairwiseSimilarity.setCombinerClass(VectorSumReducer.class);
      Configuration pairwiseConf = pairwiseSimilarity.getConfiguration();
      pairwiseConf.set(THRESHOLD, String.valueOf(threshold));
      pairwiseConf.set(NORMS_PATH, normsPath.toString());
      pairwiseConf.set(NUM_NON_ZERO_ENTRIES_PATH, numNonZeroEntriesPath.toString());
      pairwiseConf.set(MAXVALUES_PATH, maxValuesPath.toString());
      pairwiseConf.set(SIMILARITY_CLASSNAME, similarityClassname);
      pairwiseConf.setInt(NUMBER_OF_COLUMNS, numberOfColumns);
      pairwiseConf.setBoolean(EXCLUDE_SELF_SIMILARITY, excludeSelfSimilarity);
      boolean succeeded = pairwiseSimilarity.waitForCompletion(true);
      if (!succeeded) {
        return -1;
      }
    }

最後，執行asMatrix子作業，得到item相似度值文件。

if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job asMatrix = prepareJob(pairwiseSimilarityPath, getOutputPath(), UnsymmetrifyMapper.class,
          IntWritable.class, VectorWritable.class, MergeToTopKSimilaritiesReducer.class, IntWritable.class,
          VectorWritable.class);
      asMatrix.setCombinerClass(MergeToTopKSimilaritiesReducer.class);
      asMatrix.getConfiguration().setInt(MAX_SIMILARITIES_PER_ROW, maxSimilaritiesPerRow);
      boolean succeeded = asMatrix.waitForCompletion(true);
      if (!succeeded) {
        return -1;
      }
    }

執行第二部第二個大作業outputSimilarityMatrix，此次作業是一個判斷性的，如果沒有設置參數則不執行。

if (hasOption("outputPathForSimilarityMatrix")) {
        Path outputPathForSimilarityMatrix = new Path(getOption("outputPathForSimilarityMatrix"));

        Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, outputPathForSimilarityMatrix,
            SequenceFileInputFormat.class, ItemSimilarityJob.MostSimilarItemPairsMapper.class,
            EntityEntityWritable.class, DoubleWritable.class, ItemSimilarityJob.MostSimilarItemPairsReducer.class,
            EntityEntityWritable.class, DoubleWritable.class, TextOutputFormat.class);

        Configuration mostSimilarItemsConf = outputSimilarityMatrix.getConfiguration();
        mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
            new Path(prepPath, PreparePreferenceMatrixJob.ITEMID_INDEX).toString());
        mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, maxSimilaritiesPerItem);
        outputSimilarityMatrix.waitForCompletion(true);
      }

第三步：執行partialMultiply作業，將同現矩陣和用戶向量相乘

//start the multiplication of the co-occurrence matrix by the user vectors
    if (shouldRunNextPhase(parsedArgs, currentPhase)) {
      Job partialMultiply = new Job(getConf(), "partialMultiply");
      Configuration partialMultiplyConf = partialMultiply.getConfiguration();

      MultipleInputs.addInputPath(partialMultiply, similarityMatrixPath, SequenceFileInputFormat.class,
                                  SimilarityMatrixRowWrapperMapper.class);
      MultipleInputs.addInputPath(partialMultiply, new Path(prepPath, PreparePreferenceMatrixJob.USER_VECTORS),
          SequenceFileInputFormat.class, UserVectorSplitterMapper.class);
      partialMultiply.setJarByClass(ToVectorAndPrefReducer.class);
      partialMultiply.setMapOutputKeyClass(VarIntWritable.class);
      partialMultiply.setMapOutputValueClass(VectorOrPrefWritable.class);
      partialMultiply.setReducerClass(ToVectorAndPrefReducer.class);
      partialMultiply.setOutputFormatClass(SequenceFileOutputFormat.class);
      partialMultiply.setOutputKeyClass(VarIntWritable.class);
      partialMultiply.setOutputValueClass(VectorAndPrefsWritable.class);
      partialMultiplyConf.setBoolean("mapred.compress.map.output", true);
      partialMultiplyConf.set("mapred.output.dir", partialMultiplyPath.toString());

      if (usersFile != null) {
        partialMultiplyConf.set(UserVectorSplitterMapper.USERS_FILE, usersFile);
      }
      partialMultiplyConf.setInt(UserVectorSplitterMapper.MAX_PREFS_PER_USER_CONSIDERED, maxPrefsPerUser);

      boolean succeeded = partialMultiply.waitForCompletion(true);
      if (!succeeded) {
        return -1;
      }
    }

第四步：在此步首先執行一個判斷性的作業，如果filterFile文件沒有設定就不用執行

if (filterFile != null) {
        Job itemFiltering = prepareJob(new Path(filterFile), explicitFilterPath, TextInputFormat.class,
                ItemFilterMapper.class, VarLongWritable.class, VarLongWritable.class,
                ItemFilterAsVectorAndPrefsReducer.class, VarIntWritable.class, VectorAndPrefsWritable.class,
                SequenceFileOutputFormat.class);
        boolean succeeded = itemFiltering.waitForCompletion(true);
        if (!succeeded) {
          return -1;
        }
      }

其次，執行aggregateAndRecommend作業，得到最後推薦結果

Job aggregateAndRecommend = prepareJob(
              new Path(aggregateAndRecommendInput), outputPath, SequenceFileInputFormat.class,
              PartialMultiplyMapper.class, VarLongWritable.class, PrefAndSimilarityColumnWritable.class,
              AggregateAndRecommendReducer.class, VarLongWritable.class, RecommendedItemsWritable.class,
              outputFormat);

Mahout基於hadoop實現itembased協同過濾流程解析

itembased協同過濾在hadoop上的實現主要步驟

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

eclipse下mahout0.9開發實戰（不使用hadoop eclipse plugins）

mahout Newsgroups 貝葉斯分類實例

hadoop streaming 編程參數設置

機器學習算法優缺點及其應用領域

Mahout基於hadoop實現itembased協同過濾流程解析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結