1、TF-IDF文檔轉換爲向量
以下邊三個句子爲例
羅湖發佈大梧桐新興產業帶整體規劃
深化夥伴關係,增強發展動力
爲世界經濟發展貢獻中國智慧
經過分詞後變爲
[羅湖, 發佈, 大梧桐, 新興產業, 帶, 整體, 規劃]|
[深化, 夥伴, 關係, 增強, 發展, 動力]
[爲, 世界, 經濟發展, 貢獻, 中國, 智慧]
經過詞頻(TF)計算後,詞頻=某個詞在文章中出現的次數
(262144,[10607,18037,52497,53469,105320,122761,220591],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])
(262144,[8684,20809,154835,191088,208112,213540],[1.0,1.0,1.0,1.0,1.0,1.0])
(262144,[21159,30073,53529,60542,148594,197957],[1.0,1.0,1.0,1.0,1.0,1.0])
262144爲總詞數,這個值越大,不同的詞被計算爲一個Hash值的概率就越小,數據也更準確。
[10607,18037,52497,53469,105320,122761,220591]分別代表羅湖, 發佈, 大梧桐, 新興產業, 帶, 整體, 規劃的向量值
[1.0,1.0,1.0,1.0,1.0,1.0,1.0]分別代表羅湖, 發佈, 大梧桐, 新興產業, 帶, 整體, 規劃在句子中出現的次數
經過逆文檔頻率(IDF),逆文檔頻率=log(總文章數/包含該詞的文章數)
[6.062092444847088,7.766840537085513,7.073693356525568,5.201891179623976,7.073693356525568,5.3689452642871425,6.514077568590145]
[3.8750202389748862,5.464255444091467,6.062092444847088,7.3613754289773485,6.668228248417403,5.975081067857458]
[6.2627631403092385,4.822401557919072,6.2627631403092385,6.2627631403092385,3.547332831909406,4.065538562973019]
其中[6.062092444847088,7.766840537085513,7.073693356525568,5.201891179623976,7.073693356525568,5.3689452642871425,6.514077568590145]分別代表羅湖, 發佈, 大梧桐, 新興產業, 帶, 整體, 規劃的逆文檔頻率
2、相似度計算方法
在之前學習《Mahout實戰》書中聚類算法中,知道幾種相似性度量方法
歐氏距離測度
給定平面上的兩個點,通過一個標尺來計算出它們之間的距離
平方歐氏距離測度
這種距離測度的值是歐氏距離的平方。
曼哈頓距離測度
兩個點之間的距離是它們座標差的絕對值
餘弦距離測度
餘弦距離測度需要我們將這些點視爲人原點指向它們的向量,向量之間形成一個夾角,當夾角較小時,這些向量都會指向大致相同方向,因此這些點非常接近,當夾角非常小時,這個夾角的餘弦接近於1,而隨着角度變大,餘弦值遞減。
兩個n維向量之間的餘弦距離公式
谷本距離測度
餘弦距離測度忽略向量的長度,這適用於某些數據集,但是在其它情況下可能會導致糟糕的聚類結果,谷本距離表現點與點之間的夾角和相對距離信息。
加權距離測度
允許對不同的維度加權從而提高或減小某些維度對距離測度值的影響。
3、代碼實現
spark ml有TF_IDF的算法實現,spark sql也能實現數據結果的輕鬆讀取和排序,也自帶有相關餘弦值計算方法。本文將使用餘弦相似度計算文檔相似度,計算公式爲
測試數據來源於12月07日-12月12日之間網上抓取,樣本測試數據量爲16632條,
數據格式爲:Id@==@發佈時間@==@標題@==@內容@==@來源。penngo_07_12.txt文件內容如下:
第一條新聞是這段時間的一個新聞熱點,本文例子是計算所有新聞與第一條新聞的相似度,計算結果按相似度從高到低排序,最終結果保存在文本文件中。
使用maven創建項目spark項目
pom.xml配置
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.spark.penngo</groupId>
<artifactId>spark_test</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>spark_test</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.lionsoul</groupId>
<artifactId>jcseg-core</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<!--
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongodb-driver</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.1</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.21</version>
</dependency>
-->
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
</plugins>
</build>
</project>
SimilarityTest.java
package com.spark.penngo.tfidf;
import com.spark.test.tfidf.util.SimilartyData;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.ml.feature.HashingTF;
import org.apache.spark.ml.feature.IDF;
import org.apache.spark.ml.feature.IDFModel;
import org.apache.spark.ml.feature.Tokenizer;
import org.apache.spark.ml.linalg.BLAS;
import org.apache.spark.ml.linalg.Vector;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.sql.*;
import org.lionsoul.jcseg.tokenizer.core.*;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.util.*;
/**
* 計算文檔相似度https://my.oschina.net/penngo/blog
*/
public class SimilarityTest {
private static SparkSession spark = null;
private static String splitTag = "@==@";
public static Dataset<Row> tfidf(Dataset<Row> dataset) {
Tokenizer tokenizer = new Tokenizer().setInputCol("segment").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(dataset);
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures");
Dataset<Row> featurizedData = hashingTF.transform(wordsData);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(featurizedData);
Dataset<Row> rescaledData = idfModel.transform(featurizedData);
return rescaledData;
}
public static Dataset<Row> readTxt(String dataPath) {
JavaRDD<TfIdfData> newsInfoRDD = spark.read().textFile(dataPath).javaRDD().map(new Function<String, TfIdfData>() {
private ISegment seg = null;
private void initSegment() throws Exception {
if (seg == null) {
JcsegTaskConfig config = new JcsegTaskConfig();
config.setLoadCJKPos(true);
String path = new File("").getAbsolutePath() + "/data/lexicon";
System.out.println(new File("").getAbsolutePath());
ADictionary dic = DictionaryFactory.createDefaultDictionary(config);
dic.loadDirectory(path);
seg = SegmentFactory.createJcseg(JcsegTaskConfig.COMPLEX_MODE, config, dic);
}
}
public TfIdfData call(String line) throws Exception {
initSegment();
TfIdfData newsInfo = new TfIdfData();
String[] lines = line.split(splitTag);
if(lines.length < 5){
System.out.println("error==" + lines[0] + " " + lines[1]);
}
String id = lines[0];
String publish_timestamp = lines[1];
String title = lines[2];
String content = lines[3];
String source = lines.length >4 ? lines[4] : "" ;
seg.reset(new StringReader(content));
StringBuffer sff = new StringBuffer();
IWord word = seg.next();
while (word != null) {
sff.append(word.getValue()).append(" ");
word = seg.next();
}
newsInfo.setId(id);
newsInfo.setTitle(title);
newsInfo.setSegment(sff.toString());
return newsInfo;
}
});
Dataset<Row> dataset = spark.createDataFrame(
newsInfoRDD,
TfIdfData.class
);
return dataset;
}
public static SparkSession initSpark() {
if (spark == null) {
spark = SparkSession
.builder()
.appName("SimilarityPenngoTest").master("local[3]")
.getOrCreate();
}
return spark;
}
public static void similarDataset(String id, Dataset<Row> dataSet, String datePath) throws Exception{
Row firstRow = dataSet.select("id", "title", "features").where("id ='" + id + "'").first();
Vector firstFeatures = firstRow.getAs(2);
Dataset<SimilartyData> similarDataset = dataSet.select("id", "title", "features").map(new MapFunction<Row, SimilartyData>(){
public SimilartyData call(Row row) {
String id = row.getString(0);
String title = row.getString(1);
Vector features = row.getAs(2);
double dot = BLAS.dot(firstFeatures.toSparse(), features.toSparse());
double v1 = Vectors.norm(firstFeatures.toSparse(), 2.0);
double v2 = Vectors.norm(features.toSparse(), 2.0);
double similarty = dot / (v1 * v2);
SimilartyData similartyData = new SimilartyData();
similartyData.setId(id);
similartyData.setTitle(title);
similartyData.setSimilarty(similarty);
return similartyData;
}
}, Encoders.bean(SimilartyData.class));
Dataset<Row> similarDataset2 = spark.createDataFrame(
similarDataset.toJavaRDD(),
SimilartyData.class
);
FileOutputStream out = new FileOutputStream(datePath);
OutputStreamWriter osw = new OutputStreamWriter(out, "UTF-8");
similarDataset2.select("id", "title", "similarty").sort(functions.desc("similarty")).collectAsList().forEach(row->{
try{
StringBuffer sff = new StringBuffer();
String sid = row.getAs(0);
String title = row.getAs(1);
double similarty = row.getAs(2);
sff.append(sid).append(" ").append(similarty).append(" ").append(title).append("\n");
osw.write(sff.toString());
}
catch(Exception e){
e.printStackTrace();
}
});
osw.close();
out.close();
}
public static void run() throws Exception{
initSpark();
String dataPath = new File("").getAbsolutePath() + "/data/penngo_07_12.txt";
Dataset<Row> dataSet = readTxt(dataPath);
dataSet.show();
Dataset<Row> tfidfDataSet = tfidf(dataSet);
String id = "58528946cc9434e17d8b4593";
String similarFile = new File("").getAbsolutePath() + "/data/penngo_07_12_similar.txt";
similarDataset(id, tfidfDataSet, similarFile);
}
public static void main(String[] args) throws Exception{
//window上運行
//System.setProperty("hadoop.home.dir", "D:/penngo/hadoop-2.6.4");
//System.setProperty("HADOOP_USER_NAME", "root");
run();
}
}
運行結果,相似度越高的,新聞排在越前邊,樣例數據的測試結果基本滿足要求。data_07_12_similar.txt文件內容如下:
參考資料
《Mahout實戰》