sparkmlib使用Pipeline實現簡單的邏輯迴歸

原創

舞舞舞吾

2020-06-23 05:21

MLib 機器學習算法的標準API可以很方便的把多個算法整合到一個pipeline中，並可以把整個過程形象的比如機器學習算法流；

Pipeline包括三個階段：

第一階段：Tokenier會把每個一個文件分成word

第二階段：把word轉化爲特徵向量

第三階段：用特徵向量和列別訓練模型

下面就以一個簡單的邏輯迴歸爲例說明使用Pipeline開發；

定義一個類：

public class Person implements Serializable{
	public static final long serialVersionUID = 1L;
	public  long id;
	public  String text;
	public  double label;
	
	public Person(){};
	public Person(long id, String text) {
		this.id = id;
		this.text = text;
	}
	public Person(long id, String text, double label) {
		super();
		this.id = id;
		this.text = text;
		this.label = label;
	}
	public long getId() {
		return id;
	}
	public void setId(long id) {
		this.id = id;
	}
	public String getText() {
		return text;
	}
	public void setText(String text) {
		this.text = text;
	}
	public double getLabel() {
		return label;
	}
	public void setLabel(double label) {
		this.label = label;
	}
	@Override
	public String toString() {
		return "Person [id=" + id + ", text=" + text + ", label=" + label + "]";
	}
}

public static void main(String[] args) {
		
		SparkSession sparkSession = SparkSession
			      .builder()
			      .appName("loi").master("local[1]")
			      .getOrCreate();
		// Create an RDD of SparkIn objects from a text file
		JavaRDD<String> dataRDD = sparkSession
				                   .read()
				                   . textFile("E:/sparkMlib/sparkMlib/src/mllib/testLine.txt")
				                   .javaRDD();
		JavaRDD<Person> dataPerson = dataRDD.map(new Function<String,Person>(){
			public Person call(String line) throws Exception {
				String[] parts = line.split(",");
				Person sparkIn = new Person();
				sparkIn.setId(Long.parseLong(parts[0]));
				sparkIn.setText(parts[1]);
				sparkIn.setLabel(Double.parseDouble(parts[2]));
				return sparkIn;
			}
		});
		// Apply a schema to an RDD of JavaBeans to get a DataFrame
		Dataset<Row> training  = sparkSession.createDataFrame(dataPerson, Person.class);
		//把document組成words
		Tokenizer tokenizer = new Tokenizer()
				.setInputCol("text")
				.setOutputCol("words");
		//words特徵向量化
		HashingTF hashigTF = new HashingTF()
				.setNumFeatures(1000)
				.setInputCol(tokenizer.getOutputCol())
				.setOutputCol("features");
		//選擇模型
		LogisticRegression lr = new LogisticRegression()
				.setMaxIter(30)
				.setRegParam(0.001);
		//組建pipeline流
		Pipeline pipeline =new Pipeline()
				.setStages(new PipelineStage[]{tokenizer,hashigTF,lr});
		//訓練模型
		PipelineModel model = pipeline.fit(training);
		
		//組織預測數據
		Dataset<Row> test =sparkSession.createDataFrame(
				Arrays.asList(
						new Person(4,"spark i j k"),
				        new Person(5,"l m n"),
						new Person(6,"spark hadoop spark"),
						new Person(7,"apache hadoop")),
				Person.class);
		Dataset<Row> predictions =model.transform(test);
		for(Row r:predictions.select("id", "text", "probability", "prediction").collectAsList()){
			System.out.println(r.get(0)+":"+r.get(1)+":"+r.get(2)+":"+r.get(3));
		}
		
	}

以上就是用Pipeline實現的一個簡單的邏輯迴歸模型；

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

sparkmlib使用Pipeline實現簡單的邏輯迴歸

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

sqoop從greenplum到數據到hive中

sparkmlib使用Pipeline實現簡單的邏輯迴歸

Cannot instantiate user function.

FlinkSql 讀取kafka sink到mysql 案例

線性迴歸原理和實現基本認識

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結