sparkmlib使用Pipeline實現簡單的邏輯迴歸

MLib 機器學習算法的標準API可以很方便的把多個算法整合到一個pipeline,並可以把整個過程形象的比如機器學習算法流;

Pipeline包括三個階段:


第一階段:Tokenier會把每個一個文件分成word

第二階段:把word轉化爲特徵向量

第三階段:用特徵向量和列別訓練模型

下面就以一個簡單的邏輯迴歸爲例說明使用Pipeline開發;

定義一個類:

public class Person implements Serializable{
	public static final long serialVersionUID = 1L;
	public  long id;
	public  String text;
	public  double label;
	
	public Person(){};
	public Person(long id, String text) {
		this.id = id;
		this.text = text;
	}
	public Person(long id, String text, double label) {
		super();
		this.id = id;
		this.text = text;
		this.label = label;
	}
	public long getId() {
		return id;
	}
	public void setId(long id) {
		this.id = id;
	}
	public String getText() {
		return text;
	}
	public void setText(String text) {
		this.text = text;
	}
	public double getLabel() {
		return label;
	}
	public void setLabel(double label) {
		this.label = label;
	}
	@Override
	public String toString() {
		return "Person [id=" + id + ", text=" + text + ", label=" + label + "]";
	}
}
public static void main(String[] args) {
		
		SparkSession sparkSession = SparkSession
			      .builder()
			      .appName("loi").master("local[1]")
			      .getOrCreate();
		// Create an RDD of SparkIn objects from a text file
		JavaRDD<String> dataRDD = sparkSession
				                   .read()
				                   . textFile("E:/sparkMlib/sparkMlib/src/mllib/testLine.txt")
				                   .javaRDD();
		JavaRDD<Person> dataPerson = dataRDD.map(new Function<String,Person>(){
			public Person call(String line) throws Exception {
				String[] parts = line.split(",");
				Person sparkIn = new Person();
				sparkIn.setId(Long.parseLong(parts[0]));
				sparkIn.setText(parts[1]);
				sparkIn.setLabel(Double.parseDouble(parts[2]));
				return sparkIn;
			}
		});
		// Apply a schema to an RDD of JavaBeans to get a DataFrame
		Dataset<Row> training  = sparkSession.createDataFrame(dataPerson, Person.class);
		//把document組成words
		Tokenizer tokenizer = new Tokenizer()
				.setInputCol("text")
				.setOutputCol("words");
		//words特徵向量化
		HashingTF hashigTF = new HashingTF()
				.setNumFeatures(1000)
				.setInputCol(tokenizer.getOutputCol())
				.setOutputCol("features");
		//選擇模型
		LogisticRegression lr = new LogisticRegression()
				.setMaxIter(30)
				.setRegParam(0.001);
		//組建pipeline流
		Pipeline pipeline =new Pipeline()
				.setStages(new PipelineStage[]{tokenizer,hashigTF,lr});
		//訓練模型
		PipelineModel model = pipeline.fit(training);
		
		//組織預測數據
		Dataset<Row> test =sparkSession.createDataFrame(
				Arrays.asList(
						new Person(4,"spark i j k"),
				        new Person(5,"l m n"),
						new Person(6,"spark hadoop spark"),
						new Person(7,"apache hadoop")),
				Person.class);
		Dataset<Row> predictions =model.transform(test);
		for(Row r:predictions.select("id", "text", "probability", "prediction").collectAsList()){
			System.out.println(r.get(0)+":"+r.get(1)+":"+r.get(2)+":"+r.get(3));
		}
		
	}


以上就是用Pipeline實現的一個簡單的邏輯迴歸模型;





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章