在开始正式数据处理之前,我觉得有必要去学习理解下UDF。
UDF
UDF全称User-Defined Functions
,用户自定义函数,是Spark SQL的一项功能,用于定义新的基于列的函数,这些函数扩展了Spark SQL的DSL用于转换数据集的词汇表。
我在databricks上找到一个比较简单理解的入门栗子:
Register the function as a UDF
1val squared = (s: Int) => { 2 s * s 3} 4spark.udf.register("square", squared)
Call the UDF in Spark SQL
1spark.range(1, 20).registerTempTable("test") 2%sql select id, square(id) as id_squared from test
我理解就是先定义一个函数squared
,返回输入数字的平方,然后register,并绑定square
方法名为square
,然后就在Spark SQL中直接使用square
方法。
实例一:温度转化
1import org.apache.spark.sql.SparkSession 2import org.apache.spark.SparkConf 3 4object ScalaUDFExample { 5 def main(args: Array[String]) { 6 val conf = new SparkConf().setAppName("Scala UDF Example") 7 val spark = SparkSession.builder().enableHiveSupport().config(conf).getOrCreate() 8 9 val ds = spark.read.json("temperatures.json") 10 ds.createOrReplaceTempView("citytemps") 11 12 // Register the UDF with our SparkSession 13 spark.udf.register("CTOF", (degreesCelcius: Double) => ((degreesCelcius * 9.0 / 5.0) + 32.0)) 14 15 spark.sql("SELECT city, CTOF(avgLow) AS avgLowF, CTOF(avgHigh) AS avgHighF FROM citytemps").show() 16 } 17}
我们将定义一个 UDF 来将以下 JSON 数据中的温度从摄氏度(degrees Celsius)转换为华氏度(degrees Fahrenheit):
1{"city":"St. John's","avgHigh":8.7,"avgLow":0.6} 2{"city":"Charlottetown","avgHigh":9.7,"avgLow":0.9} 3{"city":"Halifax","avgHigh":11.0,"avgLow":1.6} 4{"city":"Fredericton","avgHigh":11.2,"avgLow":-0.5} 5{"city":"Quebec","avgHigh":9.0,"avgLow":-1.0} 6{"city":"Montreal","avgHigh":11.1,"avgLow":1.4} 7...
实例二:时间转化
1case class Purchase(customer_id: Int, purchase_id: Int, date: String, time: String, tz: String, amount:Double) 2 3val x = sc.parallelize(Array( 4 Purchase(123, 234, "2007-12-12", "20:50", "UTC", 500.99), 5 Purchase(123, 247, "2007-12-12", "15:30", "PST", 300.22), 6 Purchase(189, 254, "2007-12-13", "00:50", "EST", 122.19), 7 Purchase(187, 299, "2007-12-12", "07:30", "UTC", 524.37) 8)) 9 10val df = sqlContext.createDataFrame(x) 11df.registerTempTable("df")
自定义函数
1def makeDT(date: String, time: String, tz: String) = s"$date $time $tz" 2sqlContext.udf.register("makeDt", makeDT(_:String,_:String,_:String)) 3 4// Now we can use our function directly in SparkSQL. 5sqlContext.sql("SELECT amount, makeDt(date, time, tz) from df").take(2) 6// but not outside 7df.select($"customer_id", makeDt($"date", $"time", $"tz"), $"amount").take(2) // fails
如果想要在SQL外面使用,必须通过spark.sql.function.udf
来创建UDF
1import org.apache.spark.sql.functions.udf 2val makeDt = udf(makeDT(_:String,_:String,_:String)) 3// now this works 4df.select($"customer_id", makeDt($"date", $"time", $"tz"), $"amount").take(2)
实践操作
写一个UDF来将一些Int数字分类
1val formatDistribution = (view: Int) => { 2 if (view < 10) { 3 "<10" 4 } else if (view <= 100) { 5 "10~100" 6 } else if (view <= 1000) { 7 "100~1K" 8 } else if (view <= 10000) { 9 "1K~10K" 10 } else if (view <= 100000) { 11 "10K~100K" 12 } else { 13 ">100K" 14 } 15}
注册:
1session.udf.register("formatDistribution", UDF.formatDistribution)
SQL:
1session.sql("select user_id, formatDistribution(variance_digg_count) as variance from video")
写到这里,再回顾UDF,我感觉这就像是去为了方便做一个分类转化等操作,和Python里面的函数一样,只不过这里的UDF一般特指Spark SQL里面使用的函数。然后发现这里和SQL中的自定义函数挺像的:
1CREATE FUNCTION [函数所有者.]<函数名称> 2( 3 -- 添加函数所需的参数,可以没有参数 4 [<@param1> <参数类型>] 5 [,<@param1> <参数类型>]… 6) 7RETURNS TABLE 8AS 9RETURN 10( 11 -- 查询返回的SQL语句 12 SELECT查询语句 13)
1/* 2* 创建内联表值函数,查询交易总额大于1W的开户人个人信息 3*/ 4create function getCustInfo() 5returns @CustInfo table --返回table类型 6( 7 --账户ID 8 CustID int, 9 --帐户名称 10 CustName varchar(20) not null, 11 --身份证号 12 IDCard varchar(18), 13 --电话 14 TelePhone varchar(13) not null, 15 --地址 16 Address varchar(50) default('地址不详') 17) 18as 19begin 20 --为table表赋值 21 insert into @CustInfo 22 select CustID,CustName,IDCard,TelePhone,Address from AccountInfo 23 where CustID in (select CustID from CardInfo 24 where CardID in (select CardID from TransInfo group by CardID,transID,TransType,TransMoney,TransDate having sum(TransMoney)>10000)) 25 return 26end 27go 28-- 调用内联表值函数 29select * from getCustInfo() 30go
好像有异曲同工之妙~