pig中的表的都是裝在內存中的,如果pig命令行退出後這些表也不復存在。
1、需要把Hadoop的JobHistory Server啓動
mr-jobhistory-daemon.sh start historyserver
Web Console:http://ip:19888/jobhistory
2、常用的PigLatin語句
(*)load:加載數據,創建表,相當於create table
(*)foreach:是一個循環,對錶中的每一行進行處理
(*)group by 分組
(*)filter:過濾,相當於where
(*)join:連接,多表查詢
(*)union、intersect:集合運算
(*)generate:提取列,相當於:select 列1,列2,列3 ******
以上的語句,都不會立即觸發計算;只有下面的語句纔會立即執行MapReduce
(*)dump:打印在屏幕
(*)store:輸出到文件
對應Spark中,算子有兩種
(1)Transformation:延遲計算
(2)Action:觸發計算
3、通過PigLatin分析數據:數據 emp.csv,dept.csv
7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
(1)創建員工表
emp = load '/scott/emp.csv';
查看錶結構
describe emp; ---> Schema for emp unknown.
因爲我們創建表的時候並沒有制定表結構,所有顯示emp的約束爲未知
(2)創建員工表和表結構:默認的數據類型:bytearray
emp = load '/scott/emp.csv' as(empno,ename,job,mgr,hiredate,sal,comm,deptno);
因爲沒有指定列對應的數據類型,所以默認的字節數據
然後我們使用dump查看一下表的數據
然後我們就看到很多的逗號,這是因爲我們在創建表的時候沒有對csv總數據指定分割符而是使用了pig默認的分割符
創建表,表結構,列的類型,指定分隔符
emp = load '/scott/emp.csv' using PigStorage(',') as(empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
創建部門表
dept = load '/scott/dept.csv' using PigStorage(',') as(deptno:int,dname:chararray,loc:chararray);
(3)join:查詢員工信息:員工姓名、部門名稱
SQL:select ename,dname
from emp,dept
where emp.deptno=dept.deptno;
PL:t31 = join dept by deptno,emp by deptno; ---> 不會立即執行計算
emp中的數據:
dept中的數據:
t31中的數據:
(10,ACCOUNTING,NEW YORK,7934,MILLER,CLERK,7782,1982/1/23,1300,0,10)
(10,ACCOUNTING,NEW YORK,7839,KING,PRESIDENT,-1,1981/11/17,5000,0,10)
(10,ACCOUNTING,NEW YORK,7782,CLARK,MANAGER,7839,1981/6/9,2450,0,10)
(20,RESEARCH,DALLAS,7876,ADAMS,CLERK,7788,1987/5/23,1100,0,20)
(20,RESEARCH,DALLAS,7788,SCOTT,ANALYST,7566,1987/4/19,3000,0,20)
(20,RESEARCH,DALLAS,7369,SMITH,CLERK,7902,1980/12/17,800,0,20)
(20,RESEARCH,DALLAS,7566,JONES,MANAGER,7839,1981/4/2,2975,0,20)
(20,RESEARCH,DALLAS,7902,FORD,ANALYST,7566,1981/12/3,3000,0,20)
(30,SALES,CHICAGO,7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
(30,SALES,CHICAGO,7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
(30,SALES,CHICAGO,7698,BLAKE,MANAGER,7839,1981/5/1,2850,0,30)
(30,SALES,CHICAGO,7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
(30,SALES,CHICAGO,7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
(30,SALES,CHICAGO,7900,JAMES,CLERK,7698,1981/12/3,950,0,30)
t32 = foreach t31 generate dept::dname,emp::ename; ---> 不會立即執行計算
dump t32; -----> 立即執行計算
(4)查詢員工信息:員工號,姓名和薪水
SQL: select empno,ename,sal from emp;
PL: emp4 = foreach emp generate empno,ename,sal; ---> 不會立即執行計算
dump emp4; -----> 立即執行計算
(5)查詢員工信息:按照薪水排序
SQL:select * from emp order by sal;
PL: emp5 = order emp by sal; ---> 不會立即執行計算(延遲計算)
dump emp5; -----> 立即執行計算
(6)分組:求每個部門工資的最大值
SQL:select deptno,max(sal) from emp group by deptno;
PL: 第一步:分組
emp61 = group emp by deptno;
表結構
emp61: {group: int,
emp: {(empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}
數據 dump emp61;
(10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,0,10),
(7839,KING,PRESIDENT,-1,1981/11/17,5000,0,10),
(7782,CLARK,MANAGER,7839,1981/6/9,2450,0,10)})
(20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,0,20),
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,0,20),
(7369,SMITH,CLERK,7902,1980/12/17,800,0,20),
(7566,JONES,MANAGER,7839,1981/4/2,2975,0,20),
(7902,FORD,ANALYST,7566,1981/12/3,3000,0,20)})
(30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,0,30),
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),
(7900,JAMES,CLERK,7698,1981/12/3,950,0,30)})
第二步:最高工資
emp62 = foreach emp61 generate group,MAX(emp.sal);
(7)執行WordCount
① 加載數據 ----> 延遲計算
mydata = load '/input/data.txt' as (line:chararray);
② 將字符串分割成單詞 ----> 延遲計算
words = foreach mydata generate flatten(TOKENIZE(line)) as word;
③ 對單詞進行分組 ----> 延遲計算
grpd = group words by word;
④ 統計每組中單詞數量 ----> 延遲計算
cntd = foreach grpd generate group,COUNT(words);
⑤ 打印結果 ----> 執行計算
dump cntd;
Pig的自定義函數:過濾函數,運算函數,加載函數
需要的jar包
$PIG_HOME/pig-0.17.0-core-h2.jar
$PIG_HOME/lib
$PIG_HOME/lib/h2
$HADOOP_HOME/share/hadoop/common
$HADOOP_HOME/share/hadoop/common/lib
1、自定義的過濾函數:相當於where語句
舉例:查詢薪水大於3000的員工
package demo;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class IsSalaryTooHigh extends FilterFunc
{
@Override
public Boolean exec(Tuple tuple) throws IOException {
int sal = (Integer)tuple.get(0);
return sal>3000?true:false;
}
}
2、自定義的運算函數:求表達式的值
舉例:根據員工的薪水,判斷級別
sal<=1000 返回 Grade A
1000<sal<=3000 返回 Grade B
sal>3000 返回 Grade C
package demo;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class CheckSalaryGrade extends EvalFunc<String>
{
@Override
public String exec(Tuple tuple) throws IOException {
int sal = (Integer)tuple.get(0);
/*sal<=1000 返回 Grade A
0100<sal<=3000 返回 Grade B
sal>3000 返回 Grade C*/
if (sal>3000 )
{
return "Grade C";
}else if (sal>1000) {
return "Grade B";
}else {
return "Grade A";
}
}
}
3、自定義的加載函數
還需要MapReduce的jar包
$HADOOP_HOME/share/hadoop/mapreduce
$HADOOP_HOME/share/hadoop/mapreduce/lib
package demo;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.LoadFunc;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
public class MyloadFunction extends LoadFunc
{
private RecordReader reader;
@Override
public InputFormat getInputFormat() throws IOException {
return new TextInputFormat();
}
@Override
public Tuple getNext() throws IOException {
Tuple tuple=null;
try
{
if (!reader.nextKeyValue())
{
return null;
}
//創建一個返回的結果
tuple = TupleFactory.getInstance().newTuple();
//獲取數據: I love Beijing
Text value = (Text) this.reader.getCurrentValue();
String data = value.toString();
//分詞
String[] words = data.split(" ");
//創建表
DataBag bag = BagFactory.getInstance().newDefaultBag();
for(String w:words) {
Tuple one = TupleFactory.getInstance().newTuple();
//把單詞放在tuple上
one.append(w);
//再把tuple放入表
bag.add(one);
}
//最後,把表放入tuple
tuple.append(bag);
} catch (Exception e)
{
// TODO: handle exception
}
return tuple;
}
@Override
public void prepareToRead(RecordReader reader, PigSplit arg1) throws IOException {
this.reader=reader;
}
@Override
public void setLocation(String path, Job job) throws IOException {
FileInputFormat.setInputPaths(job, path);
}
}
當這些程序寫完了之後需要使用pig的register命令註冊一下
register /root/training/pigdemo.jar
也可以對裏邊的具體方方法使用define命令給函數起個別名,暫時不操作,最後測試一下
然後使用自定義的函數進行測試;
emp1 =filter emp by demo.IsSalaryTooHigh(sal);.
emp2 =foreach emp generate ename ,demo.CheckSalaryGrade(sal);
mydata =load '/input/data.txt' using demo.MyloadFunction();
最後我們測試一下define命令
define isSTH demo.IsSalaryTooHigh();
emp1 =filter emp by isSTH(sal);
最後效果是一樣的