Pig的核心代碼剝離出來後,我們可以慢慢深入到代碼內部去了。
網上大多數源代碼分析的文章,都是從幾個核心類開始分析,畫類圖、流程圖等等。現在讓我們換個方式,像剝洋蔥那樣,從外圍開始入手,
一步步深入到最核心的代碼,這樣可以有個坡度,降低分析難度。
我們首先觀察一下Pig的源代碼文件名,可以發現,有許多文件,從名字上就能看出它是幹什麼的,比如IsDouble.java,顯然是判斷
是否Double值的;XPath.java,顯然是處理XML中XPath相關工作的。以下是IsDouble類和XPath類的代碼:
/**
* This UDF is used to check whether the String input is a Double.
* Note this function checks for Double range.
* If range is not important, use IsNumeric instead if you would like to check if a String is numeric.
* Also IsNumeric performs slightly better compared to this function.
*/
public class IsDouble extends EvalFunc<Boolean> {
@Override
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return false;
try {
String str = (String)input.get(0);
if (str == null || str.length() == 0) return false;
Double.parseDouble(str);
} catch (NumberFormatException nfe) {
return false;
} catch (ClassCastException e) {
warn("Unable to cast input "+input.get(0)+" of class "+
input.get(0).getClass()+" to String", PigWarning.UDF_WARNING_1);
return false;
}
return true;
}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.BOOLEAN));
}
}
/**
* XPath is a function that allows for text extraction from xml
*/
public class XPath extends EvalFunc<String> {
/** Hold onto last xpath & xml in case the next call to xpath() is feeding the same xml document
* The reason for this is because creating an xpath object is costly. */
private javax.xml.xpath.XPath xpath = null;
private String xml = null;
private Document document;
private static boolean cache = true;
/**
* input should contain: 1) xml 2) xpath 3) optional cache xml doc flag
*
* Usage:
* 1) XPath(xml, xpath)
* 2) XPath(xml, xpath, false)
*
* @param 1st element should to be the xml
* 2nd element should be the xpath
* 3rd optional boolean cache flag (default true)
*
* This UDF will cache the last xml document. This is helpful when multiple consecutive xpath calls are made for the same xml document.
* Caching can be turned off to ensure that the UDF's recreates the internal javax.xml.xpath.XPath for every call
*
* @return chararrary result or null if no match
*/
@Override
public String exec(final Tuple input) throws IOException {
if (input == null || input.size() <= 1) {
warn("Error processing input, not enough parameters or null input" + input,
PigWarning.UDF_WARNING_1);
return null;
}
if (input.size() > 3) {
warn("Error processing input, too many parameters" + input,
PigWarning.UDF_WARNING_1);
return null;
}
try {
final String xml = (String) input.get(0);
if(input.size() > 2)
cache = (Boolean) input.get(2);
if(!cache || xpath == null || !xml.equals(this.xml))
{
final InputSource source = new InputSource(new StringReader(xml));
this.xml = xml; //track the xml for subsequent calls to this udf
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
this.document = db.parse(source);
final XPathFactory xpathFactory = XPathFactory.newInstance();
this.xpath = xpathFactory.newXPath();
}
final String xpathString = (String) input.get(1);
final String value = xpath.evaluate(xpathString, document);
return value;
} catch (Exception e) {
warn("Error processing input " + input.getType(0),
PigWarning.UDF_WARNING_1);
return null;
}
}
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
final List<FuncSpec> funcList = new ArrayList<FuncSpec>();
/*either two chararray arguments*/
List<FieldSchema> fields = new ArrayList<FieldSchema>();
fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
Schema twoArgInSchema = new Schema(fields);
funcList.add(new FuncSpec(this.getClass().getName(), twoArgInSchema));
/*or two chararray and a boolean argument*/
fields = new ArrayList<FieldSchema>();
fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
fields.add(new Schema.FieldSchema(null, DataType.BOOLEAN));
Schema threeArgInSchema = new Schema(fields);
funcList.add(new FuncSpec(this.getClass().getName(), threeArgInSchema));
return funcList;
}
}
可以看出,它們都擴展了一個叫做EvalFunc的泛型類,使用過Pig人都知道,Pig可以進行UDF(用戶自定義函數)的開發,以便實現自己的計算函數,而那些計算函數就需要繼承這個泛型類。Pig中含有很多這種函數,說白了就是Pig已經寫好的UDF。
這些UDF類的結構都比較簡單,主要區別在於:
1. 返回參數類型不同
從上面可以看出,IsDouble的exec方法返回Boolean類型,用來判定輸入是否是Double值;XPath的exec返回String類型,用來從XML中得到一個String值。
2. 具體的算法,即exec()方法的實現不同。
不同的功能的計算函數,實現方法當然不同。
這些類屬於Pig源代碼中處於輔助地位的類,簡單看看它們的結構和算法實現即可,然後刪除。
有同學問這麼多代碼文件,我是怎麼刪除掉Pig的UDF類的。由於我是在Windows下分析代碼,所以使用了Visual Studio的Find In Files功能來查找含有“extends EvalFunc”的java文件,如果你在Linux下操作,Shell命令有類似功能,Eclipse也有。
以下是在Visual Studio中搜索的結果:
Matching lines: 319 Matching files: 248 Total files searched: 1157
可以看到,Pig含有248個這樣的文件,有些擴展類本身仍然是泛型類,比如AccumulatorEvalFunc<T>
對於這些類,我把它們根據用途,分爲以下幾個組:
1. 類型判斷組
特徵是IsXXX命名,用於判斷一個輸入值是否是某種類型,比如上面提到的IsDouble,這一組的類都很簡單。
2.格式轉化組
特徵是XXXToYYY命名,把輸入類型XXX轉化爲輸出類型YYY,比如ISOToUnix,代碼如下:
/**
* <p>ISOToUnix converts ISO8601 datetime strings to Unix Time Longs</p>
* <ul>
* <li>Jodatime: http://joda-time.sourceforge.net/</li>
* <li>ISO8601 Date Format: http://en.wikipedia.org/wiki/ISO_8601</li>
* <li>Unix Time: http://en.wikipedia.org/wiki/Unix_time</li>
* </ul>
* <br />
* <pre>
* Example usage:
*
* REGISTER /Users/me/commiter/piggybank/java/piggybank.jar ;
* REGISTER /Users/me/commiter/piggybank/java/lib/joda-time-1.6.jar ;
*
* DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
*
* ISOin = LOAD 'test.tsv' USING PigStorage('\t') AS (dt:chararray, dt2:chararray);
*
* DESCRIBE ISOin;
* ISOin: {dt: chararray,dt2: chararray}
*
* DUMP ISOin;
*
* (2009-01-07T01:07:01.000Z,2008-02-01T00:00:00.000Z)
* (2008-02-06T02:06:02.000Z,2008-02-01T00:00:00.000Z)
* (2007-03-05T03:05:03.000Z,2008-02-01T00:00:00.000Z)
* ...
*
* toUnix = FOREACH ISOin GENERATE ISOToUnix(dt) AS unixTime:long;
*
* DESCRIBE toUnix;
* toUnix: {unixTime: long}
*
* DUMP toUnix;
*
* (1231290421000L)
* (1202263562000L)
* (1173063903000L)
* ...
*</pre>
*/
public class ISOToUnix extends EvalFunc<Long> {
@Override
public Long exec(Tuple input) throws IOException
{
if (input == null || input.size() < 1) {
return null;
}
// Set the time to default or the output is in UTC
DateTimeZone.setDefault(DateTimeZone.UTC);
DateTime result = new DateTime(input.get(0).toString());
return result.getMillis();
}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.LONG));
}
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
List<FuncSpec> funcList = new ArrayList<FuncSpec>();
funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY))));
return funcList;
}
}
<strong></strong><pre name="code" class="html">
它把日期字符串,轉化爲Unix長整型風格的日期表示。
3. 數學運算組
特徵是以數學運算名字來命名,比如SIN正弦函數,ASIN反正弦函數,MAX最大值函數,IntAbs整型絕對值函數,RANDOM隨機數函數等等,實現都很簡單。
這一組中還有關於大數運算的,輸出是BigDecimal類型,比如BigDecimalAbs求大數絕對值,BigDecimalAvg求大數平均值等。
另外,日期運算是這一組的特殊情況,比如ISODaysBetween,計算兩個日期之間的天數差。
4. 字符串處理組
特徵是輸入是String,命名是一個字符串的操作,比如UPPER轉化爲大寫字符串,Reverse反轉字符串,Trim剔除首尾空格等。
注意的是,還有HashFNV,HashFNV1等類,是用來根據一個字符串來求Hash值的,RegexMatch是根據一個正則表達式返回匹配字符串的。
5. 斷言組
Assert類,用於判斷一個表達式是否爲True。請看代碼:
public class Assert extends EvalFunc<Boolean>
{
@Override
public Boolean exec(Tuple tuple)
throws IOException
{
if (!(Boolean) tuple.get(0)) {
if (tuple.size() > 1) {
throw new IOException("Assertion violated: " + tuple.get(1).toString());
}
else {
throw new IOException("Assertion violated. ");
}
}
else {
return true;
}
}
}
6.腳本執行組
這一組可以執行一個其他腳本語言寫的方法,比如Jruby類,執行Ruby腳本,JsFunction類,執行Javascript腳本
經過以上分析,計算函數可以刪除掉了,現在剩下大概820個java文件。