Apache Hadoop Pig 源代碼分析（2）

Pig的核心代碼剝離出來後，我們可以慢慢深入到代碼內部去了。

網上大多數源代碼分析的文章，都是從幾個核心類開始分析，畫類圖、流程圖等等。現在讓我們換個方式，像剝洋蔥那樣，從外圍開始入手，

一步步深入到最核心的代碼，這樣可以有個坡度，降低分析難度。

我們首先觀察一下Pig的源代碼文件名，可以發現，有許多文件，從名字上就能看出它是幹什麼的，比如IsDouble.java，顯然是判斷

是否Double值的；XPath.java,顯然是處理XML中XPath相關工作的。以下是IsDouble類和XPath類的代碼：

/**
 * This UDF is used to check whether the String input is a Double.
 * Note this function checks for Double range.
 * If range is not important, use IsNumeric instead if you would like to check if a String is numeric. 
 * Also IsNumeric performs slightly better compared to this function.
 */

public class IsDouble extends EvalFunc<Boolean> {
    @Override
    public Boolean exec(Tuple input) throws IOException {
        if (input == null || input.size() == 0) return false;
        try {
            String str = (String)input.get(0);
            if (str == null || str.length() == 0) return false;
            Double.parseDouble(str);
        } catch (NumberFormatException nfe) {
            return false;
        } catch (ClassCastException e) {
            warn("Unable to cast input "+input.get(0)+" of class "+
                    input.get(0).getClass()+" to String", PigWarning.UDF_WARNING_1);
            return false;
        }

        return true;
    }
    
    @Override
    public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(null, DataType.BOOLEAN)); 
    }
}

/**
 * XPath is a function that allows for text extraction from xml
 */
public class XPath extends EvalFunc<String> {

    /** Hold onto last xpath & xml in case the next call to xpath() is feeding the same xml document
     * The reason for this is because creating an xpath object is costly. */
    private javax.xml.xpath.XPath xpath = null;
    private String xml = null;
    private Document document;
    
    private static boolean cache = true;
    
    /**
     * input should contain: 1) xml 2) xpath 3) optional cache xml doc flag
     * 
     * Usage:
     * 1) XPath(xml, xpath)
     * 2) XPath(xml, xpath, false) 
     * 
     * @param 1st element should to be the xml
     *        2nd element should be the xpath
     *        3rd optional boolean cache flag (default true)
     *        
     * This UDF will cache the last xml document. This is helpful when multiple consecutive xpath calls are made for the same xml document.
     * Caching can be turned off to ensure that the UDF's recreates the internal javax.xml.xpath.XPath for every call
     * 
     * @return chararrary result or null if no match
     */
    @Override
    public String exec(final Tuple input) throws IOException {

        if (input == null || input.size() <= 1) {
            warn("Error processing input, not enough parameters or null input" + input,
                    PigWarning.UDF_WARNING_1);
            return null;
        }


        if (input.size() > 3) {
            warn("Error processing input, too many parameters" + input,
                    PigWarning.UDF_WARNING_1);
            return null;
        }

        try {

            final String xml = (String) input.get(0);
            
            if(input.size() > 2)
                cache = (Boolean) input.get(2);
            
            if(!cache || xpath == null || !xml.equals(this.xml))
            {
                final InputSource source = new InputSource(new StringReader(xml));
                
                this.xml = xml; //track the xml for subsequent calls to this udf

                final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
                final DocumentBuilder db = dbf.newDocumentBuilder();
                
                this.document = db.parse(source);

                final XPathFactory xpathFactory = XPathFactory.newInstance();

                this.xpath = xpathFactory.newXPath();
                
            }
            
            final String xpathString = (String) input.get(1);

            final String value = xpath.evaluate(xpathString, document);

            return value;

        } catch (Exception e) {
            warn("Error processing input " + input.getType(0), 
                    PigWarning.UDF_WARNING_1);
            
            return null;
        }
    }

	@Override
	public List<FuncSpec> getArgToFuncMapping() throws FrontendException {

		final List<FuncSpec> funcList = new ArrayList<FuncSpec>();

		/*either two chararray arguments*/
		List<FieldSchema> fields = new ArrayList<FieldSchema>();
		fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
		fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));

		Schema twoArgInSchema = new Schema(fields);

		funcList.add(new FuncSpec(this.getClass().getName(), twoArgInSchema));

		/*or two chararray and a boolean argument*/
		fields = new ArrayList<FieldSchema>();
		fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
		fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
		fields.add(new Schema.FieldSchema(null, DataType.BOOLEAN));

		Schema threeArgInSchema = new Schema(fields);

		funcList.add(new FuncSpec(this.getClass().getName(), threeArgInSchema));

		return funcList;
	}

}

可以看出，它們都擴展了一個叫做EvalFunc的泛型類，使用過Pig人都知道，Pig可以進行UDF（用戶自定義函數）的開發，以便實現自己的計算函數，而那些計算函數就需要繼承這個泛型類。Pig中含有很多這種函數，說白了就是Pig已經寫好的UDF。

這些UDF類的結構都比較簡單，主要區別在於：

1. 返回參數類型不同

從上面可以看出，IsDouble的exec方法返回Boolean類型，用來判定輸入是否是Double值；XPath的exec返回String類型，用來從XML中得到一個String值。

2. 具體的算法，即exec()方法的實現不同。

不同的功能的計算函數，實現方法當然不同。

這些類屬於Pig源代碼中處於輔助地位的類，簡單看看它們的結構和算法實現即可，然後刪除。

有同學問這麼多代碼文件，我是怎麼刪除掉Pig的UDF類的。由於我是在Windows下分析代碼，所以使用了Visual Studio的Find In Files功能來查找含有“extends EvalFunc”的java文件，如果你在Linux下操作，Shell命令有類似功能，Eclipse也有。

以下是在Visual Studio中搜索的結果：

Matching lines: 319 Matching files: 248 Total files searched: 1157

可以看到，Pig含有248個這樣的文件，有些擴展類本身仍然是泛型類，比如AccumulatorEvalFunc<T>

對於這些類，我把它們根據用途，分爲以下幾個組：

1. 類型判斷組

特徵是IsXXX命名，用於判斷一個輸入值是否是某種類型，比如上面提到的IsDouble，這一組的類都很簡單。

2.格式轉化組

特徵是XXXToYYY命名，把輸入類型XXX轉化爲輸出類型YYY，比如ISOToUnix，代碼如下：

/**
 * <p>ISOToUnix converts ISO8601 datetime strings to Unix Time Longs</p>
 * <ul>
 * <li>Jodatime: http://joda-time.sourceforge.net/</li>
 * <li>ISO8601 Date Format: http://en.wikipedia.org/wiki/ISO_8601</li>
 * <li>Unix Time: http://en.wikipedia.org/wiki/Unix_time</li>
 * </ul>
 * <br />
 * <pre>
 * Example usage:
 *
 * REGISTER /Users/me/commiter/piggybank/java/piggybank.jar ;
 * REGISTER /Users/me/commiter/piggybank/java/lib/joda-time-1.6.jar ;
 *
 * DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
 *
 * ISOin = LOAD 'test.tsv' USING PigStorage('\t') AS (dt:chararray, dt2:chararray);
 *
 * DESCRIBE ISOin;
 * ISOin: {dt: chararray,dt2: chararray}
 *
 * DUMP ISOin;
 *
 * (2009-01-07T01:07:01.000Z,2008-02-01T00:00:00.000Z)
 * (2008-02-06T02:06:02.000Z,2008-02-01T00:00:00.000Z)
 * (2007-03-05T03:05:03.000Z,2008-02-01T00:00:00.000Z)
 * ...
 *
 * toUnix = FOREACH ISOin GENERATE ISOToUnix(dt) AS unixTime:long;
 *
 * DESCRIBE toUnix;
 * toUnix: {unixTime: long}
 *
 * DUMP toUnix;
 *
 * (1231290421000L)
 * (1202263562000L)
 * (1173063903000L)
 * ...
 *</pre>
 */

public class ISOToUnix extends EvalFunc<Long> {

    @Override
    public Long exec(Tuple input) throws IOException
    {
        if (input == null || input.size() < 1) {
            return null;
        }
        
        // Set the time to default or the output is in UTC
        DateTimeZone.setDefault(DateTimeZone.UTC);

        DateTime result = new DateTime(input.get(0).toString());

        return result.getMillis();
    }

	@Override
	public Schema outputSchema(Schema input) {
        return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.LONG));
	}

    @Override
    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
        funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY))));

        return funcList;
    }
}

<strong></strong><pre name="code" class="html">

它把日期字符串，轉化爲Unix長整型風格的日期表示。

3. 數學運算組

特徵是以數學運算名字來命名，比如SIN正弦函數，ASIN反正弦函數，MAX最大值函數，IntAbs整型絕對值函數，RANDOM隨機數函數等等，實現都很簡單。

這一組中還有關於大數運算的，輸出是BigDecimal類型，比如BigDecimalAbs求大數絕對值，BigDecimalAvg求大數平均值等。

另外，日期運算是這一組的特殊情況，比如ISODaysBetween，計算兩個日期之間的天數差。

4. 字符串處理組

特徵是輸入是String，命名是一個字符串的操作，比如UPPER轉化爲大寫字符串，Reverse反轉字符串，Trim剔除首尾空格等。

注意的是，還有HashFNV，HashFNV1等類，是用來根據一個字符串來求Hash值的，RegexMatch是根據一個正則表達式返回匹配字符串的。

5. 斷言組

Assert類，用於判斷一個表達式是否爲True。請看代碼：

public class Assert extends EvalFunc<Boolean>
{
  @Override
  public Boolean exec(Tuple tuple)
      throws IOException
  {
    if (!(Boolean) tuple.get(0)) {
      if (tuple.size() > 1) {
        throw new IOException("Assertion violated: " + tuple.get(1).toString());
      }
      else {
        throw new IOException("Assertion violated. ");
      }
    }
    else {
      return true;
    }
  }
}

6.腳本執行組

這一組可以執行一個其他腳本語言寫的方法，比如Jruby類，執行Ruby腳本，JsFunction類，執行Javascript腳本

經過以上分析，計算函數可以刪除掉了，現在剩下大概820個java文件。