spark-sql-catalyst

spark-sql-catalyst

@(spark)[sql][catalyst]
簡單說這部分就是做optimizer的工作的,關於這部分是有一篇論文,寫的很清楚,可以當作high leve design來看。

還有一篇blog,內容差不多。

總的來說,在catalyst這部分做的事情基本上是傳統關係數據庫的:
1. parse(讓sql語句變成合法的語法樹)
2. resolve(驗證olumn,table之類的確實存在,並把table,column的scheme和具體的名字結合起來。
3. 生成具體logicplan,詳細的見talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala,典型的比如filter,project,sort,union等等。
4. 這裏是一個基於規則的優化器,具體代碼在catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
1. 按道理來說,catalyst和Spark沒有必然的聯繫,可以看作一個SQL的optimizer。

types

原生類型

值得一提的是

/**                                                                                                                                                                     
 * ::DeveloperApi::                                                                                                                                                     
 * The data type for User Defined Types (UDTs).                                                                                                                         
 *                                                                                                                                                                      
 * This interface allows a user to make their own classes more interoperable with SparkSQL;                                                                             
 * e.g., by creating a [[UserDefinedType]] for a class X, it becomes possible to create                                                                                 
 * a `DataFrame` which has class X in the schema.                                                                                                                       
 *                                                                                                                                                                      
 * For SparkSQL to recognize UDTs, the UDT must be annotated with                                                                                                       
 * [[SQLUserDefinedType]].                                                                                                                                              
 *                                                                                                                                                                      
 * The conversion via `serialize` occurs when instantiating a `DataFrame` from another RDD.                                                                             
 * The conversion via `deserialize` occurs when reading from a `DataFrame`.                                                                                             
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class UserDefinedType[UserType] extends DataType with Serializable {          

讓我們來看一個例子:

class PointUDT extends UserDefinedType[Point] {
    def dataType = StructType(Seq( // Our native structure
        StructField("x", DoubleType),
        StructField("y", DoubleType)
    ))
    def serialize(p: Point) = Row(p.x, p.y)
    def deserialize(r: Row) =
    Point(r.getDouble(0), r.getDouble(1))
}

Decimal

關於可怕的decimal,有個專門的類來優化

/**                                                                                                                                                                     
 * A mutable implementation of BigDecimal that can hold a Long if values are small enough.                                                                              
 *                                                                                                                                                                      
 * The semantics of the fields are as follows:                                                                                                                          
 * - _precision and _scale represent the SQL precision and scale we are looking for                                                                                     
 * - If decimalVal is set, it represents the whole decimal value                                                                                                        
 * - Otherwise, the decimal value is longVal / (10 ** _scale)                                                                                                           
 */                                                                                                                                                                     
final class Decimal extends Ordered[Decimal] with Serializable {  

Metadata

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 *                                                                                                                                                                      
 * Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,                                                                      
 * Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and                                                                       
 * Array[Metadata]. JSON is used for serialization.                                                                                                                     
 *                                                                                                                                                                      
 * The default constructor is private. User should use either [[MetadataBuilder]] or                                                                                    
 * [[Metadata.fromJson()]] to create Metadata instances.                                                                                                                
 *                                                                                                                                                                      
 * @param map an immutable map that stores the data                                                                                                                     
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
sealed class Metadata private[types] (private[types] val map: Map[String, Any])                                                                                         
  extends Serializable {    

需要注意的點

  1. 請仔細閱讀parser的document,尤其是那些operator
  2. 在正則表達式中:(?i) starts case-insensitive mode ,(?-i) turns off case-insensitive mode

Tree

The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. New node types are defined in Scala as subclasses of the TreeNode class. These objects are immutable and can be manipulated using functional transformations, as discussed in the next subsection.

abstract class TreeNode[BaseType <: TreeNode[BaseType]] {                                                                                                               
  self: BaseType with Product =>   

在TreeNode中定義了大量的遍歷,map,copy,transform方法。

Expression

Expression是一個巨大的代碼分支,凡是搞過數據庫的人都知道這玩意兒的複雜。請容許我把Expression的代碼貼上來。
至於具體的class本文就不再繼續討論了。

abstract class Expression extends TreeNode[Expression] {                                                                                                                
  self: Product =>                                                                                                                                                      

  /** The narrowest possible type that is produced when this expression is evaluated. */                                                                                
  type EvaluatedType <: Any                                                                                                                                             

  /**                                                                                                                                                                   
   * Returns true when an expression is a candidate for static evaluation before the query is                                                                           
   * executed.                                                                                                                                                          
   *                                                                                                                                                                    
   * The following conditions are used to determine suitability for constant folding:                                                                                   
   *  - A [[Coalesce]] is foldable if all of its children are foldable                                                                                                  
   *  - A [[BinaryExpression]] is foldable if its both left and right child are foldable                                                                                
   *  - A [[Not]], [[IsNull]], or [[IsNotNull]] is foldable if its child is foldable                                                                                    
   *  - A [[Literal]] is foldable                                                                                                                                       
   *  - A [[Cast]] or [[UnaryMinus]] is foldable if its child is foldable                                                                                               
   */                                                                                                                                                                   
  def foldable: Boolean = false                                                                                                                                         
  def nullable: Boolean                                                                                                                                                 
  def references: AttributeSet = AttributeSet(children.flatMap(_.references.iterator))                                                                                  

  /** Returns the result of evaluating this expression on a given input Row */                                                                                          
  def eval(input: Row = null): EvaluatedType                                                                                                                            

  /**          
    * Returns `true` if this expression and all its children have been resolved to a specific schema                                                                     
   * and `false` if it still contains any unresolved placeholders. Implementations of expressions                                                                       
   * should override this if the resolution of this type of expression involves more than just                                                                          
   * the resolution of its children.                                                                                                                                    
   */                                                                                                                                                                   
  lazy val resolved: Boolean = childrenResolved                                                                                                                         

  /**                                                                                                                                                                   
   * Returns the [[DataType]] of the result of evaluating this expression.  It is                                                                                       
   * invalid to query the dataType of an unresolved expression (i.e., when `resolved` == false).                                                                        
   */                                                                                                                                                                   
  def dataType: DataType                                                                                                                                                

  /**                                                                                                                                                                   
   * Returns true if  all the children of this expression have been resolved to a specific schema                                                                       
   * and false if any still contains any unresolved placeholders.                                                                                                       
   */                                                                                                                                                                   
  def childrenResolved: Boolean = !children.exists(!_.resolved)                                                                                                         

  /**                                                        
* Returns a string representation of this expression that does not have developer centric                                                                            
   * debugging information like the expression id.                                                                                                                      
   */                                                                                                                                                                   
  def prettyString: String = {                                                                                                                                          
    transform {                                                                                                                                                         
      case a: AttributeReference => PrettyAttribute(a.name)                                                                                                             
      case u: UnresolvedAttribute => PrettyAttribute(u.name)                                                                                                            
    }.toString                                                                                                                                                          
  }                                                                                                                                                                     
}    

DSL

catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala中定義了大量的隱式轉化來支持dsl。

SqlLexical

class SqlLexical extends StdLexical {

SqlParser

/**                                                                                                                                                                     
 * A very simple SQL parser.  Based loosely on:                                                                                                                         
 * https://github.com/stephentu/scala-sql-parser/blob/master/src/main/scala/parser.scala                                                                                
 *                                                                                                                                                                      
 * Limitations:                                                                                                                                                         
 *  - Only supports a very limited subset of SQL.                                                                                                                       
 *                                                                                                                                                                      
 * This is currently included mostly for illustrative purposes.  Users wanting more complete support                                                                    
 * for a SQL like language should checkout the HiveQL support in the sql/hive sub-project.                                                                              
 */                                                                                                                                                                     
class SqlParser extends AbstractSparkSQLParser with DataTypeParser {   

含註釋文件一共386行,當然不是完整的scala不過也可以了,算是比較簡潔的吧。

plans

QueryPlan

abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanType] {   

所有plan的基類。

JoinType

sealed abstract class JoinType                                                                                                                                          

case object Inner extends JoinType                                                                                                                                      

case object LeftOuter extends JoinType                                                                                                                                  

case object RightOuter extends JoinType                                                                                                                                 

case object FullOuter extends JoinType                                                                                                                                  

case object LeftSemi extends JoinType  

d

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章