可编程的SQL是什么样的?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你使用传统编程语言,比如Python,那么恭喜你,你可能需要解决大部分你不需要解决的问题,"},{"type":"text","marks":[{"type":"strong"}],"text":"用Python就相当于拿到了零部件,而不是一辆能跑的汽车。"},{"type":"text","text":"你花了大量时间去组装汽车,而不是操控汽车去抵达自己的目的地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分非计算机专业的同学要解决的核心问题是数据操作问题,无论你是摆地摊、开餐馆,又或者在办公室做个小职员,在政府机构做工作,你都需要基本的数据处理能力,这本质上是信息处理能力。 但是在操作数据前,你必须要学习诸如变量,函数,线程,分布式等等各种仅仅和语言自身相关的特性,这就变得很没有必要了。操作数据我们也可以使用 Excel(以及类似的软件),但是Excel有Excel的限制,譬如各种鼠标“点点点”的操作,还是有点低效的,有很多较为复杂的逻辑也不太好执行,数据规模也有限。那么,什么样的交互最快,以及可扩展性最好?答案是语言,你和计算机系统约定好的一个语言。有了语言交流,总是比各种“点点点”的操作更高效。那这个语言是啥呢?就是SQL。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是SQL也有些毛病,首先它最早是为了关系型数据库而设计的,适合查询而非ETL,但是现在人们慢慢把它扩展到ETL、流式处理、甚至AI上,这就有点吃力了。 第二个问题是,它是声明式的,导致缺乏可编程性。所谓可编程性是指,我们应该具备创建小型、可理解、可重用的逻辑片段,并且这些逻辑片段还要被测试、被命名、被组织成包,而这些包可以用来构造更多有用的逻辑片段,这样的工作流程才是合理又便捷的。更进一步地说,这些“高阶”能力应该是可选的,我们希望用户一开始就使用最简单的方式来完成手头的工作,而不是显摆一些高阶技巧。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以最后的结论是,我们希望:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"保留SQL的所有原有优势,简洁易懂,上手就可以干活。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"允许用户进阶,提供更多可编程能力,但是以一种SQL Style的方式提供。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"保留原有SQL精髓"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们仅仅对SQL做了丢丢调整,在每条 SQL 语句结尾增加了一个表名,也就是任何一条SQL语句的结果集都可以命名为一张新的表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"load hive.`raw.stripe_discounts` as discounts;\nload hive.`raw.stripe_invoice_items` as invoice_items;\nselect\n invoice_items.*,\n case\n when discounts.discount_type = 'percent'\n then amount * (1.0 - discounts.discount_value::float \/ 100)\n else amount - discounts.discount_value\n end as discounted_amount\n\n from invoice_items\n\n left outer join discounts\n on invoice_items.customer_id = discounts.customer_id\n and invoice_items.invoice_date > discounts.discount_start\n and (invoice_items.invoice_date < discounts.discount_end\n or discounts.discount_end is null)as joined;\n\n\nselect\n\n id,\n invoice_id,\n customer_id,\n coalesce(discounted_amount, amount) as discounted_amount,\n currency,\n description,\n created_at,\n deleted_at\n\n from joinedas final;\n\nselect * from final as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家看到,每条SQL的执行结果都被取名为一张新表,然后下一条SQL可以引用前面SQL产生的表,相比传统我们需要insert 然后再读取,会简单很多,也更自然,速度更快。而且对于数据处理,我们也无需在一条SQL语句里写复杂的嵌套子查询和Join了,我们可以将SQL展开来书写,校本化,更加易于阅读和使用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"支持更多数据源"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"传统SQL是假定你在一个数据源中的,因为你只能按库表方式去使用,在普通Web开发里,是你配置的数据库。而在大数据里,一般是数据仓库或者数据湖。 但是随着联邦查询越来越多,越来越普及,我们希望给SQL提供更多的加载和保存多种数据源的能力。我们通过提供load语句来完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"load excel.`.\/example-data\/excel\/hello_world.xlsx` \nwhere header=\"true\" \nas hello_world;\n\n\nselect hello from hello_world as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的示例可以看到,我们加载了一个excel文件,然后映射成一张表,之后可以用标准的SQL进行处理。如果要将结果保存到数仓也很简单:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"save overwrite hello_word as hive.`tmp.excel_table`;\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"变量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"变量是一个编程语言里,一般你会接触到的第一个概念。我们也给SQL增加了这种能力。比如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"-- It takes effect since the declaration in the same cell.\nset world=\"world\";\n\n\nselect \"hello ${world}\" as title \nas output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在可编程SQL中,变量支持多种类型,诸如sql、shell、conf、defaultParam等等去满足各种需求和场景。下面是一个典型的例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set date=`select date_sub(CAST(current_timestamp() as DATE), 1) as dt` \nwhere type=\"sql\";\n\n\nselect \"${date}\" as dt as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"后面我们会有更多变量的介绍。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"调用外部模块的代码"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"传统编程语言如Java、Python,他们的生态都是靠第三方模块来提供的。第三方模块会被打包成诸如如Jar 、pip 然后让其他项目引用。 原生的SQL是很难复用的,所以没有形成类似的机制,更多的是随用随写。 但是随着SQL能力的扩展,在流、在批、在机器学习上的应用越来越多,能写越来越复杂的逻辑,也慢慢有了更多的可复用诉求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们通过引入include 关键字,可以引入本项目或者github上的SQL代码。https:\/\/github.com\/allwefantasy\/lib-core 是我们使用可编程SQL写的一个第三方模块。假设我们要引用里面定义的一个UDF 函数 "},{"type":"codeinline","content":[{"type":"text","text":"hello"}]},{"type":"text","text":",第一步是引入模块:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"include lib.`github.com\/allwefantasy\/lib-core`\nwhere \n-- libMirror=\"gitee.com\" and -- 配置代理\n-- commit=\"\" and -- 配置commit点\nalias=\"libCore\";\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步就是引入相应的udf包,然后在SQL中使用:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"include local.`libCore.udf.hello`;\nselect hello() as name as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是不是很酷?"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"宏函数"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"函数是代码复用的基础。几乎任何语言都有函数的概念。我们在SQL中也引入的宏函数的概念。但这个宏函数和原生的SQL中的函数比如 split、concat 等等是不一样的。它是SQL语言级别的函数。我们来看看示例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set loadExcel = '''\nload excel.`{0}` \nwhere header=\"true\" \nas {1}\n''';\n\n\n!loadExcel .\/example-data\/excel\/hello_world.xlsx helloTable;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这段代码中:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"我们申明了一个变量 loadExcel,并且给他设置了一段代码。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"loadExcel 有诸如 {0}, {1}的占位符。这些会被后续调用时的参数动态替换。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"使用功能 "},{"type":"codeinline","content":[{"type":"text","text":"!"}]},{"type":"text","text":" 将loadExcel变量转化为宏函数进行调用。参数传递类似命令行。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们也支持命名参数:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set loadExcel = '''\nload excel.`${path}` \nwhere header=\"true\" \nas ${tableName}\n''';\n\n\n!loadExcel _ -path .\/example-data\/excel\/hello_world.xlsx -tableName helloTable;\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"原生SQL函数的动态扩展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"像传统关系型数据库,几乎无法扩展SQL的内置函数。在Hive\/Spark中,通常需要以Jar包形式提供,可能涉及到重启应用,比较繁琐也比较重。 现在,我们把SQL UDF 书写变成和书写SQL一样。 我们来看一个例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"register ScriptUDF.`` as arrayLast \nwhere lang=\"scala\"\nand code='''def apply(a:Seq[String])={\n a.last\n}'''\nand udfType=\"udf\";\n\n\nselect arrayLast(array(\"a\",\"b\")) as lastChar as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的代码中,我们通过register语法注册了一个函数叫 arrayLast,功能是拿到数组的最后一个值。 我们使用scala代码书写这段逻辑。之后我们可以立马在SQL中使用功能这个函数。是不是随写随用?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当然,通过模块的能力,你也可以把这些函数集中在一起,然后通过include引入。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"分支语法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SQL最大的欠缺就是没有分支语句,这导致了一个啥问题呢?它需要寄生在其他语言之中,利用其他语言的分支语句。现在,我们原生的给SQL 加上了这个能力。 看如下代码:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set a = \"wow,jack\";\n\n\n!if ''' split(:a,\",\")[0] == \"jack\" ''';\n select 1 as a as b;\n!else;\n select 2 as a as b;\n!fi;\n\n\nselect * from b as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分支语句中的条件表达式中,你可以使用一切内置、或者我们扩展的原生函数。比如在上面的例子里,我们在if 语句中使用了 split函数。还有一个大家用得非常多的场景,就是我先查一张表,根据条件决定接着执行什么样的逻辑。这个有了分支语法以后也会变得很简单,比如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"select 1 as a as mockTable;\nset b_count=`select count(*) from mockTable ` where type=\"sql\" and mode=\"runtime\";\n\n\n!if ''':b_count > 11 ''';\n \n select 1 as a from b as final_table;\n!else; \n select 2 as a from b as final_table;\n!fi; \n\n\nselect * from final_table as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的代码示例中,我们先查询 mockTable里有多少数据,如果大于11条,执行 A语句,否则执行B 语句,执行完成后的结果继续被后面的SQL 处理。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"机器学习(内置算法)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SQL表达机器学习其实是比较困难的。但是别忘了我们是可编程的SQL呀。我们来看看示例,第一步我们准备一些数据:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"include project.`.\/src\/common\/mock_data.mlsql`;\n-- create mock\/validate\/test dataset.\nselect vec_dense(features) as features, label as label from mock_data as mock_data;\nselect * from mock_data as mock_data_validate;\nselect * from mock_data as mock_data_test;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接着我们就可以引入一个内置的算法来完成模型的训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"train mock_data as RandomForest.`\/tmp\/models\/randomforest` where\n\nkeepVersion=\"true\" \n\nand evaluateTable=\"mock_data_validate\"\n\nand `fitParam.0.labelCol`=\"label\"\nand `fitParam.0.featuresCol`=\"features\"\nand `fitParam.0.maxDepth`=\"2\"\n;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个语句表达的含义是什么呢? 对mock_data表的数据使用RandomForest进行训练,训练时的参数来自where语句中,训练后的模型保存在路径\/tmp\/models\/randomforest 里。是不是非常naive!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之后你马上可以进行批量预测:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"predict mock_data_test as RandomForest.`\/tmp\/models\/randomforest` as predicted_table;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或者将模型注册成UDF函数,使用Select语句进行预测:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"register RandomForest.`\/tmp\/models\/randomforest` as model_predict;\nselect vec_array(model_predict(features)) as predicted_value from mock_data as output;\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Python脚本支持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在可编程SQL里, SQL是一等公民, Python只是一些字符串片段。下面是一段示例代码:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"select 1 as a as mockTable;\n\n!python conf \"schema=st(field(a,long))\";\n\nrun command as Ray.`` where \ninputTable=\"mockTable\"\nand outputTable=\"newMockTable\"\nand code='''\nfrom pyjava.api.mlsql import RayContext\n\nray_context = RayContext.connect(globals(),None)\n\nnewrows = []\nfor row in ray_context.collect():\n row[\"a\"] = 2\n newrows.append(row)\n \ncontext.build_result(newrows)\n''';\n\n\nselect * from newMockTable as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这段代码,我们使用功能Ray 模块执行Python脚本,这段Python脚本会对 mockTable表加工,把a字段从1修改为2,然后处理的结果可以继续被SQL处理。是不是很酷?随时随地写Python处理数据或者做机器学习,数据获取和加工则是标准的SQL来完成。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"插件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可编程SQL无论语法还是内核功能应该是可以扩展的。 比如我需要一个可以产生测试数据的功能。我只要执行如下指令就可以安装具有这个功能的插件:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"!plugin app add - \"mlsql-mllib-3.0\";\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然后我就获得了一个叫SampleDatasetExt的工具,它可以产生大量的测试数据:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"run command as SampleDatasetExt.`` \nwhere columns=\"id,features,label\" \nand size=\"100000\" \nand featuresSize=\"100\" \nand labelSize=\"2\" \nas mockData;\n\nselect * from mockData as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的示例代码中,我们通过SampleDatasetExt 产生了一个具有三列的表,表的记录数为100000, 其中feature字段数组大小为100,label字段的数组大小为2。之后我们可以使用select语句进行查询进一步加工。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"更多编程小trick"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如下面一段代码在实际生产里是常态:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"select SUM( case when `id` is null or `id`='' then 1 else 0 end ) as id,\nSUM( case when `diagnosis` is null or `diagnosis`='' then 1 else 0 end ) as diagnosis,\nSUM( case when `radius_mean` is null or `radius_mean`='' then 1 else 0 end ) as radius_mean,\nSUM( case when `texture_mean` is null or `texture_mean`='' then 1 else 0 end ) as texture_mean,\nSUM( case when `perimeter_mean` is null or `perimeter_mean`='' then 1 else 0 end ) as perimeter_mean,\nSUM( case when `area_mean` is null or `area_mean`='' then 1 else 0 end ) as area_mean,\nSUM( case when `smoothness_mean` is null or `smoothness_mean`='' then 1 else 0 end ) as smoothness_mean,\nSUM( case when `compactness_mean` is null or `compactness_mean`='' then 1 else 0 end ) as compactness_mean,\nSUM( case when `concavity_mean` is null or `concavity_mean`='' then 1 else 0 end ) as concavity_mean,\nSUM( case when `concave points_mean` is null or `concave points_mean`='' then 1 else 0 end ) as concave_points_mean,\nSUM( case when `symmetry_mean` is null or `symmetry_mean`='' then 1 else 0 end ) as symmetry_mean,\nSUM( case when `fractal_dimension_mean` is null or `fractal_dimension_mean`='' then 1 else 0 end ) as fractal_dimension_mean,\nSUM( case when `radius_se` is null or `radius_se`='' then 1 else 0 end ) as radius_se,\nSUM( case when `texture_se` is null or `texture_se`='' then 1 else 0 end ) as texture_se,\nSUM( case when `perimeter_se` is null or `perimeter_se`='' then 1 else 0 end ) as perimeter_se,\nSUM( case when `area_se` is null or `area_se`='' then 1 else 0 end ) as area_se,\nSUM( case when `smoothness_se` is null or `smoothness_se`='' then 1 else 0 end ) as smoothness_se,\nSUM( case when `compactness_se` is null or `compactness_se`='' then 1 else 0 end ) as compactness_se,\nSUM( case when `concavity_se` is null or `concavity_se`='' then 1 else 0 end ) as concavity_se,\nSUM( case when `concave points_se` is null or `concave points_se`='' then 1 else 0 end ) as concave_points_se,\nSUM( case when `symmetry_se` is null or `symmetry_se`='' then 1 else 0 end ) as symmetry_se,\nSUM( case when `fractal_dimension_se` is null or `fractal_dimension_se`='' then 1 else 0 end ) as fractal_dimension_se,\nSUM( case when `radius_worst` is null or `radius_worst`='' then 1 else 0 end ) as radius_worst,\nSUM( case when `texture_worst` is null or `texture_worst`='' then 1 else 0 end ) as texture_worst,\nSUM( case when `perimeter_worst` is null or `perimeter_worst`='' then 1 else 0 end ) as perimeter_worst,\nSUM( case when `area_worst` is null or `area_worst`='' then 1 else 0 end ) as area_worst,\nSUM( case when `smoothness_worst` is null or `smoothness_worst`='' then 1 else 0 end ) as smoothness_worst,\nSUM( case when `compactness_worst` is null or `compactness_worst`='' then 1 else 0 end ) as compactness_worst,\nSUM( case when `concavity_worst` is null or `concavity_worst`='' then 1 else 0 end ) as concavity_worst,\nSUM( case when `concave points_worst` is null or `concave points_worst`='' then 1 else 0 end ) as concave_points_worst,\nSUM( case when `symmetry_worst` is null or `symmetry_worst`='' then 1 else 0 end ) as symmetry_worst,\nSUM( case when `fractal_dimension_worst` is null or `fractal_dimension_worst`='' then 1 else 0 end ) as fractal_dimension_worst,\nSUM( case when `_c32` is null or `_c32`='' then 1 else 0 end ) as _c32\nfrom data as data_id;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"写的手累?那有么有办法简化呢?当然有啦。 我们毕竟是可编程是SQL呀。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一个有意思的解决方法如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"select \n#set($colums=[\"id\",\"diagnosis\",\"fractal_dimension_worst\"])\n#foreach( $column in $colums )\n SUM( case when `$column` is null or `$column`='' then 1 else 0 end ) as $column,\n#end\n 1 as a from newTable as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们可以使用内置的 "},{"type":"codeinline","content":[{"type":"text","text":"#foreach"}]},{"type":"text","text":" 循环。先通过set设置所有字段名称,然后通过foreach循环来生成sum语句。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这就完了?就如同茴字有好多写法,我们还有其他的玩法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set sum_tpl = '''\nSUM( case when `{0}` is null or `{0}`='' then 1 else 0 end ) as {0}\n''';\n\n\nselect ${template.get(\"sum_tpl\",\"diagnosis\")},\n${template.get(\"sum_tpl\",\"radius_mean\")},\n${template.get(\"sum_tpl\",\"texture_mean\")},\nfrom data as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们可以通过set 进行模板设置,然后在sql语句里通过template.get( 语句进行模板渲染。 对于一个很复杂的SQL 语句,里面可能存在多个类似sum \/case when的重复语句,那么我们就可以使用这种方式了。而且可以做到一处修改,处处生效。不然万一你 sum里的1要改成2,那可是要改好几十个语句的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"恩,除了这些,还有非常多的好玩的玩法等待你去挖掘,SQL 再也不 Boring 了。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"不是最后的最后"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到,我们给原生SQL扩展了变量、函数、多数据源支持、第三方模块、原生函数动态扩展、分支语法、机器学习、python脚本支持、插件等等诸多功能。就像TypeScript对JavaScript的增强一样,大家也可以只用最基础的SQL语法。但是一旦有需要,你就可以使用更多高阶功能满足自己的诉求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最后"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个可编程的SQL是还在梦想中么?当然不是,它就在这里: https:\/\/mlsql.tech 我们提供了桌面版和在线试用版本,快来感受下吧!"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章