可編程的SQL是什麼樣的？

原創

2021-10-27 16:23

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你使用傳統編程語言，比如Python，那麼恭喜你，你可能需要解決大部分你不需要解決的問題，"},{"type":"text","marks":[{"type":"strong"}],"text":"用Python就相當於拿到了零部件，而不是一輛能跑的汽車。"},{"type":"text","text":"你花了大量時間去組裝汽車，而不是操控汽車去抵達自己的目的地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分非計算機專業的同學要解決的核心問題是數據操作問題，無論你是擺地攤、開餐館，又或者在辦公室做個小職員，在政府機構做工作，你都需要基本的數據處理能力，這本質上是信息處理能力。但是在操作數據前，你必須要學習諸如變量，函數，線程，分佈式等等各種僅僅和語言自身相關的特性，這就變得很沒有必要了。操作數據我們也可以使用 Excel（以及類似的軟件），但是Excel有Excel的限制，譬如各種鼠標“點點點”的操作，還是有點低效的，有很多較爲複雜的邏輯也不太好執行，數據規模也有限。那麼，什麼樣的交互最快，以及可擴展性最好？答案是語言，你和計算機系統約定好的一個語言。有了語言交流，總是比各種“點點點”的操作更高效。那這個語言是啥呢？就是SQL。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是SQL也有些毛病，首先它最早是爲了關係型數據庫而設計的，適合查詢而非ETL，但是現在人們慢慢把它擴展到ETL、流式處理、甚至AI上，這就有點喫力了。第二個問題是，它是聲明式的，導致缺乏可編程性。所謂可編程性是指，我們應該具備創建小型、可理解、可重用的邏輯片段，並且這些邏輯片段還要被測試、被命名、被組織成包，而這些包可以用來構造更多有用的邏輯片段，這樣的工作流程纔是合理又便捷的。更進一步地說，這些“高階”能力應該是可選的，我們希望用戶一開始就使用最簡單的方式來完成手頭的工作，而不是顯擺一些高階技巧。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以最後的結論是，我們希望："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"保留SQL的所有原有優勢，簡潔易懂，上手就可以幹活。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"允許用戶進階，提供更多可編程能力，但是以一種SQL Style的方式提供。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"保留原有SQL精髓"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們僅僅對SQL做了丟丟調整，在每條 SQL 語句結尾增加了一個表名，也就是任何一條SQL語句的結果集都可以命名爲一張新的表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"load hive.`raw.stripe_discounts` as discounts;\nload hive.`raw.stripe_invoice_items` as invoice_items;\nselect\n invoice_items.*,\n case\n when discounts.discount_type = 'percent'\n then amount * (1.0 - discounts.discount_value::float \/ 100)\n else amount - discounts.discount_value\n end as discounted_amount\n\n from invoice_items\n\n left outer join discounts\n on invoice_items.customer_id = discounts.customer_id\n and invoice_items.invoice_date > discounts.discount_start\n and (invoice_items.invoice_date < discounts.discount_end\n or discounts.discount_end is null)as joined;\n\n\nselect\n\n id,\n invoice_id,\n customer_id,\n coalesce(discounted_amount, amount) as discounted_amount,\n currency,\n description,\n created_at,\n deleted_at\n\n from joinedas final;\n\nselect * from final as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家看到，每條SQL的執行結果都被取名爲一張新表，然後下一條SQL可以引用前面SQL產生的表，相比傳統我們需要insert 然後再讀取，會簡單很多，也更自然，速度更快。而且對於數據處理，我們也無需在一條SQL語句裏寫複雜的嵌套子查詢和Join了，我們可以將SQL展開來書寫，校本化，更加易於閱讀和使用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"支持更多數據源"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統SQL是假定你在一個數據源中的，因爲你只能按庫表方式去使用，在普通Web開發裏，是你配置的數據庫。而在大數據裏，一般是數據倉庫或者數據湖。但是隨着聯邦查詢越來越多，越來越普及，我們希望給SQL提供更多的加載和保存多種數據源的能力。我們通過提供load語句來完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"load excel.`.\/example-data\/excel\/hello_world.xlsx` \nwhere header=\"true\" \nas hello_world;\n\n\nselect hello from hello_world as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的示例可以看到，我們加載了一個excel文件，然後映射成一張表，之後可以用標準的SQL進行處理。如果要將結果保存到數倉也很簡單:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"save overwrite hello_word as hive.`tmp.excel_table`;\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"變量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"變量是一個編程語言裏，一般你會接觸到的第一個概念。我們也給SQL增加了這種能力。比如:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"-- It takes effect since the declaration in the same cell.\nset world=\"world\";\n\n\nselect \"hello ${world}\" as title \nas output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在可編程SQL中，變量支持多種類型，諸如sql、shell、conf、defaultParam等等去滿足各種需求和場景。下面是一個典型的例子："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set date=`select date_sub(CAST(current_timestamp() as DATE), 1) as dt` \nwhere type=\"sql\";\n\n\nselect \"${date}\" as dt as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後面我們會有更多變量的介紹。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"調用外部模塊的代碼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統編程語言如Java、Python，他們的生態都是靠第三方模塊來提供的。第三方模塊會被打包成諸如如Jar 、pip 然後讓其他項目引用。原生的SQL是很難複用的，所以沒有形成類似的機制，更多的是隨用隨寫。但是隨着SQL能力的擴展，在流、在批、在機器學習上的應用越來越多，能寫越來越複雜的邏輯，也慢慢有了更多的可複用訴求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過引入include 關鍵字，可以引入本項目或者github上的SQL代碼。https:\/\/github.com\/allwefantasy\/lib-core 是我們使用可編程SQL寫的一個第三方模塊。假設我們要引用裏面定義的一個UDF 函數 "},{"type":"codeinline","content":[{"type":"text","text":"hello"}]},{"type":"text","text":"，第一步是引入模塊:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"include lib.`github.com\/allwefantasy\/lib-core`\nwhere \n-- libMirror=\"gitee.com\" and -- 配置代理\n-- commit=\"\" and -- 配置commit點\nalias=\"libCore\";\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二步就是引入相應的udf包，然後在SQL中使用："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"include local.`libCore.udf.hello`;\nselect hello() as name as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是不是很酷？"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"宏函數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"函數是代碼複用的基礎。幾乎任何語言都有函數的概念。我們在SQL中也引入的宏函數的概念。但這個宏函數和原生的SQL中的函數比如 split、concat 等等是不一樣的。它是SQL語言級別的函數。我們來看看示例："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set loadExcel = '''\nload excel.`{0}` \nwhere header=\"true\" \nas {1}\n''';\n\n\n!loadExcel .\/example-data\/excel\/hello_world.xlsx helloTable;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這段代碼中："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"我們申明瞭一個變量 loadExcel，並且給他設置了一段代碼。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"loadExcel 有諸如 {0}, {1}的佔位符。這些會被後續調用時的參數動態替換。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"使用功能 "},{"type":"codeinline","content":[{"type":"text","text":"!"}]},{"type":"text","text":" 將loadExcel變量轉化爲宏函數進行調用。參數傳遞類似命令行。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們也支持命名參數："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set loadExcel = '''\nload excel.`${path}` \nwhere header=\"true\" \nas ${tableName}\n''';\n\n\n!loadExcel _ -path .\/example-data\/excel\/hello_world.xlsx -tableName helloTable;\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"原生SQL函數的動態擴展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"像傳統關係型數據庫，幾乎無法擴展SQL的內置函數。在Hive\/Spark中，通常需要以Jar包形式提供，可能涉及到重啓應用，比較繁瑣也比較重。現在，我們把SQL UDF 書寫變成和書寫SQL一樣。我們來看一個例子:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"register ScriptUDF.`` as arrayLast \nwhere lang=\"scala\"\nand code='''def apply(a:Seq[String])={\n a.last\n}'''\nand udfType=\"udf\";\n\n\nselect arrayLast(array(\"a\",\"b\")) as lastChar as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的代碼中，我們通過register語法註冊了一個函數叫 arrayLast，功能是拿到數組的最後一個值。我們使用scala代碼書寫這段邏輯。之後我們可以立馬在SQL中使用功能這個函數。是不是隨寫隨用？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然，通過模塊的能力，你也可以把這些函數集中在一起，然後通過include引入。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"分支語法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SQL最大的欠缺就是沒有分支語句，這導致了一個啥問題呢？它需要寄生在其他語言之中，利用其他語言的分支語句。現在，我們原生的給SQL 加上了這個能力。看如下代碼："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set a = \"wow,jack\";\n\n\n!if ''' split(:a,\",\")[0] == \"jack\" ''';\n select 1 as a as b;\n!else;\n select 2 as a as b;\n!fi;\n\n\nselect * from b as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在分支語句中的條件表達式中，你可以使用一切內置、或者我們擴展的原生函數。比如在上面的例子裏，我們在if 語句中使用了 split函數。還有一個大家用得非常多的場景，就是我先查一張表，根據條件決定接着執行什麼樣的邏輯。這個有了分支語法以後也會變得很簡單，比如："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"select 1 as a as mockTable;\nset b_count=`select count(*) from mockTable ` where type=\"sql\" and mode=\"runtime\";\n\n\n!if ''':b_count > 11 ''';\n \n select 1 as a from b as final_table;\n!else; \n select 2 as a from b as final_table;\n!fi; \n\n\nselect * from final_table as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的代碼示例中，我們先查詢 mockTable裏有多少數據，如果大於11條，執行 A語句，否則執行B 語句，執行完成後的結果繼續被後面的SQL 處理。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"機器學習（內置算法）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SQL表達機器學習其實是比較困難的。但是別忘了我們是可編程的SQL呀。我們來看看示例，第一步我們準備一些數據："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"include project.`.\/src\/common\/mock_data.mlsql`;\n-- create mock\/validate\/test dataset.\nselect vec_dense(features) as features, label as label from mock_data as mock_data;\nselect * from mock_data as mock_data_validate;\nselect * from mock_data as mock_data_test;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接着我們就可以引入一個內置的算法來完成模型的訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"train mock_data as RandomForest.`\/tmp\/models\/randomforest` where\n\nkeepVersion=\"true\" \n\nand evaluateTable=\"mock_data_validate\"\n\nand `fitParam.0.labelCol`=\"label\"\nand `fitParam.0.featuresCol`=\"features\"\nand `fitParam.0.maxDepth`=\"2\"\n;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個語句表達的含義是什麼呢？對mock_data表的數據使用RandomForest進行訓練，訓練時的參數來自where語句中，訓練後的模型保存在路徑\/tmp\/models\/randomforest 裏。是不是非常naive！"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後你馬上可以進行批量預測："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"predict mock_data_test as RandomForest.`\/tmp\/models\/randomforest` as predicted_table;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或者將模型註冊成UDF函數，使用Select語句進行預測:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"register RandomForest.`\/tmp\/models\/randomforest` as model_predict;\nselect vec_array(model_predict(features)) as predicted_value from mock_data as output;\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Python腳本支持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在可編程SQL裏， SQL是一等公民， Python只是一些字符串片段。下面是一段示例代碼:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"select 1 as a as mockTable;\n\n!python conf \"schema=st(field(a,long))\";\n\nrun command as Ray.`` where \ninputTable=\"mockTable\"\nand outputTable=\"newMockTable\"\nand code='''\nfrom pyjava.api.mlsql import RayContext\n\nray_context = RayContext.connect(globals(),None)\n\nnewrows = []\nfor row in ray_context.collect():\n row[\"a\"] = 2\n newrows.append(row)\n \ncontext.build_result(newrows)\n''';\n\n\nselect * from newMockTable as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這段代碼，我們使用功能Ray 模塊執行Python腳本，這段Python腳本會對 mockTable表加工，把a字段從1修改爲2，然後處理的結果可以繼續被SQL處理。是不是很酷？隨時隨地寫Python處理數據或者做機器學習，數據獲取和加工則是標準的SQL來完成。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"插件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可編程SQL無論語法還是內核功能應該是可以擴展的。比如我需要一個可以產生測試數據的功能。我只要執行如下指令就可以安裝具有這個功能的插件："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"!plugin app add - \"mlsql-mllib-3.0\";\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後我就獲得了一個叫SampleDatasetExt的工具，它可以產生大量的測試數據:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"run command as SampleDatasetExt.`` \nwhere columns=\"id,features,label\" \nand size=\"100000\" \nand featuresSize=\"100\" \nand labelSize=\"2\" \nas mockData;\n\nselect * from mockData as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上面的示例代碼中，我們通過SampleDatasetExt 產生了一個具有三列的表，表的記錄數爲100000, 其中feature字段數組大小爲100，label字段的數組大小爲2。之後我們可以使用select語句進行查詢進一步加工。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"更多編程小trick"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如下面一段代碼在實際生產裏是常態："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"select SUM( case when `id` is null or `id`='' then 1 else 0 end ) as id,\nSUM( case when `diagnosis` is null or `diagnosis`='' then 1 else 0 end ) as diagnosis,\nSUM( case when `radius_mean` is null or `radius_mean`='' then 1 else 0 end ) as radius_mean,\nSUM( case when `texture_mean` is null or `texture_mean`='' then 1 else 0 end ) as texture_mean,\nSUM( case when `perimeter_mean` is null or `perimeter_mean`='' then 1 else 0 end ) as perimeter_mean,\nSUM( case when `area_mean` is null or `area_mean`='' then 1 else 0 end ) as area_mean,\nSUM( case when `smoothness_mean` is null or `smoothness_mean`='' then 1 else 0 end ) as smoothness_mean,\nSUM( case when `compactness_mean` is null or `compactness_mean`='' then 1 else 0 end ) as compactness_mean,\nSUM( case when `concavity_mean` is null or `concavity_mean`='' then 1 else 0 end ) as concavity_mean,\nSUM( case when `concave points_mean` is null or `concave points_mean`='' then 1 else 0 end ) as concave_points_mean,\nSUM( case when `symmetry_mean` is null or `symmetry_mean`='' then 1 else 0 end ) as symmetry_mean,\nSUM( case when `fractal_dimension_mean` is null or `fractal_dimension_mean`='' then 1 else 0 end ) as fractal_dimension_mean,\nSUM( case when `radius_se` is null or `radius_se`='' then 1 else 0 end ) as radius_se,\nSUM( case when `texture_se` is null or `texture_se`='' then 1 else 0 end ) as texture_se,\nSUM( case when `perimeter_se` is null or `perimeter_se`='' then 1 else 0 end ) as perimeter_se,\nSUM( case when `area_se` is null or `area_se`='' then 1 else 0 end ) as area_se,\nSUM( case when `smoothness_se` is null or `smoothness_se`='' then 1 else 0 end ) as smoothness_se,\nSUM( case when `compactness_se` is null or `compactness_se`='' then 1 else 0 end ) as compactness_se,\nSUM( case when `concavity_se` is null or `concavity_se`='' then 1 else 0 end ) as concavity_se,\nSUM( case when `concave points_se` is null or `concave points_se`='' then 1 else 0 end ) as concave_points_se,\nSUM( case when `symmetry_se` is null or `symmetry_se`='' then 1 else 0 end ) as symmetry_se,\nSUM( case when `fractal_dimension_se` is null or `fractal_dimension_se`='' then 1 else 0 end ) as fractal_dimension_se,\nSUM( case when `radius_worst` is null or `radius_worst`='' then 1 else 0 end ) as radius_worst,\nSUM( case when `texture_worst` is null or `texture_worst`='' then 1 else 0 end ) as texture_worst,\nSUM( case when `perimeter_worst` is null or `perimeter_worst`='' then 1 else 0 end ) as perimeter_worst,\nSUM( case when `area_worst` is null or `area_worst`='' then 1 else 0 end ) as area_worst,\nSUM( case when `smoothness_worst` is null or `smoothness_worst`='' then 1 else 0 end ) as smoothness_worst,\nSUM( case when `compactness_worst` is null or `compactness_worst`='' then 1 else 0 end ) as compactness_worst,\nSUM( case when `concavity_worst` is null or `concavity_worst`='' then 1 else 0 end ) as concavity_worst,\nSUM( case when `concave points_worst` is null or `concave points_worst`='' then 1 else 0 end ) as concave_points_worst,\nSUM( case when `symmetry_worst` is null or `symmetry_worst`='' then 1 else 0 end ) as symmetry_worst,\nSUM( case when `fractal_dimension_worst` is null or `fractal_dimension_worst`='' then 1 else 0 end ) as fractal_dimension_worst,\nSUM( case when `_c32` is null or `_c32`='' then 1 else 0 end ) as _c32\nfrom data as data_id;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫的手累？那有麼有辦法簡化呢？當然有啦。我們畢竟是可編程是SQL呀。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個有意思的解決方法如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"select \n#set($colums=[\"id\",\"diagnosis\"，\"fractal_dimension_worst\"])\n#foreach( $column in $colums )\n SUM( case when `$column` is null or `$column`='' then 1 else 0 end ) as $column,\n#end\n 1 as a from newTable as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以使用內置的 "},{"type":"codeinline","content":[{"type":"text","text":"#foreach"}]},{"type":"text","text":" 循環。先通過set設置所有字段名稱，然後通過foreach循環來生成sum語句。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就完了？就如同茴字有好多寫法，我們還有其他的玩法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"ruby"},"content":[{"type":"text","text":"set sum_tpl = '''\nSUM( case when `{0}` is null or `{0}`='' then 1 else 0 end ) as {0}\n''';\n\n\nselect ${template.get(\"sum_tpl\",\"diagnosis\")},\n${template.get(\"sum_tpl\",\"radius_mean\")},\n${template.get(\"sum_tpl\",\"texture_mean\")},\nfrom data as output;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以通過set 進行模板設置，然後在sql語句裏通過template.get( 語句進行模板渲染。對於一個很複雜的SQL 語句，裏面可能存在多個類似sum \/case when的重複語句，那麼我們就可以使用這種方式了。而且可以做到一處修改，處處生效。不然萬一你 sum裏的1要改成2，那可是要改好幾十個語句的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"恩，除了這些，還有非常多的好玩的玩法等待你去挖掘，SQL 再也不 Boring 了。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"不是最後的最後"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到，我們給原生SQL擴展了變量、函數、多數據源支持、第三方模塊、原生函數動態擴展、分支語法、機器學習、python腳本支持、插件等等諸多功能。就像TypeScript對JavaScript的增強一樣，大家也可以只用最基礎的SQL語法。但是一旦有需要，你就可以使用更多高階功能滿足自己的訴求。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"最後"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個可編程的SQL是還在夢想中麼？當然不是，它就在這裏： https:\/\/mlsql.tech 我們提供了桌面版和在線試用版本，快來感受下吧！"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

一場數據架構變革正在來臨

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-21 10:54:01

從前端到全棧 -- 最全面向對象總結

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragr

程序员海军

2021-12-21 10:54:01

跨語言的多模態、多任務檢索模型MURAL解讀

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-21 10:54:01

谷歌發佈生態系統RLDS，可在強化學習中生成、共享和使用數據集

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-20 10:53:54

程序員如何建立第二大腦

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-20 10:43:54

解讀數字化轉型下的數據安全：AI正在開闢新的可能性

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-19 14:03:54

改善十年應用的部署體驗

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-21 11:13:52

智慧家庭場景的推薦系統的發展歷程和方向 | InfoQ《公開課》

直播概要：隨着計算機的蓬勃發展，互聯網進入大數據和人工智能時代，爲了解決信息過載和長尾商品，推薦系統成爲唯一選擇，而面對不同的業務場景，爲了解決業務痛點，會根據不同的場景特點尋找不同的方法和手段來解決推薦中實際遇到的問題。在智慧家庭領域，

InfoQ 中文站

2021-12-21 10:54:01

Log4j2 維護者：沒工資還捱罵；阿里每週可選一天靈活辦公；亞馬遜 CTO 預測2022年五大技術趨勢；蘋果正式推出“數字遺產”...

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-21 10:53:51

一篇帶你用 VuePress + Github Pages 搭建博客

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言","attrs

2021-12-21 10:53:51

【HZERO微服務平臺3】源碼分析之oauth服務token生成、校驗、獲取信息、傳遞

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"headin

2021-12-20 11:08:55

BPF 和 Go: Linux 中的現代內省形式

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-20 11:08:55

從混合包開發到100%純鴻蒙應用還有多遠？優酷鴻蒙版的開發實踐與思考｜卓越技術團隊訪談錄

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-19 12:03:53

「如何從零到一實現一個玩具瀏覽器🌏」

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-18 13:28:55

Facebook 如何做大規模服務的自主測試

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragr

2021-12-21 10:54:01

24小時熱門文章

最新文章

最新評論文章