SQL Server 列存儲索引性能總結(8)——列存儲中的Dictionary

接上文:SQL Server 列存儲索引性能總結(7)——導入數據到列存儲索引的Delta Store,前面提到了幾次Dictionary,本文快速介紹一下它到底是什麼,以便更好地理解列存儲。
不過這部分不會講太深入,因爲這個功能只能用於SQL Server而不適用於SQL DB和SQL DW(現在稱爲Azure Synapse Analytics

環境

  本文繼續使用ContosoRetailDW作爲演示。但是首先我們用下面的腳本查一下哪些表適合使用聚集列存儲索引(超過100萬行):

-- 返回建議用於數據倉庫環境的聚集列存儲的表
SELECT object_schema_name(t.object_id) AS 'Schema'
	,object_name(t.object_id) AS 'Table'
	,sum(p.rows) AS 'Row Count'
	,(
		SELECT count(*)
		FROM sys.columns AS col
		WHERE t.object_id = col.object_id
		) AS 'Cols Count'
	,(
		SELECT sum(col.max_length)
		FROM sys.columns AS col
		JOIN sys.types AS tp ON col.system_type_id = tp.system_type_id
		WHERE t.object_id = col.object_id
		) AS 'Cols Max Length'
	,(
		SELECT count(*)
		FROM sys.columns AS col
		JOIN sys.types AS tp ON col.system_type_id = tp.system_type_id
		WHERE t.object_id = col.object_id
			AND (
				UPPER(tp.name) IN (
					'TEXT'
					,'NTEXT'
					,'TIMESTAMP'
					,'HIERARCHYID'
					,'SQL_VARIANT'
					,'XML'
					,'GEOGRAPHY'
					,'GEOMETRY'
					)
				OR (
					UPPER(tp.name) IN (
						'VARCHAR'
						,'NVARCHAR'
						)
					AND (
						col.max_length = 8000
						OR col.max_length = - 1
						)
					)
				)
		) AS 'Unsupported Columns'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type = 'PK'
			AND parent_object_id = t.object_id
		) AS 'Primary Key'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type = 'F'
			AND parent_object_id = t.object_id
		) AS 'Foreign Keys'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type IN (
				'UQ'
				,'D'
				,'C'
				)
			AND parent_object_id = t.object_id
		) AS 'Constraints'
	,(
		SELECT count(*)
		FROM sys.objects
		WHERE type IN (
				'TA'
				,'TR'
				)
			AND parent_object_id = t.object_id
		) AS 'Triggers'
	,t.is_tracked_by_cdc AS 'CDC'
	,t.is_memory_optimized AS 'Hekaton'
	,t.is_replicated AS 'Replication'
	,coalesce(t.filestream_data_space_id, 0, 1) AS 'FileStream'
	,t.is_filetable AS 'FileTable'
FROM sys.tables t
INNER JOIN sys.partitions AS p ON t.object_id = p.object_id
WHERE p.data_compression IN (
		0
		,1
		,2
		) -- None, Row, Page
	AND p.index_id IN (
		0
		,1
		)
	AND (
		SELECT count(*)
		FROM sys.indexes ind
		WHERE t.object_id = ind.object_id
			AND ind.type IN (
				5
				,6
				)
		) = 0
GROUP BY t.object_id
	,t.is_tracked_by_cdc
	,t.is_memory_optimized
	,t.is_filetable
	,t.is_replicated
	,t.filestream_data_space_id
HAVING sum(p.rows) > 1000000
ORDER BY sum(p.rows) DESC

  不過需要提醒一下,這個腳本並不是萬能的,也不是絕對的標準,所以僅供參考。這裏主要關注在大表環境。結果如下,有5個表滿足條件。FactOnlineSales, FactInventory, FactSalesQuota, FactSales, FactStrategyPlan。

在這裏插入圖片描述
  因爲這些表存在不少主鍵和外鍵,爲了減少影響,這裏先清理一下,如果你擔心影響後續操作的話,把數據庫備份一次再操作:

-- 刪除主鍵:
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [PK_FactOnlineSales_SalesKey]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [PK_FactStrategyPlan_StrategyPlanKey]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [PK_FactSales_SalesKey]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [PK_FactInventory_InventoryKey]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [PK_FactSalesQuota_SalesQuotaKey]

-- 刪除外鍵:
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCurrency]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCustomer]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimDate]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimProduct]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimPromotion]
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimStore]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimAccount]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimCurrency]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimDate]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimEntity]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimProductCategory]
ALTER TABLE dbo.[FactStrategyPlan] DROP CONSTRAINT [FK_FactStrategyPlan_DimScenario]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimChannel]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimCurrency]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimDate]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimProduct]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimPromotion]
ALTER TABLE dbo.[FactSales] DROP CONSTRAINT [FK_FactSales_DimStore]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimCurrency]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimDate]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimProduct]
ALTER TABLE dbo.[FactInventory] DROP CONSTRAINT [FK_FactInventory_DimStore]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimChannel]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimCurrency]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimDate]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimProduct]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimScenario]
ALTER TABLE dbo.[FactSalesQuota] DROP CONSTRAINT [FK_FactSalesQuota_DimStore]

  然後我們創建聚集列存儲索引(後稱CCI,clustered columnstore index)到每個表上:

Create Clustered Columnstore Index CCI 	on dbo.FactOnlineSales 	WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactStrategyPlan WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactSales WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactInventory WITH (DATA_COMPRESSION = COLUMNSTORE);
Create Clustered Columnstore Index CCI	on dbo.FactSalesQuota 	WITH (DATA_COMPRESSION = COLUMNSTORE);

  我們會用下面的腳本來查詢dictionary的信息:

select object_name(object_id), dictionary_id
	, count(*) as 'Number of Dictionaries'
	, sum(entry_count) as 'Entry Count'
	, min(on_disk_size) as 'Min Size'
	, max(on_disk_size) as 'Max Size'
	, avg(on_disk_size) as 'Avg Size'
	from sys.column_store_dictionaries dict
		join sys.partitions part
			on dict.hobt_id = part.hobt_id
	group by object_id, dictionary_id 
	order by object_name(object_id), dictionary_id 

在這裏插入圖片描述
  從上面的結果大概可以看出,隨着dictionary_id的增加,可用的dictionaries數量減少。

介紹和總結

  在列存儲中,有些列要求額外使用字典(dictionary),比如字符類型,用於編碼轉換成可用類型。字典包含全局(global)和本地(local),與片段關聯,其中全局字典可以橫跨全部列。

字典用於對某些而不是全部數據類型進行編碼,所以不是所有的列存儲中的列都有字典。

  另外條目數也就是(entry count)那裏,它並非直接上升或者下降,而實先升後降。另外可以看到不同的空間,字典數的差異非常大。
  其實字典主要用於字符類型,創建一個字典存儲列中的唯一值列表,列表中有數據值和字典值對應,然後在列存儲中存儲字典值而不是本來的數據值,比如顏色列,紅黃藍綠分別以1/2/3/4存儲在字典和列存儲中,這樣存儲空間可以進一步減少(注意這個例子只是形象化,並不是真的按照這個規則存儲顏色)。
  大部分情況下會使用全局字典存儲所有相關列,但是也有部分使用本地字典來存儲。
  到了這裏,我們很容易想到雖然存儲空間少了,但是你引入了一個“關聯或者配置表”,在操作的時候就多了一步。但是不可否則這種做法確實大大地降低了存儲空間,至於能否得到比關聯開銷更大的查詢提升,那需要具體問題具體分析。一般情況下我們能做的通常只是重建或重組索引。

  下一篇:SQL Server 列存儲索引性能總結(9)——

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章