SQL server 在大數據中隨機取數

Summary: The authors offer new ideas on how to select random rows from a large table. (4 printed pages)

原文地址 http://msdn.microsoft.com/en-us/library/cc441928.aspx

If you use Microsoft SQL Server 2000, you likely have run into the following problem: You want to select a random sampling of rows from a large table with lots of rows, but you are unsure of how to do so. Having a random sampling of rows can be useful when you want to make a smaller version of the table or if you want to troubleshoot a problem by seeing what kinds of rows are in the table.

To get a random sampling, you might be tempted to select the top n rows from the table. However, this sample is not random, and the first n rows are not necessarily representative of the whole table. Other solutions exist that involve adding columns to the tables; however, adding columns is not always possible or practical.

The standard way to grab random rows from a small table is to use a query such as the following:

  SELECT TOP 10 PERCENT *
  FROM Table1
  ORDER BY NEWID()

The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with the ORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.

Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.

What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM(*) *
  RAND()) as int)) % 100) < 10

The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.

Taking a closer look at how the (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) as int)) portion of this query works: The BINARY_CHECKSUM function generates a checksum value that is based on the values of the columns that you specify. If two rows are different, they typically will generate different checksum numbers. TheBINARY_CHECKSUM function generally is used to verify whether any of the columns in a row in a table have changed. However, for our purposes, it generates a number that looks like a random number for each row.

The shortcoming of using the BINARY_CHECKSUM function for our purpose is that, every time that it is used on a row that has not been modified, it returns the same checksum number. Thus, when it is used by itself, subsequent runs of the query return the same "random" set of rows, which obviously is not desirable.

To fix this shortcoming, we added the RAND function to the BINARY_CHECKSUM query. The RAND function scrambles the numbers that are returned by theBINARY_CHECKSUM function. Thus, we get a different set of rows each time the query is run—making it truly random. The ABS and CAST functions are used, becauseBINARY_CHECKSUM (*) * RAND returns a float that can be a negative number.

The asterisk (*) in BINARY_CHECKSUM (*) tells the function to use in its calculations all of the columns in the row. Alternatively, you can specify a subset of the columns in place of the asterisk. Because this function is CPU-intensive, specifying the minimum number of columns or minimum number of bytes will give you the best performance. The best candidates would be the columns in a unique index. If you decide to use specific columns instead of all of the columns, you can add NEWID as a column in theBINARY_CHECKSUM function, so that the BINARY_CHECKSUM query will return a random number each time. Thus, you do not need to use RAND in the query, which simplifies it slightly, as shown here:

  SELECT * FROM Table1
  WHERE (ABS(CAST(
  (BINARY_CHECKSUM
  (keycol1, NEWID())) as int))
  % 100) < 10

Because no sorting is involved in the BINARY_CHECKSUM query, only a single pass through the table is required to choose n % of the rows. The time and the I/O both stay linear, in proportion to the size of the table.

To test the BINARY_CHECKSUM query against the NEWID query, we set up three large tables that contain 1 million rows (435MB), 7 million rows (3GB), and 14 million rows (5.4GB), respectively, on an HP ProLiant DL580 G2 server with 1GB memory, four 2.2MHz Intel processors, and eight 36GB disks in RAID 1+0 configuration. Table 1 shows the results. Figure 1 compares graphically how long the queries took. As Figure 1 and Table 1 both show, the BINARY_CHECKSUM query saves a lot of time and I/O, compared with the NEWID query.

The SQL Server team at Microsoft realized that not being able to take random samples of rows easily was a common problem in SQL Server 2000; so, the team addressed the problem in SQL Server 2005 by introducing the TABLESAMPLE clause. This clause selects a subset of rows by choosing random data pages and returning all of the rows on those pages. However, for those of us who still have products that run on SQL Server 2000 and need backward-compatibility, or who need truly row-level randomness, the BINARY_CHECKSUM query is a very effective workaround.

Table 1. Test results

	Time (sec)	Table1 logical I/O count	Worktable logical I/O count	Total I/O count	CPU time (msec)
1 million rows
NEWID query	14.3	27,076	1,046,172	1,073,248	32,142
BINARY_CHECKSUM query	0.7	27,076	0	19,807	2,781
7 million rows
NEWID query	134	193,790	7,332,291	7,526,081	227,250
BINARY_CHECKSUM query	10	193,790	0	193,790	28,812
13 million rows
NEWID query	253	347,420	13,810,132	14,157,552	422,891
BINARY_CHECKSUM query	21	347,420	0	347,420	49,203

Cc441928.49d347cb-c75d-4d94-9cad-3a0d7d7a2d5b(en-us,MSDN.10).gif

Figure 1. NEWID vs. BINARY_CHECKSUM queries

About the authors

Marcelo De Barros ([email protected]) is a Senior Test Manager at Microsoft. He received an undergraduate degree in Computer Science in Brazil (1999), and a M.S. in Computer Science from the University of Washington (2006). Marcelo's areas of expertise include performance and scalability tests, and capacity planning.

Kenton Gidewall ([email protected]) is a Software Design Engineer/Test at Microsoft. He has a B.S. in Computer Science from Brigham Young University (1989) and a M.S. in Computer Science from Michigan State (1992). Kenton's areas of expertise include high TPS, large-scale system development, and performance testing.

Community Additions

ADD

Tried these queries

Here is what I tried to select a random row ...

SELECT TOP 1 * FROM TableName TABLESAMPLE (0.1 percent) ORDER BY NEWID()

This runs much faster that any checksum etc. This has lesser probability of selecting same row as random.

Just by experience.

mi marathi manus

5/28/2013

reading comprehension and statistics

The previous poster has a problem with reading comprehension and statistics.

First, the article quite clearly says "In this example, we want approximately 10 percent of the rows selected randomly". Second, the article also quite clearly is targeted towards large tables, since the title of the article is "Selecting Rows Randomly from a Large Table".
This is where the statistics part kicks in. The solution presented, as the poster has demonstrated quite clearly, does not work with small data samples, because the statistical standard deviation is huge. However, as the sample size increases, the standard deviation drops off very quickly, meaning that it will get closer and closer to the percent desired. I'd show the math for that, but honestly, that requires work, so either trust me, or do the math yourself. Though as a super quick example, consider a random coin flip... a single coin flip is either heads or tails (1 or 0). Over a _large!!!_ number of coin flips this will converge to an average of half heads and half tails... approximately! (.5)

The bottom line is, for large (approximately > 10k rows) this approach works very well, and results in a data set that will meet most needs for a sample data set. Just don't use it for small data or expect it to be exactly some percent of the total you need.

Reginald V Blue

10/11/2012

The algorithm of this article is incorrect for percentage. Don't use it.

The flaw in the logic is the assumption that algorithm ABS(CAST((BINARY_CHECKSUM(ID, NEWID())) as int)) % 100 will result in an even distribution of results. If the distribution of results were even then performing % 100 < 10 would always result in 10 rows per 100. The result distribution is in no way even. Here is the distribution of 20 rows:

Abs    AbsMod100    RowNumber

541059703    0    19

1991435269    4    9

886205707    5    7

150167240    7    14

161972694    10    15

317579465    13    4

508825937    25    3

385506293    31    18

806118863    35    1

1572451654    52    6

1692251164    52    11

662344368    60    2

1353054314    68    8

1624363067    75    16

894891176    81    12

905686391    83    13

1168792690    87    17

1285066729    91    10

1656198436    95    20

7309091    98    5

AbsMod100 < 10 in this case would result in 4 rows per 20 or 20% rather than 10%. Here's a query to repro the flaw with:

SELECT *

FROM (

    SELECT ABS(CAST((BINARY_CHECKSUM(object_id, NEWID())) as int)) As [Abs],

        ABS(CAST((BINARY_CHECKSUM(object_id, NEWID())) as int)) % 100 As [AbsMod100],

        ROW_NUMBER() OVER (ORDER BY object_id) As RowNumber

    FROM sys.objects

) t

WHERE RowNumber BETWEEN 1 AND 20

ORDER BY [AbsMod100]

SQL server 在大數據中隨機取數

About the authors

Community Additions

Tried these queries

reading comprehension and statistics

The algorithm of this article is incorrect for percentage. Don't use it.

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

鄒老大的分頁sql存儲過程

javascript的40個網頁常用小技巧

向SQL Server數據庫添加圖片和文字的功能

搜索之美，Google遠比想象中的複雜。

Forms 角色驗證[轉]

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結