More with SQL Server 2005 : Top n Per Group, Paging, and Common Table Expressions

http://weblogs.sqlteam.com/jeffs/archive/2007/03/30/More-SQL-Server-2005-Solutions.aspx#38351

Jeff Smith

I previously wrote about a few of the new features in SQL 2005 and how they can be used to solve some old "classic" SQL problems very easily, and I thought I'd briefly discuss a few more.  None of this is earth-shattering stuff, but you may find seeing a bunch of these techniques listed in one place useful.

All examples will be using the excellent Major League Baseball database that you can download for free here.  Download the Access version, and then use the Upsizing Wizard to export the data into SQL 2005.    It's a great resource and lots of fun for baseball fans; overall the design is fairly good but but not perfect (yearID ?).  If you enjoy baseball and you know that you should be practicing your SQL a little more, once you get a hold of this practicing SQL becomes fun and interesting.  Lots of my upcoming posts will be using this database, so if you want to "play along" go ahead and download and set it up. Let me know if you have any problems getting the data into SQL Server and, if so, I will write up some instructions.  If you don't have Access, there is a CSV text file version available as well, though that will take a bit more work to get into SQL Server. 

Returning Top N Rows Per Group

The following will return the top 10 players who hit the most home runs per year since 1990. The key is to calculate the "Home Run Rank" of each player for each year. 

select
  HRRanks.*
from
(
    Select
      b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
      rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
    from
      Batting b
    where
      b.yearID > 1990
    group by
      b.yearID, b.playerID
)
  HRRanks
where
  HRRanks.HR_Rank <= 10


Notice the use of the derived table since we cannot directly reference the Rank() expression in our criteria.  To return the player's name, simply join this to the Master table in the database.

The basic idea is this:   you PARTITION by the grouping you want to return the top 1-n for, and you ORDER BY the columns that you want to use to do the ranking in that group. So, if you wanted to return the top 10 salesmen per region in terms of total sales, you would calculate RANK() OVER (PARTITION BY Region ORDER BY TotalSales DESC) for each row.

Paging Results with SQL 2005 using ROW_NUMBER()

Paging data is so much easier in SQL 2005.  All of my old techniques are no longer needed, which is great because they were hard to implement.

A new function called ROW_NUMBER() works much in the same way as RANK().  Of course, we must have a clear, unique ORDER BY established, otherwise the results will not be deterministic and you will not always get the same rows returned for each page. 

In this example, we'll page the "Master Table" of players, sorted by firstname, lastname and using lahmanID (the primary key of the table) as the "tie-breaker".  We'll set a couple of variables that could be turned into parameters in a stored proc to indicate the starting and ending rows to return:

declare @startrow int
declare @endrow int

set @startRow = 40
set @EndRow = 70

select
  MasterRowNums.*
from
(
  select
    m.nameLast, m.nameFirst, m.lahmanID,
    ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
  from
    [master] m
)
  MasterRowNums
where
  RowNum between @startRow and @endRow
order by
  nameLast, NameFirst, lahmanID


Notice that we still cannot reference the function directly in the WHERE clause and are using a derived table again, and also that we must repeat the ordering twice -- once in the ROW_NUMBER() function and once in the ORDER BY clause.

Using this basic technique, it is very easy to see how to write a stored procedure that will let you page rows in a table or SELECT, and it is much more efficient than any pre-SQL 2005 method available.   I recommend experimenting with this and doing your paging server-side instead of using client-side paging, since client techniques requires that the database still process and return all of the rows even if they are not being displayed to the user.

Common Table Expressions - Easier Derived Tables

In both of the previous examples, we used derived tables which added an extra layer of complexity to our SQL statement.  New in SQL 2005 are Common Table Expressions (commonly called CTEs) which are a much nicer way to work with derived tables, and also much more powerful.

I often instruct people to think one step at a time when writing SQL statements, building each piece as a separate SELECT and then putting them all together at the end.  CTEs make this very easy and very intuitive.  Basically, instead of writing the derived table "in-line", you can "declare" it first, at the beginning of your SELECT.  The previous example would look like this using a CTE:

with MasterRowNums as
(
  select m.nameLast, m.nameFirst, m.lahmanID,
        ROW_NUMBER() over (order by m.nameLast, m.nameFirst, m.lahmanID) as RowNum
  from [master] m
)
select
  MasterRowNums.*
from
  MasterRowNums
where
  RowNum between @startRow and @endRow
order by
  nameLast, NameFirst, lahmanID


I feel that this convention is much easier to read and work with, and certainly easier to test since you can focus on one part of the SELECT at a time without getting lost in mazes of indentation.  This is useful when you have lots of derived tables, but even more useful because sometimes you need to reference the same derived table more than once; with standard derived tables, you would have to repeat the entire SQL twice (or create a view or temp table) but here we can declare it just once and reference it as many times as we need.

Here's the Top Rows Per Group solution given above, using a CTE:

with HRRanks as
(
    Select b.yearID, b.PlayerID, sum(b.Hr) as TotalHR,
        rank() over (partition by b.yearID order by sum(b.hr) desc) as HR_Rank
    from Batting b
    where b.yearID > 1990
    group by b.yearID, b.playerID
   
)
select
   HRRanks.*
from
   HRRanks
where
  HRRanks.HR_Rank <= 10



More to Come!

Stay tuned, there's lots more to come and we'll be using that baseball database quite a bit.  We'll mix learning new features in SQL Server 2005 with classic SQL techniques, and along the way we'll do some fun baseball analysis as well.

發佈了11 篇原創文章 · 獲贊 4 · 訪問量 5萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章