13 February, 2012

How to calculate median in MSSQL

Definition of median:
Median is described as the numerical value separating the higher half from the lower half of a
sample, a population or a probability distribution.

The median value of a list of numbers can be found by arranging all the observations from lowest value to the highest value and picking the middle one.
If there is an uneven number of observations, then median will be the middle value.
If there is an even number of observations, the median is then usually defined to be the mean of the two middle values.

So lets say we have this sample data:
Lot Wafer ResultName Value Comment
8485474 1 someresult 50
8485474 1 someresult 40
8485474 1 someresult 30 <- Median 1 / Median 2     (Median1 + Median 2) / 2 = Median = 30
8485474 1 someresult 20
8485474 1 someresult 10
8485474 2 someresult 60
8485474 2 someresult 50
8485474 2 someresult 40 <- Median 1
8485474 2 someresult 30 <- Median 2     (Median1 + Median 2) / 2 = Median = 35
8485474 2 someresult 20
8485474 2 someresult 10

First we need to sort and number the resultvalues ascending.
Then we need to get the result count for each wafer.
To do this, we can use the row_number() function / count(*) function together with over / partition by.

(select l.lot, 
          w.waferno, 
          dr.resultvalue,
          row_number() over(partition by l.lot, w.waferno order by dr.resultvalue) as rownum,
          count(*) over(partition by l.lot, w.waferno) as resultcount
  from lot l inner join
  wafer w on l.lot = w.lot inner join
  results r on w.waferid = r.waferid inner join 
  resulttype rt on r.resulttypeid = rt.resulttypeid
where l.lot = 8485474 
and rt.resultname = 'someresult') as medianvalues

The above query will give us this resultset:
Lot Wafer ResultName Value RowNum ResultCount
8485474 1 someresult 10 1 5
8485474 1 someresult 20 2 5
8485474 1 someresult 30 3 5
8485474 1 someresult 40 4 5
8485474 1 someresult 50 5 5
8485474 2 someresult 10 1 6
8485474 2 someresult 20 2 6
8485474 2 someresult 30 3 6
8485474 2 someresult 40 4 6
8485474 2 someresult 50 5 6
8485474 2 someresult 60 6 6

Now we can filter out the middle rows with the follwing formula:
(resultcount + 1) / 2 and (resultcount + 2) / 2)
So for wafer 1:
(5 + 1) / 2 = 3
(5 + 2) / 2 = 3  (3.5 will be 3 when converted to int)
And for wafer 2:
(6 + 1) / 2 = 3 (3.5 will be 3 when converted to int)
(6 + 2) / 2 = 4 
So now that we have the two middle rows, we can just do an average of the two rows:
So for wafer 1:
30 + 30 / 2 = 30  (Row 3 + Row 3) / 2
And for wafer 2:
30 + 40 / 2 = 35  (Row 3 + Row 4) / 2

Now let's put it all together:
select lot, 
       waferno, 
       resultcount, 
       avg(resultvalue) as median, 
       (resultcount + 1) / 2 as middlerow1, 
       (resultcount + 2) / 2 as middlerow2
from
  (select l.lot, 
          w.waferno, 
          dr.resultvalue,
          row_number() over(partition by l.lot, w.waferno order by dr.resultvalue) as rownum,
          count(*) over(partition by l.lot, w.waferno) as resultcount
  from lot l inner join
  wafer w on l.lot = w.lot inner join
  results r on w.waferid = r.waferid inner join resulttype rt on r.resulttypeid = rt.resulttypeid
where l.lot = 8485474 
and rt.resultname = 'someresult') as medianvalues
where rownum in((resultcount + 1) / 2, (resultcount + 2) / 2)
group by lot,waferno, resultcount
order by lot,waferno
Or if you prefer the new with statement:
with medianvalues as
(
select l.lot, 
          w.waferno, 
          dr.resultvalue,
          row_number() over(partition by l.lot, w.waferno order by dr.resultvalue) as rownum,
          count(*) over(partition by l.lot, w.waferno) as resultcount
  from lot l inner join
  wafer w on l.lot = w.lot inner join
  results r on w.waferid = r.waferid inner join resulttype rt on r.resulttypeid = rt.resulttypeid
where l.lot = 8485474 
and rt.resultname = 'someresult'
)
select lot, 
       waferno, 
       resultcount, 
       avg(resultvalue) as median, 
       (resultcount + 1) / 2 as middlerow1, 
       (resultcount + 2) / 2 as middlerow2
from medianvalues
where rownum in((resultcount + 1) / 2, (resultcount + 2) / 2)
group by lot,waferno, resultcount
order by lot,waferno

The above query will give us this resultset:
Lot Wafer ResultCount Median MiddleRow1 MiddleRow2
8485474 1 5 30 3 3
8485474 2 6 35 3 4

No comments:

Post a Comment