13 February, 2012

How to calculate median in MSSQL

Definition of median:
Median is described as the numerical value separating the higher half from the lower half of a
sample, a population or a probability distribution.

The median value of a list of numbers can be found by arranging all the observations from lowest value to the highest value and picking the middle one.
If there is an uneven number of observations, then median will be the middle value.
If there is an even number of observations, the median is then usually defined to be the mean of the two middle values.

So lets say we have this sample data:
Lot Wafer ResultName Value Comment
8485474 1 someresult 50
8485474 1 someresult 40
8485474 1 someresult 30 <- Median 1 / Median 2     (Median1 + Median 2) / 2 = Median = 30
8485474 1 someresult 20
8485474 1 someresult 10
8485474 2 someresult 60
8485474 2 someresult 50
8485474 2 someresult 40 <- Median 1
8485474 2 someresult 30 <- Median 2     (Median1 + Median 2) / 2 = Median = 35
8485474 2 someresult 20
8485474 2 someresult 10

First we need to sort and number the resultvalues ascending.
Then we need to get the result count for each wafer.
To do this, we can use the row_number() function / count(*) function together with over / partition by.

(select l.lot, 
          w.waferno, 
          dr.resultvalue,
          row_number() over(partition by l.lot, w.waferno order by dr.resultvalue) as rownum,
          count(*) over(partition by l.lot, w.waferno) as resultcount
  from lot l inner join
  wafer w on l.lot = w.lot inner join
  results r on w.waferid = r.waferid inner join 
  resulttype rt on r.resulttypeid = rt.resulttypeid
where l.lot = 8485474 
and rt.resultname = 'someresult') as medianvalues

The above query will give us this resultset:
Lot Wafer ResultName Value RowNum ResultCount
8485474 1 someresult 10 1 5
8485474 1 someresult 20 2 5
8485474 1 someresult 30 3 5
8485474 1 someresult 40 4 5
8485474 1 someresult 50 5 5
8485474 2 someresult 10 1 6
8485474 2 someresult 20 2 6
8485474 2 someresult 30 3 6
8485474 2 someresult 40 4 6
8485474 2 someresult 50 5 6
8485474 2 someresult 60 6 6

Now we can filter out the middle rows with the follwing formula:
(resultcount + 1) / 2 and (resultcount + 2) / 2)
So for wafer 1:
(5 + 1) / 2 = 3
(5 + 2) / 2 = 3  (3.5 will be 3 when converted to int)
And for wafer 2:
(6 + 1) / 2 = 3 (3.5 will be 3 when converted to int)
(6 + 2) / 2 = 4 
So now that we have the two middle rows, we can just do an average of the two rows:
So for wafer 1:
30 + 30 / 2 = 30  (Row 3 + Row 3) / 2
And for wafer 2:
30 + 40 / 2 = 35  (Row 3 + Row 4) / 2

Now let's put it all together:
select lot, 
       waferno, 
       resultcount, 
       avg(resultvalue) as median, 
       (resultcount + 1) / 2 as middlerow1, 
       (resultcount + 2) / 2 as middlerow2
from
  (select l.lot, 
          w.waferno, 
          dr.resultvalue,
          row_number() over(partition by l.lot, w.waferno order by dr.resultvalue) as rownum,
          count(*) over(partition by l.lot, w.waferno) as resultcount
  from lot l inner join
  wafer w on l.lot = w.lot inner join
  results r on w.waferid = r.waferid inner join resulttype rt on r.resulttypeid = rt.resulttypeid
where l.lot = 8485474 
and rt.resultname = 'someresult') as medianvalues
where rownum in((resultcount + 1) / 2, (resultcount + 2) / 2)
group by lot,waferno, resultcount
order by lot,waferno
Or if you prefer the new with statement:
with medianvalues as
(
select l.lot, 
          w.waferno, 
          dr.resultvalue,
          row_number() over(partition by l.lot, w.waferno order by dr.resultvalue) as rownum,
          count(*) over(partition by l.lot, w.waferno) as resultcount
  from lot l inner join
  wafer w on l.lot = w.lot inner join
  results r on w.waferid = r.waferid inner join resulttype rt on r.resulttypeid = rt.resulttypeid
where l.lot = 8485474 
and rt.resultname = 'someresult'
)
select lot, 
       waferno, 
       resultcount, 
       avg(resultvalue) as median, 
       (resultcount + 1) / 2 as middlerow1, 
       (resultcount + 2) / 2 as middlerow2
from medianvalues
where rownum in((resultcount + 1) / 2, (resultcount + 2) / 2)
group by lot,waferno, resultcount
order by lot,waferno

The above query will give us this resultset:
Lot Wafer ResultCount Median MiddleRow1 MiddleRow2
8485474 1 5 30 3 3
8485474 2 6 35 3 4

09 February, 2012

Formatted TimeDiff UDF for MSSQL

I just created a UDF function for displaying a formatted time diff value in MSSQL.
Why ? I googled this and found a lot of solutions that did not work :(
So i just had to look into it.

I wanted the time difference to be displayed in the following format:
HHH:MM:SS (@type = 0)
Where HHH = Total number of hours,  so if the difference is one week,
then this value will be 168. (168:00:00).

I also created another variant:
D MM:HH:SS (@type = 1)
Where D = Total number of days, so if the difference is one week,
then this value will be 7. (7 00:00:00).

create function [dbo].[TimeDiff](@date1 datetime, @date2 datetime, @type smallint)
returns varchar(50)
as
begin
declare @days int
declare @totalhours int
declare @hours int
declare @minutes int
declare @seconds int
declare @ret varchar(50)
set @seconds = datediff(ss, @date1,@date2) 
set @totalhours = @seconds / 3600 
set @days = @totalhours / 24 
set @hours = @totalhours - (@days * 24)
set @minutes = ((@seconds - (3600 * @totalhours)) / 60)
set @seconds = ((@seconds - (3600 * @totalhours) - (60 * @minutes)))
if(@type = 0)
begin
    --HHH:MM:SS 
    set @ret = case when @totalhours < 10 then '0' + convert(varchar(20), @totalhours) else convert(varchar(20), @totalhours) end + ':' + right('0' + convert(varchar(20), @minutes), 2) + ':' + right('0' + convert(varchar(20), @seconds), 2)
end
else
begin
    --D HH:MM:SS
    set @ret = convert(varchar(20), @days) + ' ' + right('0' + convert(varchar(20), @hours), 2) + ':' + right('0' + convert(varchar(20), @minutes), 2) + ':' + right('0' + convert(varchar(20), @seconds), 2)
end
return @ret
end

07 February, 2012

Custom multiply aggregation function in AS2008

In a recent project of mine i needed a custom aggregation function that needed to multiply som yield numbers of a specific measure in a given dimension.

My yield numbers are actually stored as a number between 0 and 100. And not between 0 and 1 which i guess are more common when expressed as percentage.
The reason for this is some limitations in our 3rd party reporting charts.
Any way this is how i made the custom aggregation function that will work with both variants:

Create a c# class library.
Add a reference to this dll (ADOMD.Net interfaces):
C:\Program Files\Microsoft Analysis Services\AS OLEDB\10\msmgdsrv.dll
Add a class named ASUtils with the following content:

namespace ASUtils
{
    using System;
    using Microsoft.AnalysisServices.AdomdServer;

    public class Multiplication
    {
        [SafeToPrepare(true)]
        public static Nullable<double> Multiply(Set inputSet, Expression inputExpression, double divider, bool excludeZeroValues)
        {            
            Nullable<double> num = null;
            bool first = true;
            foreach (Tuple tuple in inputSet.Tuples)
            {
                MDXValue mdxValue = inputExpression.Calculate(tuple);
                bool isEmpty = (mdxValue.ToChar() == 0x00);
                if (!isEmpty)
                {
                    double value = mdxValue.ToDouble();
                    if (value != 0 || (value == 0 && excludeZeroValues == false))
                    {
                        if (first == true)
                        {
                            num = value / divider;
                            first = false;
                        }
                        else
                        {
                            num *= value / divider;
                        }
                    }
                }
            }

            return num * divider;
        }
    }
}
I did not find any built in functions that could check if a value was empty or not, but this seems to work:
bool isEmpty = (mdxValue.ToChar() == 0x00);
Any other solutions out there ?

Build the project, then in your AS2008 cube project, add a reference to the new class library.
In your cube, under calculations. Add a new calculated member named YieldM with this code:
ASUtils.Multiply([Process].[ProcessNoGroup].Children,[Measures].[Yield], 100.0, false)
[Process] = Dimension
[ProcessNoGroup] = Hierarchy
[Measures].[Yield] = Measure containg numbers between 0 and 100.

If the measure contains numbers beteen 0 and 1, then you can change the multiplyer to 1 like:
ASUtils.Multiply([Process].[ProcessNoGroup].Children,[Measures].[Yield], 1.0, false)
Empty values are always eliminated from the multiply.
You can also exclude 0 (zero) values by changing the last parameter to true.

So if we now look at the yield measure over time and process, we have this test data:
Time Dimension
Process Dimension
Yield
W01
Process1
94.88
W01
Process2
96.42
W01
Process3
95.03
W02
Process1
98.86
W02
Process2
96.42
W02
Process3
94.03

When we aggregate this by week and look at our calculated YieldM measure:
Time Dimension
YieldM
Formula
W01
86.93
(94.88*96.42*95.03)
W01
89.63
(98.86*96.42*94.03)

And voila, our custom aggregation is working as it should.