InsideOut

Bring your inside out...

My Photo
Name:
Location: Pune, Bangalore, India

I like doing things in new ways... music... touring... swimming... teaching...mathematics...

Tuesday, January 11, 2005

The subset-rule is just a rule-of-thumb!

What I discuss here is actually related to my previous blog. I succeeded in drawing attentions of many. I got very good comments from people including the ones I'd never met. The first one was about a typo from anonymous.

The second one is really dear to me, as I did not expect anybody to post a blood-red-faced comment! Mr. Mahadevan, I see your face still blood-red as you are going through this blog. I, no doubt, have the capability to stir up emotions in the readers' minds.

The third one was what I expected from somebody. But, that does not mean, it's not dear to me. I believe there are people really interested in languages and formalisms and they can't simply keep quite; justifying my expectation. In fact, I had a chat with Manivannan. He asked me those questions. I eventually managed him post those as comments. Thanks Manivannan.

Let me quickly go through the formalisms.

Work-space:
1. We have one table T1, with columns { C1, C2, C3, C4 }. This's the schema, in fact.
2. As far as the data is concerned, let's have a data-instance as

{
(1 1 0 0),
(1 1 0 1),
(0 0 1 1),
(0 0 1 0),
(0 0 0 1),
(1 0 1 1)
}

where each row is enclosed in the parentheses (- - - -). Each row has four entities or values, correponding to the four columns (*** Let, for simplicity, their values be 0 or 1. ***)
Assumptions:
I can't think of any, right now!

Definitions:
1. Partition (grouping of rows) -- If R is an equivalence relation on a set S, then R divides S into disjoint subsets {S1, S2, ..., Sn} where 1 <= n <= |S|. This comes from the basic discrete-algebra. For our case, S is the table T1 and an element of the set T1 is of form (C1, C2, c3, C4) as we have four columns. SQL programmers like to call each element as a row.

The R is the equality of two tuples. Two tuples (e1, e2, ..., ek)
and (p1, p2, ..., pk) are said to be equal iff the following boolean expression evaluated to be true:
(e1 == p1) ^ (e2 == p2) ^ ...^ (ek == pk). The == is the standard C/Java boolean operator and the operator ^ is the boolean AND.

Are you thinking what I am thinking? Let me cross-check if your brain is in synch with mine, through the following example:

If we partition the above data-instance of T1, defined in the work-space, using R acting on the tuple (C1, C2, C3, C4) then we get each row/element fall into different partitions; the number of partitions being equal to the number of rows in T1.

If the R acts on the tuple (C1, C2), we get three partitions, namely,
{ (1 1 0 0), (1 1 0 1) }
{ (0 0 1 1),
(0 0 1 0), (0 0 0 1) } and
{ (1 0 1 1) }
Yes, we applied the equality on the first two columns for each row!

So, let me define R(Ci, Cj) as partitioning the table T1 based on equality of the tuple (Ci, Cj), that is, based on the columns Ci and Cj of T1.

--------------------------------------------------------------------------

2. FR ('from') clause --
This caluse takes a comma-list of tables, does their cross product (join in database parlance) and produces final big table. This caluse operates first of all other clauses. This justifies considering only one table T1 in the work-space.

3. GB ('group by') clause -- This clause partitions the big table resulting from the FR clause. So, GROUP BY C1, C3 is effectively R(C1, C3) over the table.

4. Scalar function -- The one which operated on the column(s) of the same row, to give out one value. These include arithmetic operators. These are like multi-variate algebraic functions of form f(x, y, z); x, y, z being the entities of the same row.

5. Aggregate functions -- The one that operates on a column of all the rows, to give out one value. E.g. SUM, AVG, MIN, MAX.

SELECT SUM(C3) , AVG(C1)
FROM T1

Results in
--------
3 | 0.5
--------

6. SL ('select') clause -- This clause projects what you want to see in the output, from the result after GB (if present); precisely the tuple structure of the relust. So, SL is not a row-selector.

SL accepts a comma-list of entities. Each entity can be a simple column or a scalar/aggregate function applied over one or more columns.

A typical example would be
SELECT f1(C1, C2), f2(C1, C4) ...
where f1 and f2 are scalar functions. But it is important to note that f1
(C1, C2) operates on single row at a time.

SELECT C1, C2 + C3
FROM T1
returns
1 1
1 1
0 1
0 1
0 0
1 1

But what happens when GB is present? The SL must apply on each partition! But what? A partition can have more than one rows! So, how is it possible for the scalar function in SL to operate on a partition? Are you getting the point?

Let's quickly go through this example
SELECT C1 , C2 * 2
FROM T1
GROUP BY C1, C2
Note that, f1(x) = x and f2(x) = x * 2, here.

This returns
1 2
0 0
1 0
Why because, The GB partitions the table T1 (with six rows/elements) into three partitions (recall, R(C1, C2) over T1)! The SL then operates on these partitions (see up, the definition #1) to fetch the tuples (C1, C2*2); one per partition.


Could you say, what could be the output of
SELECT C1, C3
FROM T1
GROUP BY C1, C2

Think! See, C1 is unique for each partition. Same for C2. But C3? Let's go through each partition one by one!

The first partition
{ (1 1 0 0), (1 1 0 1) } has C3 as 0 in both the tuples. No Problem!

The second partition
{ (0 0 1 1), (0 0 1 0), (0 0 0 1) } has C3 varying! So how can you expect a scalar function to act on C3 for this partition? You just can't! Same is the story with C4.

But, what was so special about C1 and C2 then? How could the scalar function act over the C1 and C2? Yes, you got it! Because, they are unique in each partition, as they are the ones using which the partitioning is done. So, they got to be unique to preserve the disjoint-ness of the partitions! Can your generalise this? You'll will reach close to the subset rule here -- I promise!

Thus, the SQL we just saw is invalid. Now what if, instead of scalar function, we use an aggregate function? Nice idea, huh? The aggregate function acts on a group of elements and return a single value, something a scalar function can digest!

So,
SELECT C1, (SUM(C3)) * 2
FROM T1
GROUP BY C1, C2

Following table sows the computation.

Partition -------------------------- C1 ------- C2 ------------- SUM(C3)
-----------------------------------------------------------------------------
{ (1 1 0 0), (1 1 0 1) } -------------- 1 --------- 1 ------------- 0+0 = 0
{ (0 0 1 1), (0 0 1 0), (0 0 0 1) } --- 0 --------- 0 ------------- 1+1+0 = 0
{ (1 0 1 1) } ---------------------- 1 --------- 0 -------------- 1

------- to be continued ----------



3 Comments:

Blogger Srihari SN said...

Mr Banerjee

the last thing that would stir up my emotions is some subset rule theory and some donkey coding ;)

Check this out to see how nerd ur -
http://www.wxplotter.com/ft_nq.php

3:40 PM  
Anonymous Anonymous said...

I stopped! This blog-editor-app itself got frustrated and gave up, unable to manage the burden of my stuff!

10:13 PM  
Blogger Sujeet Banerjee said...

I don't wish to continue above, due to the crappy behaviour of the editor. Perhaps this happens if one uses fonts heavily and the matter also being huge!

5:46 PM  

Post a Comment

<< Home