InsideOut: January 2005

Tuesday, January 18, 2005

KISS

Yeah! Everyone dreams about having one in one's life. In fact, it has to be that way, for most of the people to relish it...understand it...maintain it. I won't say much about KISS, when already so many, more experienced people have said about it, and almost made you feel it! Just go and feel it yourself -- http://www.adambosworth.net/archives/000031.html

I'd say, it's a learning for the geeky people, for the sake of mere mortals, how to KISS!

From Adam's blog:

... There are those who will say that all this is utopian. If Utopian means not being afraid to dream, then indeed it is. So was Tim’s initial vision of universal access to information. So is Google’s mission. T.E Lawrence wrote in the Seven Pillars of Wisdom,

“All men dream: but not equally. Those who dream by night in the dusty recesses of their minds wake in the day to find that it was vanity: but the dreamers of the day are dangerous men, for they may act their dream with open eyes, to make it possible”
I encourage all of you to act your dreams with open eyes. I encourage all of you to dream...

Tuesday, January 11, 2005

The subset-rule is just a rule-of-thumb!

What I discuss here is actually related to my previous blog. I succeeded in drawing attentions of many. I got very good comments from people including the ones I'd never met. The first one was about a typo from anonymous.

The second one is really dear to me, as I did not expect anybody to post a blood-red-faced comment! Mr. Mahadevan, I see your face still blood-red as you are going through this blog. I, no doubt, have the capability to stir up emotions in the readers' minds.

The third one was what I expected from somebody. But, that does not mean, it's not dear to me. I believe there are people really interested in languages and formalisms and they can't simply keep quite; justifying my expectation. In fact, I had a chat with Manivannan. He asked me those questions. I eventually managed him post those as comments. Thanks Manivannan.

Let me quickly go through the formalisms.

Work-space:

1. We have one table T1, with columns { C1, C2, C3, C4 }. This's the schema, in fact.
2. As far as the data is concerned, let's have a data-instance as

{
(1 1 0 0),
(1 1 0 1),
(0 0 1 1),
(0 0 1 0),
(0 0 0 1),
(1 0 1 1)
}

where each row is enclosed in the parentheses (- - - -). Each row has four entities or values, correponding to the four columns (*** Let, for simplicity, their values be 0 or 1. ***)

Assumptions:

I can't think of any, right now!

Definitions:

1. Partition (grouping of rows) -- If R is an equivalence relation on a set S, then R divides S into disjoint subsets {S1, S2, ..., Sn} where 1 <= n <= |S|. This comes from the basic discrete-algebra. For our case, S is the table T1 and an element of the set T1 is of form (C1, C2, c3, C4) as we have four columns. SQL programmers like to call each element as a row.

The R is the equality of two tuples. Two tuples (e1, e2, ..., ek) and (p1, p2, ..., pk) are said to be equal iff the following boolean expression evaluated to be true:
(e1 == p1) ^ (e2 == p2) ^ ...^ (ek == pk). The == is the standard C/Java boolean operator and the operator ^ is the boolean AND.

Are you thinking what I am thinking? Let me cross-check if your brain is in synch with mine, through the following example:

If we partition the above data-instance of T1, defined in the work-space, using R acting on the tuple (C1, C2, C3, C4) then we get each row/element fall into different partitions; the number of partitions being equal to the number of rows in T1.

If the R acts on the tuple (C1, C2), we get three partitions, namely,
{ (1 1 0 0), (1 1 0 1) }
{ (0 0 1 1), (0 0 1 0), (0 0 0 1) } and
{ (1 0 1 1) }
Yes, we applied the equality on the first two columns for each row!

So, let me define R(Ci, Cj) as partitioning the table T1 based on equality of the tuple (Ci, Cj), that is, based on the columns Ci and Cj of T1.

--------------------------------------------------------------------------

2. FR ('from') clause --
This caluse takes a comma-list of tables, does their cross product (join in database parlance) and produces final big table. This caluse operates first of all other clauses. This justifies considering only one table T1 in the work-space.

3. GB ('group by') clause -- This clause partitions the big table resulting from the FR clause. So, GROUP BY C1, C3 is effectively R(C1, C3) over the table.

4. Scalar function -- The one which operated on the column(s) of the same row, to give out one value. These include arithmetic operators. These are like multi-variate algebraic functions of form f(x, y, z); x, y, z being the entities of the same row.

5. Aggregate functions -- The one that operates on a column of all the rows, to give out one value. E.g. SUM, AVG, MIN, MAX.

SELECT SUM(C3) , AVG(C1)
FROM T1

Results in
--------
3 | 0.5
--------

6. SL ('select') clause -- This clause projects what you want to see in the output, from the result after GB (if present); precisely the tuple structure of the relust. So, SL is not a row-selector.

SL accepts a comma-list of entities. Each entity can be a simple column or a scalar/aggregate function applied over one or more columns.

A typical example would be
SELECT f1(C1, C2), f2(C1, C4) ...
where f1 and f2 are scalar functions. But it is important to note that f1(C1, C2) operates on single row at a time.

SELECT C1, C2 + C3
FROM T1
returns
1 1
1 1
0 1
0 1
0 0
1 1

But what happens when GB is present? The SL must apply on each partition! But what? A partition can have more than one rows! So, how is it possible for the scalar function in SL to operate on a partition? Are you getting the point?

Let's quickly go through this example
SELECT C1 , C2 * 2
FROM T1
GROUP BY C1, C2
Note that, f1(x) = x and f2(x) = x * 2, here.

This returns
1 2
0 0
1 0
Why because, The GB partitions the table T1 (with six rows/elements) into three partitions (recall, R(C1, C2) over T1)! The SL then operates on these partitions (see up, the definition #1) to fetch the tuples (C1, C2*2); one per partition.

Could you say, what could be the output of
SELECT C1, C3
FROM T1
GROUP BY C1, C2

Think! See, C1 is unique for each partition. Same for C2. But C3? Let's go through each partition one by one!

The first partition { (1 1 0 0), (1 1 0 1) } has C3 as 0 in both the tuples. No Problem!

The second partition { (0 0 1 1), (0 0 1 0), (0 0 0 1) } has C3 varying! So how can you expect a scalar function to act on C3 for this partition? You just can't! Same is the story with C4.

But, what was so special about C1 and C2 then? How could the scalar function act over the C1 and C2? Yes, you got it! Because, they are unique in each partition, as they are the ones using which the partitioning is done. So, they got to be unique to preserve the disjoint-ness of the partitions! Can your generalise this? You'll will reach close to the subset rule here -- I promise!

Thus, the SQL we just saw is invalid. Now what if, instead of scalar function, we use an aggregate function? Nice idea, huh? The aggregate function acts on a group of elements and return a single value, something a scalar function can digest!

So,
SELECT C1, (SUM(C3)) * 2
FROM T1
GROUP BY C1, C2

Following table sows the computation.

Partition -------------------------- C1 ------- C2 ------------- SUM(C3)
-----------------------------------------------------------------------------
{ (1 1 0 0), (1 1 0 1) } -------------- 1 --------- 1 ------------- 0+0 = 0
{ (0 0 1 1), (0 0 1 0), (0 0 0 1) } --- 0 --------- 0 ------------- 1+1+0 = 0
{ (1 0 1 1) } ---------------------- 1 --------- 0 -------------- 1

------- to be continued ----------

Tuesday, January 04, 2005

Oracle's Funny Interpretter

Quite sometime back I got interested in the GROUP BY clause of SQL-92. I feels that's an interesting clause, especially when it works in combination with expressions and aggregate functions.

Oh! Before I dive deep into the thought provoking title, let me put some background for those who have a faint idea of SQL-92 constructs. SQL is a query-language (something that helps to fetch/update data using small chunks of code). It expands to Structured Query language. The question 'how much structured' really makes me sink deep into thoughts. Hmmm...let me not spoil the spirit by discussing the structuredness of SQL. But I'd say it's successful (like HTML) because of its simplicity and flexible nature (I think, I sound applauding!).

I don't have patience to teach you SQL here (I think its pretty simple, get C.J. Date's book in front of you). My focus would be on the SELECT construct, that helps you fetch the data of your desire. GROUP BY clause helps you partition the data as per your desire (i.e each partition following some characteristic that you specify). Rest of this blog, I'll talk about partitioning (I guess, 'grouping' is the appropriate term for the SQL programmers).

I'd also mention this:

SELECT COL1, COL2
FROM TAB1
GROUP BY COL1, COL2, COL3

The column-list in SL (short for 'SELECT') clause must be a subset of that in GB (short for 'Group By') clause. So the query above is valid. What follows is an invalid query (as COL2 is not in the GB-list):

SELECT COL1, COL2
FROM TAB1
GROUP BY COL1, COL3

Let me use Pink color for GB-List and Green for the SL-List. SL clause can take any column if it's an argument to an aggregate function. Thus, the following query is valid.

SELECT COL1, sum(COL2)
FROM TAB1
GROUP BY COL1, COL3

If you know a bit of SQL, you see every thing is in perfect shape. But what if we bring in expressions? OK, let's see:

SELECT (COL2 * 2)
FROM TAB1
GROUP BY COL1, (COL2 * 2)

is valid, because the SL-list is a subset of the GB-list (as per SQL-92 requirement). The following queries are all valid and are treated well by the Oracle's SQL Interpretter:

SELECT (COL2 * 2 * 2)
FROM TAB1
GROUP BY COL1, (COL2 * 2)

SELECT (COL1 + COL2)
FROM TAB1
GROUP BY (COL1 * COL2)

SELECT (COL1 + COL2)
FROM TAB1
GROUP BY COL1, COL2

SELECT (COL2 * 2)
FROM TAB1
GROUP BY COL1, (2 * COL2)

SELECT (COL2 * 2 * 2)
FROM TAB1
GROUP BY COL1, (COL2 * 4)

Sorry if this confuses you! The Subset rule is just a rule of thumb. To judge their validity, requires a bit of going into SQL-92, which gives a wide allowance of what is acceptable in a query with a GB. I won't try to explain it here. But I may take if offline through comments.

I was a fan of Oracle. I believed the developers really pumped in good amount of AI to achieve the expression comparision, until I came across the following observation with the Oracle's Interpretter:

SQL> select (2*comm) + (2*sal) from bonus group by (2*comm) + (2*sal);

(2*COMM)+(2*SAL)
----------------
150000
150020
150220

SQL> select (2*comm) + (2*sal) from bonus group by (2*comm) + 2*(sal);

(2*COMM)+(2*SAL)
----------------
150000
150020
150220

SQL> select (2*comm) + (2*sal) from bonus group by (2*comm) + (sal)*2;
select (2*comm) + (2*sal) from bonus group by (2*comm) + (sal)*2
*
ERROR at line 1:
ORA-00979: not a GROUP BY _expression

You observed what happened?...NO?? Let me hand it over to you. Oracle's Interpretter says, the following SQL query is invalid (in fact, it's a valid SQL-92 query)

SELECT COL1 * 2 + COL2 *2
FROM TAB1
GROUP BY COL1 * 2 + 2* (COL2)

Whoa! I am not criticizing the failure of Oracle to treate this one correctly. But rather I want to bring your attention to a pattern. Go back and read Oracle's reaponse in Red. What does the success of the first two queries suggest you? To me it appeared that Oracle implements a good AI (this is just not as trivial as Expression-String-matching or Expression-Syntax-Tree matching) to resolve the expression-comparision! In fact, this conclusion is buttressed by the support of other sofisticated GB queries involving expression (re-read the statement in Blue, back).

But hey! What happened to the third query in Red? Are COL1 * 2 + COL2 *2 and COL1 * 2 + 2* (COL2) not equivalent? They are! Then why does Oracle's interpretter fail here? This suggests Donkey-work done!

Another valid SQL, that fails in Oracle.

SELECT 2*(COL1) + 2*(COL2)
FROM TAB1
GROUP BY 2*(COL1 + COL2)

We know that 2*(COL1) + 2*(COL2) and 2*(COL1 + COL2) are equivalent. The question is how do you make the Computer recognize this! One way is to write code for each pattern you can think of (this is exactly what I mean by Donkey-Coding!). Another is to have a robust AI that matches expression as Human Brain does (here your system will work in case of unforeseen patterns as well; not dependent on how many patterns you could think of!).

The title of this blog could appropriately have been Oracle's Donkey interpretter! Whatever is it... it's really costly to implement such a robust AI. But at the same time, it's needed to support all the common patterns and Oracle has done that! No wonder, it's successful.

Sunday, January 02, 2005

Returned back to Pune

The return journey wasn't that interesting. Yeah...You guessed it right! The students were still on vacation!

I was jam-packed with winter clothes when I left Kanpur (my hometown, read previous blogs for knowing more about Kanpur!) as it was terribly cold there (almost melting your bones)!

The train was late due to fog, by few hours! It got more and more as it approached Pune. It was really frustrating, until I heard a voice. It was, in fact, a puzzle asked by someone to little kids. The voice dragged me to him. He was a professor in Computer Science at some college affiliated to the University of Delhi. The puzzle was interesting, easy though. He told me, 'you can write a computer program simulating the solution, like you might have done for the Tower of Hanoi problem'.

I know, you are eager to hear the puzzle! Here is it:

There is a group of 3 monks and 3 cannibals. They are all on the bank of a river with a wish to cross it.

But argh! There is only one boat and it's so small that it can carry only two persons(at most) at a time.

Also there's is another condition - that is, at any time or place, in a group the number of monks should not be less than that of the cannibals, else the monks will be eaten up by the cannibals. Now work out a solution to fulfill their wish to cross the river!

If you want to send your soluion, please do it here through the comments and not through the e-mail. You may post your comment as 'anonymous', if
1. you don't want to disclose yourself or
2. if you are lazy enough to create your account in www.bloggers.com

I reached Pune on Monday, December 27, 2004 (at about 7:15 pm), the very next day the Tsunami debacle occured!

From flashback

InsideOut

About Me

Other Blogs