SQL Aggregation Tutorial

I’ve been working on a Basic Aggregation tutorial for my students. I think this might be close to what may benefit them. However, I thought it would be great to put it out there and solicit ideas. If you have some on improving this post, please let me know.

My first take at the post …

This is a lesson on basic aggregation in SQL. Aggregation in SQL means counting, adding, and grouping by results of counts or sums. Aggregation is a critical part of using the SQL language. At a basic level, aggregation includes the COUNT, SUM, AVERAGE, MAX, and MIN aggregation functions; and the ORDER BY, GROUP BY, and HAVING clauses.

You’ll find the setup scripts for these examples at the bottom of this blog page. The best way to use this page is to copy the setup code, run it in your database, and then test the examples as you work though them.

Data set ↓

This section discusses the data set. You unfold it by clicking on the bold text above.

The illustration of aggregation is a challenge because a trivial set of numbers doesn’t present much challenge or sometimes relevance but it shows the basic concept. A more applied sample set makes mental calculation and comprehension of functions more difficult. As a result of that trade off, this uses two copies of the ordinal numbers (ordinal numbers are the single digit integers from zero to nine).

After you run the setup script, you can display the data set with the following query.

-- Oracle only SQL*Plus formatting command to smooth output display.
SET PAGESIZE 99
 
-- Query ordered by the sets and then the integers in ascending order.
SELECT * FROM ordinal ORDER BY 2, 4;

You could have sorted them more easily by using the ORDINAL_ID surrogate key column but it wouldn’t demonstrate the ORDER BY clause. In some database, like Oracle and MySQL, the ORDER BY lets you use column names, column aliases, or the position order of the column in the SELECT clause. Unfortunately, this isn’t true across all databases.

The ORDER BY clause does let you see that you can set nested ordering, like the numbers inside the data sets that are qualified by the LIST_SET column. This is demonstrated in the output below.

ORDINAL_ID LIST_SET             LIST_NAME  LIST_VALUE
---------- -------------------- ---------- ----------
         1 Value Set A          Zero                0
         2 Value Set A          One                 1
         3 Value Set A          Two                 2
         4 Value Set A          Three               3
         5 Value Set A          Four                4
         6 Value Set A          Five                5
         7 Value Set A          Six                 6
         8 Value Set A          Seven               7
         9 Value Set A          Eight               8
        10 Value Set A          Nine                9
        11 Value Set A
        12 Value Set B          Zero                0
        13 Value Set B          One                 1
        14 Value Set B          Two                 2
        15 Value Set B          Three               3
        16 Value Set B          Four                4
        17 Value Set B          Five                5
        18 Value Set B          Six                 6
        19 Value Set B          Seven               7
        20 Value Set B          Eight               8
        21 Value Set B          Nine                9
        22 Value Set B

Single row aggregation function use ↓

This section discusses and demonstrates the five aggregation functions in the post. They return a single row of data. Like the other sections, you unfold this by clicking on the bold text above.

COUNT function

The COUNT function has two behaviors that may apply. One is when you count rows and the other is when you count values. Both return only a single row.

Counting by Reference: The following example demonstrates counting rows. In this scenario, the COUNT function takes a single call parameter, which is an asterisk (*). The asterisk actually counts the references to the rows in the table, and in Oracle is equivalent to COUNT(ROWID). The pseudo column ROWID points to the physical block address where the row is stored in the database. As such, the asterisk acts very much like a pointer in the C or C++ language (that is if you substitute a block address for a memory address).

SELECT COUNT(*) FROM ordinal;

It returns

  COUNT(*)
----------
        22

When you count rows, you count them whether they contain values or not. It is possible that you could insert a meaningless row with null values for all columns in a table that’s unconstrained. Moreover, the COUNT function counts all rows no matter whether contain any values.

Counting by Value: The following examples demonstrate approaches to counting values. It’s important to note that when you count values, you ignore null values. You can also count distinct things or all things when you count by value.

The LIST_SET column doesn’t contain any null values (as you can see by inspecting the data set early in this blog page). The following counts the number of values in the LIST_SET column. You could also substitute COUNT(ALL list_set) and it would return the same thing because ALL is the default and always provided unless you override it with a DISTINCT keyword.

SELECT COUNT(list_set) FROM ordinal;

It returns:

  COUNT(*)
----------
        22

Since the LIST_SET column only contains one of two values, you can count that to make sure. You do it with the following syntax:

SELECT COUNT(DISTINCT list_set) FROM ordinal;

It returns:

  COUNT(*)
----------
         2

The LIST_NAME and LIST_VALUE both contain two null values. Let’s see what they return when we count all or distinct values. This example simplifies it a step more by performing both counts side-by-side. This is possible because both columns return a single row.

SELECT COUNT(ALL list_name) AS name_number
,      COUNT(DISTINCT list_value) AS value_number
FROM   ordinal;

It returns the twenty total string values found in the first column and the ten unique numeric values found in the second column. The COUNT function with the DISTINCT filter acts like a COUNT function with a GROUP BY clause on the LIST_VALUE column, as you’ll see later in the discussion.

NAME_NUMBER VALUE_NUMBER
----------- ------------
         20           10

SUM, AVERAGE, MAX, and MIN functions

Math Operations by Value: These math aggregations are done with the SUM, AVERAGE, MAX, and MIN aggregation functions. They’re restricted to columns that contain numeric values. Each of them support the default ALL or DISTINCT keywords and they behave much as the COUNT function discussed earlier.

There is a significant difference between their functionality. You can count unique strings but you can’t sum a column of numbers based on that other column without a GROUP BY clause. You see examples of the GROUP BY in the next part of this post.

Since these behave more or less the same, their examples are grouped together in sections by summing and averaging together, and taking the maximum and minimum values together. For reference, the ordinal numbers sum to 45, two sets of them are naturally 90; and the average of evenly dispersed numbers is half, or 45, and their unique values 4.5.

SELECT SUM(ALL list_value) AS sum_all
,      SUM(DISTINCT list_value) AS sum_distinct
,      AVG(ALL list_value) AS avg_all
,      AVG(DISTINCT list_value) AS avg_distinct
FROM   ordinal;

It returns:

   SUM_ALL SUM_DISTINCT    AVG_ALL AVG_DISTINCT
---------- ------------ ---------- ------------
        90           45        4.5          4.5

The next section makes the same comparison with the MAX, and MIN functions. The maximum or minimums are the same because they the uniqueness doesn’t change the minimum or maximum of the sets.

SELECT MIN(ALL list_value) AS min_all
,      MIN(DISTINCT list_value) AS min_distinct
,      MAX(ALL list_value) AS max_all
,      MAX(DISTINCT list_value) AS max_distinct
FROM   ordinal;

It returns:

   MIN_ALL MIN_DISTINCT    MAX_ALL MAX_DISTINCT
---------- ------------ ---------- ------------
         0            0          9            9

Multiple row aggregation function use ↓

This section discusses and demonstrates the five aggregation functions where they return more than a single row of data, which involves grouping by non-aggregated columns. This section covers the GROUP BY and HAVING clauses. Like the other sections, you unfold this by clicking on the bold text above.

COUNT function

Counting by Groups: The following example demonstrates how we can count rows or values by groups. The example counts rows by reference and value, but it does so by grouping the return values on the unique string values in the LIST_SET column.

SELECT   list_set AS grouping_by_column
,        COUNT(*)
,        COUNT(list_value)
FROM     ordinal
GROUP BY list_set;

It returns the following values:

GROUPING_BY_COLUMN     COUNT(*) COUNT(LIST_VALUE)
-------------------- ---------- -----------------
VALUE SET A                  11                10
VALUE SET B                  11                10

The results show that there are eleven rows for each set, and that ten of the rows contain values while one holds a null value.

You can also group on columns that contain null values or sets of columns. The following example groups by two columns, where one contains a null value.

SELECT   list_set AS grouping_by_not_null
,        list_name AS group_by_null_too
,        COUNT(*)
,        COUNT(list_value)
FROM     ordinal
WHERE    list_set = 'Value Set A'
GROUP BY list_set
,        list_name;

It returns the following values:

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO   COUNT(*) COUNT(LIST_VALUE)
-------------------- ----------------- ---------- -----------------
VALUE SET A          Zero                       1                 1
VALUE SET A          Five                       1                 1
VALUE SET A          Three                      1                 1
VALUE SET A          Four                       1                 1
VALUE SET A          One                        1                 1
VALUE SET A          Two                        1                 1
VALUE SET A          Eight                      1                 1
VALUE SET A          Nine                       1                 1
VALUE SET A          Seven                      1                 1
VALUE SET A          Six                        1                 1
VALUE SET A                                     1                 0

The last row returned is the one where the grouping by column value and counted column value are null. You should note that the GROUP BY applies to a string column and the return set includes a row grouped by its null value. The count of null column values is always zero.

Another thing that you may have not noticed is that the rows are no longer in ascending numeric order by LIST_NAME or LIST_VALUE columns. There’s a multiple edge sword when you examine why this occurs. Basically with regard to the LIST_NAME column, Oracle and many other databases use advanced sorting algorithms which may leave non-ordered sets as outcomes. The LIST_VALUE column is not in the GROUP BY clause and therefore can’t be used as an ORDER BY key.

When you attempt to use a column outside of the GROUP BY clause to order the return set, like this:

SELECT   list_set AS grouping_by_not_null
,        list_name AS group_by_null_too
,        COUNT(*)
,        COUNT(list_value)
FROM     ordinal
WHERE    list_set = 'Value Set A'
GROUP BY list_set
,        list_name
ORDER BY list_value;

You would see the following error in an Oracle Database 10g instance:

ORDER BY list_value
         *
ERROR at line 9:
ORA-00979: NOT a GROUP BY expression

You can filter this particular set because the conditions are limited, but this wouldn’t be a good idea in many other solution sets. However, it certainly highlights some of the potential for extraordinary sorting of result sets.

SELECT   list_set AS grouping_by_not_null
,        list_name AS group_by_null_too
,        COUNT(*)
,        COUNT(list_value)
FROM     ordinal
WHERE    list_set = 'Value Set A'
GROUP BY list_set
,        list_name
ORDER BY CASE
           WHEN list_name = 'Zero'  THEN 0
           WHEN list_name = 'One'   THEN 1
           WHEN list_name = 'Two'   THEN 2
           WHEN list_name = 'Three' THEN 3
           WHEN list_name = 'Four'  THEN 4
           WHEN list_name = 'Five'  THEN 5
           WHEN list_name = 'Six'   THEN 6
           WHEN list_name = 'Seven' THEN 7
           WHEN list_name = 'Eight' THEN 8
           WHEN list_name = 'Nine'  THEN 9
         END;

It returns the following values:

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO   COUNT(*) COUNT(LIST_VALUE)
-------------------- ----------------- ---------- -----------------
VALUE SET A          Zero                       1                 1
VALUE SET A          One                        1                 1
VALUE SET A          Two                        1                 1
VALUE SET A          Three                      1                 1
VALUE SET A          Four                       1                 1
VALUE SET A          Five                       1                 1
VALUE SET A          Six                        1                 1
VALUE SET A          Seven                      1                 1
VALUE SET A          Eight                      1                 1
VALUE SET A          Nine                       1                 1
VALUE SET A                                     1                 0

A better way to sort this type of data would involve putting it into an inline view and then joining it against the base data set. This allows you to use a non-grouping column as the sort key in an ORDER BY clause. Here’s an example of that approach but note it does take system overhead and should only be done when SQL must perform the sort operation.

SELECT   ilv.grouping_by_not_null
,        ilv.group_by_null_too
,        ilv.row_count
,        ilv.column_count
FROM    (SELECT   list_set AS grouping_by_not_null
         ,        list_name AS group_by_null_too
         ,        COUNT(*) AS ROW_COUNT
         ,        COUNT(list_value) AS column_count
         FROM     ordinal
         WHERE    list_set = 'Value Set A'
         GROUP BY list_set
         ,        list_name) ilv LEFT JOIN ordinal o
ON       ilv.grouping_by_not_null = o.list_set
AND      ilv.group_by_null_too = o.list_name
ORDER BY o.list_value;

That returns a perfectly ordered set because the LEFT JOIN lets us capture the null value in the inline view without matching the two columns in the join condition. It returns the following set:

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO  ROW_COUNT COLUMN_COUNT
-------------------- ----------------- ---------- ------------
Value Set A          Zero                       1            1
Value Set A          One                        1            1
Value Set A          Two                        1            1
Value Set A          Three                      1            1
Value Set A          Four                       1            1
Value Set A          Five                       1            1
Value Set A          Six                        1            1
Value Set A          Seven                      1            1
Value Set A          Eight                      1            1
Value Set A          Nine                       1            1
Value Set A                                     1            0

SUM, AVERAGE, MAX, and MIN functions

Math Operations by Groups: These math aggregations are done with the SUM, AVERAGE, MAX, and MIN aggregation functions. Like the previous single row return examples, they’re restricted to columns that contain numeric values. While they also support the default ALL or DISTINCT keywords, it didn’t seem necessary to demonstrate more of those behaviors here.

The following sums, averages, and gets both the minimum and maximum value by unique LIST_SET column values:

SELECT   list_set AS grouping_by_not_null
,        SUM(list_value) AS ordinal_sum
,        AVG(list_value) AS ordinal_avg
,        MIN(list_value) AS ordinal_min
,        MAX(list_value) AS ordinal_max
FROM     ordinal
GROUP BY list_set;

It returns the following two rows:

GROUPING_BY_NOT_NULL ORDINAL_SUM ORDINAL_AVG ORDINAL_MIN ORDINAL_MAX
-------------------- ----------- ----------- ----------- -----------
VALUE SET A                   45         4.5           0           9
VALUE SET B                   45         4.5           0           9

Expanding the GROUP BY criteria to include the LIST_NAME column, you see what happens when the SUM and AVG functions work with only null values. They always return null when adding only column values with a null value. This is handy as you’ll see in the selective aggregation discussion in a subsequent tutorial.

As done in the prior example, this filters out the second value set, to focus on what’s happening with the rows grouped on two columns.

SELECT    list_set AS grouping_by_not_null
,         list_name AS group_by_null_too
,         SUM(list_value) AS ordinal_sum
,         AVG(list_value) AS ordinal_avg
,         MIN(list_value) AS ordinal_min
,         MAX(list_value) AS ordinal_max
FROM      ordinal
WHERE NOT list_set = 'Value Set B'
GROUP BY  list_set
,         list_name;

It returns the following set of records, which are clearly out of order based on the algorithms that optimize aggregation. It’s not hard to read because there aren’t too many rows returned.

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO ORDINAL_SUM ORDINAL_AVG ORDINAL_MIN ORDINAL_MAX
-------------------- ----------------- ----------- ----------- ----------- -----------
VALUE SET A          Zero                        0           0           0           0
VALUE SET A          Five                        5           5           5           5
VALUE SET A          Three                       3           3           3           3
VALUE SET A          Four                        4           4           4           4
VALUE SET A          One                         1           1           1           1
VALUE SET A          Two                         2           2           2           2
VALUE SET A          Eight                       8           8           8           8
VALUE SET A          Nine                        9           9           9           9
VALUE SET A          Seven                       7           7           7           7
VALUE SET A          Six                         6           6           6           6
VALUE SET A

If SQL must sort these into numeric order, it can do so like the prior example with the result from the COUNT function. You must instrument a solution for it through an inline view. The following shows you how to do that.

SELECT   ilv.grouping_by_not_null
,        ilv.group_by_null_too
,        ilv.ordinal_sum
,        ilv.ordinal_avg
,        ilv.ordinal_min
,        ilv.ordinal_max
FROM    (SELECT   list_set AS grouping_by_not_null
         ,        list_name AS group_by_null_too
         ,        SUM(list_value) AS ordinal_sum
         ,        AVG(list_value) AS ordinal_avg
         ,        MIN(list_value) AS ordinal_min
         ,        MAX(list_value) AS ordinal_max
         FROM     ordinal
         WHERE    list_set = 'Value Set B'
         GROUP BY list_set
         ,         list_name) ilv LEFT JOIN ordinal o
ON       ilv.grouping_by_not_null = o.list_set
AND      ilv.group_by_null_too = o.list_name
ORDER BY o.list_value;

Now you should see an ordered set like this:

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO ORDINAL_SUM ORDINAL_AVG ORDINAL_MIN ORDINAL_MAX
-------------------- ----------------- ----------- ----------- ----------- -----------
VALUE SET B          Zero                        0           0           0           0
VALUE SET B          One                         1           1           1           1
VALUE SET B          Two                         2           2           2           2
VALUE SET B          Three                       3           3           3           3
VALUE SET B          Four                        4           4           4           4
VALUE SET B          Five                        5           5           5           5
VALUE SET B          Six                         6           6           6           6
VALUE SET B          Seven                       7           7           7           7
VALUE SET B          Eight                       8           8           8           8
VALUE SET B          Nine                        9           9           9           9
VALUE SET B

Math Operations with the HAVING clause: The HAVING clause is very useful when you want to find rows that duplicate a column or set of column values. It’s also useful for sorting data sets.

The HAVING clause effects all aggregate rows and acts somewhat like a filter in the WHERE clause of a non-aggregated query. The difference is that the WHERE clause filters the base result set, while the HAVING clause filters the aggregated result set.

The following example captures the sum and average for only the odd LIST_VALUE column values. It does that by using modular (or modulo) mathematics through the MOD function in Oracle’s SQL implementation. Modulo math works by calculating the remainder of integer division. When you divide any number by two and the result is one, then you have an odd number.

Here’s the sample program that illustrates the HAVING clause in a CASE statement. The CASE statement only adds the number when it is an odd number. Any even number is zeroed out. The comparison operator ensures that only rows are returned when the SUM is greater than zero, which only happens for odd numbers.

SELECT   list_set AS grouping_by_not_null
,        list_name AS group_by_null_too
,        SUM(list_value) AS ordinal_sum
,        AVG(list_value) AS ordinal_avg
FROM     ordinal
WHERE    list_set = 'Value Set A'
HAVING   SUM(CASE
               WHEN MOD(list_value,2) = 1 THEN list_value ELSE 0
             END) > 0
GROUP BY list_set
,        list_name;

The data set, as you’ve seen with others isn’t sorted in ascending order because of how the algorithms work. Since there are only five rows returned, it is easy to see that the HAVING clause let us narrow the return set to odd numbers. What we might overlook is that the HAVING is always a filtering statement. This means we need to have a comparative operator tied to the HAVING return value.

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO ORDINAL_SUM ORDINAL_AVG
-------------------- ----------------- ----------- -----------
Value Set A          Five                        5           5
Value Set A          Three                       3           3
Value Set A          One                         1           1
Value Set A          Nine                        9           9
Value Set A          Seven                       7           7

As you’ve seen in early examples, you can also put this inside an inline view and sort the aggregated results. Here’s the code to accomplish that.

SELECT   ilv.grouping_by_not_null
,        ilv.group_by_null_too
,        ilv.ordinal_sum
,        ilv.ordinal_avg
FROM    (SELECT   list_set AS grouping_by_not_null
         ,        list_name AS group_by_null_too
         ,        SUM(list_value) AS ordinal_sum
         ,        AVG(list_value) AS ordinal_avg
         FROM     ordinal
         WHERE    list_set = 'Value Set A'
         HAVING   SUM(CASE
                        WHEN MOD(list_value,2) = 1 THEN list_value ELSE 0
                      END) > 0
         GROUP BY list_set
         ,        list_name) ilv LEFT JOIN ordinal o
ON       ilv.grouping_by_not_null = o.list_set
AND      ilv.group_by_null_too = o.list_name
ORDER BY o.list_value;

It naturally returns the ordered set of aggregated results from only odd numbers in the base set of values.

GROUPING_BY_NOT_NULL GROUP_BY_NULL_TOO ORDINAL_SUM ORDINAL_AVG
-------------------- ----------------- ----------- -----------
Value Set A          One                         1           1
Value Set A          Three                       3           3
Value Set A          Five                        5           5
Value Set A          Seven                       7           7
Value Set A          Nine                        9           9

Setup Script ↓

The drop down items, unfold with source code to seed the examples.

Oracle SQL Join Supporting Script ↓

This has the setup script for the example tables.

-- Conditionally drop the table and sequence.
BEGIN
  FOR i IN (SELECT TABLE_NAME
            FROM   user_tables
            WHERE  TABLE_NAME = 'ORDINAL') LOOP
    EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name;
  END LOOP;
  FOR i IN (SELECT sequence_name
            FROM   user_sequences
            WHERE  sequence_name = 'ORDINAL_S1') LOOP
    EXECUTE IMMEDIATE 'DROP SEQUENCE '||i.sequence_name;
  END LOOP;
END;
/
 
-- Create the aggregation sample table and sequence.
CREATE TABLE ordinal
( ordinal_id  NUMBER
, list_set    VARCHAR2(20)
, list_name   VARCHAR2(5)
, list_value  NUMBER);
 
CREATE SEQUENCE ordinal_s1;
 
-- Seeding values in the aggregation sample table.
DECLARE
  -- Define local types.
  TYPE listg IS TABLE OF VARCHAR2(20);
  TYPE listn IS TABLE OF NUMBER;
  TYPE lists IS TABLE OF VARCHAR2(5);
 
  -- Declare local variables.
  groups LISTG := listg('Value Set A','Value Set B');
  labels LISTS := lists('Zero','One','Two','Three','Four','Five','Six','Seven','Eight','Nine');
  valuen LISTN := listn(0,1,2,3,4,5,6,7,8,9);
 
BEGIN
  -- Outer loop sets the group level.
  FOR i IN 1..groups.COUNT LOOP
    -- Inner loop sets the row level.
    FOR j IN 1..labels.COUNT LOOP
      INSERT INTO ordinal
      VALUES (ordinal_s1.NEXTVAL,groups(i),labels(j),valuen(j));
    END LOOP;
 
    -- Insert the null values for each group.
    INSERT INTO ordinal
    VALUES (ordinal_s1.NEXTVAL,groups(i),NULL,NULL);
  END LOOP;
  COMMIT;
END;
/
 
-- Commit the inserts.
COMMIT;

MySQL SQL Join Supporting Script ↓

This has the setup script for the example tables.

-- Conditionally drop the table.
SELECT 'DROP TABLE ordinal' AS Statement;
DROP TABLE IF EXISTS ordinal;
 
-- Create the table.
SELECT 'CREATE TABLE ordinal' AS Statement;
CREATE TABLE ordinal
( ordinal_id  INT UNSIGNED PRIMARY KEY AUTO_INCREMENT
, list_set    VARCHAR(20)
, list_name   VARCHAR(5)
, list_value  INT UNSIGNED);
 
-- Seed the row values.
SELECT 'INSERT INTO ordinal' AS Statement;
INSERT INTO ordinal
VALUES
 (NULL,'Value Set A','Zero','0')
,(NULL,'Value Set A','One','1')
,(NULL,'Value Set A','Two','2')
,(NULL,'Value Set A','Three','3')
,(NULL,'Value Set A','Four','4')
,(NULL,'Value Set A','Five','5')
,(NULL,'Value Set A','Six','6')
,(NULL,'Value Set A','Seven','7')
,(NULL,'Value Set A','Eight','8')
,(NULL,'Value Set A','Nine','9')
,(NULL,'Value Set A',NULL,NULL)
,(NULL,'Value Set B','Zero','0')
,(NULL,'Value Set B','One','1')
,(NULL,'Value Set B','Two','2')
,(NULL,'Value Set B','Three','3')
,(NULL,'Value Set B','Four','4')
,(NULL,'Value Set B','Five','5')
,(NULL,'Value Set B','Six','6')
,(NULL,'Value Set B','Seven','7')
,(NULL,'Value Set B','Eight','8')
,(NULL,'Value Set B','Nine','9')
,(NULL,'Value Set B',NULL,NULL);
 
-- Commit the inserts.
COMMIT;

Written by maclochlainn

February 26th, 2010 at 1:29 am

Posted in MySQL,Oracle,Oracle XE,sql,SQL Server 2008

MacLochlainns Weblog

SQL Aggregation Tutorial

Recent Posts

Things Written About

Pages

Blogroll

Archives