We started by talking in general terms about normal forms.
In general, if you follow the guidelines we talked about yesterday, then your database tables will all conform to all three normal forms. By the midterm, you should know the difference between first, second, and third normal form. But if I run into you a year from now, I won't care if you can't remember the difference between second and third normal form, but you better still be able to put together a database that conforms to all three.
A database is in first normal form (1NF) if for every row, each column has only one value. In other words, each entry in the table only has one thing in it. This might seem like a pointless definition because in most database systems (including MySQL), it's the only way to do things. It's just not possible to violate the first normal form if you're using MySQL.
Strictly defined, a table is in second normal form (2NF) if no non-prime attribute depends on a proper subset of any candidate key. (Actually, it also has to be in First Normal Form, but that goes without saying.) That is quite a mouthful. So let's break it down bit by bit and use an example of a table that's not in 2NF to help explain what was going on.
Login | Name | DOB | Phone Number | Job Title |
bsmith12 | Bob Smith | 3/12/1980 | 555-1111 | lecturer |
bsmith12 | Bob Smith | 3/12/1980 | 555-1234 | lecturer |
. | . | . | . | . |
. | . | . | . | . |
. | . | . | . | . |
We're assuming that there's a lot more data in this table. For this example, we're also going to assume that logins are unique, that every employee only has one job title, and that no two employees have the same name AND date of birth. But it is possible for two employees to have the same phone number, or the same date of birth, or the same full name. Also, as shown by the two rows above, employees can have multiple phone numbers.
Note that the definition of 2NF doesn't care about what the primary key is. It does care about candidate keys, however. In our example, there are two candidate keys. Now you might want to use the login as the primary key, but it's no good by itself because while it uniquely identifies a person, it doesn't uniquely identify a row in this database. You'd have to combine the login with the phone number, and together they are a candidate key. Similarly, Name, DOB, and Phone Number together form a candidate key.
The definition of 2NF talks about "non-prime" attributes. Those are the columns that are not part of any candidate key. In this case, the only non-prime key is Job Title. The values here depend upon the values in the candidate keys. (That's what makes them candidate keys.) So if you know the Login and the Phone Number in a particular row, then there's only one possible Job Title in that row (because the login uniquely identifies a particular employee and each employee only has one job title). Also, if you know the Name, DOB, and Phone Number, in a particular row, then that's also enough to fix the Job Title in that row. This is what it means to say that the Job Title "depends" on the candidate keys.
Of course, you may have noticed that the Phone Number was really of no help in determining the Job Title. Once we knew the Login (or the Name and DOB), that was enough to fix the Job Title. The Job Title actually depends on a part of the candidate key, not the whole candidate key. "Proper subset" is just a mathy way of saying a part of something and not the whole thing. So because the Job Title depends on part of one of the candidate keys, this table is not in second normal form.
So to rephrase the definition of 2NF: a table violates second normal form if it has a column that's not part of a candidate key and which depends only on part of one of the candidate keys.
Your instincts may tell you that the real problem is with the phone numbers, and in a way, you are right, but the reason why it's a problem can be seen when you think about what happens to the database when an employee gets a new job title. You have to change their job title every place it appears, and it appears redundantly here, so that's a problem.
The fix is to do what we did yesterday, and to make a separate table for phone numbers.
So now we come to third normal form (3NF). Strictly defined, a table is in 3NF if every attribute depends on all of the primary key and on nothing but the primary key. (Also, it has to be in second normal form before you even ask about third normal form.)
Right off the bat, there are a few obvious differences from the 2NF definition. Here, we talk about any attribute, not just the non-prime ones. Also, the definition depends on what we chose for our primary key, and not on any candidate key. So we'll need to make sure in our example that we specify what the primary key is. I'll use a different* example than the one in class here.
*The example we used in class (with the birth years and zodiac signs) is still a good example of breaking the "all of the primary key" part because the sign only depends on the date of birth, but as a result, it also violates second normal form. So I thought an example that broke the other part of the 3NF definition ("nothing but the primary key") would be more useful.
ISBN | Title | Author | Author's Birth Year |
0380789035 | American Gods | Neil Gaiman | 1960 |
140123402X | World's End | Neil Gaiman | 1960 |
9997531590 | World's End | Upton Sinclair | 1878 |
. | . | . | . |
. | . | . | . |
. | . | . | . |
So imagine this is a table for a library database with information about the books in the library. The primary key is ISBN. In fact, it's the only possible primary key (in other words, it's the only candidate key). This is in second normal form because all of the other columns are uniquely determined once you know the ISBN. All the non-prime keys depend on the only candidate key, and since the only candidate key is just one column, they can't depend on only part of that key.
This table fits the part of the 3NF definition that requires the attributes to depend on "all of the primary key". But it doesn't fit the "nothing but the primary" key part. Because you can see that the Author's Birth Year depends on the Author. Once you know the author, the author's birth year is fixed. So this table is not in third normal form.
There is such a thing as fourth normal form and even fifth normal form, but we won't get into that in this class.
Next we moved on to doing more advanced queries in MySQL. I started by opening up the database we used last time with the employees and their phone numbers.
[ewennstr@silo ~]$ cd mysql2
[ewennstr@silo mysql2]$ ./connect_as_ewennstr
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 14
Server version: 5.6.19 MySQL Community Server (GPL)
Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> use corporation;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql>
Just as a reminder, I looked at what tables the database corporation
has, along with some information about those tables.
mysql> show tables;
+-----------------------+
| Tables_in_corporation |
+-----------------------+
| employees |
| phonenumbers |
+-----------------------+
2 rows in set (0.00 sec)
mysql> describe employees;
+-----------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+----------+------+-----+---------+-------+
| login | char(8) | NO | PRI | NULL | |
| firstname | char(30) | YES | | NULL | |
| lastname | char(50) | YES | | NULL | |
| dob | date | YES | | NULL | |
+-----------+----------+------+-----+---------+-------+
4 rows in set (0.00 sec)
mysql> describe phonenumbers;
+-------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+-------+
| login | char(8) | NO | PRI | | |
| phonenumber | char(20) | NO | PRI | | |
+-------------+----------+------+-----+---------+-------+
2 rows in set (0.00 sec)
You can see what the fields are, what their types are, whether they're allowed to have null values, whether they're part of the primary key, and what the default value is. (I have no idea what "Extra" is for.) Note that by definition, primary key columns can't have null entries.)
Next, we took a look at the data in the actual tables by running very simple queries. (For a complete description of all the different ways you can run MySQL queries, look here.)
mysql> select * from employees;
+----------+-----------+-----------+------------+
| login | firstname | lastname | dob |
+----------+-----------+-----------+------------+
| ewennstr | Erik | Wennstrom | 1978-08-19 |
| leesmith | Lee | Smith | 1985-07-22 |
| lesmith1 | Lee | Smith | 1946-06-15 |
+----------+-----------+-----------+------------+
3 rows in set (0.00 sec)
mysql> select * from phonenumbers;
+----------+-------------+
| login | phonenumber |
+----------+-------------+
| ewennstr | 555-1111 |
| ewennstr | 555-8989 |
| leesmith | 555-1234 |
+----------+-------------+
3 rows in set (0.00 sec)
It's worth noting that this particular display with the nicely printed ASCII tables is just part of the command-line interface we're using. All of these queries that we're writing today can be used in other ways too. For example, a PHP script could submit one of these queries to the MySQL server. In that case, the server wouldn't send back an ASCII formatted table like this, but something more like a two-dimensional array with all the values in it.
The *
says to get data from every column in the table. If we want, we can restrict which columns are returned.
mysql> select firstname, lastname from employees;
+-----------+-----------+
| firstname | lastname |
+-----------+-----------+
| Erik | Wennstrom |
| Lee | Smith |
| Lee | Smith |
+-----------+-----------+
3 rows in set (0.00 sec)
mysql> select login from phonenumbers;
+----------+
| login |
+----------+
| ewennstr |
| ewennstr |
| leesmith |
+----------+
3 rows in set (0.00 sec)
We can also resort the results.
mysql> select * from employees order by lastname;
+----------+-----------+-----------+------------+
| login | firstname | lastname | dob |
+----------+-----------+-----------+------------+
| leesmith | Lee | Smith | 1985-07-22 |
| lesmith1 | Lee | Smith | 1946-06-15 |
| ewennstr | Erik | Wennstrom | 1978-08-19 |
+----------+-----------+-----------+------------+
3 rows in set (0.00 sec)
mysql> select * from employees order by lastname desc;
+----------+-----------+-----------+------------+
| login | firstname | lastname | dob |
+----------+-----------+-----------+------------+
| ewennstr | Erik | Wennstrom | 1978-08-19 |
| leesmith | Lee | Smith | 1985-07-22 |
| lesmith1 | Lee | Smith | 1946-06-15 |
+----------+-----------+-----------+------------+
3 rows in set (0.00 sec)
If there are rows which have the same value in a particular column (such as for the Logins column in our Phonenumbers table), we can group those rows together and summarise the results. In this next query, we counted the number of different phone numbers that appear for each login in the Phonenumbers table.
mysql> select login, count(phonenumber) from phonenumbers group by login;
+----------+--------------------+
| login | count(phonenumber) |
+----------+--------------------+
| ewennstr | 2 |
| leesmith | 1 |
+----------+--------------------+
2 rows in set (0.00 sec)
If you have a column with numerical values, you can also combine those values by finding the average (using the avg()
function) or adding them up (using the sum()
function). But note that these so-called "aggregate" functions only make sense if you have a group by
clause in the query. You can see a list of all the aggregate functions you can use with group by
here.
Sometimes it's useful to give a name (this is called an "alias") to a calculated value, either to make it look nicer:
mysql> select login, count(phonenumber) as phones from phonenumbers group by login;
+----------+--------+
| login | phones |
+----------+--------+
| ewennstr | 2 |
| leesmith | 1 |
+----------+--------+
2 rows in set (0.00 sec)
Or if you want to use the alias later in a different part of the query:
mysql> select login, count(phonenumber) as phones from phonenumbers group by login having phones>=2;
+----------+--------+
| login | phones |
+----------+--------+
| ewennstr | 2 |
+----------+--------+
1 row in set (0.00 sec)
Here, you can see the having
keyword at work. It acts as a filter, only leaving those rows that match the given condition. There is also the keyword where
, which works in a very similar way. The primary difference is that a where
clause is calculated before MySQL aggregates things with group by
, and a having
clause is calculated after. So if you want to refer to something that was calculated by grouping things together (like an average, sum, or count), then you need to use having
. If you want to filter rows of your table out before counting (or averaging or summing), then you need to use a where
clause. (For example, if you were counting how many employees had a birthday in August, you'd want to include the condition about the birthdate being in August in a where
clause.) If there is no group by
clause, then it doesn't matter which you use, but it's polite to use where
in those situations.
You can use most of the same operators that you're used to from other programming languages when you're writing conditions. Although the syntax for equality is a single equal sign (=
) and not two (==
). See here for specifics.
So far, we haven't put together our two tables yet. To do this, we "join" the tables by looking for rows that have matching values in the columns that we care about. Here, we want the login from the phonenumbers database to match the login from the login from the employees database.
mysql> select firstname, lastname, phonenumber from employees,phonenumbers where employees.login = phonenumbers.login;
+-----------+-----------+-------------+
| firstname | lastname | phonenumber |
+-----------+-----------+-------------+
| Erik | Wennstrom | 555-1111 |
| Erik | Wennstrom | 555-8989 |
| Lee | Smith | 555-1234 |
+-----------+-----------+-------------+
3 rows in set (0.00 sec)
This is another use of the where
clause: to identify which columns we want to use to join the two tables. Note that when there's more than one table, we use the syntax tablename.columnname
to refer to a column in a particular table. But when columnname only appears in one of the tables, we can leave out the tablename.
There are many different ways to join tables together, but for our purposes, this should be enough. You can see all the different ways here.
Of course, for most interesting things, you'll want to join tables together and do some aggregation. (And maybe also do some filtering.)
mysql> select firstname, lastname, count(phonenumber) from employees,phonenumbers where employees.login = phonenumbers.login group by phonenumbers.login;
+-----------+-----------+--------------------+
| firstname | lastname | count(phonenumber) |
+-----------+-----------+--------------------+
| Erik | Wennstrom | 2 |
| Lee | Smith | 1 |
+-----------+-----------+--------------------+
2 rows in set (0.01 sec)
mysql> select firstname, lastname, count(phonenumber) as phones from employees,phonenumbers where employees.login = phonenumbers.login group by phonenumbers.login having phones<=1;
+-----------+----------+--------+
| firstname | lastname | phones |
+-----------+----------+--------+
| Lee | Smith | 1 |
+-----------+----------+--------+
1 row in set (0.00 sec)
Near the end of class, someone asked me if we could do this last query (where we asked for the first and last name of everyone who had no more than one phone number) without displaying the phone numbers, and the answer is yes:
mysql> select firstname, lastname from employees,phonenumbers where employees.login = phonenumbers.login group by phonenumbers.login having count(phonenumber)<=1;
+-----------+----------+
| firstname | lastname |
+-----------+----------+
| Lee | Smith |
+-----------+----------+
1 row in set (0.00 sec)