Fall Semester 2004


Lecture Notes Eleven: Storing State Information in SQL Databases
The main issue in preserving state information is where to store it. Six frequently used places are shown in the following list. They can be broadly broken down into client-side techniques (items 1 through 3) and server-side techniques (items 4 through 6).

  1. Store state in hidden fields
  2. Store state in cookies
  3. Store state in the URI
  4. Store state in web server process memory
  5. Store state in a file
  6. Store state in a database

The first approach is covered in the previous set of notes and the homework assignment. Let's now move to addressing the approach at the other end of the spectrum; but before doing that let's briefly say something about all the other approaches.

2. Maintaining State with Cookies

HTTP cookies are named bits of information that are transmitted between the server and the browser within the HTTP header. Ordinarily the server creates a cookie by including a Set-Cookie field in the HTTP header. The browser then stashes away the cookie information in a small in-memory or on-disk database. The next time the browser makes a request from that particular server, it returns that cookie in a Cookie field.

Cookies are relatively flexible. You can create cookies that will be returned to only one specific server or to any server in your domain. You can set them up so that they're returned only when users access a particular part of a document tree or any URI in the document hierarchy. They can be set to expire immediately when the user exits the browser, or they can be made to persist on the user's disk database for an extended period of time. You can also create secure cookies that are only returned to the server when a secure protocol, such as SSL, is in effect. This prevents cookies from being intercepted in transit by network eavesdroppers.

The exact format of HTTP cookies is somewhat involved and is described in the HTTP specification. Fortunately it's easy to make cookies in the right format using the CGI::Cookie module. As a matter of fact CGI.pm allows you to set and retrieve cookies that consist of entire hashes, and the conversion is handled automatically. We don't plan to look at an implementation involving cookies, but this overview is meant to give us an idea of what that would look like, just so we know if we were to.

(Stein and MacEahern do provide such an implementation, though).

3. Protecting Client Side Information

The cookie-based implementation of the hangman game is a lot classier than the first implementation. Not only does it have the advantage of maintaining state across browser sessions, but the game is also somewhat harder to cheat. While the user is actively playing the game, the cookie is kept in memory where it is difficult to read without the benefit of a debugger. However, after the user quits the browsing session, the cookie is written out to disk; determined cheaters could still find and edit the cookie database file if they wanted to make their statistics look better.

When you store information on the client side of the connection, peeking and tampering is a general problem. Fortunately, the cure is relatively simple. To prevent tampering, you can use a message authentication check (MAC) -- a form of checksum that will detect if the user has altered the information in any way. To prevent peeking, you can encrypt the information using an encryption key that is known to you but not to the user.

3.1 Message Authentication Checks

How can we add a MAC to the cookie used in the last section's example? There are many ways to compute a checksum, but the most reliable use a class of algorithms known as message digests. A message digest algorithm takes a large amount of data (usually called a "message") and crunches it through a complex series of bit shifts, rotates, and other bitwise operations until it has been reduced to a smallish number known as a hash. (Hashes like these are essentially guiding searches in Perl's associative arrays, also known as hash tables). The widely used MD5 message digest algorithm produces a 128-bit hash.

Because information is lost during the message digest operation, it is a one-way afffair: given a hash, you can't reconstruct the original message. Because of the complexity of the digest operation, it is extremely difficult to deliberately create a message that will digest to a particular hash. Changing just one bit anywhere in a message will result in a hash that is utterly unlike the previous one. However, you can confirm that a particular message was likely to have a produced a particular hash simply by running the message through the digest algorithm again and comparing the result to the hash.

To create a MAC, follow this general recipe:

  1. Choose a secret key. The key can be any combination of characters of any length. Long keys that don't spell out words and phrases are preferred. Keep the secret key well guarded.

  2. Select the fields that will be used for the MAC. You should include any field that you don't want the user to alter. You can also add consistency checking fields such as the remote browser's IP address and an expiration date. This helps protect against the information being intercepted en route by an eavesdropper and used later to impersonate the user.

  3. Compute the MAC by concatenating the fields and the secret key and running them through the digest algorithm. You actually need to concatenate the key and run the digest algorithm twice. Otherwise a technically savvy user could take advantage of one of the mathematical properties of the algorithm to append his own data to the end of the fields. (You should read the rest of the details in Stein and MacEahern). The MAC is now sent to the user along with the other state information.

  4. When the state information is returned by the user, retrieve the various fields and the MAC. Repeat the digest process and compare it to the retrieved MAC. If they match, you know that the user hasn't modified or deleted any of the fields.

To test such an implementation one can do the following: try quitting the browser, open up the cookie database file with a text editor, make some changes to the cookie by hand (try increasing your number of wins by a few notches). Then when you try to open the hangman script again, the program should bring you up short. With minor changes, everything described above can be easily adapted for use with the hidden field version of the hangman script.

3.2 Encrypting Client-Side State Information

Message authentication checks implement a "look but don't touch" policy. Users can't modify the state information, but they can still see what's there. In many web applications, there's no harm in this, but with the hangman game it has the unwanted consequence that the user can peek at the unknown word, either by viewing the page source in the fill-out form version or by quitting the browser and viewing the cookie database file.

To prevent this from happenning without abandoning client-side storage entirely, you can encrypt the state information. Your application will have the secret key necessary to decrypt the information, but without launching an expensive cryptanalysis project (and maybe not even then) the user won't be able to get at the data. Encryption can be combined with a MAC in order to obtain truly bullet-proof client-side authentication. (One can use the Crypt::CBC Perl module, and the code is in the book).

4. Storing State at the Server Side

Client-side storage of state information works well when each of the user sessions is independent of the others. But what if we wanted to combine the information from users, for example, to display a list of the top-scoring players in an Internet-wide tournament?

This is where server-side storage comes in. When you store the user information at the server side rather than the client side, you have full access to the list of all users and to the record of what they've done and what they're doing. You can crunch, tally, tabulate, and cross-reference this information to your heart's content. Server-side storage also has the advantage of being more secure, since the information never leaves the server, and it is more resilient to failure. If the user's browser crashes in the midst of accepting or updating a cookie, that information isn't lost because it's stored safely on the server. The downside is scalability and performance. Each user session that you store on the server side consumes some amount of memory, disk, and CPU cycles. When you store state information on the server side, you have to be careful to conserve these resources, for example by deleting user sessions that are no longer in use.

We will consider two types of server-side techniques in this section: storing the information transiently in main memory and storing it in an SQL database.

4.1 Storing State Information in Main Memory

Because Apache server processes are persistent across multiple accesses, you can store small amounts of state information in main memory. When the user first runs your application, it generates a random unique session identifier (session ID) and stores the state information in a data structure, for instance, a hash table keyed by the session ID. The application then sends the session ID back to the user in the form of a cookie, a hidden field, or a component of a URI. When the same user connects again, your application recovers the session ID and retrieves the state information from its data structure. (That, if you are using mod_perl to write Apache modules, and we won't.)

The main problem with this technique is that the amount of state information that you can store in shared memory is very limited, making it unsuitable for high-volume or high-reliability applications. A better server-side solution involves using relational database management systems (RDBMS's), which we turn to in this next section, the main topic of these notes.

4.2 Storing State Information in SQL Databases

Persistent memory is only suitable for storing small amounts of state information for relatively short periods of time. If you need to reliably store lots of information for a long time, you need a server-side database.

The DBI library, designed by Tim Bunce and others, is a generic Perl interface to relational database management systems (RDBMS's) that speak SQL (Structured Query Language). The DBI library speaks to specific databases by way of DBD (DataBase Driver) modules. You can make queries on any database that has a DBD module available for it. These modules are sometimes provided by the database vendor, and sometimes by third parties. DBD modules for Oracle, Sybase, Illustra, mSQL, MySQL, and others can be found at CPAN.

Full information on using DBI can be found in the Perl DBI book. We'll summarize just enough here so that you can follow the examples if you are not already familiar with DBI.

Before you can work with the DBI interface, you must select and install a relational database. We have MySQL installed, and that's what we will use.

In relational databases, all information is organized in tables. Each row of the table is a data record, and each column is a field of the record. For example here's one way to represent the hangman data:

table: dgerman_hangman
+----------+--------+-------+------+---+------------+-----+--------------+
|session_id|  WORD  |GUESSED|GAMENO|WON|GUESSES_LEFT|TOTAL|      modified|
+----------+--------+-------+------+---+------------+-----+--------------+
|fd2c95dd1 |entice  |e      |    10|  6|           6|   34|19980623195601|
|97aff0de2 |bifocals|aeilort|     4|  2|           3|   20|19980623221335|
+----------+--------+-------+------+---+------------+-----+--------------+
Most of the columns in the table above directly correspond to the fields in the now-familiar hangman state object. In addition to these fields we add two more columns.
  1. session_id is a string that uniquely identifies each user session and is used as a key in to the table for fast record lookup. For reasons that will become apparent soon, we use a short hexadecimal string as the session ID. We also add a timestamp field named

  2. modified which holds the date and time at which the record was last changed. If you look carefully, you'll see that the column consists of the four-digit year and two digits each for the for the month, day, hour, minute, and second. This timestamp will come in handy for detecting old unused sessions and clearing them out periodically.

In SQL databases, each table column has a defined data type and a maximum field length. Available data types include integers, floating point numbers, character strings, date/time types, and sometimes more esoteric types. Unfortunately the data types supported by database management systems vary considerably, limiting the portability of applications among different vendors' products. In this chapter, our examples use MySQL data types and functions. One may have to make some modifications in order to support another database system.

The most basic way to communicate with an SQL database is via a text monitor -- a small terminal-like application in which you type SQL queries to the database and view the results. To create the definition for the table shown above, you could issue the SQL CREATE command:

mysql> create table dgerman_hangman (
    ->   session_id   char(8) primary key, 
    ->   WORD         char(30), 
    ->   GUESSED      char(26), 
    ->   GAMENO       int, 
    ->   WON          int, 
    ->   GUESSES_LEFT int, 
    ->   TOTAL        int, 
    ->   modified     timestamp
    -> ); 
Query OK, 0 rows affected (0.06 sec)
This declares a table named dgerman_hangman using the MySQL syntax. The session_id column is declared to be a string of at most eight characters, and it is also declared to be the primary key for the table. This ensures that a given session ID is unique, and speeds up table lookups considerably. The WORD and GUESSED columns are declared to be strings of at most 30 and 26 characters, respectively, and GAMENO, WON, GUESSES_LEFT, and TOTAL are delcared to be integers (using the default length). We declare the column named modified to be a timestamp, taking advantage of a MySQL-specific feature that updates the field automatically whenever the record that contains it is changed.

You can then load some sample data into the database using an SQL INSERT statement:

mysql> insert into dgerman_hangman  
    ->   (session_id, WORD, GUESSED, GAMENO, WON, GUESSES_LEFT, TOTAL)
    -> values 
    ->   ('a0000001', 'spruce',  '',      1,   0,            6,     0);
Query OK, 1 row affected (0.02 sec)

mysql> 
This inserts the indicated values for the columns session_id through TOTAL. We don't explicitely set the value of the modified column because MySQL takes care of that for us.

We can now perform some queries over the database using the SQL SELECT statement.

To see everything in the dgerman_hangman table:

mysql> select * from dgerman_hangman;
+------------+--------+---------+--------+------+--------------+-------+----------------+
| session_id | WORD   | GUESSED | GAMENO | WON  | GUESSES_LEFT | TOTAL | modified       |
+------------+--------+---------+--------+------+--------------+-------+----------------+
| a0000001   | spruce |         |      1 |    0 |            6 |     0 | 20020704022752 |
+------------+--------+---------+--------+------+--------------+-------+----------------+
1 row in set (0.01 sec)

mysql> 
The part of the query following the SELECT command chooses which columns to display. In this case we use * to indicate all columns. The FROM keyword names the table to select the data from.

If we wished to look at just the

fields from the table, we could use this query:
mysql> select session_id, WORD, GAMENO from dgerman_hangman; 
+------------+--------+--------+
| session_id | WORD   | GAMENO |
+------------+--------+--------+
| a0000001   | spruce |      1 |
+------------+--------+--------+
1 row in set (0.00 sec)

mysql> 
An optional WHERE clause allows us to filter the records so that only records matching a set of criteria are displayed. For example, this query shows only session records from players who have played five games or more:
mysql> insert into dgerman_hangman 
    ->   (session_id, WORD, GUESSED, GAMENO, WON, GUESSES_LEFT, TOTAL) 
    -> values 
    ->   ('fd2c95dd1', 'entice'  ,       'e', 10, 6, 6, 34), 
    ->   ('97aff0de2', 'bifocals', 'aeilort',  4, 2, 3, 20); 
Query OK, 2 rows affected (0.00 sec)

mysql> select * from dgerman_hangman;
+------------+----------+---------+--------+------+--------------+-------+----------------+
| session_id | WORD     | GUESSED | GAMENO | WON  | GUESSES_LEFT | TOTAL | modified       |
+------------+----------+---------+--------+------+--------------+-------+----------------+
| a0000001   | spruce   |         |      1 |    0 |            6 |     0 | 20020704022752 |
| fd2c95dd   | entice   | e       |     10 |    6 |            6 |    34 | 20020704024506 |
| 97aff0de   | bifocals | aeilort |      4 |    2 |            3 |    20 | 20020704024506 |
+------------+----------+---------+--------+------+--------------+-------+----------------+
3 rows in set (0.01 sec)

mysql> select session_id, WORD, GAMENO from dgerman_hangman where GAMENO >= 5; 
+------------+--------+--------+
| session_id | WORD   | GAMENO |
+------------+--------+--------+
| fd2c95dd   | entice |     10 |
+------------+--------+--------+
1 row in set (0.02 sec)

mysql> 
This query retrieves the session with the ID a0000001:
mysql> select session_id, WORD, GAMENO from dgerman_hangman where session_id = 'a0000001'; 
+------------+--------+--------+
| session_id | WORD   | GAMENO |
+------------+--------+--------+
| a0000001   | spruce |      1 |
+------------+--------+--------+
1 row in set (0.00 sec)

mysql> 
Finally, this query retrieves all sessions that were modified within the past 24 hours:
mysql> select session_id, WORD, GAMENO from dgerman_hangman
    -> where unix_timestamp() - unix_timestamp(modified) < 60 * 60 * 24; 
+------------+----------+--------+
| session_id | WORD     | GAMENO |
+------------+----------+--------+
| a0000001   | spruce   |      1 |
| fd2c95dd   | entice   |     10 |
| 97aff0de   | bifocals |      4 |
+------------+----------+--------+
3 rows in set (0.00 sec)

mysql>
Here's how one would select the sessions modified within the past 10 minutes (in slow motion):
mysql> select session_id, WORD, GAMENO, modified from dgerman_hangman;
+------------+----------+--------+----------------+
| session_id | WORD     | GAMENO | modified       |
+------------+----------+--------+----------------+
| a0000001   | spruce   |      1 | 20020704022752 |
| fd2c95dd   | entice   |     10 | 20020704024506 |
| 97aff0de   | bifocals |      4 | 20020704024506 |
+------------+----------+--------+----------------+
3 rows in set (0.00 sec)

mysql> select session_id, WORD, GAMENO, modified from dgerman_hangman 
    -> where unix_timestamp() - unix_timestamp(modified) < 60 * 10; 
+------------+----------+--------+----------------+
| session_id | WORD     | GAMENO | modified       |
+------------+----------+--------+----------------+
| fd2c95dd   | entice   |     10 | 20020704024506 |
| 97aff0de   | bifocals |      4 | 20020704024506 |
+------------+----------+--------+----------------+
2 rows in set (0.00 sec)

mysql> 
These last two examples show the use of the MySQL-specific unix_timestamp() function. Called without arguments, unix_timestamp() returns the current time and date as the number of seconds since the start of the Unix epoch. The function can also be called with a timestamp field as the argument, in which case it operates on the timestamp rather than the current time. The effect of the query above is to subtract the modified field from the current time and compare the difference to one day. The SQL language allows you to form queries that are substantially more complex than these, including ones that that combine the results of multiple tables. We may say more about this later.

The INSERT statement can only be used to create a new record (or row) of the table. If we were to try to execute the insertion statement shown earlier a second time, the attempt would fail because any given session ID can only occur once in the table. This feature guarantees the uniqueness of session IDs. To change the values in an existing record, we would use an UPDATE statement instead.

A typical UPDATE statement looks like this:

mysql> update dgerman_hangman set gameno=gameno+1   
    -> where session_id='a0000001'; 
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select session_id, WORD, GAMENO, modified from dgerman_hangman; 
+------------+----------+--------+----------------+
| session_id | WORD     | GAMENO | modified       |
+------------+----------+--------+----------------+
| a0000001   | spruce   |      2 | 20020704030627 |
| fd2c95dd   | entice   |     10 | 20020704024506 |
| 97aff0de   | bifocals |      4 | 20020704024506 |
+------------+----------+--------+----------------+
3 rows in set (0.00 sec)

mysql> 
Lastly, the DELETE statement can be used to delete all records that satisfy the criteria set out in the WHERE clause. This query deletes all sessions older than a day:
mysql> delete from dgerman_hangman
    -> where unix_timestamp() - unix_timestamp(modified) > 60 * 60 * 24; 
Query OK, 0 rows affected (0.00 sec)

mysql> 
Let's try again with a 15 minute expiration limit:
mysql> delete from dgerman_hangman 
    -> where unix_timestamp() - unix_timestamp(modified) > 60 * 15;
Query OK, 2 rows affected (0.00 sec)

mysql> 
If you forget to include a WHERE clause in the UPDATE and DELETE statements, every record in the database will be affected by the operation. This is generally to be avoided.

4.3 Using DBI

The DBI interface provides methods for opening SQL databases, sending queries to the opened database, and reading the answers returned by those queries.

To open a database, you call DBI->connect() with the "data source name," a string that tells the database driver where the database is located. If the database requires a username and password for access, you can pass that information in the connect() call as well. The format of the data source name is DBMS-specific. For a MySQL database, it looks like this:

"dbi:mysql:$database:$hostname:$port"
All MySQL data sources begin with "dbi:mysql". They are followed by the name of the database, and, optionally, by the name and port of the remote host on which the DBMS is running. If the hostname and port are omitted, the driver defaults to using a standard port on the local host. To connect to a database named www on the local host using the username games and the password grok, you'd make this call:
$dbh = DBI->connect('dbi:mysql:www', 'games', 'grok'); 
If successful, connect() returns a database handle, $dbh, which is used for subsequent communication with the database. The connect() method also accepts an optional fourth argument which consists of a hash reference of parameter name=value pairs. These control a variety of database options, such as whether to automatically commit all changes made to the database. The only option that we'll use in the examples that follow is PrintError, which when set to false, supresses the printing of unwanted warnings to the server error log.

The database handle has several methods, the most important of which are do(), prepare(), and errstr(). do() is used to execute SQL statements which do not return a list of records, such as INSERT, DELETE, UPDATE, or CREATE. If the operation is successful, do() returns a count of the number of rows modified. For example, the following query sets the GAMENO field of all sessions to 1, and returns the number of rows affected:

$count = $dbh->do('UPDATE dgerman_hangman SET GAMENO=1'); 
die $dbh->errstr unless defined $count;
If the database encountered an error while processing the statement (for example the SQL contained a syntax error), it will return undef. The errstr() method can be used to retrieve an informative error message from the driver.

SELECT queries can return a potentially large number of records, often more than will fit into memory at once. For this reason, the results from SELECT queries are returned in the form of statement handle objects. You then call the statement handle's fetch() method repeatedly to retrieve each row of the result. Here's an example of retrieving the session_id and WORD fields from each session in the hangman database:

$sth = $dbh->prepare('SELECT session_id, WORD FROM dgerman_hangman') 
       || die $dbh->errstr;
$sth->execute() || die $sth->errstr;
while (my $row = $sth->fetch) {
  my($session, $word) = @$row; 
  print "$session => $session, $word => $word\n"; 
} 
$sth->finish; 
The example starts with a call to the database handle's prepare() method with the text of the SQL SELECT method. prepare() parses the SQL and checks it for syntactic correctness but does not actually execute it. The query is returned as a statement handler which we store into the variable $sth. If some error occurred while preparing the statement, prepare() returns undef, in which case we return the errstr() error text.

Next we call the statement handler's execute() method. This performs the query and returns either the number of rows retrieved or undef if an error occurred. In the case of a syntactically correct query that happens to return no rows (because the table is empty or because no records satisfied the criteria in the WHERE clause), execute() returns the value 0E0 which Perl regards as true in a logical context, but as zero in a numeric one.

Now we enter a loop in which we call the statement handler's fetch() method. Each time it's called, fetch() returns the requested columns in the form of an array reference. To retrieve the values themselves, we just dereference the value into a list. Because we requested the column session_id and WORD, we get a reference to a two-item array back from fetch(). When there are no more rows left, fetch() returns undef.

DBI actually offers a family of fetch functions. fetchrow_array() is like fetch(), but it dereferences the row first and returns an array corresponding to the list of requested columns. Another function, fetchrow_hashref(), turns the current row into a hash of the column names and their values and returns the hash's reference to the caller. This allows us to make the example above more readable at the cost of making it somewhat less efficient:

$sth = $dbh->prepare('SELECT session_id, WORD FROM dgerman_hangman')
       || die $dbh->errstr; 
$sth->execute || die $sth->errstr;
while (my $row = $sth->fetchrow_hashref) {
  print "session => $row->{session_id}, word => $row->{WORD}\n"; 
}
$sth->finish;
DBI also provides a fetchrow_arrayref() method for fetching the row as an array reference. It is identical in every respect to fetch(). Also, when you are finished with a statement handler, you should call its finish() method in order to free up the resources it uses.

The last thing you need to know about statement handlers is that many DBI drivers allow you to put placeholders, indicated by the ? character, inside SQL statements. prepare() compiles the statement and returns a statement handler as before, but when you later call execute() you pass in the values to be substituted into the placeholders. This allows you to treat statement handlers much as you would a subroutine by calling it repeatedly with different runtime arguments. For example, we can create a statement handler for returning the entire row of a given session with this bit of code:

$sth = $dbh->prepare('SELECT * FROM dgerman_hangman WHERE session_id=?'); 
Now we can fetch information on session fd2c95dd, by calling the statement handler's execute() method this way:
$sth->execute('fd2c95dd');
The same statement handler can later be used to fetch information from other named sessions. You should still call finish() at the end of each series of fetches, even though you are going to reuse the statement handler. Failure to do so can lead to memory leaks.

When you are completely finished with a database handle, you should call its disconnect() method in order to sever the connection and clean up.

4.4 A DBI Backend for Hangman

Like the persistent memory version of the hangman game, the DBI implementation has to have code to open the database, to set and fetch session records from the database, to generate unique session IDs for each incoming connection, and to expire old sessions that we're no longer interested in. The code presented at the end of these notes shows what's new and different on the server side. There are no visible changes in the user interface.

This script assumes a database has already been set up that contains a table named hangman (don't forget our convention of prefixing the name with your username so we can distinguish the tables) with this structure (I used to have such a table, now I drop it, so I can create a new one):

mysql> show tables like '%hangman%'; 
+----------------------------+
| Tables_in_a348 (%hangman%) |
+----------------------------+
| dgerman_hangman            |
+----------------------------+
1 row in set (0.02 sec)

mysql> drop table dgerman_hangman;
Query OK, 0 rows affected (0.00 sec)

mysql> show tables like '%hangman%'; 
Empty set (0.02 sec)

mysql> create table dgerman_hangman (
    ->   session_id      char(8) primary key, 
    ->   WORD            char(30), 
    ->   GUESSED         char(26), 
    ->   GAMENO          int, 
    ->   WON             int, 
    ->   GUESSES_LEFT    int, 
    ->   TOTAL           int, 
    ->   modified        timestamp,
    ->   key(modified)
    -> ); 
Query OK, 0 rows affected (0.00 sec)

mysql> describe dgerman_hangman;
+--------------+---------------+------+-----+---------+-------+---------------------------------+
| Field        | Type          | Null | Key | Default | Extra | Privileges                      |
+--------------+---------------+------+-----+---------+-------+---------------------------------+
| session_id   | char(8)       |      | PRI |         |       | select,insert,update,references |
| WORD         | char(30)      | YES  |     | NULL    |       | select,insert,update,references |
| GUESSED      | char(26)      | YES  |     | NULL    |       | select,insert,update,references |
| GAMENO       | int(11)       | YES  |     | NULL    |       | select,insert,update,references |
| WON          | int(11)       | YES  |     | NULL    |       | select,insert,update,references |
| GUESSES_LEFT | int(11)       | YES  |     | NULL    |       | select,insert,update,references |
| TOTAL        | int(11)       | YES  |     | NULL    |       | select,insert,update,references |
| modified     | timestamp(14) | YES  | MUL | NULL    |       | select,insert,update,references |
+--------------+---------------+------+-----+---------+-------+---------------------------------+
8 rows in set (0.00 sec)

mysql> 
The modified field is a MySQL-specific data type, and later we will take advantage of other MySQL features involving the handling of dates. SQL databases vary widely in their handling of dates and times, and we prefer to show you an efficient implementation of the application on a specific database than an inefficient implementation that might work more generically. To port this code to the database of your choice, you will need to change the data type of the modified column to a date/time type that your database understands and modify the expires() subroutine to work with this changed type.

Before stepping through the script, let's first look at get_state() and save_state():

# get the state from the database ------------------------------get_state---
sub get_state {

  my $id = shift;  

  my $query = 
      "SELECT * FROM $DB_TABLE WHERE session_id = '$id' AND WORD LIKE '%'"; 

  my $sth = $DBH->prepare($query) || die "Prepare: ", $DBH->errstr;

  $sth->execute || die "Execute: ", $sth->errstr; 

  my $state = $sth->fetchrow_hashref; 

  $sth->finish; 

  return $state; 

}
get_state() is responsible for recovering the state information as a hash reference, given the ID of an existing session. At its core is this SQL statement:

select from dgerman_hangman where session_id='$id'
This selects all columns from the record named by the session ID. We then call DBI's fetchrow_hashref() to retrieve the record in the form as a hash reference in which the keys (WORD, GUESSED, GAMENO, and so on) correspond to the columns of the selected record. As it happens, this hashref is identical to the state variable that the higher levels of the script operate on, so all we have to do is to return it.

The save_state() subroutine is almost as simple:

# save the state in the database ------------------------------save_state---
sub save_state {
  my ($state, $id) = @_; 
  my $sth = $DBH->prepare(<<END) || die "prepare: ", $DBH->errstr;
UPDATE $DB_TABLE
  SET WORD=?,GUESSED=?,GAMENO=?,WON=?,TOTAL=?,GUESSES_LEFT=?
  WHERE session_id='$id'
END
  $sth->execute(@{$state}{qw(WORD GUESSED GAMENO WON TOTAL GUESSES_LEFT)})
    || die "execute: ", $DBH->errstr;
  $sth->finish; 
} 
This subroutine constructs a DBI statement handler containing placeholders for the six keys in $state. It then calls the statement handler's execute() statement to write the values from $state into the database.

The remainder of the code is concerned with the generation and maintenance of session IDs. Although most of the state information is stored on the server's side of the connection, there's more to the story. There will always have to be some information stored by the client, because otherwise, there would be no way for the server to distinguish one client from another and, hence, no way to retrieve the correct session record. Some of the obvious ways of distinguishing one client from another, such as recording their IP addresses, do not work well in practice (a dial-in user may have several IP addresses, and conversely, all America Online users share the IP address of a few large proxy servers). The general technique for identifying clients is to generate a session ID for them when they first connect to your application and then arrange for them to return the session ID to you on subsequent requests. A session ID can be anything you like. In the hangman game we use an eight-digit hexadecimal number, which is sufficient for about four billion active sessions.

We've already seen two techniques that can be adapted to this purpose: HTTP cookies and fill-out forms. Because the session ID is a relatively small amount of information, there's also a third option available to us. We can store the session ID in the URI itself as additional path information. When a connection comes in from a new client we assign it a randomly generated ID, append it to our URI as additional path information, and send the client an HTTP redirect() directive to make it fetch this new URI. On subsequent requests, we recover the session ID from the additional path information. This technique has an advantage over cookies in that it is compatible with all browsers, including those for which the user has disabled cookies. It has the disadvantage that the session ID is visible to the user. The URI displayed by the browser will look something like this:

http://burrowww.cs.indiana.edu:17600/cgi-bin/stein/dbi/d00f9f35
A side benefit of this technique is that the user can bookmark this URI, session ID and all, and come back to a game later.

Beginning our walkthrough of the script, we bring in the DBI library and define a few new constants:

#!/usr/bin/perl

use DBI; 
use MD5; 
use CGI; 

$WORDS     = '/usr/share/lib/dict/words'; 
$TRIES     = 6; 
$EXPIRE    = 60 * 60 * 24 * 30;     # allow 30 days before expiration 
$DB        = "DBI:mysql:a348";      # data source name (database) 
$username  = "a348";                # username
$password  = "a348AG";              # password
$DBAUTH    = "$username:$password"; 
$SECRET    = " ***( something secret, whatever )*** "; 
$MAX_TRIES = 10; 
$DB_TABLE  = "dgerman_hangman"; 
$ID_LENGTH = 8;                     # length of the session ID 

$q = new CGI; 
Here are brief notes about each:
# Open the database --------------------------------------------------------
$DBH = DBI->connect($DB, $username, $password, {PrintError => 0})
  || die "Couldn't open database: ", $DBI::errstr; 

# get the current session ID, or make one ----------------------------------
my ($session_id, $note) = &get_session_id(); 
The script begins by opening the database and saving its database handle in a global named $DBH. Next, we retrieve the session ID (or generate a new one) by calling a subroutine named get_session_id(). get_session_id() returns a two-element list: the session ID and a note that can be used to alert the user to exceptional conditions. In this script, the only exceptional condition that occurs is when the user tries to use a session ID that has expired.
# retrieve the state -------------------------------------------------------
my $state = &get_state($session_id) unless $q->param('clear'); 

# reinitialize if we need to -----------------------------------------------
if (!$state || $q->param('restart')) { $state = &initialize($state) }

# process the current guess, if any ----------------------------------------
my ($message, $status) = &process_guess($q->param('guess') || '', $state); 

# save the modified state --------------------------------------------------
&save_state($state, $session_id); 
With the session ID in hand, we retrieve the state by calling the get_state() subroutine that we looked at earlier. We then (re)initialize the state variable as before if need be, process the user's guess if any, and call save_state() to write the modified session back to the database. The remainder of the script is unchanged from previous versions, except that we display the note returned by get_session_id() at the top of the page if it's nonempty.

We'll look at the get_session_id() subroutine now, which is responsible for retrieving an existing session ID or generating a new one.

# retrieve the session ID from the path info. if it's -----get_session_id---
# not already there, add it to the path info (more or less) with a redirect 
sub get_session_id {

  my (@result); 
  &expire_old_sessions(); 
  my ($id) = $q->path_info() =~ m:^/([a-h0-9]{$ID_LENGTH}):o;
  return @result if $id and @result = &check_id($id); 
  
  # if we get here, there's not already an ID in the path info 
  my $session_id = &generate_id(); 
  die "Couldn't make a new session id" unless $session_id; 
  print $q->redirect($q->script_name() . "/$session_id"); 
  exit 0; 
}
This subroutine first expires all out-of-date sessions by calling expire_old_sessions(). (If there are many session records, expire_old_sessions() will rapidly become a performance drain on the script.) In high-volume applications, you will want to move session expiration into a separate standalone process that runs at regular intervals under the Unix cron or NT at utilities. For the hangman application, a nightly expiration is more than sufficient.) Next it calls CGI.pm's path_info() function to return the additional path information and attempt to match it against the expected session ID pattern. If a likely looking session ID is found, we call check_id() to ensure that the session ID actually corresponds to a database record. Otherwise, we call generate_id() to create a new session ID. We append the ID to our URI (using CGI.pm's script_name() function), incorporate it into a call to redirect(), and exit. In this case, the subroutine never returns to the caller, but the redirected browser immediately generates a second call to the script, this time with the session ID appended to the URI.

The expire_old_sessions() is simple:

sub expire_old_sessions { # --------------------------expire_old_sessions---
  $DBH->do(<<END); 
DELETE FROM $DB_TABLE
  WHERE (unix_timestamp() - unix_timestamp(modified)) > $EXPIRE
END
}
The subroutine consists of a single DBI call that sends an SQL DELETE statement to the database. The effect of the call is to delete all session records that are older than the time limit set by the $EXPIRE constant.

generate_id(), which chooses new session IDs, is slightly more complex:

# find a new unique ID and insert it into the database -------generate_id---
sub generate_id {
  # create a new session id 
  my $tries = 0; 
  my $id = &hash($SECRET . rand()); 

  while ($tries++ < $MAX_TRIES) {
    last if $DBH->do("INSERT INTO $DB_TABLE (session_id) VALUES ('$id')"); 
    $id = &hash($id); 
  }

  return undef if $tries >= $MAX_TRIES; # we failed 
  return $id; 
}
The reason for this complexity is that it is important to generate a unique session ID in such a way that valid session IDs cannot be trivially guessed. Otherwise it would be possible for a malicious person to hijack another user's session by misappropriating that user's session ID. This is not important in the hangman game, but becomes an issue in transactions in which things of value (money, merchandise, confidential information) are changing hands. A simple sequence of session IDs, such as choosing one higher than the previous highest, is too obvious. IDs generated from the rand() call are unreliable as well, because once you know where you are in the series, you can generate all the subsequent values.

Instead, we use a combination of rand() and the MD5 message digest algorithm. We begin by computing the MD5 hash of the value of rand() concatenated with a secret phrase. This extra concatenation step makes it impossible to derive the value of the next session ID from the previous one. Instead of calling MD5 directly, we call a small internal subroutine, hash(), to compute the MD5 hash and then truncate it to eight characters. This reduces the size of the session ID at the cost of making the ID somewhat easier to guess (The size of the session ID determines the number of guesses a would-be hijacker has to make before getting a correct one). There are about 4.3 billion eight-digit session IDs. If you have 10,000 active sessions, this means that the hijacker has to guess (and try) 430,000 IDs before getting lucky. You'll probably notice this number of hits on your server long before anything untoward happens. If you have 100,000 active sessions, however, only 43,000 guesses are required, and you might want to use a longer session ID. In practice, it's almost always easier for a hijacker to recover a session ID by some other method (such as packet sniffing) than by guessing.) We then enter a loop in which we repeatedly attempt to insert the current session ID into the database. If a record with that session ID does not already exist in the database, the insertion statement returns a true result code and we immediately return the ID. Otherwise, we generate a new trial ID by hashing the current ID concatenated with the secret, and try again. We do this up to $MAX_TRIES times, at which point we give up. This allows us to fill up the space of possible session IDs to approximately 90 percent, or around 3 billion.

The check_id() subroutine is called by get_session_id() when the browser provides a previous session ID. Its job is to check that the session ID still corresponds to a database record. If not, it attempts to insert a record with that session ID into the database and delivers a warning to the user that his game session may have expired.

# check to see that an old ID is valid --------------------------check_id---
sub check_id {

  my $id = shift; 

  return ($id, '') 

    if $DBH->do("SELECT 1 FROM $DB_TABLE WHERE session_id = '$id'") > 0; 

  return ($id, 'The record of your game may have expired. Restarting.')

    if $DBH->do("INSERT INTO $DB_TABLE (session_id) VALUES ('$id')"); 

  return (); 

}

The reason we try to reuse old session IDs is that the user may have bookmarked the URI of the game, session ID and all. We honor the bookmark so that the user doesn't have to discard it and enter a new one after his session has expired. check_id() consists of two DBI calls. In the first, it makes an SQL SELECT query looking for a record matching the provided session ID. Since we're only interested in whether the query succeeds or fails, we select a constant 1 instead of a named set of columns. If the query fails, then the database does not already contain the session ID. We call DBI again to insert the session ID into the database. If this fails (which it might in the unusual case of another instance of this script picking the same session ID from within generate_id()) we return an empty list. Otherwise, we return the ID and the warning message. Although the user has lost the record of his old set of games, his bookmarked URI will still be valid and can now be used to return to the new set.

The last new routine defined in this version of the game is hash(), which simply computes the MD5 digest of the value passed to it, then truncates it to $ID_LENGTH characters:

# generate a hash value ---------------------------------------------hash---
sub hash {
  my $value = shift; 
  return substr(MD5->hexhash($value), 0, $ID_LENGTH); 
}

Finally, here's how the whole program looks like:

#!/usr/bin/perl

use DBI; 
use MD5; 
use CGI; 

$WORDS     = '/usr/share/lib/dict/words'; 
$TRIES     = 6; 
$EXPIRE    = 60 * 60 * 24 * 30;     # allow 30 days before expiration 
$DB        = "DBI:mysql:a348";      # data source name (database) 
$username  = "a348";                # username
$password  = "a348AG";              # password
$DBAUTH    = "$username:$password"; 
$SECRET    = " ***( something secret, whatever )*** "; 
$MAX_TRIES = 10; 
$DB_TABLE  = "dgerman_hangman"; 
$ID_LENGTH = 8;                     # length of the session ID 

$q = new CGI; 

# Open the database --------------------------------------------------------
$DBH = DBI->connect($DB, $username, $password, {PrintError => 0})
  || die "Couldn't open database: ", $DBI::errstr; 

# get the current session ID, or make one ----------------------------------
my ($session_id, $note) = &get_session_id(); 

# retrieve the state -------------------------------------------------------
my $state = &get_state($session_id) unless $q->param('clear'); 

# reinitialize if we need to -----------------------------------------------
if (!$state || $q->param('restart')) { $state = &initialize($state) }

# process the current guess, if any ----------------------------------------
my ($message, $status) = &process_guess($q->param('guess') || '', $state); 

# save the modified state --------------------------------------------------
&save_state($state, $session_id); 

# start the page -----------------------------------------------------------
print $q->header, 
      $q->start_html(-title => 'Database Sessions with URL Rewriting',
                     -bgcolor => 'white'); 

print $q->p($q->font({-color=>'red'}, $note)) if $note; 

# draw the picture ---------------------------------------------------------
&picture($state); 

# draw the statistics ------------------------------------------------------
&status($message, $state); 

# prompt the user to restart or for his next guess -------------------------
if ($status =~ /^(won|lost)$/) { # to restart 
  &show_restart_form($state); 
} else {                         # for his/her next game 
  &show_guess_form($state);   
}

print $q->end_html; 

$DBH->disconnect; 

#------------(subroutines)--------------------------------------------------

# get the state from the database ------------------------------get_state---
sub get_state {

  my $id = shift;  

  my $query = 
      "SELECT * FROM $DB_TABLE WHERE session_id = '$id' AND WORD LIKE '%'"; 

  my $sth = $DBH->prepare($query) || die "Prepare: ", $DBH->errstr;

  $sth->execute || die "Execute: ", $sth->errstr; 

  my $state = $sth->fetchrow_hashref; 

  $sth->finish; 

  return $state; 

}

# called to initialize a whole new state object ---------------initialize---
# or to create a new game 
sub initialize { 

  my $state = shift; 

  $state = {} unless $state; 

  $state->{WORD} = &pick_random_word(); 

  $state->{GUESSES_LEFT} = $TRIES; 

  $state->{GUESSED} = ''; 

  $state->{GAMENO} += 1; 

  $state->{WON}    += 0; 

  $state->{TOTAL}  += 0; 

  return $state; 
}

# called to process the user's guest -----------------------process_guess---
sub process_guess { 
  my ($guess, $state) = @_; 

  # lose immediately if user has no more guesses left 
  return ('', 'lost') unless $state->{GUESSES_LEFT} > 0; 

  # create hash containing the letters guessed thus far
  my %guessed = map { $_ => 1 } $state->{GUESSED} =~ /(.)/g; 

  # create hash containing the letters in the original word 
  my %letters = map { $_ => 1 } $state->{WORD} =~ /(.)/g; 

  # return immediately if user has already guessed the word 
  return ('', 'won') unless grep (!$guessed{$_}, keys %letters); 

  # do nothing more (stop here) if no guess is provided 
  return ('', 'continue') unless $guess; 

  # this section processes individual letter guesses 
  $guess = lc $guess; 
  return ("Not a valid letter or word!", 'error') unless $guess =~ /^[a-z]+$/; 
  return ("You already guessed that letter!", 'error') if ($guessed{$guess}); 

  # this section is called when the user guesses the whole world 
  if (length($guess) > 1 && $guess ne $state->{WORD}) {
    $state->{TOTAL} += $state->{GUESSES_LEFT}; 
    return (qq{You lose. The word was "$state->{WORD}."}, 'lost');  
  }

  # update the list of guesses 
  foreach ($guess =~ /(.)/g) { $guessed{$_}++; }
  $state->{GUESSED} = join('', sort keys %guessed); 

  # correct guess -- word completely filled in
  unless (grep(!$guessed{$_}, keys %letters)) {
    $state->{WON}++; 
    return (qq{You got it! The word was "$state->{WORD}."}, 'won'); 
  }

  # incorrect guess
  if (! $letters{$guess}) { 
    $state->{TOTAL}++; 
    $state->{GUESSES_LEFT}--; 

    # user runs out of turns 
    return (qq{The jig is up. The word was "$state->{WORD}".}, 'lost') 
      if $state->{GUESSES_LEFT} <= 0; 

    return ('Wrong guess!', 'continue'); 
  } 

  # correct guess but word still incomplete 
  return ('Good guess!', 'continue'); 
}

# create the cute hangman picture --------------------------------picture---
sub picture { 
  my $state = shift; 
  my $tries_left = $state->{GUESSES_LEFT}; 
  my $picture = sprintf("/h%d.gif", $TRIES - $tries_left); 

  print $q->img( {-src=>$picture, 
                  -align=>'LEFT',
                  -alt=>"[$tries_left tries_left]"
                 } 
               ); 
}

# print the status ------------------------------------------------status---
sub status { 
  my ($message, $state) = @_; 
  print qq {
    <table width=100%> <tr> 
           <td> <b> Word #: </b> $state->{GAMENO} ($state->{WORD}) </td>  
           <td> <b> Guessed: </b> $state->{GUESSED} </td> 
      </tr> <tr> 
           <td> <b> Won: </b>  $state->{WON} </td> 
           <td> <b> Current average: </b> },
 
                    sprintf("%2.3f", $state->{TOTAL} / $state->{GAMENO}), 

       qq{ </td> <td> <b> Overall average: </b> }, 

            $state->{GAMENO} > 1 ? sprintf("%2.3f", 
                                             ($state->{TOTAL} - 
                                                ($TRIES - 
                                                 $state->{GUESSES_LEFT}
                                                )
                                             ) / ($state->{GAMENO} - 1)
                                          ) 
                                   : '0.000', 

       qq{ </td> 
      </tr> 
    </table> 
  }; 

  my %guessed = (); 
  my @guessed = $state->{GUESSED} =~ /(.)/g; 
  foreach $letter (@guessed) { 
    $guessed{$letter} = 1; 
  } # instead of my %guessed = map { $_ => 1 } $state->{GUESSED} =~ s/(.)/g; 

  print $q->h2("Word:", 
               map { $guessed{$_} ? $_ : '_' } 
                   $state->{"WORD"} =~ /(.)/g
  );
  
  print $q->h2($q->font({-color=>'red'}, $message)) if $message; 

}

# ask the user if (s)he wants to start over ------------show_restart_form---
sub show_restart_form { 
  my $state = shift;
  print $q->start_form(), 
        "Do you want to play again?",
        $q->submit(-name=>'restart', 
                   -value=>'Another game'), qq{     }, 
        $q->checkbox(-name=>'clear', -label=>'Clear scores'); 

  delete $state->{"WORD"}; 

  # &save_state($state); 

  print $q->end_form; 
} 

# print the fill-out form for requesting input -----------show_guess_form---
sub show_guess_form { 

  my $state = shift; 

  print $q->start_form(), 
        "Your guess: ", 
        $q->textfield(-name=>'guess', 
                      -value=>'', 
                      -override=>1), 
        $q->submit(value=>'Guess');

  # &save_state($state); 

  print $q->end_form; 
}

# pick a word, any word ---------------------------------pick_random_word---
sub pick_random_word { 
  open (AB, $WORDS); 
  my @words = <AB>; 
  close(AB); 
  my $chosenWord = $words[int(rand($#words + 1))]; 
  chop($chosenWord); 
  return lc $chosenWord; 
} 

# save the state in the database ------------------------------save_state---
sub save_state {
  my ($state, $id) = @_; 
  my $sth = $DBH->prepare(<<END) || die "prepare: ", $DBH->errstr;
UPDATE $DB_TABLE
  SET WORD=?,GUESSED=?,GAMENO=?,WON=?,TOTAL=?,GUESSES_LEFT=?
  WHERE session_id='$id'
END
  $sth->execute(@{$state}{qw(WORD GUESSED GAMENO WON TOTAL GUESSES_LEFT)})
    || die "execute: ", $DBH->errstr;
  $sth->finish; 
} 

# retrieve the session ID from the path info. if it's -----get_session_id---
# not already there, add it to the path info (more or less) with a redirect 
sub get_session_id {

  my (@result); 
  &expire_old_sessions(); 
  my ($id) = $q->path_info() =~ m:^/([a-h0-9]{$ID_LENGTH}):o;
  return @result if $id and @result = &check_id($id); 
  
  # if we get here, there's not already an ID in the path info 
  my $session_id = &generate_id(); 
  die "Couldn't make a new session id" unless $session_id; 
  print $q->redirect($q->script_name() . "/$session_id"); 
  exit 0; 
}

# find a new unique ID and insert it into the database -------generate_id---
sub generate_id {
  # create a new session id 
  my $tries = 0; 
  my $id = &hash($SECRET . rand()); 

  while ($tries++ < $MAX_TRIES) {
    last if $DBH->do("INSERT INTO $DB_TABLE (session_id) VALUES ('$id')"); 
    $id = &hash($id); 
  }

  return undef if $tries >= $MAX_TRIES; # we failed 
  return $id; 
}

# check to see that an old ID is valid --------------------------check_id---
sub check_id {

  my $id = shift; 

  return ($id, '') 

    if $DBH->do("SELECT 1 FROM $DB_TABLE WHERE session_id = '$id'") > 0; 

  return ($id, 'The record of your game may have expired. Restarting.')

    if $DBH->do("INSERT INTO $DB_TABLE (session_id) VALUES ('$id')"); 

  return (); 

}

# generate a hash value ---------------------------------------------hash---
sub hash {
  my $value = shift; 
  return substr(MD5->hexhash($value), 0, $ID_LENGTH); 
}

sub expire_old_sessions { # --------------------------expire_old_sessions---
  $DBH->do(<<END); 
DELETE FROM $DB_TABLE
  WHERE (unix_timestamp() - unix_timestamp(modified)) > $EXPIRE
END
}

You can play this version of the game here.

Please check the contents of dgerman_hangman as you play!

4.5 Using DBI to Its Full Advantage

We've reached the end of our excursion into server side state maintenance techniques and we should feel that we have covered (and learned) a lot. This is, as you would expect, only the beginning.

Once you keep session information stored in a database, there are all sorts of interesting things you can do with it. For example, you can easily compute statistics, such as the average number of games that users have played or how many guesses they have to make on average to arrive at the correct answer.

We could take advantage of this ability to create a "top winners" list, but we won't, at least not immediately. We feel that we've already covered enough. The book, though, has all the details and I strongly encourage you to read it, if you have the inclination, and I very much recommend and endorse it as the outstanding work that it is.


Last updated: Oct 13, 2004 by Adrian German for A348/A548