Topics for today:
  1. Announcements
  2. The basics of Web communication.
  3. A review of Apache install.
  4. A few Perl programs.

1. Announcements

A few servers were already up and running at the beginning of the second lecture. We reviewed how the servers could be accessed; the URL of a server is
http://hostname.cs.indiana.edu:portnumber
where hostname and portnumber were listed at
http://www.cs.indiana.edu/l/www/classes/a348/students.html
which also indicates which servers are running already.

For example the server that we use for class demos can be called with:

http://tucotuco.cs.indiana.:19800
What gets retrieved in response to such a request from a web browser?

The complete answer to this question is this:

Based on the shape of the URL it must be the index.html file in the DocumentRoot directory of the web server that runs on tucotuco, servicing port #19800.

But we can get even more specific about this: since tucotuco:19800 is the demo server we know it's administered by dgerman, as the web course page says. Because dgerman has followed the conventions that everybody had to follow, and since we know the burrow environment like the back of our hands, the path of the file that gets retrieved is this:

/u/dgerman/httpd/htdocs/index.html

We are now ready to start our prelude to CGI.

2. The basics of web communication.

We start by describing the relationship between a web server and a web browser. We detail the interaction that goes on behind the scenes and show what happens when a browser requests a web page from a remote server. We then introduce the CGI (Common Gateway Interface) and explain how the browser-server interaction changes with the addition of CGI scripts.

To understand how scripts interact with web browsers and servers we begin by reviewing a simpler interaction: how static HTML files are requested by and displayed by users. Let's say you have the following simple, basic HTML file in your DocumentRoot called hello.html:

<html>
<head>
<title> Hello world! </title>
</head>
<body>
<h1> Hello world! </h1>
<p> How are you doing? </p>
</body>
</html>
Let's now assume that you put this file in your DocumentRoot and make it readable by the world.
tucotuco.cs.indiana.edu% pwd
/nfs/paca/home/user2/dgerman/httpd/htdocs
tucotuco.cs.indiana.edu% vi hello.html
tucotuco.cs.indiana.edu% cat hello.html
<head>
<title> Hello world! </title>
</head>
<body>
<h1> Hello world! </h1>
<p> How are you doing? </p>
</body>
</html>
tucotuco.cs.indiana.edu% ls -l hell*     
-rw-r--r--   1 dgerman  students     126 Sep  5 18:39 hello.html
tucotuco.cs.indiana.edu% 
Once we've created the HTML text, it may seem that the process of delivering it to a web browser should be a trivial task.

But serving even a simple page like this one requires that a lot of coordination occur between the browser and the web server on which the page is stored.

By web server we mean a program residing on a host machine that uses the Hypertext Transport Protocol (HTTP) to communicate with the browser. Your

/u/username/httpd/httpd
is such a program.

The web is based on a client-server model. This means that there is a server (that provides resources) and a client (which requests them). We need to keep this in mind about them:

For the World Wide Web the role of the client is played by a web browser (such as Netscape) and the web server is the software that delivers resources such as: computer files, images, movies, sound files, to one or more web browsers.

There are thousands of web servers throughout the world (wide web) but they are all acessible from any browser because they have all agreed to use a common protocol - the Hypertext Transfer Protocol (HTTP). HTTP is based on an exchange of requests and responses.

Each request can be thought of as a command, or action, which is sent by the browser to the server to be carried out. The server performs the requested service and returns its answer in the form of a response.

[figure1]

The components of a simple WWW interaction are the user, the client, and the server. The client acts as an intermediary between the user and the server.

Steps 1-7 detail the basic information flow in a simple HTTP transaction. Essentially the client requests a file and the server delivers it. The entire HTTP process takes place as a result of simple transactions of requests and responses.

  1. The user sees an interesting URL
    http://tucotuco.cs.indiana.edu:19800/hello.html
    and clicks the hyperlink or types the URL into the browser.

  2. The browser interprets the command. (For example it different from printing, creating the bookmark, saving a file, changing any preferences, etc., instead it says that the computer
    tucotuco.cs.indiana.edu
    needs to be contacted on port 19800 and that the hello.html file is needed. It does so by sending the HTTP GET command to the server (which you don't see here, just yet - we'll see how this works when we simulate this with telnet).

  3. The browser sends the GET request to the server indicating what file it needs. This request travels through the Internet, going from computer to computer until it reaches tucotuco. There's a security aspect here that we will discuss later.

  4. The server receives and parses the request. It uses the file extension (.html) to determine the type of information in the file. The .html means that it will send back to the browser the file but it will first say: the file's Content-type: text/html. You do not have to write this in the file, it is inferred by the server from the file's extension.

  5. An HTTP response goes from server to the client. The headers that are part of the message indicate that the request was OK and that the data returned is of
    Content-type: text/html
    The headers are followed by the HTML data itself.

  6. The Content-type header tells the browser that the data is HTML, so the browser formats and renders the text appropriately, including highlighting hyperlinks

  7. User views the HTML output and has the opportunity to select another hyperlink, starting the cycle over again
Of course you want to achieve more than this limited functionality. You may want to query a database and retuln the result to the user. The HTTP server can't do this directly and instead it delegates the reques to an external program to which it has access, which acts as a gateway (or intermediary) between the server and the data repository.

When the server receives a request to access the database it passes the request to a gateway program which does whatever is necessary to get the data and return the results to the server.

The server then repackages the information from the script, and forwards the information back to the client. (In a sense the server acts as a sort of translator, taking data from either a file or script and providing it to the browsers in a consistent and uniform manner).

We make two observations now:

Clearly, in order to make all this relationship work, the gateway programs and the server must communicate with each other. The details of this interaction are specified by the Common Gateway Interface (CGI). The CGI protocol defines the input that a script can expect to receive from the server as well as the output it must return to the server in order to be understood. What the script does in between the input and the output, is entirely up to the script.

So the process of servicing the

http://tucotuco.cs.indiana.edu:19800/cgi-bin/hello
request is different for the server, because by the shape of the request it realizes that it needs to execute the script specified by that address (or path). Upon starting the script, the server provides it with a variety of potentially useful information (such as the name of the machine from which the request originated, type of browser used, etc.) Additional data may be passed by the server to the script but we'll cover that later.

What follows is of no concern to the server, other than the output of the script, which the server will send back to the requesting browser.

You take a lot of responsibility this way if you're writing the script. While the server doesn't care how the script generates its output, it does need to know the format of the output - the script's output is, after all, the server's input. Recall that when the web server delivers a static file to the browser, it uses a filename extension to determine what to return in the Content-type header.

[comments/14 here]

The hello script:

print qq{Content-type: text/html

<html>
<head>
<title> Hello world! </title>
</head>
<body>
<h1> Hello world! </h1>
<p> How are you doing? </p>
</body>
</html>}; 

[figure 2 here]

The information flow is as follows:

Review: Basics of Web Communication

3. A review of Apache install

We have described the basic directory structure and have identified three important elements: We have described the installation of a web server starting from a precompiled (binary) distribution.

We have identified the conf directory and three configuration files:

We made the following changes to the configuration files: Then we were ready to start the web server.

We described how the server can be started and stopped.

We mentioned that we need to be able to restart the server automatically if we need to be on-line continuously, and we have said we will describe ways to do that using cron.

The process id (pid) of the http daemon is located in a file httpd.pid in the logs directory.

The same directory contains a file where accesses are logged and a file where errors are recorded. We looked at them and described them briefly.

4. A few Perl programs

We didn't get to this in the lecture but a separate document was posted the day after the lecture with a brief introduction to Perl. See the announcements page.