Fall Semester 2002


Lecture Notes Seven: Pattern matching in Perl. Building a CGI processor.

Let's now look at pattern matching.

1. Basic Pattern Matching in Perl

We're using the =~ operator, together with the letter s on its right hand side, followed by a slash delimited pattern to be matched, and a string. When the pattern matches, the string that follows the second slash will replace it. There are several rules and exceptions and we will summarize those that we care for here, through a couple of examples.

The dot (.) matches any individual character except newline.

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
$a = "1234567890"; 
$a =~ s/./a/; 
print $a; 
frilled.cs.indiana.edu%./alpha
a234567890frilled.cs.indiana.edu%
To have the substitution happen everywhere possible, use g (global) after the third slash.

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
$a = "1234567890"; 
$a =~ s/./a/g; 
print $a; 
frilled.cs.indiana.edu%./alpha
aaaaaaaaaafrilled.cs.indiana.edu%
The pattern can be bigger (or longer):
frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
$a = "1234567890"; 
$a =~ s/../a/g; 
print $a; 
frilled.cs.indiana.edu%./alpha
aaaaafrilled.cs.indiana.edu%
Parentheses can be used as memory elements:

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
$a = "1234567890"; 
$a =~ s/(.)(.)/$2$1/g; 
print $a; 
frilled.cs.indiana.edu%./alpha
2143658709frilled.cs.indiana.edu%
And they can include larger patterns:
frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
$a = "1234567890"; 
$a =~ s/(..)/$1+1/g; 
print $a; 
frilled.cs.indiana.edu%./alpha
12+134+156+178+190+1frilled.cs.indiana.edu%
To have the part between the last two slashes act as Perl code use e (evaluate) after the third slash.

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
$a = "1234567890"; 
$a =~ s/(..)/$1+1/ge; 
print $a; 
frilled.cs.indiana.edu%./alpha
1335577991frilled.cs.indiana.edu%
A few other things needed in ReadParse are listed below.

2. Additional Information

Characters have (decimal) ASCII codes that can be obtain with ord.

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
@values = ('A', 'B', 'C', 'D', 'E'); 
foreach $value (@values) {
  print $value, " has ASCII code: ", ord($value), "\n"; 
} 
frilled.cs.indiana.edu%./alpha
A has ASCII code: 65
B has ASCII code: 66
C has ASCII code: 67
D has ASCII code: 68
E has ASCII code: 69
frilled.cs.indiana.edu%
ASCII codes can be turned into characters with chr.

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
@values = (65, 66, 67, 68, 69); 
foreach $value (@values) {
  print "ASCII code $value stands for: ", chr($value), "\n"; 
} 
frilled.cs.indiana.edu%./alpha
ASCII code 65 stands for: A
ASCII code 66 stands for: B
ASCII code 67 stands for: C
ASCII code 68 stands for: D
ASCII code 69 stands for: E
frilled.cs.indiana.edu%
The hex function turns a hexadecimal value in a decimal one.

frilled.cs.indiana.edu%cat alpha
#!/usr/bin/perl
@values = (1, 10, 20, 100, 110, 111); 
foreach $value (@values) {
  print "$value in base 16 is equal to ", hex($value), " in base 10.\n"; 
} 
frilled.cs.indiana.edu%./alpha
1 in base 16 is equal to 1 in base 10.
10 in base 16 is equal to 16 in base 10.
20 in base 16 is equal to 32 in base 10.
100 in base 16 is equal to 256 in base 10.
110 in base 16 is equal to 272 in base 10.
111 in base 16 is equal to 273 in base 10.
frilled.cs.indiana.edu%

3. Basic HTML Forms

Next we can discuss the various HTML form elements, for example:

To display: Use: Attributes:
A form <form>
... HTML form info
</form>
method
action
enctype
Single-line text field
<input type=text>
name
value
maxlength
size
Single-line password field
<input type=password>
name
value
maxlength
size
Multiple-line text area
<textarea></textarea>
name
cols
rows
wrap
Checkbox
<input type=checkbox>
name
value
checked
Radio buttons
<input type=radio>
name
value
checked
List of choices <select>
items in list...
</select>
name
multiple
size
Items in a <select> list <option>
value
selected
Clickable image
<input type=image>
name
align
src
File upload
<input type=file>
name
accept
Hidden field
<input type=hidden>
name
value
Reset button
<input type=reset>
value
Submit button
<input type=submit>
name
value

We now want to build a generic CGI processor.

4. Building a Generic CGI Processor

We also need to come up with a definition of CGI.

For this purpose let's again review what we have done so far in terms of CGI.

  1. We started with a hello.html in Lab Two, placed in htdocs.

  2. We then said that we have been able to write a script (called hello) which we placed in cgi-bin and whose output was the same as when we accessed the hello.html file on the web. hello.html was in htdocs. hello was in your script (cgi-bin) directory.

  3. The difference between them was that the script was entirely responsible for the output and so it had to start it with its MIME type:
    "Content-type: text/html\n\n"
    was the first thing that the script was supposed to write. Note the two newline characters, an empty line is required after the MIME type. We took the script and changed the output a little, to make it display an image.

  4. Then we thought whether we could make it display something new every time. And we introduced a bit of randomness in it, such that the output was changed from time to time. This way most of the times, most likely, the output changes.

    To implement the change in output we created a list of names of images. Then every time the script is called, a random number that represents an index in the list of names of images will be produced and the image with that index will appear in the output.

  5. That's an improvement, the output is changing, but it's not that predictable. Is there any way to make the user participate, and maybe choose the output? Can the user then talk to the script (instead of just starting it?).

    We said the answer was "yes" and to explain that we introduced a short script by the name of printenv. Each one of our servers had this script in their cgi-bin directories after installation. It looked like this:

    #!/usr/bin/perl
    
    print "Content-type: text/html\n\n<html><body><pre>"; 
    
    foreach $elem (keys %ENV) {
      print $elem, " --> ", $ENV{$elem}, "\n"; 
    } 
    
    print "</pre></body></html>"; 
  6. The hash %ENV is built by the system. Browser, server, host operating system contribute to it. The info is passed to the script. One of the keys in this hash table is called QUERY_STRING. If we put a ? (question mark) after the name of the program (when we invoke its URL) the string that follows, up to the first blank space, will be placed in
    $ENV{"QUERY_STRING"}
    We also noted that there was an entry in %ENV for REQUEST_METHOD. The value associated with $ENV{REQUEST_METHOD} was GET (please confirm that through your own experiments).

    OK, that was the review.

  7. Now we need to talk a bit about forms, and we create a very simple one, that looks like this:

    <form method="GET" action="/cgi-bin/printenv"> 
    <input type=text name=fieldOne> <p> 
    <input type=text name=fieldOne> <p> 
    <input type=text name=fieldOne> <p> 
    <input type=submit> <p> </form>

    Using this form we should be able to call our script, and even pass spaces to it.

  8. But we notice a conversion process.

  9. It is happening with other characters too, such as slashes (/).

  10. So we decide to clarify what this means.

    CGI is, in fact, the transfer of information

    1. from the browser,
    2. through the server,
    3. into the script.

    And the transfer can be done in two ways, that are identified by the keywords

    1. GET and
    2. POST.

  11. Regardless of the method (be it GET or POST) the transfer always involves the encoding of special characters in a particular way. It is the purpose of this lecture to clarify the encoding scheme as well as how one can access that information (that is passed to the script) inside the script.

  12. The encoding involves turning special characters into hexadecimal codes. To retrieve them you need to know the encoding scheme, and to use substitutions.

  13. The scheme is that every encoded character is turned into % followed by the two hexadecimal characters that make up the ASCII code of the character.

    An example: A has ASCII code 6510.

    In base 16 this is: 4116.

  14. We discussed how we compute the base 10 equivalent of a number in base 16 and that we have 16 symbols that we could use to write numbers in this base: 0-9, and a-f.

    There are 256 character codes, so two hexadecimal digits would be enough to represent them all (from 0 all the way up to ff16 which is 25510).

  15. If the user has a form that specifies GET as the transmission mode, then all the data will be put together in one long string, encoded as described above, and placed such that the script will find it in $ENV{"QUERY_STRING"}.

  16. To decode it one would do the following:
    $input = $ENV{"QUERY_STRING"};
    $input =~ s/%(..)/chr(hex($1))/ge; 
    Now, this second line will have to be clarified, but this is not as hard as it may appear.

  17. And that's because we have already explained it (only in stages).

  18. If the method is POST then the info no longer comes through the QUERY_STRING and instead the script is receiving it through a channel that it identifies as its standard input (STDIN). So the read process will be somewhat different:
    read(STDIN, $input, $ENV{"CONTENT_LENGTH"}); 
  19. We read from the standard input, into a buffer called $input and we need to specify how many characters we want to read. Fortunately this number is available to us in the %ENV hash table, associated with the CONTENT_LENGTH key.

  20. So now we can write a script that can read info (and that regardless of how the info comes):

    1. with GET it's in $ENV{'QUERY_STRING'}
    2. with POST it's coming through STDIN

    3. We start from:
      #!/usr/bin/perl
      
      &printHeader;
      
      if    ($ENV{REQUEST_METHOD} eq 'GET' ) { 
        print "Called with GET." ; 
      } elsif ($ENV{REQUEST_METHOD} eq 'POST') { 
        print "Called with POST."; 
      } else {
        print "Method not supported.\n"; 
      } 
      
      &printTrailer; 
      
      sub printHeader { print "Content-type: text/html\n\n<html><body>"; } 
      
      sub printTrailer { print "</body></html>"; }
    4. Our next step is to print a form when called for the first time (with GET), and to print the contents of all the fields in reply to any subsequent POST call.

    5. So we should try something like this:

      #!/usr/bin/perl
             
      &printHeader;
             
      if    ($ENV{"REQUEST_METHOD"} eq 'GET' ) { 
        $me = $ENV{"SCRIPT_NAME"}; 
        print qq{ 
          <form method=POST action=$me> 
          Please write your thoughts below: <p> 
          <textarea name="thoughts" rows=5 cols=60></textarea> <p> 
          Also please write your e-mail address here: 
             <input type="text" name="email"> <p>     
          <input type="submit"> 
          </form> 
        };  
      } elsif ($ENV{REQUEST_METHOD} eq 'POST') { 
        print "Called with POST.";
      } else {
        print "Method not supported.\n"; 
      } 
             
      &printTrailer; 
             
      sub printHeader  { print "Content-type: text/html\n\n<html><body>"; } 
      
      sub printTrailer { print "</body></html>"; }
    6. The next step is a significant leap: we want to read the data and print it back.

      #!/usr/bin/perl
             
      &printHeader;
      
      &readParse; 
             
      if    ($ENV{"REQUEST_METHOD"} eq 'GET' ) { 
        $me = $ENV{"SCRIPT_NAME"}; 
        print qq{ 
          <form method=POST action=$me> 
          Please write your thoughts below: <p> 
          <textarea name="thoughts" rows=5 cols=60></textarea> 
          <p> Also please write your e-mail address here: 
          <input type="text" name="email"> <p>     
          <input type="submit"> 
          </form> 
        };  
      } elsif ($ENV{"REQUEST_METHOD"} eq 'POST') { 
        print "Called with POST.<pre>";
        foreach $k (keys %in) {
            print $k, " --> ", $in{$k}, "<br>"; 
        } 
      } else {
        print "Method not supported.\n"; 
      } 
             
      &printTrailer; 
             
      sub printHeader  { print "Content-type: text/html\n\n<html><body>"; } 
      
      sub printTrailer { print "</body></html>"; }
      
      sub readParse {
          if      ($ENV{"REQUEST_METHOD"} eq 'GET' ) {
              $input = $ENV{"QUERY_STRING"}; 
          } elsif ($ENV{"REQUEST_METHOD"} eq 'POST') {
              read (STDIN, $input, $ENV{"CONTENT_LENGTH"}); 
          } else {
              print "Unsupported method."; 
              &printTrailer; 
              exit; 
          } 
      
          @input = split(/\&/, $input); 
          foreach $elem (@input) {
              $elem =~ s/%(..)/chr(hex($1))/ge;
              $elem =~ s/\+/ /g; 
              ($key, $value) = split(/\=/, $elem); 
              $in{$key} = $value; 
          } 
      } 
      In class we need to explain this very thoroughly.

    7. We have in fact seen some of it last time so it shouldn't be too hard.


    Last updated: Sep 22, 2002 by Adrian German for A348/A548