We will discuss pattern matching in Perl this week. The context will be: summarizing information from files. We will start with filehandles, and describe pattern matching and regular expressions in the context of locating and extracting information that is read from the files.


A filehandle is just a name you give to a file, device, socket or pipe to help you remember which one you're talking about (also to hide the complexities of buffering and such).

Internally, filehandles are similar to streams in C++.

You create a filehandle and attach it to a file by using the open function. It takes two parameters: the filehandle and the filename.

Perl gives you some predefined (and preopened) filehandles:

STDIN  - your program's normal input channel 
STDOUT - your program's normal output channel 
STDERR - additional output channel (for snide remarks) 
These filehandles are typically attached to your terminal but they may also be attached to other files or pipes.

You can use the open function to create filehandles for various purposes (input, output, piping) so you need to specify what behaviour you want:

open (AB, "filename");                 # read from file
open(AB, "<filename");                 # same, explicitly
          >filename");                 # create and write file 
          >>filename");                # append to file create if needed
         "| output_pipe_command");     # set up an output filter 
         "input_pipe_command |");      # set up an input filter
The name you pick for the filehandle is arbitrary. Once opened, the filehandle can be used to access the file or pipe until explicitly closed, which you can do with close.

Once a filehandle is open for reading you can read lines from it just as you can read from standard input with STDIN (as we did in the first command-line calculator that we wrote in lab2).

So, for example, to read lines from a file specified in the command line:

open (AB, $ARGV[0]); 
while ($x = <AB>) {
  print $x; 
} 
close(AB); 
The fragment above just lists the lines in the specified file and is therefore, for all practical purposes, equivalent to the cat command. Note that the newly opened filehandle is used inside the angle brackets just as we have used STDIN previously.

If you have a filehandle open for writing or appending, and if you want to print to it, you must place the filehandle immediately after the print keyword and before the other arguments. No comma should occur between the filehandle and the rest of the arguments.

When you read from a filehandle you can specify either a scalar context (read one line which is then stored into the scalar variable that appears on the left)

$x = <AB>;
or a list context:
@x = <AB>; 
which reads all the lines from AB and places them in
$x[0], $x[1],... $x[$#x]. 

Exact pattern matching

The =~ operator is used for pattern matching.

The pattern itself is specified between leaning toothpicks, or slashes.

$x =~ /foo/; 
is a statement that checks whether the string $x contains the pattern foo in it. This statement returns a boolean value (0 or 1) so it can be used as a condition in an if statement.

$x =~ /foo/i 
does the same thing but ignores case.

So this is how we locate patterns.

If we locate them we could also replace them, and we do that with the s operator.

For example,

$x =~ s/foo/bar/; 
replaces the first occurence of foo with bar in $x.

$x =~ s/foo/bar/g; 
performs a global replacement of all occurrences of foo with bar in $x (if any exist).

Regular expressions

A regular expression is a way of describing a set of strings without having to list all of the strings in the set.

We start from exact patterns, like the string foo, or abc and we introduce quantifiers: * and +.

A character followed by * describes a string of zero or more such characters. Thus

/aba/
refers to the pattern
aba
and
/ab*a/
refers to the pattern that starts with a, is followed by zero or more b's and ends with an a.

* specifies that the preceding character can appear zero or more times. + has a similar meaning, it says that the character appears at least once. * and + are two of a set of characters that have a special meaning and are therefore called metacharacters:

\ | ( [ { ^ $ * ? .
We'll mention two of them, ( and [, and then we'll write the two programs we wanted to write.

( and ) can be used to 'capture' the patterns that match. These patterns are being captured in special variables: $1, $2, $3, and so forth.

Example:

$x = "abbbc"; 
$x =~ /a(b*)c/; 
print $1; 
will print
bbb
In other words if the pattern specified inside the leaning toothpicks matches then $1 (which is a special variable) immediately becomes whatever the parens are enclosing.

Classes of characters

The square bracket is used just as { and }'s are used in mathematics.

[a-z] means one alphabetic lowercase character
[a-zA-z] means one alphabetic character
[0-9] means a digit
[a-zA-Z0-9_] is also shortened \w
[0-9] is also shortened \d
[^0-9] means anything but digit
[^\w] is also shortened \W
[ \t\r\n\f] is white space \s
  • \n is newline
  • \r is carriage return
  • \f is formfeed
  • \t is tab
  • there's a blank at the beginning
Four examples

1. Here's a program that puts parens around a's in the strings that it receives from the command line.

tucotuco.cs.indiana.edu% vi sub
tucotuco.cs.indiana.edu% cat sub
#!/usr/bin/perl

$ARGV[0] =~ s/(a)/($1)/g; 

print $ARGV[0], "\n"; 
tucotuco.cs.indiana.edu% ./sub abcdefghabcdefgh
(a)bcdefgh(a)bcdefgh
tucotuco.cs.indiana.edu% ./sub "abc def gha"
(a)bc def gh(a)
2. Here's another program that does the same thing with any alphabetic character:
tucotuco.cs.indiana.edu% vi sub1
tucotuco.cs.indiana.edu% cat sub1
#!/usr/bin/perl

$ARGV[0] =~ s/([a-zA-Z])/($1)/g;

print $ARGV[0], "\n"; 
tucotuco.cs.indiana.edu% ./sub1 "a1 bc3 4_&c +=m "
(a)1 (b)(c)3 4_&(c) +=(m) 
3. Here's a program that reads the index.html file and prints the lines that have what looks like a hyperlink on them:
open (AB, "/u/dgerman/httpd/htdocs/index.html");
while ($x = <AB>) {
  if ($x =~ /<a href="([^"]+)">([^<]+)<\/a>/) {
    print $1; 
  } 
} 
close(AB); 
Note: Thanks to Tom Bloomfield and Kwang Lim for pointing this ommission out. The blue backslash was missing. The slashes in red delimit the pattern. Without the blue backslash the black slash would have ended the pattern thus producing an error right before the a it precedes.

Thanks also for those that noticed the typo on the pattern matching operator which should be =~ but for some reason appeared as ~. Sorry for the trouble.

The two patterns in round parens are non-empty strings that will be stored in $1 and $2 after they match. The first one is a string that contains at least one character and does not contain double quotes. (This makes the pattern matching mechanism stop at the first ").

The second one describes a non-empty (+) string of characters that does not contain the < sign (which is where the description of the hyper-reference ends).

If you look close you will see outside these two patterns the clear structure of an

<a href="...">...</a>
tag, except we have put those two intimidating patterns where the ellipsis are.

4. Lines in access_log start like this:

129.79.207.219 - - [16/Sep/1998:01:29:37 
This can be described as follows:
^[\S]+ - - \[[^:]+:\d\d:\d\d:\d\d
that is: Thus we can collect this information to build a table of the number of hits, grouped by hour, for the server.

open (AB, "httpd/logs/access_log");
while ($x = ) {
  if ($x =~ /^([\S]+) - - \[([^:]+:\d\d):\d\d:\d\d/) {
    $hits{$1} += 1; 
  } 
} 
close(AB); 
The first pair of parens collects the IP number, the second one a date like this:
16/Sep/01
that means Sept 16, 1am.

For each request to the server there is a line in the log file.

Each line has the time of access.

We basically count the lines (which stand for hits) and put them in bins, one such bin for each distinct hour of our lives.