![]() |
CSCI A348/A548Lecture Notes 9 Fall 1999 |
One thing that we will discuss this week will be pattern matching in Perl. The context will be: summarizing information from files. We will start with filehandles, and describe pattern matching and regular expressions in the context of locating and extracting information that is read from the files.
1. Filehandles
A filehandle is just a name you give to a file, device, socket or pipe to help you remember which one you're talking about (also to hide the complexities of buffering and such). Internally, filehandles are similar to streams in C++ or Java.
You create a filehandle and attach it to a file by using the open function.
It takes two parameters: the filehandle and the filename.
Perl gives you some predefined (and preopened) filehandles:
STDIN
- your program's normal input channel
STDOUT
- your program's normal output channel
STDERR
- additional output channel (for snide remarks)
open (AB, "filename"); # read from file open (AB, "<filename"); # same, explicitly >filename"); # create and write file >>filename"); # append to file create if needed "| output_pipe_command"); # set up an output filter "input_pipe_command |"); # set up an input filterThe name you pick for the filehandle is arbitrary. Once opened, the filehandle can be used to access the file or pipe until explicitly closed, which you can do with close.
Once a filehandle is open for reading you can read lines from it just as you can read from standard input with STDIN.
So, for example, to read lines from a file specified in the command line:
The fragment above just lists the lines in the specified file and is therefore, for all practical purposes, equivalent to the Unixopen (AB, $ARGV[0]); while ($x = <AB>) { print $x; } close(AB);
cat
command. Note that the newly opened filehandle is
used inside the angle brackets just as we have used STDIN previously.
Also, note that to make the program completely equivalent to cat
we'd have to process all arguments passed to it on the command line,
like this:
If you have a filehandle open for writing or appending, and if you want toforeach $argv (@ARGV) { open (AB, $argv); while ($x = <AB>) { print $x; } close(AB); }
print
to it, you must place the filehandle immediately after the
print
keyword and before the other arguments. No comma should
occur between the filehandle and the rest of the arguments. (I personally
never remember this so this is the first compile error I have to fix). When you read from a filehandle you can specify either a scalar context (read one line which is then stored into the scalar variable that appears on the left)
or a list context:$x = <AB>;
which reads all the lines from AB and places them in@x = <AB>;
2. Exact pattern matching$x[0], $x[1],... $x[$#x].
The =~
operator is used for pattern matching.
The pattern itself is specified between leaning toothpicks, or slashes.
is a statement that checks whether the string$x =~ /foo/;
$x
contains the pattern
foo
in it. This statement returns a boolean value
(0
or 1
)
so it can be used as a condition in an if
statement.
does the same thing but ignores case.$x =~ /foo/i
So this is how we locate patterns.
If we locate them we could also replace them, and we do that with the s operator.
For example,
replaces the first occurence of$x =~ s/foo/bar/;
foo
with bar
in $x
.
performs a global replacement of all occurrences of$x =~ s/foo/bar/g;
foo
with bar
in $x
(if any
exists). 3. Regular expressions
A regular expression is a way of describing a set of strings without having to list all of the strings in the set.
We start from exact patterns,
like the string foo
, or abc
and we introduce
quantifiers: *
and +
.
A character followed by
*
describes a string of zero or more such characters. Thus
refers to the pattern/aba/
andaba
refers to the pattern that starts with/ab*a/
a
, is followed by
zero or more b
's and ends with an a
. * specifies that the preceding character can appear zero or more times. + has a similar meaning, it says that the character appears at least once. * and + are two of a set of characters that have a special meaning and are therefore called metacharacters. They are listed below:
We'll mention two of them,\ | ( [ { ^ $ * ? .
(
and
[
, and then we'll move on.
(
together with its associate )
can be used to capture and memorize the patterns that match.
These patterns are being captured in special variables:
$1
, $2, $3
, and so forth.
The numbers represent the order of the parens in the pattern.
Example:
will print$x = "abbbc"; $x =~ /a(b*)c/; print $1;
In other words if the pattern specified inside the leaning toothpicks matches, thenbbb
$1
(which is a special variable) immediately becomes whatever
the parens are enclosing. 3.1 Classes of characters
The square bracket is used just as {
and
}
's are used in mathematics to denote sets,
althought the notation is somewhat different.
[a-z] | means one alphabetic lowercase character |
[a-zA-z] | means one alphabetic character |
[0-9] | means a digit |
[a-zA-Z0-9_] | is also shortened \w |
[0-9] | is also shortened \d |
[^0-9] | means anything but digit |
[^\w] | is also shortened \W |
[ \t\r\n\f] | is white space
also shortened \s |
|
1. Here's a program that puts parens around a
's in the
strings that it receives from the command line.
Note the use of double quotes to specify a string with blank spaces in it.tucotuco.cs.indiana.edu% cat sub #!/usr/bin/perl $ARGV[0] =~ s/(a)/($1)/g; print $ARGV[0], "\n"; tucotuco.cs.indiana.edu% ./sub abcdefghabcdefgh (a)bcdefgh(a)bcdefgh tucotuco.cs.indiana.edu% ./sub "abc def gha" (a)bc def gh(a)
2. Here's another program that does the same thing with any alphabetic character:
3. Here's a program that reads thetucotuco.cs.indiana.edu% cat sub1 #!/usr/bin/perl $ARGV[0] =~ s/([a-zA-Z])/($1)/g; print $ARGV[0], "\n"; tucotuco.cs.indiana.edu% ./sub1 "a1 bc3 4_&c +=m " (a)1 (b)(c)3 4_&(c) +=(m)
index.html
file and prints
out the lines that have what looks like a hyperlink on them:
The two patterns in round parens are non-empty strings that will be stored inopen (AB, "/u/dgerman/httpd/htdocs/index.html"); while ($x = <AB>) { if ($x =~ /<a href="([^"]+)">([^<]+)<\/a>/) { print $1; } } close(AB);
$1
and $2
after they match. The
first one is a string that contains at least one character and does
not contain double quotes. (This makes the pattern matching mechanism
stop at the first "
encountered double quote).
The second one describes a non-empty (+
) string of characters that
does not contain the < sign (which is where the description of the
hyper-reference ends).
If you look close you will see outside these two patterns the clear structure of an
tag, except we have put those two intimidating patterns where the ellipsis are.<a href="...">...</a>
4. Lines in access_log start like this:
This can be described as follows:129.79.207.219 - - [16/Sep/1999:01:29:37
that is:^[\S]+ - - \[[^:]+:\d\d:\d\d:\d\d
^
outside [
's means start of string
\S
) chars
- -
pattern
\
from acting as a metacharacter)
:
character, two digits (for the number of minutes),
again :
, and two digits for the number of seconds
The first pair of parens collects the IP number, the second one a date like this:open (AB, "httpd/logs/access_log"); while ($x = <AB>) { if ($x =~ /^([\S]+) - - \[([^:]+:\d\d):\d\d:\d\d/) { $hits{$1} += 1; } } close(AB);
that means16/Sep/01
Sept 16
, and the time 1am
. For each request to the server there is a line in the log file. Each line has the time of access. We basically count the lines (which stand for hits) and put them in bins, one such bin for each distinct hour of our server's life.
5. CGI.pm
We have developed a ReadParse
function last time that
does the reading and parsing for us within a CGI script. This function
was developed following Steve Brenner's cgi-lib.pl
library.
This library is mentioned in your book on page 503. We only developed a
core part of it, the most basic part of it that does CGI processing.
The function looked like this:
You can use this in all your CGI scripts to process incoming data.sub ReadParse { local ($i, $key, $val) = @_; if ($ENV{'REQUEST_METHOD'} eq 'GET' ) { $in = $ENV{'QUERY_STRING'}; } elsif ($ENV{'REQUEST_METHOD'} eq 'POST') { read (STDIN, $in, $ENV{'CONTENT-LENGTH'}); } @in = split(/&/, $in); for ($i = 0; $i <= $#in; $i++) { $in[$i] =~ s/\+/ /g; ($key, $val) = split(/=/, $in[$i]); $key =~ s/%(..)/pack("c", hex($1))/ge; $val =~ s/%(..)/pack("c", hex($1))/ge; if (defined($in{$key})) { $in{$key} .= "\0"; # thanks to Rudy! } $in{$key} .= $val; } }
For example here's a circular script:
Try this script.#!/usr/local/bin/perl &ReadParse; &header("Lab 5 Circular Script"); if ($ENV{REQUEST_METHOD} eq 'GET' ) { &printform; } elsif ($ENV{REQUEST_METHOD} eq 'POST') { &printform($in{count}); } &trailer; sub printform { local ($arg) = @_; local $count = $arg + 1; print qq{ <form method="POST" action="$ENV{SCRIPT_NAME}"> Your call has number: <font size=+5>$count<font>. <p> Press <input type="submit" value="here"> to call again. <input type="hidden" name="count" value="$count"> </form> }; } sub header { local ($t) = @_; print "Content-type: text/html\n\n<html><head>"; print "<title>$t</title></head><body bgcolor=white>\n"; } sub trailer { print "\n</body></html>"; } sub ReadParse { local ($i, $key, $val) = @_; if ($ENV{'REQUEST_METHOD'} eq 'GET' ) { $in = $ENV{'QUERY_STRING'}; } elsif ($ENV{'REQUEST_METHOD'} eq 'POST') { read (STDIN, $in, $ENV{'CONTENT_LENGTH'}); } @in = split(/&/, $in); for ($i = 0; $i <= $#in; $i++) { $in[$i] =~ s/\+/ /g; ($key, $val) = split(/=/, $in[$i]); $key =~ s/%(..)/pack("c", hex($1))/ge; $val =~ s/%(..)/pack("c", hex($1))/ge; if (defined($in{$key})) { $in{$key} .= "\0"; } $in{$key} .= $val; } }
But notice that our ReadParse
can't handle file uploads and
is not as robust as it possibly could and should be. We can save some time by
using one of the public domain libraries, such as CGI.pm
.
CGI.pm
is a Perl library for writing CGI. It handles many of the ugly details of creating HTTP
headers, parsing query strings, and maintaining the state of fill-out forms so that you
can concentrate on the task at hand. The module is widely used and frequently updated. The description in your book starts on page 494.