Today, we talked about form validation in PHP, and we introduced the syntax for defining functions in PHP. As an example, we revisited the first CGI script we wrote way back at the beginning of the semester (a whole week and a half ago). This was a Python script with an HTML interface that asked for the user's name and age, and when "Proceed" was pressed, greeted them by name and told them how old they'd be in the following year.
Our first task was simply to re-implement this in PHP. This time, we decided to do it as a single combined HTML web page / PHP script instead of a separate HTML interface and Python script. Here's what it looked like:
<html>
<head>
<title>Age Next Year</title>
</head>
<body>
<form method="get">
Name: <input type="text" name="who">
Age: <input type="number" name="age">
<input type="submit" name="action" value="Proceed">
</form>
<br>
<?php
if (isset($_GET["who"]) and isset($_GET["age"])) {
echo "Hello, " . $_GET["who"] . ". Next year, you will be " . ($_GET["age"] + 1) . " years old."; }
?>
</body>
</html>
Now this works as advertised. But, it's not very secure. You can type anything you want into that "Name" field, and it will be sent right back into the code of the web page. It's not just about displaying things. If you put some HTML code into that name file, it will get interpreted by the browser as if it was part of the web page, so for example, if you input <h1>H4X0R5!</h1>
into the "Name" field, you might end up with a page that looks like this:
It doesn't take a whole lot of imagination to see how big of a problem this can be. Being able to inject arbitrary code into a web page means that an unscrupulous user could add Javascript or PHP code and do just about anything they want. So this is definitely something that needs to be fixed.
There are (at least) two approaches we could take to fix the problem. The first method is to search the inputted data for HTML tags and then rejecting the input, telling the user not to use them. The second is to accept the HTML tags, but encode them in such away that they'll just be displayed and won't be interpreted as code. For this problem, we took the second approach.
There are fortunately only a few special characters in HTML. There are the angle brackets <
and >
that start and end every HTML tag. The ampersand &
is used to display special characters, and you could make an argument that single and double quotes should be protected too. Fortunately, there's a way to display characters in HTML without them being interpreted as code. They're called "HTML entities", and they consist of an ampersand, followed by a code describing the character and then a semi-colon. So for example, the "less than" sign <
can be displayed by using the HTML entity <
and the ampersand &
by using &
. You can also use HTML entities to display characters that you can't or don't want to type directly into the code. They're often used for things like the euro sign €
or the copyright symbol ©
. You can see a big list of them here.
It wouldn't be too much trouble to write a little function that replaces all of these symbols with their HTML entities. But fortunately, there's a built-in PHP function that does this for us called htmlspecialchars()
. So we can slip that into our code to protect the page from this style of injection attack:
<html>
<head>
<title>Age Next Year</title>
</head>
<body>
<form method="get">
Name: <input type="text" name="who">
Age: <input type="number" name="age">
<input type="submit" name="action" value="Proceed">
</form>
<br>
<?php
if (isset($_GET["who"]) and isset($_GET["age"])) {
echo "Hello, " . htmlspecialchars($_GET["who"]) . ". Next year, you will be " . ($_GET["age"] + 1) . " years old."; }
?>
</body>
</html>
Now if we try the same trick, here's the result:
While we're preprocessing our input, there are other things we can do to clean up the user's input. For example, we could remove any extra spaces or tabs or whatever that might've appeared at the beginning or the end of the "Name" string. Those sorts of extra white space are common when people copy and paste data into forms. The built-in PHP function trim()
can do this job for us. But since we're likely to have to do this sort of thing repeatedly, it makes sense to put it all into a single function. That way, if we have to make a change to how we clean up the user's input, we only have to change it in one place. It will also make the code easier to read.
So here's what our clean-up function (called janitor()
) looks like:
function janitor($data) {
$data = trim($data);
$data = htmlspecialchars($data);
return $data;
}
PHP function definitions start with the word function
. Here, $data
is the temporary name we've given to the one argument our function takes. The function trim
s off the extra white space and then replaces all of the HTML special characters with the appropriate HTML entities. Lastly, the function return
s the now-cleaned string.
I decided to clean up the age input as well. Also, I separated the cleaning process onto its own line because that one line of code was already too long. Here's the result:
<html>
<head>
<title>Age Next Year</title>
</head>
<body>
<form method="get">
Name: <input type="text" name="who">
Age: <input type="number" name="age">
<input type="submit" name="action" value="Proceed">
</form>
<br>
<?php
if (isset($_GET["who"]) and isset($_GET["age"])) {
$who = janitor($_GET["who"]);
$age = janitor($_GET["age"]);
echo "Hello, " . $who . ". Next year, you will be " . ($age + 1) . " years old.";
}
}
function janitor($data) {
$data = trim($data);
$data = htmlspecialchars($data);
return $data;
}
?>
</body>
</html>
So we've protected our script against the user injecting damaging code. But we can do more than that. As it is, we get strange behavior when we enter in non-numerical data for the "Age" field. So we'll want to make sure that the user only enters in a number.
This time, we'll take the strategy of simply rejecting any non-numerical input. To do this, we'll make use of the built-in PHP function is_numeric()
. Here's the result:
<html>
<head>
<title>Age Next Year</title>
</head>
<body>
<form method="get">
Name: <input type="text" name="who">
Age: <input type="number" name="age">
<input type="submit" name="action" value="Proceed">
</form>
<br>
<?php
if (isset($_GET["who"]) and isset($_GET["age"])) {
$who = janitor($_GET["who"]);
$age = janitor($_GET["age"]);
if (is_numeric($age)) {
echo "Hello, " . $who . ". Next year, you will be " . ($age + 1) . " years old.";
}
else {
echo "Please enter a number for your age.";
}
}
function janitor($data) {
$data = trim($data);
$data = htmlspecialchars($data);
return $data;
}
?>
</body>
</html>
This is as far as we got in class, but a good next step would be to deal with the possibility that the user doesn't fill out one of the required fields. This is slightly tricky because when you hit the "Proceed" button, all the environment variables will be set. It's just that if you leave one of fields blank, that variable will be set to the empty string. So it's not enough just to check that isset($_GET["who"])
is true; you also need to verify that $_GET["who"]==""
is false.