When application security was still in it’s infancy, there were discussions on how to protect applications from newly discovered injection vulnerabilities. "Sanitize Input" was a popular solution that rolled off the tongue nicely and was not overly complicated to explain. It was also, a very generic solution that would (hopefully) be part of a more complete approach.
As much as "Sanitizing Input" makes sense, so does writing your code in a way which, allows you to handle failure safely. This way, when the unexpected does happen, an entire operation doesn't fall down, introduce a bug or propagate unsafe data.
Question: Where does this approach fail miserably?
Answer: When it is the only approach you have.
The OWASP Top 10 categorizes XSS and SQL Injection separately. As an attacker, you are injecting data that is handled insecurely by application code. In this way, it is really just another form of injection. On that note, let’s discuss two manifestations of injection. SQL Injection and HTML Injection (XSS).
SQL Injection: "Save your one-liners for the bar". Parametrization of database queries is a classic method for handling queries safely and in many cases more efficiently. From a security standpoint, parametrized queries help to solidify the boundaries between user data and SQL statements. It ensures that data submitted by the user will be separated from the actual database query and won’t interfere with the SQL code and ultimately the database.
Example of lazy code....
http://www.example.com/example.php?use_name=gevans
$uname = $_GET['user_name']
....and this is the classic example everyone shows, nothing new here, that illustrates a SQL Injection flaw where the data ($uname) is actually included in the SQL statement.
"SELECT user_id from users where username = $uname;"
This programming flaw has destroyed the boundary between the SQL command and user-supplied input. Because the user data is now cast as a string-- it is no longer clear to the SQL server what part was supplied by the developer and what was supplied by the user. The whole query can fall apart by appending double quotes. This is not just a vulnerability, this is bad programming. Sure, it takes one line to write the query but there is no further sanity checking here. The string is formed, sent to the server, and executed as SQL. How are parametrized queries different?
Parametrized queries separate the data from the query so that we as coders don’t miscommunicate our intentions to the database server. How does it work? The majority of the query is sent to the server MINUS the actual user submitted data. So, the query is prepared (meaning sent to the server), a response comes back with a token (minus MySQL as I understand it), and THEN, the variable is sent to the server with the token and a SQL query executes. This means the expected query and actual data that we've gathered from the user are separated prior to execution.
Lets provide a visualization
// Pass in db credentials as well as the host it is located on and the database we'd like to connect to
$conn = new PDO("mysql:host=$dbhost;dbname=$dbname",$dbuser,$dbpass);
// Prepare the statement
$sql = "SELECT user_id from users where username = ?";
//Execute the query, taking in the variable data ($user_name) from the user
$q = $conn->prepare($sql);
$q->execute(array($uname));
$object = $q->fetchColumn();
As you can see, the $sql statement is prepared and the server knows exactly what it should look like. Next, the SQL statement is executed, passing in the variable value in place of the "?" (shown above). By specifying that question mark, you tell the db, this is my statement but I don't know what the value will be.....I'll give you that on the next call.
Lets examine XSS. Again, something I hear a lot is "Sanitize your input". Some people even go as far as "Whitelist" versus "Blacklist". Okay, great, that is not extensible and ultimately context matters. What do I mean? It is a very one-sided approach with a lot of assumptions. Let me draw a picture for you. The understanding, as of right now, is the data comes in one place and is potentially echoed in another. So the model looks something like this:
A typical example would be a registration form. You sign up with your First Name, Last Name, etc. Upon successful authentication to the application, you notice a little message at the top right....
So....."Welcome, Ken!", I wonder where that value came from? When we registered, our information was stored in the db, later extracted after login, and shown on the page. Now, we should be safe right? Even if we had attempted to place JavaScript in the First Name value upon registration, it wouldn't have matter.....We Sanitized!!!
Two months later, a user complains that they signed up with a misspelled username and would like the ability to change it. A new developer is assigned the task of adding the ability to edit your first and last name and does so. The new developers assumption is that we are going to safely handle that data when rendered to the user. But we aren't. We sanitized the input and didn't bother with handling the data. Our model has changed from Input/Output to......
So with one additional point of input, our model gets (very slightly) more complicated. Now imagine adding multiple points of input, multiple points of output. Now split input into data entry (processed), storage handling (stored in the db) and then do the same for output. While we are at, lets throw a web-service that consumes the data as well. It becomes very easy to see how "Sanitize Input" doesn't scale, isn't a sure-fire solution, and really oversimplifies the problem for those who are looking to either receive or give an easy answer.
In summary, please join me in the fight to stop the mindless regurgitation of old material.
Cheers,
Ken