As previously discussed, there are a number of inherent flaws in the practice of web content classification and how the fruits of that classification are then sold in the form of content filtering products. The impact of these flaws, however, are not limited to the consumers of content filtering products. Mis-classification of web content affects websites themselves, often as harshly as it affects users of content filtering who need access to the resources provided by mis-classified sites. Before discussing the impact of mis-classification on legitimate websites and users of content filtering, it is important to understand the process by which a website is classified (or mis-classified) by the content classification industry.
In the mid-1990's as the World Wide Web rose to popularity, stories in the mass media prompted parents and religious leaders to worry that children might access indecent material. In 1996, the United States Congress reacted with the Communications Decency Act (CDA), banning indecency on the Internet. Civil liberties groups challenged the law under the First Amendment in 1997 under ACLU v. Reno and the U.S. Supreme Court partially overturned the law. While the CDA was struck down on the basis of free speech, another overriding objection to content filtering has always been the correctness of the filtering decisions made by content filtering products. Overly broad filters against key words either in the content of the website or in the site's URL led to sites carrying information about breast cancer, clothing, and poultry recipes being filtered because of their use of the word "breast". In a particularly famous example, Beaver College in Glenside, Pennsylvania changed its name to Arcadia University in part because content-filtering software was blocking access to the school's web site suggesting the college's use of "beaver" was indecent.
These kinds of mis-classifications are very common due to the way that most content filtering technology works. Content classification companies maintain huge databases of web sites which are classified into various categories. Users of content filtering software may then choose which categories will be allowed or blocked. In addition to URLs, so-called key words are used to help block sites which may not be specifically classified, but which may contain material related to a filtered category or web searches for restricted material. Key words are generally regular expressions which can be broadly applied to either the name of the site, or the content of the site itself. URL-based key word filters are generally faster and less reliable because, although the entire site is not checked for bad content before being displayed to the end-user, the URL rarely relates to the content of the web page itself. Content-based key words are considered more reliable but slow because the entire site must be processed. The accuracy of key word-based filtering comes into question when one looks more deeply into regular expressions and their use.
A regular expression is a string that is used to describe or match a set of characters according to a certain set of syntactical rules. There are a number of different iterations of regular expression syntax including POSIX modern, Perl-Compatible, and traditional UNIX. While the distinctions between the syntax are numerous, the goal is to correctly match every character in a string containing a specific, described pattern. When the set of strings is every URL on the Web, describing even a single pattern which will match one particular objectionable word is a monumental task. Combine with that simple inattention to detail or laziness and, suddenly, content filters block every instance of the string "sex" or "orgy" which leads to the blacklisting of sites relating to Sussex or Essex, or the opera Porgy and Bess as well as legitimate sites containing sex educational material.
To better illustrate this problem, consider the following web sites and their URLs. Each one is a legitimate business that has absolutely nothing to do with indecent material, but runs the risk of being filtered due to 'inappropriate content' simply because of the site's name. Some filtering products act on the assumption that if a 'bad' word is in the URL or site name, it must be part of the content they are offering.
http://www.dickssportinggoods.com/ - Dick's Sporting Goods
http://www.msexchange.org/ - Microsoft Exchange Server Resource Site
http://www.cummingfirst.com/ - First Cumming Methodist Church
http://www.penisland.net/ - Pen Island
http://www.molestationnursery.com/ - Mole Station Native Nursery
In a whitepaper published by the Electronic Frontier Foundation, authors Seth Finkelstein and Lee Tien had the following to say regarding the use of key words:
"From a theoretical point of view, the claimed abilities would require a computer-science breakthrough of Nobel-Prize magnitude. Consider the legal test for 'obscenity,' which requires that an obscene work 'taken as a whole lack serious literary, artistic, political or scientific value.' It is hard to see how anyone can seriously assert that computer programs could make such a judgment when humans endlessly debate these concepts."
It is important to note that, despite the marketing jargon surrounding the functionality of content filtering software, key words really are the only intelligence used to make automatic filtering decisions. When an administrator uses such functionality, which is not turned on by default for some content filtering products, they may inadvertantly block thousands of harmless sites. The remainder of sites are filtered based on a static maintained list of URLs and IP addresses. These massive databases are maintained by the filtering companies themselves, regarded as the keys to their intellectual property kingdom, kept completely opaque from the end user and sold as an updated service, much like antivirus vendors sell virus definition updates. When a site like attrition.org gets categorized as "racism and hate", that is the content filtering company forcing their views or mistake on their customers. Considering that only a small fraction of the sites contained in these databases are actually viewed by human eyes, mis-classified sites constantly enter into the databases, and are rarely (if ever) re-classified correctly, unless or until a site realizes that it's been blacklisted and complains.
Before you casually dismiss this issue thinking "it doesn't affect me", consider that these content filtering products are frequently used for public Internet access. This means that if you use WiFi at your local coffee shop, public kiosk terminals at an airport or even the computers at your local library, you may be browsing the Web under the guidance and supervision of these products...
... which leads us to a time in the not-so-distant past (as you will read below) where attrition.org was again directly affected by the content filtering industry. Shortly after the release of an email to the "General Attrition Mayhem Mail List", we received a curious email stating the following:
From: A.A. (email@example.com) To: lyger (firstname.lastname@example.org) Date: Sat, 16 Jun 2007 13:01:50 -0700 (PDT) Subject: Bloody Bobbing Bollocks, You've Been Blocked! lyger- Wow... just wow. Imagine my happiness when I arrived at work after a few days off and saw a bunch of new dataloss entries and a new going postal. "Good times," I thought to myself," as I typed attrition.org into my address bar only to be greeted by this: You have attempted to access a site that is not consistent with [Company]'s Internet Usage Policy. Your request for http://attrition.org/postal/p0014.html was denied because of its content categorization: "Racism and Hate" -------------------------------------------------------------------------------- Use of the Internet by [Company] employees is permitted and encouraged where such use is suitable for business purposes and supports the goals and objectives of [Company] and its business units. The Internet is to be used in a manner that is consistent with [Company]'s standards of business conduct as defined in the company's Ethics in the Workplace policy, is a part of the normal execution of an employee's job responsibilities, and does not compromise the security or the integrity of [Company]'s information systems. This policy covers all connections via intranet, extranet, Internet, and any remote methods that allow physical or logical connectivity to internal [Company] information systems using [Company] resources. Violations of this policy may be subject to disciplinary action up to and including termination of employment. Racism and hate? I am Jack's complete shock. Okay, maybe hate of stupid people but c'mon? Racism? Attrition.org? What the fucking fuck? I can only think that one of the myriad images from the image gallery or some mirrored page defacement is what did this. How did my corporate overlords even find out about Attrition? I demand answers, dammit! Okay, I know you don't have any answers for me but this sucks. No more Going Postal. No more defacement mirror. No more reviews. No more charlatans. Damn, what am I going to do when I'm bored at work... oh, now I see. They blocked all the fun websites so we'll review company news and policies if we aren't busy. (Like we're "supposed" to.) Those bastards! It's been fun checking Attrition.org out for the all-too brief period we've had together. I'll see you when I get internet at home again... now, to check and see if http://www.racismandhate.org is blocked. a. P.S.- As per usual, I'd like to humbly request that, should this e-mail be featured in the Going Postal section, you don't use my name or e-mail address. (Especially since this involves my job.) I also have a new request, being that you tell me it's going to be used since I won't be able to actually see it from work and can't see it from home for at least another month and a half. I'd prefer it not be used but that's just my preference.
The most heinous "[Company] disclaimer" shown above appears to be their actual message displayed upon blocking a web site on their network. Why not just say "access denied, go read our AUP" and be done with it? I responded back, asking for more information about the filtering software in question. Before I received a reply, yet another email hit my inbox:
From: A.G. (email@example.com) To: firstname.lastname@example.org Date: Sat, 16 Jun 2007 17:44:45 -0400 Subject: Interesting with regards to Websense. Looks like Attrition is a hotbed of racists, extremists, and hatemongers. At least according to Websense. Attrition has now been blocked under the dubiously humorous category "Racism and Hate". It seems Websense Inc. or whoever rules their 'block list' has a hard on for you guys. It also seems that I'll be getting my dose of sarcasm and infosec news elsewhere. Well, however long it takes me to find another working proxy. Good luck getting Websense to unblock Attrition.org.
For a touch of historical flavor, we will say that attrition.org was once categorized as "hacking", which is another
category generally blocked by many companies that use content-filtering products. Attrition.org was later placed under the
"Computer Security" category after a long e-mail campaign and many customer complaints, so the recategorization to "Racism and
Hate" concerned us for two reasons:
We searched Websense's public web site in an attempt to find a link, address or form we could use to request a site recategorization. Nothing appeared to be available. We then obtained an email address for Websense's support division and contacted them to request that the site be reviewed and appropriately recategorized. After 48 hours, no response was received. After a few minutes of research and strategic e-mails, we found a friend who is an existing Websense customer. On his own time, he contacted Websense by phone and spoke with tech support personnel who directed him to a web link that would allow a customer recategorization request. The recategorization request was made, and within three hours, the following message was delivered via email to his account:
Thank you for writing to Websense. The sites you submitted have been reviewed and categorized accordingly: http://attrition.org/ - Information Technology
So customers who pay for the product can request support for a recategorization, but sites which may be wrongfully categorized sometimes have no readily available means to request a manual peer review? That seems wrong, and it really does smack of "big business": we can affect the way you are perceived by millions, but if you're not paying us, go screw yourselves and find someone who is. In other words, if you're not our customer, you get no service, even if we screwed you... and worse, potentially libeled you.
So why was attrition.org recategorized as "Racism and Hate"? We don't really have an honest or truthful
answer, but we do have a few ideas:
If one of these were the reason, why not apply that designation to /postal instead of the entire site? If it was for the mirror content, why not a custom message explaining that it is a security site mirroring content from criminal activity and is useful to law enforcement and security personnel, but may be offensive to some? WebSense can use this type of custom message as a value-add, giving customers more reason to continue using their product. If you think this takes too much time and effort, do you really want to browse the web and get denied access to content based on three second snap judgements of web sites and their material?
As described by Submicron, the impact of a "negative recategorization" affects more than just a company using a content-filtering product. It also affects the person who visited a site (should their visit be logged and flagged), and also affects the site itself, which may have to jump through countless hoops in order to be fairly categorized. In some cases, reputations may be at stake, and it hardly seems right that companies with arbitrary control over viewable content should be able to unilaterally make decisions such as these and subject millions of users to them.