Website defacement detection techniques
Posted on October 14, 2009 in Research • 4 min read
Table of Contents
1. Website defacement
2. Anomaly detection systems
2.1 Checksum comparison
2.2 Diff comparison
2.3 DOM tree analysis
2.4 Complex algorithms
3. Signature detection
4. Thresholds and worst cases
1. Website defacement
A website defacement is the unauthorized substitution of a web page or
a part of it by a system cracker. A defacement is generally meant as a
kind of electronic graffiti, although recently it has become a means
to spread messages by politically motivated cyber protesters or
hacktivists.
This is a very common form of attack that seriously damages the trust
and the reputation of a website.
Detecting web page defacements is one of the main services for the
security monitoring system.
A lot of time ago I wrote a small & smart application to detect web
site defacements in large scale with the ability to monitor a lot
(thousands) of websites. This was a test to collect some statistics,
so I tried to do it in a short time: I wrote it in a few days.
So I was asking me about what techniques and technologies I can use to
get the highest detection rate with the minimum effort.
I choose Ruby, Ruby on Rails for the user interface and Event Machine
to speed up the performances.
With only few days of development I can’t struggle with complex
algorithms to detect defacements, but I choose some very simple
techniques, that after some months of tests, seemed to be very
effective. The performance and detection rate of this “poor man”
techniques are comparable to some others commercial monitoring
systems.
The key feature of the proposed techniques is that it does not require
the installation of a component (like an HIDS) or a participation of
the site maintainers. It require only the URL of the web site to
monitor.
Today I want to share this brainstorming about web site detection
techniques.
2. Anomaly detection systems
Anomaly detection refers to detecting patterns in a given data set
that do not conform to an established normal behavior. The patterns
thus detected are called anomalies and often translate to critical and
actionable information in several application domains.
The defacement monitoring application needs to detect a change in a
web page and detect if it’s “normal” or it’s an “anomaly”.
To create a set of “normal” a preliminary learning phase builds a
profile of the monitored web page, then the web site can be monitored
for “anomaly” changes.
The detection of a defacement is based on a dynamic threshold, if the
web page changed over this threshold the system treat it as defaced
and throw a defacement alert.
This threshold is updated to avoid the obsolescence of his value and
the learning set.
2.1 Checksum comparison
The simplest way to detect a change in some text-formatted data, like
a HTML page, is to compute and check his checksum with a hash
algorithm like MD5 or SHA1.
Only a little change in the monitored web page generate a different
checksum, so you can detect a defacement.
This works well for “best of ‘90s” web sites which uses only static
content, but for today’s web pages with contents that change at every
reload this technique is quite obsolete.
For example a web page with a counter or a timers inside changes his
content at every reload and the checksum is continually different.
Moreover this type of check cannot observe a threshold based system
because it’s a comparison with a true or false result.
2.2 Diff comparison
There are some libraries in python and ruby implementing the widely
known unix tool diff, using it we can get the difference between two
web pages.
We can use a threshold based system learning the usual difference
percentage of a web page and check if a changeset is under the usual
threshold.
This is a very fast but very effective technique which works well in
most dynamic sites.
2.3 DOM tree analysis
This is a similar strategy to the diff comparison, but is used the DOM
tree instead of the plain HTML content for the comparison.
The layout of a website changes, tags and properties, have little
changes during time. Using this fact you can build up a threshold
based system as above.
2.4 Complex algorithms
You can design a lot of algorithms, or use some of the already known, but this is a very expensive work. I haven’t used any complex logic or algorithm but if you want to follow this way you can find a lot of academic papers about this field.
3. Signature detection
The web pages are examined for pre-configured and predetermined attack patterns known as signatures. Many attacks today have distinct signatures. The collection of these signatures must be constantly updated to mitigate emerging threats. I used the wide database of Zone-h to build a signature set always updated.
4. Thresholds and worst cases
The bigger effort is design the engagement rules and tuning good
thresholds.
The percentage of changes in a website can change during time, an
evaluation of both anomaly detection and signature detection
techniques, using a weighted logic can help to reduce false positives.
You must remember that you need to deal with website restyling, layout
changes, widgets and banners that can be removed or added.
As today there are some worst cases that causes false negatives:
defacement done via javascript (levaraging on a XSS vulnerability) or
via CSS, or partial defacements (do you remember the securityfocus.com
defacement?) where only a part, like an image or a banner, of the
website changes.