15.0 - Web Master Trickery
15.1 - Deep Mining
15.1.1 - The Problem(s)
One of the problems that those of us who manage web sites face is deep mining. This is a technique whereby someone at another web site places an image from your site on their web page by pointing to the image URL on your site; when visitors to their site view that page, the image is served from your own server, using your server CPU time and your network bandwidth, both of which are limited resources that you bear the costs for. In most cases, there is no benefit to you, just the loss of bandwidth and time.
Typically the way webmasters deal with this is to edit the Apache configuration file and tell the webserver not to serve images to any web browser that provides an HTTP_REFERER tag that is on a web site that is known to be deep-mining your images. This works, but there are a couple of problems. First, not all web browsers provide the HTTP_REFERER tag, and second, you have to be reasonably vigilant in terms of watching your server logs to see who is deep mining your images so you can keep your configuration file updated, which can be a drain on your own time and resources.
A slight variation on this theme is not to allow serving images to any web browser that does not specifically provide an HTTP_REFERER that says the browser is on your site. This works very well in that it eliminates any need for you to watch your server logs (everyone anywhere else is locked out, so no need to watch for them) but it has the rather serious downside of causing legitimate users of your site who are not providing HTTP_REFERER from seeing your images. This, for most, is unacceptable. Furthermore, the trend seems to be for more and more firewalls to (very unwisely, but that's another rant) block the HTTP_REFERER tag and so as time goes on, more and more of your legitimate visitors will be unable to see your site's images when they visit.
What is needed is a new approach, one that never refuses a legitimate site visitor, doesn't require maintainance in order to keep blacklists (and/or whitelists) updated, yet prevents deep mining from becoming a serious drain on your resources.
15.1.2 - A Solution
First, all images are placed in a directory with a unique, non-public and unusual name, such as "xyzzyfunguwsimages". Second, all HTML pages are written to reference these images, but these HTML pages are in turn placed in a directory with a unique, non-public and unusual name, such as "xyzzymossyhtml". With this set up, each day at midnight, a "cron" job runs which creates a new softlink with a random name such as "akshd876_images" that points to the image directory called "xyzzyfunguwsimages". Then, each of the HTML files from "xyzzymossyhtml" is copied by a special script to the server's normal "html_docs" directory, and during the copy processs, each reference to "xyzzyfunguwsimages" is modified to point to the new softlink, "akshd876_images". The old softlinks are now deleted, and the site is up and running with the images in a brand new and unpredictable location. The source copies of the HTML pages in "xyzzymossyhtml" are unmodified, and so are ready to do this again at any time.
The end result is that anyone who wishes to deep mine your images will have to update the location of those images each and every day. Should that not prove to be effective, up the rate of the cron job to as fast as once per hour. Your web pages will always work, as they always point at the correct location, but remote web sites will have obsolete URLs as soon as the next cron job runs.
I have been using this technique on a couple of commercial web sites that are image-heavy and were suffering from a great deal of deep mining, and we no longer have any kind of deep mining problem at all. Yes, you can get around this with a counter-script that reads the site on a regular basis, so it isn't perfect, however that requires a level of sophistication that so far, I have not seen applied to the problem. And of course, should someone go to such an extreme, you can simply blacklist them; it will be very uncommon, so you don't have to worry about constant maintaince as you would with a blacklist-only approach.
Have a comment for me? Click here.

