Thursday, April 24, 2008

Password protecting your site

I wanted to loosely password protect my website. I wanted to do this so that search engines wouldn't index the site and so that random people I didn't know wouldn't read it (without some work) but so that it was easy as pie for people I did know to get in and see what they want.

The solution I came up with was a form that asks for a password (there are no user names) and if you enter it correctly sets a cookie on your browser that should last for a year. Every time you visit the site your cookie life is extended. So as long as you come to the site at least once a year it should never ask you for a password again. I made the password ridiculously easy that anyone who knew me would know the answer to. I also made it possible to have a list of acceptable passwords so you could ask for the name of one of my cats and any correct name would get you in.

I didn't want to use basic http auth since there's not enough room to explain why you're asking for a password or to give hints about what might work and it also requires a username which further complicates matters, you can only tell people what to enter after they have failed once, and that's a bad user experience.

The devil's in the details of course, I usually run Cookie Safe which selectively allows me to allow sites to set cookies and I'm very frustrated when a page tries to set cookies, fails, but doesn't tell me. So I try and detect that scenario and report an error if it happens. If you're only allowing session cookies I try and set one of those too, but then you'll need to type a password next time you visit.

I also wanted it to be modular, so I could use the same code from many pages, so I made it a php include that you could use with only a small amount of code in the page you're protecting (your blogger template for example).

Here's how you use it

1. download restrict.php and save it as restrict.php somewhere on your server.

2. Create a page with the form users will enter the password on. A minimal example is included below:
<?php
if (isset($ARK_RESTRICT_ERROR)) {
print "<h1>$ARK_RESTRICT_ERROR</h1>\n";
}
?>
<p>Access to this site is restricted!</p>

<p>Please enter my favorite color:<br />
<form method="post" action="<?php echo CurrentPageUrl() ?>">
<input type="text" name="answer" />
<input type="submit" value="submit"/>
</form>
However, remember, this is the page search engines will see, so you might want to include a:
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />
at the top and perhaps some better instructions or explaining why you do this. You might want to use the same template you use on the rest of your site?

3. Now for every page you want to protect you need to add this at the very very top of the page:
<php
$ARK_RESTRICT_ANSWERS = array('red', 'no', 'blue', 'arghhh');
$ARK_RESTRICT_FORM = '/www/html/form.php';
include_once('/www/html/restrict.php');
?>
Make sure you add it at the very very top, since it sets some cookies and senders a Location: HTTP header if anything is output before it runs there will be errors.

Note how you provide the paths to the files on the web server machine (do not use urls).

Thats it, should just work now. Hope you find it useful.

Possible improvements:

TODO(ark) add a long error description variabletoo
TODO(ark) try and set cookies using javascript and report an error if there is one before the user even tries a password.

Wednesday, April 23, 2008

unwelcome user agents

The following useragents are not welcome on my website:

SetEnvIfNoCase User-Agent "^Biz360" bad_bot
SetEnvIfNoCase User-Agent "^Blogslive" bad_bot
SetEnvIfNoCase User-Agent "^Cazoodle" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^FeedLounge" bad_bot
SetEnvIfNoCase User-Agent "^OmniExplorer" bad_bot
SetEnvIfNoCase User-Agent "^Sphere" bad_bot
SetEnvIfNoCase User-Agent "^SurveyBot" bad_bot
SetEnvIfNoCase User-Agent "^edgeio" bad_bot
SetEnvIfNoCase User-Agent "^ia_archiver" bad_bot
SetEnvIfNoCase User-Agent "^nutch" bad_bot
SetEnvIfNoCase User-Agent "^panscient.com" bad_bot
SetEnvIfNoCase User-Agent "^ping.blo.gs" bad_bot
SetEnvIfNoCase User-Agent "^topicblogs" bad_bot
SetEnvIfNoCase User-Agent "^Moreoverbot" bad_bot
SetEnvIfNoCase User-Agent "^BlogSearch" bad_bot
SetEnvIfNoCase User-Agent "Twiceler" bad_bot
SetEnvIfNoCase User-Agent "^BlogPulse" bad_bot
SetEnvIfNoCase User-Agent "FreeMyFeed" bad_bot
# SetEnvIfNoCase User-Agent "^Java" bad_bot
I didn't have the courage to deny all Java folks.

Then I just have this in my apache config

<Directory /home/ark/html/>
Order allow,deny
allow from all
Deny from env=bad_bot
</Directory>
I'll try and keep this post up to date. Mostly you get on this list if you're a robot that's crawling (and indexing) my rss feeds.

Tuesday, April 22, 2008

Search Results

I added search to this blog. If you're viewing at http://wtwf.com/scripts/ you can see it over there on the right.

Awesome!

feedFixer

feedFixer takes an Atom feed and inserts as many elements as it can to indicate that this feed is private and should not be indexed by search engines. It also modifies old entries in the feed and 'expires' them, replacing the title and the text.

Download: feedFixer

It's design to be set up and run every hour (or minute) and only kicks into action when it notices one of the source feeds is newer than the destination feed.

It's configured by a simple python file ~/.feedFixerrc
# -*- Python -*-
global BASE_DIR, FEEDS

BASE_DIR = os.path.expanduser("/home/blogger")
FEEDS = (FeedFixer("html/atom.fromblogger.xml",
"html/atom.xml",
url="http://example.com/atom.xml"),
)
or you could just modify the FEEDS global variable in the source file (but that will make upgrading harder).

I do this because Google Reader keeps posts around forever. If you subscribe to a feed you an then read everything that was ever posted to the feed. Having that much of my data in Google's control didn't make me feel comfortable so I wanted a way to take it back. Tombstoning works for that. It also means that other feed crawlers who might find my site will only get 1 months worth of posts to crawl at any one point and that should be a moving window with old posts replaced with tombstones when they crawl again.

I've written about feeds and search engines before but thought feedFixer deserved it's own page.

Implementation notes:
I started off using ElementTree to navigate the XML, it was so much easier, but their handling of namespaces on output rendered the output useless. I could work out how to name each namespace, but I wanted a default one with no prefix at all and couldn't do it. I moved over to using xml.minidom which worked pretty good and the output was lovely!

I want to change a few more things but I wanted to get this out there first:

TODO(ark) random text replace: Allow you to replace any text in the feed. Blogger feeds are full of the blog's id, and from this you can construct a feed url that gets all this data you're hiding direct from blogger. However doing this would make all the posts appear as new (new id's) so I didn't want to do it. Might be useful though.

TODO(ark) replace link too: when I tombstone a post I leave the link in there, but you might want to replace it with a standard link, or make it empty.

Final touches:
Now that you've expired all the old items in the current feed and verified that once Google Reader's crawler has picked it up, you should see your old items tombstoned in google reader. However if you keep going back you'll see your really old posts are still there. This is because they were too old to be in the current feed. You can make a new feed with all your items in if you're using blogger all you need is your blog's id number. It'll be up there at the top of the page when you start a post to your blog, or you can find it in the feed it looks like this: 10693130
wget 'http://www.blogger.com/feeds/1YOURBLOGNUMBER1/posts/default?start-index=1&max-results=200' -O fullfeed.xml
now fun feedFixer over that file and copy that to the place where your feed is for your blog. Allow a few hours or a day for Google Reader to pick it up and ALL your old posts should now be tombstoned! If you see one or two old posts that you posted but deleted there is a fix for that too: label each post in google reader with a tag (I use 'delete') then go into settings for google reader and under tags make your delete tag public. view the public page and copy the link for the feed for that page. Now do this on the command line:
wget 'PASTE_THE_FEED_URL_FROM_YOUR_LABEL_PAGE' -O - | xmllint --format - | fgrep original-id
Now you have a list of id's for those posts. Go back and make your label page private again. Get your feedFixed full feel and add a few more entries into it (xmllint --format makes this a pleasure to edit) and paste in those original-id's once google reader's feed fetcher picks them up they'll get tombstoned too.

protecting your blog:
In my next post I'll show you how to protect your blog from search engines while still letting your users in with a minimum amount of hassle.

Tuesday, April 15, 2008

Archving things on the web

I often seem images as I'm browsing around that make me laugh a lot myconfinedspace.com has been an especially good source recently . I'd like to keep them, but it needs to be as easy and seamless as possible. Here's what I do.

If I can I select the image, open up a google notebook and click on 'clip'
now it's in my google notebook.
However notebook used to serve a cached copy of the image, but it no longer does that so I have one more step to make sure I don't lose the image.
I have a cron job that runs every week that backs up my public notebook url and all the images referenced from it. It's easy to do with wget, here's the command line I use:

wget -t 1 -T 15 -N -E -H -k -K -p http://google.com/notebook/public/NOTEBOOK_PATH
That makes a whole bunch of directories for each website and has all the images on the local disk. If an image vanishes off the net it should stay around in the directories and I'll still have it.

After it's loaded you can run my wwwis script over it to fix all the image width and heights too.