feedFixer
Posted by ark, ,
feedFixer takes an Atom feed and inserts as many elements as it can to indicate that this feed is private and should not be indexed by search engines. It also modifies old entries in the feed and 'expires' them, replacing the title and the text.

Download: feedFixer

It's design to be set up and run every hour (or minute) and only kicks into action when it notices one of the source feeds is newer than the destination feed.

It's configured by a simple python file ~/.feedFixerrc
# -*- Python -*-
global BASE_DIR, FEEDS

BASE_DIR = os.path.expanduser("/home/blogger")
FEEDS = (FeedFixer("html/atom.fromblogger.xml",
"html/atom.xml",
url="http://example.com/atom.xml"),
)
or you could just modify the FEEDS global variable in the source file (but that will make upgrading harder).

I do this because Google Reader keeps posts around forever. If you subscribe to a feed you an then read everything that was ever posted to the feed. Having that much of my data in Google's control didn't make me feel comfortable so I wanted a way to take it back. Tombstoning works for that. It also means that other feed crawlers who might find my site will only get 1 months worth of posts to crawl at any one point and that should be a moving window with old posts replaced with tombstones when they crawl again.

I've written about feeds and search engines before but thought feedFixer deserved it's own page.

Implementation notes:
I started off using ElementTree to navigate the XML, it was so much easier, but their handling of namespaces on output rendered the output useless. I could work out how to name each namespace, but I wanted a default one with no prefix at all and couldn't do it. I moved over to using xml.minidom which worked pretty good and the output was lovely!

I want to change a few more things but I wanted to get this out there first:

TODO(ark) random text replace: Allow you to replace any text in the feed. Blogger feeds are full of the blog's id, and from this you can construct a feed url that gets all this data you're hiding direct from blogger. However doing this would make all the posts appear as new (new id's) so I didn't want to do it. Might be useful though.

TODO(ark) replace link too: when I tombstone a post I leave the link in there, but you might want to replace it with a standard link, or make it empty.

Final touches:
Now that you've expired all the old items in the current feed and verified that once Google Reader's crawler has picked it up, you should see your old items tombstoned in google reader. However if you keep going back you'll see your really old posts are still there. This is because they were too old to be in the current feed. You can make a new feed with all your items in if you're using blogger all you need is your blog's id number. It'll be up there at the top of the page when you start a post to your blog, or you can find it in the feed it looks like this: 10693130
wget 'http://www.blogger.com/feeds/1YOURBLOGNUMBER1/posts/default?start-index=1&max-results=200' -O fullfeed.xml
now fun feedFixer over that file and copy that to the place where your feed is for your blog. Allow a few hours or a day for Google Reader to pick it up and ALL your old posts should now be tombstoned! If you see one or two old posts that you posted but deleted there is a fix for that too: label each post in google reader with a tag (I use 'delete') then go into settings for google reader and under tags make your delete tag public. view the public page and copy the link for the feed for that page. Now do this on the command line:
wget 'PASTE_THE_FEED_URL_FROM_YOUR_LABEL_PAGE' -O - | xmllint --format - | fgrep original-id
Now you have a list of id's for those posts. Go back and make your label page private again. Get your feedFixed full feel and add a few more entries into it (xmllint --format makes this a pleasure to edit) and paste in those original-id's once google reader's feed fetcher picks them up they'll get tombstoned too.

protecting your blog:
In my next post I'll show you how to protect your blog from search engines while still letting your users in with a minimum amount of hassle.

Comments

Posted Tuesday 22 April 2008 Share