Showing newest posts with label rss. Show older posts
Showing newest posts with label rss. Show older posts

Tuesday, February 16, 2010

notes from moving to blogger custom domains

I've been moving my blogs over to blogger's custom domains since blogger will be stopping support for ftp sometime soon (now later than they announced). Here's some of the good and the bad experiences I had during the move.

1. No way to customize the 'request access to this private blog page'.

Some of my blogs are private. I used to implement this with some php wrapper in the template that set a cookie based on a really simple question. My main goal was keeping out robots not real people. With custom domains the option I chose was to have a private invite only blog. That's all cool, but the page when you go to the blog not logged in, or logged in without permission provides no way at all to request permission to read the blog, or to contact the blog author. Or any way to customize any of the text on the page to tell people how to get access to the blog.

2. No way to set the favicon

Favicons are important to me. I have too many tabs open in Google Chrome to not  use favicons. The default favicon is the blogger favicon, this is annoying since it's the same as the blogger compose favicon too. I tried adding 'link rel="icon"' higher or lower down in the header and Chrome just ignored it.

In the end I copied some javascript that replaces the favicon. It's not perfect but it works most of the time. Webmaster Central still uses the blogger icon which is annoying.

3. Alternative/Feed information in the header

The Blogger feed page is super confusing. I can put my feedburner url in there but alas it never seems to show up in the source of the served page. Then people will subscribe to the blog via the blogger url which I may want to change later. The subscribe gadget/buttons do this too. plus it also puts links to rss versions of the feed in the header, which I don't have feedburner versions of. It's horrible that so many years after Google bought feedburner it's still not integrated well.

Also I have lots of subscribers who are subscribed to a url on the old site. I can't have the new custom domain replace the old site since I still need these links to work. Don't tell me about the failover site, because that doesn't work when you have a private blog!

In the end I wrote more javascript to strip out all the old feed information and replace it with my feed information. I also use my own subscribe buttons with my own urls in them.
I had to move all my blogs to custom domains which were different than the original blog urls. Now I need to wait for Google to index my public blogs even though the content didn't really change.

4. Those annoying screwdriver/wrench icons to edit gadgets! I turned off quicklinks.

I turned off quicklinks and yet blogger still insists on showing me the quick edit tools for all my gadgets, this is especially annoying on my private blogs where I must be signed in!
I ended up getting rid of these by hand editing the expanded template html a terrifying experience.

5. oauth feeds have drafts

As I documented before I fetch the feeds to my private blogs and make restricted versions of those feeds available semi-publicly. I was surprised to find out that feeds fetched via oauth had the draft blog posts in them too! I should have made a google account with only read access and fetched the blog feed using that (then getting my oauth stored credentials would have only given you access to read the blog, not post to it too!).

6 search gadgets for label or subscribe

The tools to search the layout gadgets are awful. I searched for label or labels and didn't get the main google provided label gadget (or it was hard to find). same with trying to find a subscribe gadget (which ended up not working for me anyway (see my complaint about feeds).

7. Redirecting the Old Blog to the New Blog.

O.K. enough complaining. I've listed how I fixed a few issues. Here's some other fixes I found useful:
for my public blogs it was easy to modify .htaccess files to redirect traffic to the new blog. Here's some rules that worked for me. note that archives used to end with .php but now end with .html It seems blogger keeps the file extensions for old posts so they still have .php at the end.

RedirectMatch 301 /scripts/(2[0-9]*.*) http://blog.wtwf.com/$1
RedirectMatch 301 /scripts/labels/(.*)\.php http://blog.wtwf.com/search/label/$1
RedirectMatch 301 /scripts/archive/(.*)\.php http://blog.wtwf.com/$1.html
RedirectMatch 301 /scripts$ http://blog.wtwf.com/
RedirectMatch 301 /scripts/$ http://blog.wtwf.com/
RedirectMatch 301 /scripts/index\.php$ http://blog.wtwf.com/

For my private blogs I wanted to put up an alert that the blog had moved but give a way to contact me. I did it with the following php that I added to the template so it appeared on every page:

<style>
.alert {                                                                        
  background-color: #f00;                                                       
  color: #fff;                                                                  
  padding: 50px 0px 50px 5%;                                                    
  margin: 50px 0px 50px 5%;                                                     
  border: double 3px #000;                                                      
  font-weight: bold;                                                            
  text-align: center;                                                           
}                                                                               
                                                                                
.alert a, .alert a:hover {                                                      
  color: #fff;                                                                  
}                    
</style>
<div class="alert">
This blog has moved!

The new location of this page is:
<?php echo $newloc ?>

If you are unable to read that page then please Email us using the Email
link on the left.
</div>

I added email addresses using javascript so they're not easily harvestable.

I do like that I can add javascript to my new blogger template by adding a html/javascript gadget, that worked pretty well for me.

Monday, February 15, 2010

more private feeds

I've posted before about my little script to make atom feeds more private by expiring old posts by replacing the text, here and here. I've also posted about weak password protecting your sites. This was all that I did to have a blog that was private from the search engines but easy for people to read either in google reader or via email or in their browser. I did this using blogger.com's sftp support (I also ran a chrooted sftp server). Seems that my paranoia has fueled quite a few posts here in the past, eh! Well Blogger has decided to stop supporting sftp publishing so I've had t find another solution. Unlucky for me, I like the blogger posting UI and I think I found a good solution that will allow me to keep my content out of search engines and still make it accessible to folks that I want to read it. I'm not talking about this blog by the way, I'm talking about my daughter's blog and my personal blog. I'm going to have a blog that is invite only, you need to be specifically invited to read it, otherwise you end up at a rather unhelpful 'you need access' page with no way to contact me to ask for access. To make my blog readable in google reader all I needed to do was get the feed and make it available. I thought of a few ways to do this. First plan was to subscribe to the blog and then get some kind of email to rss gateway worked out. I wish mailman supported this but it doesn't. I did manage to get a patch to do it, but never worked on it because by the time I got the patch I also found my final solution. I wrote a Google AppEngine app that stored oauth credentials and fetched the feed from blogger, trimmed it and published it for all to see. I used code from the gdata python blogger oauth example and salmon protocol and ended up with my own 'Feed App' (Say it with a New Zealand accent) check out the source code at:

http://code.google.com/p/wtwf/source/browse/#svn/trunk/feeds/feedappwtwf

I plan to extend it to implement the expandRss functionality in a server that should be more reliable than my home network connection but for now it does the job and that makes me happy. I encountered a few problems while writing it. The oauth code required you to be logged in, but I wanted some of the urls to work without being logged in. I finally tracked it down and used users.create_login_url('/oauth/request_token') to make sure the user was logged in. The oauth.py code failed silently when the user wasn't logged in (very annoying!). I also found out that when you get the feed via oauth you end up getting all the draft blog posts too! so I had to write in a quick check to filter those out. I guess I really should make a google account that can only read the blogs then that might only get a feed with the real blog posts in it.

Now I'm off to work out how to point all my old blog posts at the new blog that's living in a subdomain and to play with the new fancy blogger templates, whee!

Sunday, June 29, 2008

Moving to google code

I'm slowly moving all my code over to being hosted Google Code. This will allow me to learn how to use subversion and also allow me to have RSS Feeds for my checkins. I can also have a wiki page per project and then let this blog be a blog about updates to each of the projects (or more likely new projects, I mean, really, who wants to maintain their old code).

I've already checked in some stuff, some of it hasn't ever been released before! Go check it out at: http://wtwf.googlecode.com/

If you want to follow the feed of stuff I check in the feed url is: http://code.google.com/feeds/p/wtwf/svnchanges/basic

If you use google reader, you can get a preview of what the source code changes feed looks like in google reader.

So far I'm liking it a lot. You can't beat the price!

Perhaps one day I'll even have people submit fixes (yay!), or file bugs (boo!) using the google code interface.

Saturday, May 24, 2008

Jaiku

I've been playing around with Jaiku, Google's purchased Twitter clone/competitor. I also played with Twitter at the same time (actually revived an account I made ages ago). Twitter sucks, it's way way too slow to even approach usability so I gave up on it. Jaiku is zippy fast which is likely a function of the number of users of each service, but perhaps one scales better than the other.

I've had a need for something that's not quite blogging, but blogging, I try to keep my posts down, so that when they do show up they might be somewhat interesting, But I also have random blabbering, thoughts observations I'd like to note and perhaps aggregate into a post at some point. micro-blogging seemed ideal for this. But I also didn't want a high barrier to entry so Jaiku's IM robot seemed like a good idea. You just IM them a message and it shows up in your jaiku account. Perfect for easy Jaiku-ing. However the robot has an annoying habit of sending you messages about what everyone else is doing. I guess that's the point, but I find it annoying. Especially in gmail where it flashes the title when you have an unread IM, from Jaiku.

Grouping together posting and reading in the same IM robot is like making my blogger post page also be my google reader. They should be two separate activities that I can merge if I like.

Looks like I'm going to have to learn a Jaiku API and how to write my own jabber robot (in python!). Perhaps I'll host in on Google App Engine (I wonder if you can for things like this?).

As an aside, getting the IM set up was a nightmare since my Google Apps domain wasn't exporting SRV DNS records for jabber. The DNS hosting (chosen by Google) was down for editing so I had to deal with their support department (ugh!). Here is the magic command to make sure you have it set up right (replace example.com with your domain name):
dig srv _xmpp-server._tcp.example.com

Wednesday, April 23, 2008

unwelcome user agents

The following useragents are not welcome on my website:

SetEnvIfNoCase User-Agent "^Biz360" bad_bot
SetEnvIfNoCase User-Agent "^Blogslive" bad_bot
SetEnvIfNoCase User-Agent "^Cazoodle" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^FeedLounge" bad_bot
SetEnvIfNoCase User-Agent "^OmniExplorer" bad_bot
SetEnvIfNoCase User-Agent "^Sphere" bad_bot
SetEnvIfNoCase User-Agent "^SurveyBot" bad_bot
SetEnvIfNoCase User-Agent "^edgeio" bad_bot
SetEnvIfNoCase User-Agent "^ia_archiver" bad_bot
SetEnvIfNoCase User-Agent "^nutch" bad_bot
SetEnvIfNoCase User-Agent "^panscient.com" bad_bot
SetEnvIfNoCase User-Agent "^ping.blo.gs" bad_bot
SetEnvIfNoCase User-Agent "^topicblogs" bad_bot
SetEnvIfNoCase User-Agent "^Moreoverbot" bad_bot
SetEnvIfNoCase User-Agent "^BlogSearch" bad_bot
SetEnvIfNoCase User-Agent "Twiceler" bad_bot
SetEnvIfNoCase User-Agent "^BlogPulse" bad_bot
SetEnvIfNoCase User-Agent "FreeMyFeed" bad_bot
# SetEnvIfNoCase User-Agent "^Java" bad_bot
I didn't have the courage to deny all Java folks.

Then I just have this in my apache config

<Directory /home/ark/html/>
Order allow,deny
allow from all
Deny from env=bad_bot
</Directory>
I'll try and keep this post up to date. Mostly you get on this list if you're a robot that's crawling (and indexing) my rss feeds.

Tuesday, April 22, 2008

feedFixer

feedFixer takes an Atom feed and inserts as many elements as it can to indicate that this feed is private and should not be indexed by search engines. It also modifies old entries in the feed and 'expires' them, replacing the title and the text.

Download: feedFixer

It's design to be set up and run every hour (or minute) and only kicks into action when it notices one of the source feeds is newer than the destination feed.

It's configured by a simple python file ~/.feedFixerrc
# -*- Python -*-
global BASE_DIR, FEEDS

BASE_DIR = os.path.expanduser("/home/blogger")
FEEDS = (FeedFixer("html/atom.fromblogger.xml",
"html/atom.xml",
url="http://example.com/atom.xml"),
)
or you could just modify the FEEDS global variable in the source file (but that will make upgrading harder).

I do this because Google Reader keeps posts around forever. If you subscribe to a feed you an then read everything that was ever posted to the feed. Having that much of my data in Google's control didn't make me feel comfortable so I wanted a way to take it back. Tombstoning works for that. It also means that other feed crawlers who might find my site will only get 1 months worth of posts to crawl at any one point and that should be a moving window with old posts replaced with tombstones when they crawl again.

I've written about feeds and search engines before but thought feedFixer deserved it's own page.

Implementation notes:
I started off using ElementTree to navigate the XML, it was so much easier, but their handling of namespaces on output rendered the output useless. I could work out how to name each namespace, but I wanted a default one with no prefix at all and couldn't do it. I moved over to using xml.minidom which worked pretty good and the output was lovely!

I want to change a few more things but I wanted to get this out there first:

TODO(ark) random text replace: Allow you to replace any text in the feed. Blogger feeds are full of the blog's id, and from this you can construct a feed url that gets all this data you're hiding direct from blogger. However doing this would make all the posts appear as new (new id's) so I didn't want to do it. Might be useful though.

TODO(ark) replace link too: when I tombstone a post I leave the link in there, but you might want to replace it with a standard link, or make it empty.

Final touches:
Now that you've expired all the old items in the current feed and verified that once Google Reader's crawler has picked it up, you should see your old items tombstoned in google reader. However if you keep going back you'll see your really old posts are still there. This is because they were too old to be in the current feed. You can make a new feed with all your items in if you're using blogger all you need is your blog's id number. It'll be up there at the top of the page when you start a post to your blog, or you can find it in the feed it looks like this: 10693130
wget 'http://www.blogger.com/feeds/1YOURBLOGNUMBER1/posts/default?start-index=1&max-results=200' -O fullfeed.xml
now fun feedFixer over that file and copy that to the place where your feed is for your blog. Allow a few hours or a day for Google Reader to pick it up and ALL your old posts should now be tombstoned! If you see one or two old posts that you posted but deleted there is a fix for that too: label each post in google reader with a tag (I use 'delete') then go into settings for google reader and under tags make your delete tag public. view the public page and copy the link for the feed for that page. Now do this on the command line:
wget 'PASTE_THE_FEED_URL_FROM_YOUR_LABEL_PAGE' -O - | xmllint --format - | fgrep original-id
Now you have a list of id's for those posts. Go back and make your label page private again. Get your feedFixed full feel and add a few more entries into it (xmllint --format makes this a pleasure to edit) and paste in those original-id's once google reader's feed fetcher picks them up they'll get tombstoned too.

protecting your blog:
In my next post I'll show you how to protect your blog from search engines while still letting your users in with a minimum amount of hassle.

Saturday, August 18, 2007

My Feed is Private KTHXBYE

This post has been superseded by my post on feedFixer.


Did you know... even if your site is completely blocked via robots.txt if you publish a feed and someone subscribes to it in bloglines or other web based readers, all your content in that feed is considered fair game (at least to bloglines it is, see this excellent explanation from feedburner). But don't worry help is at hand, there are at least TWO standards on how to block people from indexing your rss content. The leader appears to be bloglines own: http://www.bloglines.com/about/specs/fac-1.0 and there's also a w3c one that no one links to so I'm guessing it's dead in the water? http://www.w3.org/TR/access-control/. Bloglines really shows why w3c is generally unusable. w3c doc, completely unreadable and no simple examples, old and busted. bloglines doc, small, simple, easy examples to follow, new hotness.

Sadly Blogger.com which I use to manage my blogs doesn't allow me to turn this on for my feeds so I had to write something that did it for me, since I was changing my feeds anyway. I figured I might as well start using feedburner.com as well to get better stats on how many people are reading my blogs.

Hopefully this change will go completely unnoticed, I'm redirecting via .htaccess with a RedirectPermanent so I think the feed readers should update their stuff and stop hitting the old urls soon?

I wrote a small program to fix a feed and make it private. I present to you feedFixer, you run it, it takes a feed from one file and writes the same feed (but with private flags in it) to another file.

I like how it's configured, you have a file in ~/.feedFixerrc that is plain python, this file get's evaluated when feedFixer is run. An example file looks like this:

# -*- Python -*-
global BASE_DIR, FEEDS

BASE_DIR = os.path.expanduser("~/html")
FEEDS = (("blog/atom.public.xml", "blog/atom.private.xml"),
)
you can add many feeds. It won't run if the destination file is older than the source file, I just run it from a crontab every 10 minutes with this....

1-51/10 * * * * feedFixer >feedfixer-log.txt 2>&1
Strangely enough this blog is the one I actually want the content indexed for and searchable. It's my other personal blogs (me, baby, house) that I don't want indexed.

Saturday, July 14, 2007

Public Google Bookmarks Feeds

I used to use del.icio.us for my bookmarks. There were two things I really liked about it, private bookmarks and public bookmark feeds. There were many things I didn't like about it, it was slow, foxylicious was kinda funky to show my bookmarks in my browser and adding bookmarks was really slow as the browser popped up a whole new window and it was a little too public. Plus I always, always had to pause and think really hard about where to add those damn periods when I was typing in the url. However the public feeds were great, I could subscribe to them in Google Reader and then embed them on my web pages and even do some simple jiggery pokery so they'd look just like regular lists and just like part of my template. That was the "Recent Links" you'd see on the side, go ahead and look, it's back now and that's really the point of this post....

When Google Toolbar started to integrate Google Bookmarks I decided to make the switch. I used Mihai's script to move over all my del.icio.us bookmarks. It was going great, I was easily adding bookmarks by clicking on the star in my toolbar (I really only use toolbar for that now and the very fancy 'UP' button (a feature so trivial it never seems to get mentioned on any google toolbar pages)). However I forgot all about my links feeds for my guests! They were going stale for over a year, eventually I just removed that section from my website but I missed it and wanted it back. Google Bookmarks have rss feeds, but they're protected behind some authentication. I finally worked out how to authenticate to Google Services (without sending my google password over basic auth (sheesh!)), You need to get an SID cookie value and then pass that in with future requests. The GData API provided something that was pretty close to what I wanted, and I finally hacked something together. It runs every hour and gets feeds and html for some queries. I implemented private bookmarks by having a label called _private that I remove from all results.

It also runs as a monitor every day from a crontab to make sure it's still running.

bookmarks.py and .bookmarksrc

Hope you find it useful.

Saturday, November 19, 2005

expandRss

Since I discovered the joy that is blogs, rss and of course Google Reader I've found one or two feeds that don't quite provide the data I want. Usually they just provide a small summary and a link to the full story, or some crucial part of the post is missing. It started off with woot.com's feed which shows you the woot but not the price (what's the point in that?) so I wrote a small python script that fetched the feed and the price and updated the feed. Then it was b3ta.com's newsletter that had one article per newsletter without the contents of the newsletter, so I changed my script to get that too, and even make a whole bunch of articles with the link be the url that each little snippet points to. Then came snopes.com which was kinda generic, it used to point to a page with the content, so I made it scrape that content and replace the description with the scraped content.

So now I share my script here, hopefull that it'll be useful to someone. I'm not publishing my feeds since my goal isn't to upset the feed publishers/sites. If you make changes or find it useful please let me know.

expandRss