2004-06-16

Blog Backup & Wget

I posted a question to blogger support about a week ago about backing up my blog. I had realized that I had all this work out there, but I was completely relying on their servers to keep my work safe. I also had no easy way to move the data over to my own FTP server if I ever wanted.

This was a few days before weblogs.com abruptly closed.

Blogger posted a solution on their help site under advanced topics. Their solution was to make some complicated manual configuration change to put all your blog entries on one page, then save it. I was not in a rush to do this. I was tempted to just go out and browse each entry and save it.
I had also set the option to have blogger email me each post, and I set up a filter for the emails. But there is a problem with this solution: you only get the initial entry, you don't get any updates if you edit an entry.

Then I suddenly realized that I might be able to use wget to backup my blog. Wget is a line command utility that can be used to fetch a web page.
It turned out to be even easier than I expected! All I had to do was add the '-r' recurse line command switch to the program, and it traced through all my entries and saved them to relative files on my hard drive. I was really impressed by all the great options in wget.

But I decided I wanted more. I wanted to backup my image files too (posted on my ISP web server and Hello's photos1.blogger.com server). So I wrote a shell script to extract all the image files, and put them in a list file to use with wget.
Then I updated the script to do my site wget first. And I set both wget commands to only do newer files. I run it on my windows machine under cygwin. I would imagine it would run fine under linux or other UN*X platforms.
I am sharing my script under a GNU-like open source: getmyblog.

For those running windows without cygwin, I found that there is a windows port of wget (and a bunch of other un*x utilities) at unxutils.sourceforge.net. I have tested the wget.exe program, it it works great for vanilla windows backup of a blog site, for just the html files.

On an interesting note, I had been playing with wget a few days before as a way to post xml-rpc to blogger's API (which is in deprecation - the old API will be going away). It worked very well for a simple command, but I did not test creating an entry with it. I imagine it would not be hard to make a script to upload files backed up for a blog via wget going to the xml-rpc.

1 comment:

Keith said...

Update - it seems that blogger is sending an email on at least some updates. I don't know if they made a change, or I just missed it on a few edits, or if it matters what you edit.
Hm - I made a minor correction to the latest entry, but I didn't recieve an updated email. Perhaps it was because it was during the same minute as the original post? Anyway - I still say you can't count on email as a backup for your blog, but it does add to the safety.