My old wiki code used query strings to identify resources – not exactly kosher, according the the HTTP spec. I remember reading, in those early days (;-), some of Google’s “notes to webmasters”, in which they claimed that Google might not properly index a site that used query strings. I think they didn’t want to spider a potentially infinite space of dynamic pages.
And, in fact, the first few months my site was up, Google was fetching the robots.txt and the front page, but not digging any deeper. I fiddled around with various things, and ended up removing a big meta keywords element. After that, causally or not, Google started spidering me.
But I never liked the query string URIs. After a couple of attempts I’ve finally replaced them with pathinfo-style URIs.
I want the old URIs to continue to work, however, since there are going to be lots of cached links (at Google at least) and I don’t want everything to immediately break.
Essentially the task is to forward
/wiki?show=<pagename>
to
/show/<pagename>
That seems pretty easy, right? Since I’m using Apache as a “dispatcher” – the “show” in that URI is actually a script – I figured it would be easy to do this using RedirectMatch. After some befuddlement I realized that the query string – like the pathinfo string – doesn’t show up in the URI that the Apache directives get to work with. The query string and pathinfo parts of the request URI have already been written to environment variables. Since those directives don’t have access to the envvars representing an HTTP request, I had two choices:
- have “wiki” be a small redirector script that extracts the query string and generates an HTTP Redirect to the new URI; or
- try using the rewrite engine (mod_rewrite).
I’ve looked at mod_rewrite in the past and found it to be a huge messy powerful kludge. It’s ugly. And the documentation is appalling. Actually, the Apache server documentation in general is appalling. A perfect example of this is that nowhere do they say that query strings and pathinfo parts are stripped out of the URI before matching with Alias, AliasMatch, Redirect, and RedirectMatch.
Even though I had essentially the code to do the redirect (my wiki code has to parse the query string, and I wrote a redirector to solve another problem), I decided to try mod_rewrite. It wasn’t too bad – except for one thing: the way that you tell it to elide the query string in the rewritten URI is completely retarded.
Here is the config snippet:
RewriteEngine On RewriteCond %{QUERY_STRING} show=(.+) RewriteRule ^/(browse|wiki) /show/%1? [last,redirect=301]
It’s pretty simple: turn on the engine. Match a URI with a query string of the form “show=<page>”. Capture the <page> in %1. Now rewrite /browse or /wiki URIs with this query string to "/show/<page>”. The trailing "?” is the “clue” to the rewriter to null out the query string. Since the rewritten URI can include a query string, it seems to me to make more sense that simply leaving it blank (which is what I did at first) would be the sensible thing. If I want to keep it I should have to write it on the right side.
The “[last,redirect=301]” bit is a note to
- not match any more rules, and
- send a Permanent Redirect (301) to the client.
Since these URIs are history I figured 301 (permanent) made more sense than 302 (temporary).
That’s it! It works great.