Sunday, October 14, 2007

On generating and searching static web pages

Summary: Adding search capability to static pages is easy with Google; creating heavily template-based web pages with Apache Forrest is not quite so easy but worth the effort, there are some nice templates around like Mollio.

Long Version.

Anyone who has looked my web pages knows that they are lacking in style, both figuratively and literally, at least with respect to appearance. My pitiful excuse is that they started out as a place to disseminate the Medical Image Format FAQ, and since that started out as a set of plain text files posted via Usenet Newsgroups, the bulk of the material had no style to start with. Since then, I guess I have just focused more on content than appearance; shameful, I know.

But recently I have been thinking about how to create a more friendly web site, specifically in the context of the "Patient Contributed Image Repository" site that I am working on building. Since the primary audience will be ordinary people like patients and their families, a pleasant, professional appearance with easily navigable and clearly readable content is required. At the same time I have been working on the XML representation of the DICOM Standard, which we are doing in DocBook, and though I have previously used XSL-T quite a lot to transform structure and extract content, I have also been forced to learn something about CSS as well.

Accordingly, I began looking around for both nice "templates" to use, as well as ways of automating the transformation of structured content into web pages (without having to re-invent this from scratch).

I am looking only for straightforward navigation and layout, preferably CSS-based, since frames and tables seem to be regarded as passé these days. Frames in particular seem to be positively "harmful" in some folks opinion. Table-based layout versus CSS -based layout seems to be more a question of ease versus browser compatibility (see for example, Tables Vs. CSS - A Fight to the Death, and Why avoiding tables (for layout) is important). Strict XHTML requires the use of styles anyway, forbidding the legacy appearance related tags, though of course one can still use tables for layout; but the writing is on the wall, avoiding CSS is just not an option. But such stylesheets are potentially sufficiently complex that using somebody else's professionally designed template seems like a good tactic, especially if that professional is a trained and/or experienced graphic designer or artist.

In my hunt for nice templates for web sites (as opposed to entire documents), the only (free and reusable) ones I have come across so far that I liked enough to recommend are those from Mollio, which have a stark simplicity with sufficient functionality to match most typical modern web sites.

But I was still faced with having to do a lot of untidy manual cutting and pasting on multiple pages, as well as maintaining many internal and external navigation links. Most tedious. Being both an XSL-T and DocBook aficionado, I was most pleased to discover the "Website" package amongst the (many) types of DocBook stylesheet generated output possibilities, and even more pleased to discover that it was reasonably thoroughly documented in the standard text, "DocBook XSL: The Complete Guide". However, before getting too far into playing with it, I found what seemed to be a more "active" set of tools developed from DocBook Website called SilkPage. These stylesheets seemed to be quite a lot easier to use, and more thoroughly documented. Some preliminary experiments were quite promising. However, yet more searching revealed the existence of the Apache Forrest project, which seems to have taken over where SilkPage left off (and indeed the SilkPage developer, Sina K. Heshmati, seems to have moved on to Forrest). This is dead easy to get going (all pure Java and client-side), including generating a "seed" set of pages with a single command, which can then be edited to include the outline, content and layout that you desire. Though Forrest is still in development and not officially released yet, what is currently supplied looks like it works pretty well, and the default appearance of the seed "project" looks pretty good using the supplied appearance ("skin"), with more skins promised in the future (if you don't want to create your own). You can see some real-world examples here. Even better, Forrest promises DocBook support as well, though the primary "content" format seems to something called "xdoc", which contains a limited set of tags and I gather grew out of the Maven project. I haven't experimented sufficiently yet to decide whether xdoc or DocBook will be more suitable for my new web pages; there is probably a lot more tooling for the latter, but if the former is sufficient I may well opt for its simplicity.

Anyway, if I find anything better I will let you know, but for the time being Forrest seems to satisfy my relatively straightforward requirements of being able to create, and more importantly maintain, a non-trivial set of static page content with a contemporary appearance and navigation.

Of course it goes without saying that the pages will avoid the use of proprietary rubbish like Flash, which I (and it seems, many others) hate with a vengeance and regard as the modern equivalent of flashing text or banners. Indeed, I hate Flash so much that maybe I will start a "Flash-free validation service", maybe with a cute little logo to include on your site if you pass.

To the extent possible, I also want to avoid anything that might be configured off in the user's browser or require plugins or be non-portable across browsers, which includes Javascript, applets, etc. So I don't yet know to what extent Forrest supports these. The Patient Contributed Image Repository site will allow uploading of files, so something like Java Web Start will probably be required, but there is no getting around that, unfortunately (I do love Java Web Start, by the way, and have had great fun experimenting with pages that automatically download the correct JRE on a platform-neutral and browser-neutral basis and load the right native libraries for JAI, etc., but that is a subject for another day).

I have mentioned static content several times, and this is a consequence of my preference for avoiding server-side deployment issues that require any particular choice of server pages, database, etc., if at all possible. For complex content this always raises the question of how to search for stuff, and in perusing the Forrest documentation I came across a page that addresses this question, which reminded me about the possibility of using Google to search a particular site.

The bottom line is that one just needs to get the right parameters into the URL. A normal Google search for word "bla" looks like "http://www.google.com/search?q=bla". To constrain the search to a specific site only, such as my site, just add "&sitesearch=www.dclunie.com". Note that the sitesearch parameter can include sub-folders. It is trivial to insert a simple form element in any static web page to do this, and there are some simple examples at Dave Taylor's page. Note in particular that no Javascript is required to do this, no indirection through anybody else's site is required (which some downloadable scripts for this seem to do, perhaps nefariously to gather your details), and you don't have to have a special account or be registered with Google. This despite links to what apparently used to be the Google Free page describing this,
"http://www.google.com/searchcode.html", which now redirects one to a page called Google Custom Search Engine, which seems to imply that more is necessary. I am sure there are more powerful features there, but the simple approach seems good enough.

It took me only a few minutes to augment my own home page with an ugly little search tool and configure it to search not just my own site, but also a few favorites, like the current DICOM standard. Indeed, since these blog pages, though created with Blogger, are actually stored at and served from my primary web site, they get searched as well. As do PDFs, which is particularly cool.

Anyway, just thought you might like to know. Not that I am promising to update my primary site so that it looks halfway decent anytime soon. As I mentioned this is for a new project. The FAQ though, is quite structured despite its hand-written content, so it might be possible to automate most of that conversion.