A while back James Robertson pointed to two blog entries about migrating legacy content into a CMS: Martin White and David Gammel. Quick summary: White says to plan ahead and Gammel encourages your to inventory, delete, and use temps. Good advice all around.
Here's how'd I'd approach the problem.
First, inventory and identify (and remove) ROT (Redundant, Outdated, Trivial) content. This is also a great place to get the metadata assigned.
Assuming you need the content in some sort of useful form (rather than the tag soup its currently in), you'll need to do some serious conversion. I think that much of this can be automated. A mix of a database, HTMLTidy, and python (or perl) should be enough to slam things into shape. You'll probably need to do some manual changes, but this should take care of a good deal of content.
Then, suck it up into the CMS and tweak as needed.
Posted by Karl
April 18, 2003 12:56 PM