James Linden

~# data ninja / linux guru / web dev geek / robotics nerd / idea machine / N6NRD

The Anarchy of Taxonomies, Hierarchies, and Versioning

Currently, Project Gutenberg uses an often confusing file hierarchy. The use of a two digit year in the base-level directory has two problems:

  1. They don't sort properly since "00" comes before "99".
  2. They don't tell anyone anything - particuarly since etext04 already exists and it isn't even 2003 yet.

Granted, the sorting issue is relatively minor, but the bogus etext creation year is confusing and downright illogical.

So, this brings up the question - "How do you build the base level of the hierarchy?" This question has many possible answers, each with its own problems.

For example, if you sort alphabetically by author's last name, you end up with a lot of "v" entries for "various authors", or "m" entries for "multiple authors". This, of course, can defeat the purpose very quickly.

If you sort alphabetically by book title, you don't have the "v" and "m" issue, but you run into things like:

  1. When do you make "A Princess on Mars" into "Princess on Mars, A" - which moves the file from "a" to "p" - and when do you not do that?
  2. How do you decide where "Beethoven's Fifth Symphony" goes - is it "Fifth Symphony" by "Beethoven" and therefore goes under "f" for "Fifth", or is it "Beethoven's Fifth Sympony", and goes under "b" for "Beethoven"?

Not only do these issues come up the file end of the picture, but also the interface end. In the case of Project Gutenberg, FTP for files and the web for the interface. The difference, of course, is that within the web, we have the ability to do all the different sorting options using database driven scripts. In that situation, these hierarchy issues become practically non-existant. On the file (FTP) end of the situation, there is no database, since FTP is purely file-based, and the hierarchy remains a nightmare.

So, how do we solve the problem, without creating half a dozen entire hierarchy combinations? Again, this issue could have several possible solutions.

One solution would be to write a smarter FTP daemon, which can make use of database indexes to dynamically create the "file" listings that the FTP client "sees". "How exactly would you do this?", you may be asking. That answer is quite simple, actually. Allow me to explain.

The nature of the file transfer protocol (FTP) system is simple and makes it possible to change from binary to ASCII mode, turn on or off hash mark display, etc. (The actual file downloading portion of FTP is immaterial to this article.) By using a few secondary option commands, the FTP server can be told to switch from author to title mode or any other mode desired. Then, when the list command is recieved, the FTP server can generate the proper "file" hierarchy for that mode. When changing directories, the FTP server can simply add a WHERE statement to the SQL creating the file list. For example: the directory /author/a/ would actually be WHERE LEFT(author,1) = 'a' in the SQL query.

Using this concept would give the FTP server the ability to be as robust as the web interface, without the need for a special FTP client to be used. Adding multiple file formats is just as simple, as would be adding other hierarchies, including Library of Congress, Dewey Decimal, PICS ratings, DMOZ, TopicMaps, etc.

Of course, this concept isn't limited to just ebooks or Project Gutenberg, nor is it completely new. ProFTPd has a third party (beta) module for TDS supporting DB servers.

photo of James Linden
Founder / Head Geek
Digital Dock, LLC
aka kodekrash & N6NRD
Alexandria, LA USA

What I Do

Linux administration & virtualization
Data mining, storage & analysis
Web development

What I've Done

Rescued a skunk
Built Prime GNU/Linux
Contributed to Spidering Hacks