Going meta on metadata

In this week’s Organizing and Access to Information class I fear I steered the conversation into a detour on the subject of metadata. The prof’s introduction to the concept began straightforwardly enough — metadata is data about data, e.g. author, title, publication date, etc. But then as his examples got more complex he began to call things metadata that I would have considered part of the data itself. I don’t have his slides in front of me but I think it started when he put chapter 1, page 1 of Pride and Prejudice on the screen and said that its structure — chapters, paragraphs, etc. — was also metadata. That surprised me, and we spent probably far too much class time working on why. (I feel guilty but not too guilty — I kept offering to drop it, but other people had questions and comments, too.)

It boiled down to this: in my naive interpretation, metadata is information that applies to the “information object” as a whole, or is extrinsic to it in some way. I’d call anything integrated with the meat of the object “data”, not “metadata”. That includes structure and layout information — the information represented in typography and layout on a printed page, or in ordinary inline markup on the web, etc. Of course we can abstract structure away from presentation but that doesn’t mean that the structure is no longer part of the work. Where Jane Austen chose to put her paragraph breaks is as much a part of the novel as the words she chose to put inside them. There were some interesting examples presented in class, such as whether the abbreviation and typography conventions used to identify the parts of speech in a dictionary are metadata. I argued that whether represented through an italic n. or an XML <part-of-speech> entity, the part of speech is an integral part of the content. Or in multimedia terms, the string of bits representing an audio or video stream is the data and it doesn’t make sense to speak of the volume level or amount of cowbell as being a separate thing called metadata.

The prof worked hard to explain his model and I was probably just being thickheaded not to get it. On reflection I see that the concepts of “data” and “metadata” are conventions and where to draw the line is a matter of utility; if it is helpful in a certain setting to call internal markup (or its pre-digital equivalents of layout and typography) “metadata” then so be it. Also, there are many times when it’s good for information to live in both places: you can hear the cowbell in the audio stream, and you may also want an access point in the catalog of your music library which says “Cowbell: track 6890691, timepoint 1:37″.

But I don’t think I’m alone in my naive model. The very next day’s reading assignment was chapter 3 of Erik Ray’s Learning XML, where he defines the term: “Metadata is information about the document that is not part of the flow.” What I’m wondering now is whether these diverging definitions have any consequences beyond occasional confusion.

Managing PDFs with iPapers

Revisiting the topic of how to manage a personal collection of PDFs, Don Turnbull turned up the interesting program iPapers by Toshihiro Aoyama. It manages PDFs in an iTunes-like interface:

It’s a nifty program but it does have a couple of limitations. It’s intended for use with the medical bibliographic service PubMed, so incorporating PubMed articles is a simple matter of dragging and dropping, and iPapers will then retrieve the bibliographic metadata from PubMed much as iTunes retrieves track listings. For non-PubMed articles, though, the import process is much less straightforward. An obvious improvement would be to make drag and drop work for any PDF and to pop up the dialogue for hand-editing bibliographic data by default when the program can’t retrieve it from PubMed.

Even better would be to support plugins so third parties could write interfaces to PubMed’s counterparts in other fields, perhaps to CiteSeer or Google Scholar. The hardest part there might be determining unique identifiers by which to do the lookup.

Another limitation is iPapers’ model of metadata. It is strictly oriented toward journal articles, so books, handouts, PDF archives of webpages, etc. fall outside its scope. Maybe more importantly, iPapers doesn’t currently allow for user-definable fields or tags. I’m hoping for a tool which will let me sort PDFs by topics, courses, and my own writing projects.

But it’s new and hopefully Toshihiro Aoyama is still adding features. Check it out and send him your encouragement.

Social networking vs. just the apps, ma’am

A new baby was born in my extended family this weekend and I was asked to recommend an easy way for someone without a personal website to share photos. I of course suggested Flickr and the designated photographer started uploading baby pics. But then a few more requirements emerged and I realized that maybe Flickr, wonderful as it is, wasn’t the right choice for this application, so with a heavy heart I sent her on to Ofoto.

Why the change of course? Because despite the intrinsically social nature of photo trading, these people don’t really want a social network system. The requirement that they omitted from the initial conversation is that they want the baby pics to be private. Fine, Flickr can do that, you just have to mark your photos as visible only to friends and family — and then your friends and family have to join Flickr. Ofoto, if I understood their intro right, is perfectly happy just to host an album and let you send out a URL and a password for viewing it. Ofoto’s approach is asymmetrical and not especially social, but sometimes you don’t want to invite people to join with you in creating a beautiful virtual world. Sometimes all you want is a dumb app.

I’m sure there are lessons to be learned here, but it’s late and I still have homework to do so I’ll save them for another day.