In this week’s Organizing and Access to Information class I fear I steered the conversation into a detour on the subject of metadata. The prof’s introduction to the concept began straightforwardly enough — metadata is data about data, e.g. author, title, publication date, etc. But then as his examples got more complex he began to call things metadata that I would have considered part of the data itself. I don’t have his slides in front of me but I think it started when he put chapter 1, page 1 of Pride and Prejudice on the screen and said that its structure — chapters, paragraphs, etc. — was also metadata. That surprised me, and we spent probably far too much class time working on why. (I feel guilty but not too guilty — I kept offering to drop it, but other people had questions and comments, too.)
It boiled down to this: in my naive interpretation, metadata is information that applies to the “information object” as a whole, or is extrinsic to it in some way. I’d call anything integrated with the meat of the object “data”, not “metadata”. That includes structure and layout information — the information represented in typography and layout on a printed page, or in ordinary inline markup on the web, etc. Of course we can abstract structure away from presentation but that doesn’t mean that the structure is no longer part of the work. Where Jane Austen chose to put her paragraph breaks is as much a part of the novel as the words she chose to put inside them. There were some interesting examples presented in class, such as whether the abbreviation and typography conventions used to identify the parts of speech in a dictionary are metadata. I argued that whether represented through an italic n. or an XML <part-of-speech> entity, the part of speech is an integral part of the content. Or in multimedia terms, the string of bits representing an audio or video stream is the data and it doesn’t make sense to speak of the volume level or amount of cowbell as being a separate thing called metadata.
The prof worked hard to explain his model and I was probably just being thickheaded not to get it. On reflection I see that the concepts of “data” and “metadata” are conventions and where to draw the line is a matter of utility; if it is helpful in a certain setting to call internal markup (or its pre-digital equivalents of layout and typography) “metadata” then so be it. Also, there are many times when it’s good for information to live in both places: you can hear the cowbell in the audio stream, and you may also want an access point in the catalog of your music library which says “Cowbell: track 6890691, timepoint 1:37″.
But I don’t think I’m alone in my naive model. The very next day’s reading assignment was chapter 3 of Erik Ray’s Learning XML, where he defines the term: “Metadata is information about the document that is not part of the flow.” What I’m wondering now is whether these diverging definitions have any consequences beyond occasional confusion.