Monday, 29 February 2016

Metadata Update # 29 - Strings Versus Things

Strings versus things – this is a common debate/discussion in cataloging circles lately.  The idea that a string of text which represents or describes something is more difficult to construct and less versatile relative to an assigned code which can be mapped to represent the “thing” is not a new idea. The reality is that there is a long tradition of “string creation” in the realm of library metadata.  The science of creating metadata for libraries had its origin in a time before the invention of electronic computers.  The earliest metadata was recorded by human beings on paper or in a paper-based location; coded using human language; and read and interpreted directly by the human eye.  Humans read and can readily make sense of words and sentences.  Words and sentences are made up of strings of text.  In order to make sense of the strings of text, they are typically organized in a certain way (e.g. ISBD).   When the MARC standard was developed in the late 1960s, it was built to organize and format strings of text in a container which could be read by computers.  While the Library of Congress could have, for example, decided to convert library metadata to something that was less “string-based” over 40 years ago and the tradition of constructing complex strings could have been gradually shifted toward the use of codes to represent “things”, this transition has, for the most part, not happened. 

There is an irony in holding tight to the “strings” model.  Many librarians and library workers may find “string-based” metadata more user-friendly because the values are expressed in human language.  A person who is not trained in library metadata standards can generally read and make sense of string-based metadata.  Thus many library workers may feel comfortable with and prefer the idea of string-based metadata.  The irony arises from the fact that learning how to create effective strings which can function in todays crowded information environment is a challenging task.   This year at ALA midwinter, I heard a number of speakers discuss the problem with the length of time that it takes for a new librarian to build basic cataloging skills, let alone become an expert.  Most experienced cataloguers will estimate the learning period as lasting from two to three years.  That time estimate assumes that the new librarian has a local expert to instruct and mentor them through the learning process.  In smaller libraries, the reality is that the new librarian may be doing much of the learning on his or her own which makes the learning curve a little steeper.  From my own experience with taking the NACO training and going through the review process, I can attest to the complexity that lies behind learning to do a task which may seem, on the surface, to be little more than data entry.  So, considering that it takes all of the time and effort to learn how to create string-based metadata, one might assume that the quality and utility of the metadata would be superior to other forms of metadata.  While the quality is likely to be equal, many may find it initially surprising that the code based representations of “things” is actually much more useful and versatile.  The reality is that once a code is assigned to a person, place, thing or concept (etc), that code can be mapped to multiple languages and scripts. The ability to map the codes in this matter assists in addressing a new discovery environment which is not only diverse locally but needs to address the needs of an increasingly global audience.  Thus by assigning a code once and for all discovery environments, the metadata creator can work in his or her native language to create highly flexible, internationalized metadata. Perhaps libraries’ recent experiments with BIBFRAME and other forms of linked data are starting to bring this reality into focus.

To suggest that all “string” based metadata is of the past and will disappear soon and all new metadata will be the code based representations of “things” creates a false dichotomy.  We are not likely to ever entirely get rid of all string-based metadata.  In fact, we need metadata which makes the links between the representative codes and their equivalent text strings in various languages and scripts.  In our current environment, we are gradually moving into a hybrid situation.  Our “string-based” MARC records are now being enriched by OCLC which is adding URIs in the $0 fields.  Those URIs are the new code equivalent for the “thing” which is expressed by the string.  While is it hard to imagine a form of library metadata or data which is not readable in the record format to which we are accustomed, we now know that the vast majority of academic library discovery metadata will take the form of linked data which has no “record structure”.  The recent addition of the $0 subfield to MARC records is our first steps into the new world of library data.  It is a sign that the transition is real and that it has begun.  I certainly will be interested to see, first of all, how the major transition which needs to occur will unfold and, secondly, how we can improve our discovery environments by making the changes.  I’m sure that in the upcoming months and years this blog will revisit the issue of “strings versus things” from time to time. 

Now to round off this post, I will introduce a tool of interest.  I this case it’s not so much a tool but an interesting experiment that OCLC is working on to demonstrate how authority data and linked data can be used to create a useful discovery environment.  This experiment is called WorldCat identities and can be found at: .  To see the results of the experiment in linking up various sources of linked data, users can just click one of the top 100 names that are listed on the main page.  Or, you can also search for the name of a person.  When I demonstrate this site to others, I often pick Justin Trudeau seeing as he has both created resources (i.e. writing, interviews, theatrical performances, etc) and has had works written about him.  It is interesting how the linked data can be brought together to create timelines and also provide links to other persons.  You may also notice with Justin Trudeau that there is some orphaned metadata which forms its own results set and doesn’t seem to link back to the other content.  I think that it’s this sort of problem which indicates to us an area toward which the efforts of cataloguers will need to be redirected in the near future.   From time to time, I check into this website and search the same names to see if known problems have been resolved or if anything new has been added.  I’ve found that it’s not unusual to find new content or relationships displayed.   What hasn’t changed is the problems with what we would call in traditional cataloging, authority control. 

Monday, 22 February 2016

Metadata Update #28 - Identifiers, 3 years later

Way back in Metadata Update #13 (Feb 14 2013, I spoke briefly about the role and importance of identifiers in online electronic information.  Three years later, it has proven that the talk about identifiers wasn’t a splash in the pan. They were the talk of the town at ALA Midwinter once again.  

As libraries experimented with BIBFRAME and moved from BF 1.0 to BF 2.0 and linked data work moved from the theoretical LD4L to the practical LD4P, certain things about library data and the wider information environment, things that we have “sort of known” for a long time, have gradually started to come into much clearer focus and we are starting to understand what they really mean for the day to day work of creating and managing metadata.  One of those “things” is the importance of identifiers.

Libraries have long made use of the concept of controlled vocabularies where a single word, phrase or form of a name is used to represent a single person, place, thing or event.  Cataloguers and other librarians understand that the use of these vocabularies (controlled headings, name authority data, etc.) assist with collocation and disambiguation.  I often hear discussions about the need for increased disambiguation of terms and persons in online environments.   It seems that as our body of electronic information grows, the more we need to be able to bring together all of the information on a same topic with ease while eliminating the voluminous noise of irrelevant information.  In academic environments, the ability of authors to collocate their work and institutions to do the same for textual and artistic outputs of their faculty and researchers in becoming increasingly important.  Identifiers help with all of this.

The idea behind “identifiers” builds on the older concept of controlled vocabularies.   That thing (identifier) which is used represents person, places, things or events is much more flexible and powerful in our current information environment than was possible in the past.  For the most part identifiers are numeric rather than textual or string representations.  The power they carry is that they can be mapped to multiple scripts and languages and linked to multiple other identifiers so that information seekers can explore topics and see relationships among persons and events in increasingly complex and rich ways.  As VIAF ( has shown, even where there are multiple systems of identifiers, it is possible to map all of the identifiers to support increasingly powerful ways to collocate the works of an author and disambiguate similar or identical names.  This can be clearly seen in some of the better linked data discovery environments.   As I found out at ALA this year not all linked data is good linked data but for those who get it right, your socks can be knocked off. 

So, now we know that identifiers are a good thing and linked data is a good thing but what does it all mean for the average metadata or cataloging librarian?  At ALA it became apparent that librarians are beginning to think about all of our controlled headings for which there is no associated authority data, keywords stored as subject headings and, to a lesser extent, blind references.  We’ve known about these sorts of issues for a long time but the limited complexity of our OPACs and discovery systems have not caused the systems to break down.  However, linked data triplestores can’t be Swiss cheese and if you’re linking to something, the thing that you want to link to needs to exist.  So, what does this really mean – what do we need to face up to in our day to day work?  A lot of librarians are taking the situation to mean that we need to start creating a whole lot of identifiers somehow.  Some are looking at lowering the training threshold to be able to dramatically increase NARs (LC name authority data) production, while others are looking at automated or systematic ways to create ISNI and ORCID identifiers while others still are looking at ways to create local identifiers as placeholders until proper identifiers in the form or NARs or LCSHs (for example) are created.  On the flip side, I’ve also heard that it’s not necessary for every person place, thing, etc have an associated identifier for BIBFRAME or other linked data for libraries to work.  In response to this, I’ve heard other librarians say that these systems may work but they won’t work as well as they could, etc. 

For myself, I know how much work it is to create some NARs that now that RDA coding is required.  However, I look at the results of a new RDA NAR in terms of quality and potential in our more complex information environments and I can see the value of the work.  My mind does remain a little boggled about where we go from here.  I’m taking stab at assuming that it would be a good idea that I start getting more efficient at creating NARs and also try to focus on making more NARs for authors originating my geographical location and those from whom I can collect the information we are now storing in RDA NARs.  In general, thoughts about identifiers and how we can create more of them remain in the back of my mind and I will most certainly be looking for opportunities as they arise.

Given the readership my blog seems to have picked up again in the last couple of months, I’d like to put the question out to my readers as to what they think about identifiers in a linked data environment and what it all means for the work that we are doing today and will need to do in the near future.  As many of us say, now is an interesting time to be a librarian and this is another issue which reinforces that idea.

By the way, I have heard that some of my readers have been binge reading my blog posts!  Thanks for your email and feedback.  I had even forgotten some of what I wrote three or more years ago.  I find it interesting to hear what people have found the most interesting and also to learn how quickly things can date in our field.  It reminds me that I shouldn’t let quite so much time pass between posts.  I hear and appreciate what one of you has said about the overall rate of change in our field and I suffer from the same information overload and overwhelming curiosity about what is up-and-coming.  I have a big backlog of topics I’d like to cover upon which I think I was getting bogged down and ended up not writing much of anything.   I think that I will shift my focus away from that list and onto what is currently being discussed.  I’m thinking that it might be easier for me to keep up if I focus on what happens to have my interest at the moment rather than topics I feel that I “ought” to cover.  Seeing as there is interest in reading the blog, I’ll try to increase my projected output from one every two months or so to twice a month.  We’ll see how that goes.  And, if I don’t keep up with my word, let me know by email again!

By the way, I a number of you have said that the mini-MOOC and cataloging calculator were very valuable.  I'll continue to suggests tools and videos.  This time you might want to have a look at OCLC's "classify" tool which can be a quick way to work out a classification number when you either don't have time to search around ClassWeb or don't even have a subscription.  Check it out:

Tuesday, 16 February 2016

Metadata Update #27 - Key messages from ALA

It's been a little over a month since I was at ALA Midwinter in Boston and I've been thinking about the key messages that I came away with this year.  I don't think that there is any question that there was a lot of talk about "strings v things" and the increasing importance of "identities".  However, what were the bigger messages?  I did have to take my time and think through the issues and discussions!  There was a lot to consider.  In the end, this is what I concluded:

1)  Libraries have talked about modernizing their metadata, replacing MARC and entering the global information environment for a very long time.  Now we are actually doing it.  It is real, things are happening.  You don't even have to look too long to find evidence of it - if you know what you know where and how to look.
2)  Libraries have used the same models and concepts for a very long time and have massive amounts of metadata which reflect the traditional way of thinking about information (legacy metadata).  For a long time the issue of dealing with legacy metadata in a new environment seemed insurmountable, however, while issues have not been resolved, there does seem to be a bit of light showing.
3)  There are leaders in the library community (national libraries and large research and academic libraries) who have the brainpower, funding and other resources to actually take on the task of testing and tweaking the new models and developing practical ways to implement them.  We can benefit greatly by following there progress and studying their findings.  It came through very clearly at ALA that there are small steps we can all take now to make the eventual change smoother and less stressful.

As usual, I came back from ALA all pumped up from the meetings and all of the fascinating discussions I had.  I continue to believe that now is an extremely exciting time to be a librarian.  Of course, I sometimes get a little discouraged when I try to share the ideas and vision with others locally who don't yet understand the vision and bigger picture and feel that the change is too vague and far off to be of concern.  I guess that the message that I can take away from that, is that I have some work to do locally in terms of helping to uncover the new vision.

After considering the big picture, I think that it is worth it to talk about why it is that "strings v things" and "identities" were such big topics of discussion but I'd like to leave that for future posts.  I think that a the three key messages I've listed in this blog are important enough to get their own consideration today.