Research Guides: Making Pacific Language Materials Discoverable: Describing Oceania's Languages

Metadata standards

Pacific language list comparing standards

Comparative standards for Pacific language description
Compares the description standards used by librarians and linguists for the just over 1,400 Pacific languages.

Workflow

Our workflow evolved over the course of the project. In the early months, the first few steps were to develop a thesaurus to map ISO 639-3 codes, MARC codes, and LCSH codes (see boxes below); and to examine the entire Pacific Collection catalog and create "priority lists" of items to process.

Some items were batch processed because their catalog entry indicated that they contained language materials -- this was possible for major regional languages like Fijian, Tahitian, Samoan, Tongan, and Maori.

For the remaining 12,000+ items, we gave top priority to items that were likely to contain Pacific language material, but were not yet described well enough in our catalog to batch process them without pulling them from the shelves and physically inspecting them. For these items, we spent the first year of our project looking through individual books to first identify whether language material was contained in them, and then to identify the language name in the ISO 639-3 standard.

Second priority items were those that were not already described in one of the three ways noted below as containing language material, but still had a strong possibility. For example, ethnographies and other texts describing culture often enough have significant word lists, or explain aspects of the culture via local language terms.

Once we finished all the items in the main stacks, we moved to our Creole collection, our rare items, and our oversized items.

Why does this project require linguistic expertise?

Identifying languages for this project is not as straightforward as one might think!

In many cases, the name of the language listed in the book did not match any language names in MARC, LCSH, or ISO 639-3; for many other items, there was no language name listed. Describing these items properly requires a bit of sleuthing.

Clues for identifying languages can be found in various places:

The location of publication may narrow down the range of possible languages found in the item
Extended online searches (Google, OLAC, WorldCat) may reveal the language name
Villages, towns, tribes, mentioned in the text

In some cases, we relied on outside expertise from linguists who are familiar with the languages and language families represented in the Pacific Collection, including Robert Blust, Bill Palmer, Jeff Siegel, Piet Lincoln, and Alexandre François.

Librarians describing language

Currently librarians describe language in their catalog records in three ways.

Read below about the limitations of each of these and how these were addressed by this project.

Notes field (546)

The notes fields are simple prose and the 546 field in particular is designated to be about the language of the text. The limitation of this type of description is that it is only discoverable via keyword search, which becomes problematic if you consider that over 80% of Pacific languages have an alternate spelling or dialects. Getting comprehensive results would require a highly elaborate search, or multiple searches, depending on the database being used.

Despite the limitations, this project reviewed any notes that already existed for accuracy, and added notes as needed. Frequently this would include specific dialect information, or more detailed description of just how a language is used in the text.

Library of congress Subject Headings (650)

The Library of Congress Subject Headings (LCSH) are a controlled vocabulary used by librarians to ensure that the topics of publication are consistently described. This is especially important considering that over 80% of Pacific languages have alternate or dialect names.

At this moment, only 33% of Pacific language have an established LCSH. This is due to the principle of literary warrant, an outcome of which is if there isn't enough material about a topic, a subject heading isn't created.

An outcome of this project will be tracking what languages need an LCSH and finding adequate materials to propose additional LCSH terms for Pacific language.

Library of Congress MARC language codes field (008/041)

The Library of Congress (LoC) maintains a MARC code list for languages, which corresponds with the ISO 639-2 list of language codes. These are 3-letter codes and they are entered primarily into the 041 field.

In the LoC list, only 23 of the Pacific's languages, or 2%, have a unique 1-to-1 code. Only 11% have what we call a collective code. For Pacific languages, this includes two main collective codes: paa (Papua other) and map (Austronesian other). Hundreds of Pacific languages are subsumed within these two collective codes and thereby hard to find with any precision.

87% of Pacific languages are not included in the LoC language code list at all. Assuming the librarian had the right spelling of the language, she would still be unable to find a code for the vast majority of Pacific languages.

The bulk of the work of this project is to describe each item using linguistics codes, or ISO 639-3, as maintained by Ethnologue. Now all the publications with Pacific language text are described using the unique language code.