Each speech variety in a collection has its own metadata.
Information for cataloguing and retrieving information includes the
kinds of public information that would ordinarily appear in the
introduction of a book, monograph, or major article on the
languages covered by the collection.
Language codes. The three-letter Ethnologue
Code should be filled in from
the current edition of the Ethnologue at a convenient time.1 There is a code for every known language.
The codes have been proposed as the main component of the
international standard for the ULC (Universal Language Code ISO/DIS
639-3). The code for Filomeno Mata-Coahuitlan Totonac, for example,
is tlp. There are also codes in the ULC worked out by
Linguist List for extinct languages, which are of considerable
interest to historical linguists.
If you're working with a speech variety
that is not listed in the Ethnologue, yet its speakers can't
understand any listed variety well without having to learn it as a
second language, fill it in temporarily with the code for the
linguistically closest known variety, and tell the Editor of the
Ethnologue what you know about the variety and its
relationship with nearby varieties so a definitive code can be
assigned. Ask what the new code is so you can update your
information about the variety.
Language main name. The obligatory Name
is the full name by which the language is known locally or referred
to in the linguistics literature. The full name of a language can
be long but should still be given in full, even if it's a mouthful
like Filomeno Mata-Coahuitlan Totonac. If you need diacritical
marks for the name, you can use Wordcorr's facility for typing IPA
symbols (which our Web editor can't do - sorry). No two varieties
in the same collection can have the same name.
Short language name. The obligatory Short
Name is helpful for archiving and retrieval purposes, and for
publishing results. It should be as short as possible so as to take
up little room on the screen (one to eight characters), and must be
unique within the collection. It should contain no spaces (example:
Many comparative monographs have a list of
short names used throughout. Linguists working in a particular area
are likely to have short names already in use for all the varieties
they are familiar with. So be sure to inform yourself about their
conventions, before deciding on your short names. No two varieties
in the same collection can have the same short name.
Language abbreviation. The Abbreviation
is one to three characters that identify the variety uniquely. The
abbreviation is optional. It's basically there because of importing
from the WordSurv program,
which requires you to identify each variety with a unique single
character. No two varieties in the same collection can have the
same abbreviation, and upper and lower case abbreviations (like
"A" and "a") are treated as distinct.
Locale where collected. The Locale
is the location where the variety is spoken or where you collected
the word list. Linguists have been known to collect word lists in
airports and at river junctions, sometimes at quite a distance from
where the language communities live.
Language family. The Genetic
classification may not be fully known - figuring it out may be
one reason why you're doing what you're doing. But put in as much
of a series of groupings as you can. If in doubt, follow the
Ethnologue; if you end up improving the classification, tell the Ethnologue Editor.
Quality. The Quality box
contains your overall evaluation of the word list. Use an A, B, C,
... type of ranking where A is top quality and F is hopelessly bad
(there are such things, especially some of the word lists collected
by 19th century travelers that are the only record we have of some
Alternate language names. Many
languages are known by more than one name. Under Alternate
language names give a list. It helps to follow the Ethnologue's
convention of putting pejorative names in double quotes; both
"Auca" and "Araucanian," for example, derive
from Quechuan auka, which means something like 'savage'. The
people involved call their languages Waorani and Mapudungun
respectively, and don't appreciate being called what their ancient
enemies the Quechuas have called them.
Where spoken is a verbal
description of the area where the language is spoken. For small
communities it may include latitude and longitude; others may be
better described by geographic regions (like "Southernmost 130
km of the Sierra Madre Occidental") or political areas like
provinces or districts. Frequently more than one language group
inhabits a single area, or overlaps with the territory of another.
Country where collected may
be the country where the variety is spoken, or it may be somewhere
Sources. In deciding how to cite
unpublished sources, linguists should go out of their way to
protect individuals whose participation in the research might be
adversely construed either by members of their own societies or by
local authorities. For example, it might be more prudent in some
circumstances to say "Two men in their 40s from the coastal
region" than to give names and places.
Unpublished source may simply be the
name of the person who gave you the word list, followed by the
words "native speaker" if appropriate, the person's sex
and estimated age, and geographic or social sector of the speech
community with which that person is identified, if it's okay with
the person to include that.
Or it could be something like "word
list collected by XXX in 1932" if XXX collected it years ago
and you have XXX's permission to circulate it.
Published source is one or more
standard bibliographical entries for information you got from
published works. If a work is already mentioned in the collection
metadata, you may refer to it in the variety metadata by just the
author and the date.
Remarks holds anything else
that you feel needs to be said, but that doesn't fit convincingly
into any of the other fields.
Wordcorr and OLAC. The Variety
metadata match the precise set used by OLAC. The OLAC standard,
however, defines source information only for entire works, in
Wordcorr's case the data collection.
As a result, OLAC treats all the speech
varieties in a collection minimally. They are considered instances
of the repeatable <subject> element (in the sense of subject
language, that is, a language that the work contains
information about, as opposed to the language the work is written
OLAC searches, therefore, can currently
(February 2006) find only the full name of the variety and its
language code, which is often enough to make other linguists aware
that the data exist. Wordcorr simply keeps the detailed information
on varieties within your collection, and passes on only the name
and code to the OLAC repository. If OLAC broadens the scope of its
searches, as we hope it eventually will, we will adjust Wordcorr to
pass on the more detailed data on the varieties that is kept inside
Terminology is taken from both the
linguistically specific OLAC metadata standard and the broader Dublin Core
Metadata Initiative, which provides the framework for
handling and searching for all types of data including
bibliographies via the Internet.
1Some of the
codes in the 14th edition of the Ethnologue (2000) have been
changed in the 15th edition (2005) to provide compatibility
with the earlier ISO standard. The current codes have been changed
to lower case to highlight the transition: for example, Hawaiian
(formerly HWI) has been changed to haw in ISO/DIS
639-3 for continuity with an earlier standard, and Hawai`i Pidgin
(formerly HAW) has been changed to hwc.