Variety metadata. Each speech variety in a collection has its own metadata. Information for cataloguing and retrieving information includes the kinds of public information that would ordinarily appear in the introduction of a book, monograph, or major article on the languages covered by the collection.

Language codes. The three-letter Ethnologue Code should be filled in from the current edition of the Ethnologue at a convenient time.1 There is a code for every known language. The codes have been proposed as the main component of the international standard for the ULC (Universal Language Code ISO/DIS 639-3). The code for Filomeno Mata-Coahuitlan Totonac, for example, is tlp. There are also codes in the ULC worked out by Linguist List for extinct languages, which are of considerable interest to historical linguists.

If you're working with a speech variety that is not listed in the Ethnologue, yet its speakers can't understand any listed variety well without having to learn it as a second language, fill it in temporarily with the code for the linguistically closest known variety, and tell the Editor of the Ethnologue what you know about the variety and its relationship with nearby varieties so a definitive code can be assigned. Ask what the new code is so you can update your information about the variety.

Language main name. The obligatory Name is the full name by which the language is known locally or referred to in the linguistics literature. The full name of a language can be long but should still be given in full, even if it's a mouthful like Filomeno Mata-Coahuitlan Totonac. If you need diacritical marks for the name, you can use Wordcorr's facility for typing IPA symbols (which our Web editor can't do - sorry). No two varieties in the same collection can have the same name.

Short language name. The obligatory Short Name is helpful for archiving and retrieval purposes, and for publishing results. It should be as short as possible so as to take up little room on the screen (one to eight characters), and must be unique within the collection. It should contain no spaces (example: FMCTot).

Many comparative monographs have a list of short names used throughout. Linguists working in a particular area are likely to have short names already in use for all the varieties they are familiar with. So be sure to inform yourself about their conventions, before deciding on your short names. No two varieties in the same collection can have the same short name.

Language abbreviation. The Abbreviation is one to three characters that identify the variety uniquely. The abbreviation is optional. It's basically there because of importing from the WordSurv program, which requires you to identify each variety with a unique single character. No two varieties in the same collection can have the same abbreviation, and upper and lower case abbreviations (like "A" and "a") are treated as distinct.

Locale where collected. The Locale is the location where the variety is spoken or where you collected the word list. Linguists have been known to collect word lists in airports and at river junctions, sometimes at quite a distance from where the language communities live.

Language family. The Genetic classification may not be fully known - figuring it out may be one reason why you're doing what you're doing. But put in as much of a series of groupings as you can. If in doubt, follow the Ethnologue; if you end up improving the classification, tell the Ethnologue Editor.

Quality. The Quality box contains your overall evaluation of the word list. Use an A, B, C, ... type of ranking where A is top quality and F is hopelessly bad (there are such things, especially some of the word lists collected by 19th century travelers that are the only record we have of some languages).

Alternate language names. Many languages are known by more than one name. Under Alternate language names give a list. It helps to follow the Ethnologue's convention of putting pejorative names in double quotes; both "Auca" and "Araucanian," for example, derive from Quechuan auka, which means something like 'savage'. The people involved call their languages Waorani and Mapudungun respectively, and don't appreciate being called what their ancient enemies the Quechuas have called them.

Where spoken is a verbal description of the area where the language is spoken. For small communities it may include latitude and longitude; others may be better described by geographic regions (like "Southernmost 130 km of the Sierra Madre Occidental") or political areas like provinces or districts. Frequently more than one language group inhabits a single area, or overlaps with the territory of another.

Country where collected may be the country where the variety is spoken, or it may be somewhere else.

Sources. In deciding how to cite unpublished sources, linguists should go out of their way to protect individuals whose participation in the research might be adversely construed either by members of their own societies or by local authorities. For example, it might be more prudent in some circumstances to say "Two men in their 40s from the coastal region" than to give names and places.

Unpublished source may simply be the name of the person who gave you the word list, followed by the words "native speaker" if appropriate, the person's sex and estimated age, and geographic or social sector of the speech community with which that person is identified, if it's okay with the person to include that.

Or it could be something like "word list collected by XXX in 1932" if XXX collected it years ago and you have XXX's permission to circulate it.

Published source is one or more standard bibliographical entries for information you got from published works. If a work is already mentioned in the collection metadata, you may refer to it in the variety metadata by just the author and the date.

Remarks holds anything else that you feel needs to be said, but that doesn't fit convincingly into any of the other fields.

Wordcorr and OLAC. The Variety metadata match the precise set used by OLAC. The OLAC standard, however, defines source information only for entire works, in Wordcorr's case the data collection.

As a result, OLAC treats all the speech varieties in a collection minimally. They are considered instances of the repeatable <subject> element (in the sense of subject language, that is, a language that the work contains information about, as opposed to the language the work is written in).

OLAC searches, therefore, can currently (February 2006) find only the full name of the variety and its language code, which is often enough to make other linguists aware that the data exist. Wordcorr simply keeps the detailed information on varieties within your collection, and passes on only the name and code to the OLAC repository. If OLAC broadens the scope of its searches, as we hope it eventually will, we will adjust Wordcorr to pass on the more detailed data on the varieties that is kept inside Wordcorr.

Terminology is taken from both the linguistically specific OLAC metadata standard and the broader Dublin Core Metadata Initiative, which provides the framework for handling and searching for all types of data including bibliographies via the Internet.

1Some of the codes in the 14th edition of the Ethnologue (2000) have been changed in the 15th edition (2005) to provide compatibility with the earlier ISO standard. The current codes have been changed to lower case to highlight the transition: for example, Hawaiian (formerly HWI) has been changed to haw in ISO/DIS 639-3 for continuity with an earlier standard, and Hawai`i Pidgin (formerly HAW) has been changed to hwc.

