Buttons with image map

                                             WWW www.wordcorr.org                                    Home


Wordcorr Home > Linguist > Background > Implications
Advancing discovery
Promoting learning
Broadening participation
Society at large

Advancing discovery. There are three levels at which Wordcorr advances discovery: experienced, student, and curious guest.

Experienced: Experienced scholars with huge amounts of data profit from the fact that

  • no data or hypotheses leak out of Wordcorr.
  • all the data are available all the time.
  • observations expressing part of the investigator's analysis can be attached to any relevant unit.
  • a Residue section holds everything for which the scholar has not yet found a place in the analysis, including things that do not fit the analysis because they are due to language contact or internal analogies.

As in all science, it's the things that don't quite fit the big picture that are the cracks through which new insights make their way to the inside of one's mental box.

Furthermore, the fact that scholars no longer have to spend most of their time on data management details, to the detriment of thinking analytically, means that novel ideas have a better chance of surfacing.

And the ability to set up separate views to follow out the implications of several incompatible hypotheses at the same time should lead to more thorough documentation of the reasons for preferring or rejecting alternative analyses.

Student: Students of comparative linguistics have been known to fall asleep after a certain number of pages of data that began to look all the same. Experience with Wordcorr's interactive approach makes it look like it holds the user's attention with the intensity of the more cerebral types of video game.

This means that the student using a prepared data set is more motivated to retrace the original scholar's path of discovery, and not just to read about the conclusions the scholar reached and the controversies along the way. In fact, the student just might see something the established scholar missed.

INTERESTED: People with no linguistic background but plenty of curiosity are welcome to try Wordcorr to browse the same demonstration data sets that the developers of Wordcorr used to test the program. With the help of this Web site and the Wordcorr Help facility they can learn how to try their hand at making comparisons.

Some of them will not only be motivated to play with real language data; they will discover things about language in general they hadn't thought of before. Personal discovery of that kind could lead some to become linguists, and might soften the prejudice some people were brought up with against languages other than their own.

Promoting learning. Graduate students in master's or doctor's programs find Wordcorr useful for storing and archiving their field data and combining them with other relevant data already available from archives or publications. Then as they tabulate what they have collected and form their own hypotheses about the patterns of language divergence that underlie each correspondence set they find in their tabulations, they go through what for graduate school is the prime learning experience: doing a workmanlike job that actually advances knowledge.

Classroom discussions at both graduate and undergraduate levels should be interesting, since the students are likely to notice things the professor has never dealt with. The "interested" category may turn into an avenue of self-directed learning.

Broadening participation. Trends in the kinds of papers accepted for meetings such as the Linguistic Society of America suggest that fewer people than before are active in comparative linguistic research.

One reason may be that when students of linguistics are exposed to the comparative method, they find it exciting -- until it hits them how much picky work is involved and how easy it is to overlook something, at which point semantics begins to look like a better career choice. Knowing that there is a tool that diminishes the picky work and makes it hard to overlook anything could lead to an increase in participation.

The ability to form research teams by exchanging files over the Internet can come to involve collaboration among many institutions, domestic and foreign. And the Interested status may draw in students and members of the public, including native speakers of some of the languages who are otherwise underrepresented in linguistic scholarship because of geographic or social isolation from mainstream academic institutions, or lack of funding, but who nevertheless have much to contribute.

Infrastructure. Linguist List lets any linguist search any archive that follows the norms put forth by the Open Language Archives Community (OLAC), including the Linguist List's own Electronic Metastructures for Endangered Languages Data (EMELD). In addition, Linguist List as eventual host of this Web site provides a practical point of contact for exchanging Wordcorr files among the Wordcorr community over the Internet. In this way research teams can form and discuss each other's analyses. Professors of linguistics and their students can interchange data and analyses.

The Wordcorr design began with an impossible alternative: to create and maintain an infrastructure capable of managing the data and analytical work necessary to complete a thorough classification of all the world's languages in a single large data base.

Were it to operate on that scale, it would contain a data component of around 200 gigabytes, containing say 10,000 speech varieties with an average of 1,000 entries per variety (based on 6,900 living languages in the 15th edition of the Ethnologue, revisiting poorly documented dialects of known languages, and finding varieties that linguists are still not aware of; Kurebito 2001 is an example of an actual 1,000-entry data list), with each variety containing a datum of on average 10 segments of Unicode characters in UTF-8 encoding for each entry.

Such world scale tabulation could well involve 1,000 investigators, with each investigator looking at 100 varieties and some varieties being looked at by more than one investigator, through 10 different views with annotations of 10 bytes for each datum. That would make the results component a little over 10 gigabytes, assuming 50 protosegments per view, 100 correspondence sets per protosegment, and 100 Unicode characters in UTF-8 encoding in each correspondence set, plus 20 4-byte citations per set. With a 50% overhead for the management component tables and behind-the-scenes linking tables, this would put the worldwide database for comparative linguistics at around 45 gigabytes, a modest size as serious databases go.

But such a world size database would require costly maintenance over decades. One could guess that in five years the data component might grow to 5,000 varieties averaging 500 entries per variety, requiring 50 megabytes. By that time tabulation activity might reach 300 investigators working on an average of 50 varieties each, in 3 views, giving 450 megabytes. Results would still be around 50 protosegments per view and 100 correspondence sets per protosegment, but only 50 segments per average correspondence set because of the 50-variety scope, and fewer available citations per set, giving another 450 megabytes. With overhead, the actual database in five years would be around 1.5 gigabytes, which would fit the 2002 model laptop computer this page is being edited on with room to spare.

Educational use of Wordcorr to teach comparative phonology might swell this number to 2 gigabytes; but it is unlikely to strain the resources, because educational users are likely to stick to small collections with relatively few varieties. (Agard's excellent pedagogical presentation of the Romance language family (1984), for example, has 475 entries for 8 varieties; it would be ideal as a data set for educational use with Wordcorr.)

The important decision for Wordcorr turned out to be designing it so that one installation could in principle handle a huge amount of data, but committing resources only for what could be developed in the two years of funding. Having multiple copies of Wordcorr data collections spread independently around the world is a better way to go than putting them in just a few archival databases, because dispersal can be done free over the Internet, and is an effective kind of insurance against catastrophes. That way, we can keep operating without an enormous database that requires a staff of its own.

So the main focus of the Wordcorr Project quickly came to be the standalone application, with its own local database inside the computer of an individual investigator. If that person is out in the field collecting and analyzing primary data, it doesn't matter if there is easy Internet access or not. The person may have already been working alone for years and may not be in a position to join in team research. But whenever there is opportunity to connect with colleagues over the Internet, everything is ready to go.

One person's data component is not likely to be larger than 100 varieties, and many investigators collect only about 300 entries per variety, giving about a megabyte total. The results of tabulation for just one investigator, not hundreds, for 100 varieties and perhaps 3 views, give another megabyte. The results part is comparable to the others, averaging 50 protosegments per view, 100 sets per protosegment, 100 segments per set with citations, giving around 2 megabytes. The grand total for a substantial amount of data is under 5 megabytes. (Behind all that, the Wordcorr program and its database take up 13 megabytes, and the Java Runtime Environment that Wordcorr draws on is 68 megabytes.)

Once linkages between individual investigators start to form, file exchanges over the Internet among colleagues (even scholars who work in remote locations can stumble across an Internet cafe every now and then) can turn into productive research networks.

Dissemination. We had proposed to circulate a printed report in the usual fashion, to maybe a few dozen interested scholars. But we realized that with around 400 downloads already out on Release 2.0 of Wordcorr, this Web site (which as of November 2005 attracts over 550 different visitors each month) is much more effective as a means of dissemination.

Society at large. There is always public interest in knowing about how languages have developed and diverged.

  • Archaeology
  • Genetics
  • Comparative linguistics

are our main sources of knowledge about the paths taken by peoples whose history has never been written, and about what may have gone on in times before any history was written anywhere.

At the other end of the scale of language relationships, knowing about the ways in which closely related speech varieties can diverge from each other meshes with Agard's typology of sound changes that result in language differentiation by inhibiting intelligibility (Agard 1984, pp. 41-47; Grimes 1995a, pp. 4-8; Milliken 1988). Intelligibility and lack of intelligibility among speech varieties are also of interest to educators in multilingual or multidialectal areas.

For example, when Grimes was on Saipan Island in the Northern Marianas consulting with a project on Carolinian languages, he met a high school principal from Pohnpei in the Federated States of Micronesia. The educator was concerned with providing school texts for students on island chains. In many parts of the Pacific, people on island chains speak related languages that are unlike enough that their speakers do not understand each other readily unless they have learned the other varieties as second languages. (German and Dutch, or Spanish and Catalán, are European examples of the same phenomenon.) When the educator saw the output of the STAMP program (Weber et al. 1990) based on preliminary comparative tables for Carolinian that Grimes had worked out by traditional paper and pencil means, then used for switching a folk tale from one variety to another, he saw it as a possible solution for his textbook problem.

Making it easy for people at large to try their hand at language comparison using Wordcorr could have a modest societal impact in two different directions. First, it may well attract more people into linguistics. Second, it may encourage people to discover for themselves the patterning and beauty of languages that they had previously thought of as primitive or deficient.

Look at references for this background material, or at the CSH Collections to see how Wordcorr has already helped in data preservation.

How it works: Wordcorr gives you control over five main functions, called Data, Views, Annotate, Tabulate, and Refine.

Data includes inputting, editing, importing and exporting files, and inspection of the data in a collection. Data are stored in text form (audio samples are useful too, but that's for the future). You have easy access to the entire IPA phonetic alphabet.

For each individual in a research team or linguistics class, having common data is the starting point for defining multiple views of the data, and annotating the same data differently for each view. Views allows you to define multiple views of the common data in order to try out different approaches to analysis. Views can differ in coverage and ordering of speech varieties. By setting up different views, you can follow out the implications even of conflicting hypotheses.

Annotate lets you tell Wordcorr your judgments about which forms in an entry might be treated as cognates, and how the segments in them are to be lined up for comparison. You may modify the annotations as you go. If they change, Wordcorr helps you unroll the original tabulation and step through it again on the new arrangement.

Tabulate takes the data and annotations for the entries and groups in a particular view and from them generates the correspondence sets that are the primary pieces of evidence in comparative analysis. You specify a phonological environment and a tentative protosegment for each correspondence set, and Wordcorr organizes the sets accordingly. You can look at the complete register of correspondence sets (including residual sets), and at the annotated data they are derived from, at any stage of the tabulation.

Refine allows you to change how the results are arranged, by correspondence sets in clusters representing a particular protosegment and environment. You can move sets from the place where they were registered on initial tabulation to a more appropriate place. The cluster concept allows you to work with correspondence sets for which data are missing, which may be indeterminate as to where they fit the analysis best. Clusters also help in filtering out sets that represent borrowings or internal analogies, by moving them into Residue. Display Evidence constructs a presentation suitable as an appendix to a comparative monograph, containing a listing as complete as the investigator wants of detailed evidence for each conclusion the linguist has reached, starting with the most convincing evidence.


Up                               Site map                                              Home