Dataset history

Timothy Usher, Santa Fe Institute

This is a brief description of the history and conventions of the historical versions of my lexical dataset, from my late 1990s v.1 dataset, to Usher and Whitehouse (2006) as quoted in LinguistList's LEGO project and Utilika's PanLex, and the Newguineaworld dataset which supersedes both, followed by a vision for its future direction.

Usher v.1

My earliest lexical dataset began in 1996 as a digitalization of Joseph Greenberg's “Indo-Pacific” notebooks, unpublished collations of mostly colonial-era vocabularies to a single list of meanings. It is immediately obvious to anyone who views these that what Greenberg was aiming for was a spreadsheet, and that he would have wanted the order of these entries to have been manipulable, were this possible under the pencil-on-paper technology regime of his day.

Through the early 2000s, this was expanded to include more contemporary attestations. The result was a landscape-oriented document including most New Guinea languages, with languages in rows and meanings in columns, which was very quickly scrollable by basic term. It was this document that turned up the relationship between Karkar Yuri and East Pauwasi River, among a number of other low-hanging fruits. I still use it today to match mystery vocabularies to known languages, and to tentatively explore unconsidered hypotheses.

Its strengths were accompanied by several critical and irreperable flaws. Transcriptions were anglophone in a standard ascii font; the transformations were not documented. Greenberg had provided his sources at each notebook's onset, but I hadn't yet the awareness to see how important this was. Most egregiously, several sources were conflated into a single row when it was assumed that they represented the same variety of a given language.

Usher-Whitehouse

In the early 2000s, Merritt Ruhlen introduced me to Paul Whitehouse, who had independently been pursuing a very similar project, with a shared focus on New Guinean languages. By the mid 2000s, Paul and I had merged our collections.

Paul's formats, while similar in purpose with much overlapping coverage, were superior in several respects. First, they used a standard IPA font, though still ascii rather than utf8. Second, he had come to realize the strong need to distinguish records from different sources earlier than I did, although his earliest work did not specify these very well. Finally, they were portrait orientation, with languages in colums and meanings in rows. This eroded a key functionality of my v.1 dataset (above) but avoided the problem of having to break up single records into multiple tables, since spreadsheets are generally limited to 256 columns and our comparative list was longer than that. These advantages convinced me to adopt his format, with many of my v.1 vocabularies ported forward to be compatible.

During the Usher-Whitehouse years, we significantly expanded our coverage, with Paul concentrating especially on unpublished survey vocabularies from the Summer Institute of Linguistics in Ukarumpa. We gradually expanded our comparative term list from 987 terms to 1,820, largely in order to accomodate New Guinea local glosses which were present in Usher v.1 but not Whitehouse due to its more global ambitions, and to accomodate Australian local terms such as flora and fauna which had not hitherto been included in either comparative termlist. This was to culminate in the 2,656 row sort between the Usher-Whitehouse termlist and that of IDS, which (not entirely by choice) became the basis for the LEGO vocabularies.

Newguineaworld

The Newguineaworld dataset, as it exists and is under expansion on this site, has departed from the curational practices and intentions of Usher-Whitehouse in several respects. First, the dual purposes of the Usher-Whitehouse format, as archival record and as comparative tool, have been split into several stages. The foundational stage is archival in nature, aiming to model the presentational intentions of the source within a standardized manipulable spreadsheet format. Individual entries are accompanied by a page number and an original display order. This supports citation and verification, while providing a living link between the words we see on the page and the data structures we'd like to create.

The big disadvantage relative to Usher-Whitehouse, and even more so to Usher v.1, is the deprecation of the comparative function in the archival stage documents. For a quick browse through a very large number of languages, these early versions remain superior. However, the virtues of this format were held to outweigh this as the quality demands of my comparative work increased. More broadly, I was sick of redoing it, rechecking sources to add missed glosses, struggling to find page numbers to proof particular entries, etc. Indeed, some have been redone yet again to less-than-archival standards on other websites, and no doubt collectively redone many times on different reseearchers' desktops. It seemed best to do things correctly the first time, with a mind towards how the comparative function can be rebuilt within a structure that draws from and correlates these foundational entities to one another, without sacrificing the reliability, verifiability and citability of individual attestations.

Next steps

So, how can the comparative function be rebuilt? By comparative function is meant the ability to easily compare comparable meanings across languages, even if the source glosses are slightly different. Excepting a few standardized termlists, they are nearly always a different selection presented in a different order. In Usher-Whitehouse and its predecessors, this was accomplished by directly rekeying the entry into the appropriate cell, sometimes but not always with an original gloss appended to the form in brackets and/or quotes. This is expedient, but resulted in something which needed to be redone, with the revisitation of all the original sources. What should be done is to correlate original display orders in columns to the master comparative termlist (“concepticon”) using a sort function or its programming equivalent. These sorter files are being created in the background, but cannot be technically implemented at this time.

Another vital comparative function is to be able to compare forms under a single system of transcription, as to follow more than a handful at once is distracting and confusing. In Usher-Whitehouse, this was addressed by regularizing transcriptions within the foundational document, such that original transcriptions were neither visible, nor always easily recoverable even if the rules are known. This, too, required revisitation of all the original sources. The solution adopted here is to preserve original transcriptions in one set of columns (or rows) alongside clearly demarcated standard IPA versions, so that one or the other, or both, can be displayed depending upon our immediate purposes without compromising either the integrity of the record or orthographic consistency across sources.

The most sweeping transformation is already being implemented: the integration of the Newguineaworld lexical dataset with a text based historical-phonological encyclopedia, under the theory that context multiplies value. Correspondingly, it serves as a core component of the evidentiary basis for our original findings.