Updated: Lightning talk on Pleiades at NEH

UPDATED 4 March 2016: Followers of the ISAW News Blog will have seen the recent piece on the award of an NEH grant for new work on the Pleiades gazetteer and the follow-up piece about my intended participation at a meeting of similarly funded project directors. NEH's Office of Digital Humanities intends to has posted video of all the talks online on YouTube. You can drop into my 3 minutes, starting at 16:01:10, but meantime Or, if you'd prefer to read rather than listen, here are words I read and the slides I showed during the 3 minutes allotted to my "lightning talk". 

Slides are at Slideshare.

I am here today as a representative of the Pleiades community, an international group of volunteers who build and maintain the most comprehensive geospatial dataset for antiquity available today. Like many people in this room, we believe that the study of the past is a fundamental aspect of the Humanities endeavor. It's essential to understanding what it means to be human today, and to envisioning how we might be better humans in the future. Our collective past is geographically entangled: "where" is the stage on which the human drama is played, and it's an important analytical variable in every field of the past-oriented Humanities: history, archaeology, linguistics, text analysis, and so on. We also believe that the places and spaces known to and inhabited by our ancestors are the precious and fragile property of every person alive today, no matter whether we can still see and touch that heritage, or only just imagine it. 
So, scholars, students, and the public need free and open data about their ancient geography. They need it in order to learn about the past, to advance research, and to inform conservation. They also need it to connect digital images, texts, and other information across the web, regardless of where that information is created or hosted. Unsurprisingly, these same individuals have the collective skill and energy necessary to create and improve geographic information, if only we can put good tools in their hands.
That's what Pleiades does. It combines web-enabled public participation with peer review and editorial oversight in order to identify and describe ancient places and spaces. It continuously enables and draws upon the work of individuals, groups, and their computational agents as a core component of a growing, public scholarly communications network. 
And now, thanks to the Endowment, its reviewers, and the National Humanities Council, we have the opportunity to supercharge that network. Responding to recent adjustments in the guidelines for the digital humanities implementation grants, we requested funds to retool Pleiades. A decade of growth and diversification has left our web application underpowered and unreliable even as more users and external projects look to Pleiades as a source of information and a venue for publication. We need more power to address the most urgent needs articulated by the community: accelerated content creation and review, faster dissemination and discovery, display support for phones and tablets, expanded spatial and temporal coverage, flexible modeling of spatial relationships, comprehensive and customized access and preservation. 
Thanks to NEH, we can continue going where no daughters of Atlas have gone before!

Sorting Unicode Strings Across Languages and Writing Systems in Python

Sometimes when you put together a particular list of character strings, a particular use case, a particular audience, and default behaviors, you don't get what you need. Consider an arbitrarily-ordered list of Unicode strings:

 >>> titles = [  
... u'Alétheia - Revista de estudos sobre Antigüidade e Medievo',
... u'Archaeology Times',
... u'ákoue',
... u'Journal of Ancient Fish',
... u'Zeitschrift für Numismatik',
... u'Antípoda',
... u'Antipodes',
... u'Alecto',
... u'Ägyptische Residenzen und Tempel',
... u'Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο',
... u'Античный мир и археология',
... u'ACME',
... u'Ávila',
... u'Άβιλα',
... u'Araştırma Sonuçları Toplantıları',
... u'Archäologische Informationen',
... u'Académie des Inscriptions et Belles-Lettres: Lettre d’information',
... u'Àvila',
... u'‘Atiqot',
... u'Aleppo'
... ]

Sort the list:

 >>> for title in sorted(titles): print title  
Académie des Inscriptions et Belles-Lettres: Lettre d’information
Alétheia - Revista de estudos sobre Antigüidade e Medievo
Araştırma Sonuçları Toplantıları
Archaeology Times
Archäologische Informationen
Journal of Ancient Fish
Zeitschrift für Numismatik
Ägyptische Residenzen und Tempel
Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο
Античный мир и археология

You and your users may not be satisfied with this result. Perhaps you'd prefer to see a list sorted across languages and scripts in a way that considers Roman characters (i.e., A-Z, a-z) as equivalent for purposes of sorting regardless of whether they bear diacritics or not (e.g., A == Á == À). Perhaps you'd like to go even further and consider characters equivalent across writing systems on the basis of a Romanization scheme (e.g., Greek α == Russian а == Latin/English a).

One way is to write a function that gives us an alternative sort key for each string; that is, a derivative string that, when sorted against other such strings, gives the desired result.

If we're comfortable with some amount of naiveté in the results, we can get this done pretty quickly for some languages and scripts by taking advantage of existing packages in the Python open-source ecosystem.

How to ignore diacritics

The venerable ASCII character encoding scheme (see also the "Basic Latin" Unicode code block) provides for only the baseline Roman characters, plus Arabic numerals, English-standard punctuation, and some ancillary things that don't concern us here.

Tomaz Solc's unidecode package (a port of Sean M. Burke's Text::Unidecode Perl module) provides a quick and easy way to:
[take] Unicode data and [try] to represent it in ASCII characters ... where the compromises taken when mapping between two character sets are chosen to be near what a human with a US keyboard would choose.
How does that work out for our example list of strings?

 >>> from unidecode import unidecode  
>>> print(titles[0])
Alétheia - Revista de estudos sobre Antigüidade e Medievo
>>> print(unidecode(titles[0]))
Aletheia - Revista de estudos sobre Antiguidade e Medievo
>>> for title in titles: print(unidecode(title))
Aletheia - Revista de estudos sobre Antiguidade e Medievo
Archaeology Times
Journal of Ancient Fish
Zeitschrift fur Numismatik
Agyptische Residenzen und Tempel
Akamas, Omilos Anadeixes Mnemeion Salaminos, Enemerotiko Deltio
Antichnyi mir i arkheologiia
Arastirma Sonuclari Toplantilari
Archaologische Informationen
Academie des Inscriptions et Belles-Lettres: Lettre d'information

You'll have noted that unidecode.unidecode() does more than just ignore diacritics. It attempts to transliterate non-Roman characters (i.e., "Romanize" them) as well. Indeed, note what the docs say:
The quality of resulting ASCII representation varies. For languages of western origin it should be between perfect and good. On the other hand transliteration (i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system) of languages like Chinese, Japanese or Korean is a very complex issue and this library does not even attempt to address it. It draws the line at context-free character-by-character mapping. So a good rule of thumb is that the further the script you are transliterating is from Latin alphabet, the worse the transliteration will be.
Note that this module generally produces better results than simply stripping accents from characters (which can be done in Python with built-in functions). It is based on hand-tuned character mappings that for example also contain ASCII approximations for symbols and non-Latin alphabets.

Alternative Romanization techniques

I have no doubt that there are a wide variety of good techniques and packages for Romanizing character strings available in Python. I have not done a comprehensive search for these, and would welcome relevant, collegial comments with links on this post.

I did notice Artur Barseghyan's transliterate package. It is a:
Bi-directional transliterator for Python [that] transliterates (unicode) strings
according to the rules specified in the language packs (source script <->
target script).
At the time of this writing, the package provided Romanization for strings identifiable as written in the standard scripts (as cataloged in the IANA language subtag registry) for the following languages:

 >>> from transliterate import get_available_language_codes as get_langs  
>>> get_langs()
['el', 'ka', 'hy', 'ru', 'bg', 'uk']

Side note: getting language names for IANA codes

There's open source for that too: Matthew Caruana Galizia's IANA Language Tags project, about which:
IANA's official repository is in record-jar format and is hard to parse. This project provides neatly organized JSON files representing that data.
It also provides a JavaScript API. But I don't need that for this purpose. I can just grab the JSON version of the IANA repository data from Python, with an assist from the requests package. I can then use it to make human-readable names for the languages the transliterate package supports:

 >>> import requests  
>>> r = requests.get('https://raw.githubusercontent.com/mattcg/language-subtag-registry/master/data/json/registry.json')
>>> r.status_code
>>> lang_registry = r.json()
>>> languages = {}
>>> for lang in lang_registry:
... if lang['Type'] == 'language':
... languages[lang['Subtag']] = lang['Description'][0]
>>> len(languages)
>>> romanizable_languages = [languages[code] for code in get_langs()]
>>> for l in romanizable_languages: print(l)
Modern Greek (1453-)

Of course, there's a lot more one can do with that IANA JSON file... but let's get back to Romanization, by way of ...

Language and script detection

The transliterate package demands that you be able to identify the language (and implicitly the writing system) of the string you want to Romanize. The package provides a "very basic" language detection method to help us out:

 >>> from transliterate import detect_language  
>>> for title in titles: print(u'{0}: "{1}"'.format(detect_language(title), title))
ru: "Alétheia - Revista de estudos sobre Antigüidade e Medievo"
ru: "Archaeology Times"
None: "ákoue"
ru: "Journal of Ancient Fish"
ru: "Zeitschrift für Numismatik"
ru: "Antípoda"
ru: "Antipodes"
ru: "Alecto"
ru: "Ägyptische Residenzen und Tempel"
el: "Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο"
ru: "Античный мир и археология"
ru: "ACME"
None: "Ávila"
el: "Άβιλα"
ru: "Araştırma Sonuçları Toplantıları"
ru: "Archäologische Informationen"
ru: "Académie des Inscriptions et Belles-Lettres: Lettre d’information"
None: "Àvila"
ru: "‘Atiqot"
ru: "Aleppo"

The apparent default value for any pure-ASCII string of 'ru' (Russian) problematic.

There is another package designed specifically for language detection: Marco Lui's langid package:
langid.py is a standalone Language Identification (LangID) tool.
The design principles are as follows:
  1. Fast
  2. Pre-trained over a large number of languages (currently 97)
  3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
  4. Single .py file with minimal dependencies
  5. Deployable as a web service
Let's give it a whirl:

 >>> import langid  
>>> print(titles[0])
Alétheia - Revista de estudos sobre Antigüidade e Medievo
>>> langid.classify(titles[0])
('pt', 0.9997639656878511)
>>> for title in titles: print(u'{0}: "{1}"'.format(repr(langid.classify(title)), title))
('pt', 0.9997639656878511): "Alétheia - Revista de estudos sobre Antigüidade e Medievo"
('hu', 0.4951981167657506): "Archaeology Times"
('cs', 0.6559598835005537): "ákoue"
('en', 0.9999542989722191): "Journal of Ancient Fish"
('de', 1.0): "Zeitschrift für Numismatik"
('cs', 0.9327142178388013): "Antípoda"
('pt', 0.35121448605116784): "Antipodes"
('en', 0.16946150595865334): "Alecto"
('de', 0.9999999632942209): "Ägyptische Residenzen und Tempel"
('el', 1.0): "Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο"
('ru', 0.9999999999665641): "Античный мир и археология"
('en', 0.16946150595865334): "ACME"
('lv', 0.3049662840719183): "Ávila"
('el', 1.0): "Άβιλα"
('tr', 0.9999345038597317): "Araştırma Sonuçları Toplantıları"
('de', 0.9999982379021314): "Archäologische Informationen"
('fr', 0.9999999999999973): "Académie des Inscriptions et Belles-Lettres: Lettre d’information"
('en', 0.16946150595865334): "Àvila"
('fr', 0.9511801373660571): "‘Atiqot"
('en', 0.31773663282480374): "Aleppo"
>>> for title in titles: print(u'{0}: "{1}"'.format([None, languages[langid.classify(title)[0]]][langid.classify(title)[1] > 0.9], title))
Portuguese: "Alétheia - Revista de estudos sobre Antigüidade e Medievo"
None: "Archaeology Times"
None: "ákoue"
English: "Journal of Ancient Fish"
German: "Zeitschrift für Numismatik"
Czech: "Antípoda"
None: "Antipodes"
None: "Alecto"
German: "Ägyptische Residenzen und Tempel"
Modern Greek (1453-): "Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο"
Russian: "Античный мир и археология"
None: "ACME"
None: "Ávila"
Modern Greek (1453-): "Άβιλα"
Turkish: "Araştırma Sonuçları Toplantıları"
German: "Archäologische Informationen"
French: "Académie des Inscriptions et Belles-Lettres: Lettre d’information"
None: "Àvila"
French: "‘Atiqot"
None: "Aleppo"

That's better, especially if we pay attention to the probability measures attached to each result.

The missing pieces

So, I think we can now imagine some process that follows this outline:

  • for each string:
    • if not every character in the string is ASCII
      • try to use langid.classify() to determine language
        • if language is successfully determined:
          • if transliterate.translit() supports the language, get the transliteration
      • remove remaining non-ASCII characters by brute force
      • if result is a zero-length string, step back to the original string (what else can you do?)
    • else: just use the original string
    • strip all the punctuation
    • convert everything to lowercase
    • normalize or remove spaces (depending on how you want to deal with word breaks in sorting)

ASCII detection and stripping

It's pretty easy to strip non-ASCII characters from a Unicode string in Python:

 >>> t = u"Antípoda"  
>>> t.encode('ascii', 'ignore')

We can exploit this to do quick and dirty "all ASCII" detection:

 >>> t = u"Antípoda"  
>>> t == unicode(u.encode('ascii', 'ignore'))
>>> u = u'Chicken'
>>> u == unicode(u.encode('ascii', 'ignore'))

This approach depends upon the assumption that the strings you're starting with are really Unicode strings. If you were a huge regular expressions fan, you could use the python re module to perform a similar test, but I'm betting it would run slower.

If you weren't so confident in the consistency of your source list, you'd need to do some preprocessing. Unicode normalization might be necessary. You might have to resort to the chardet module in order to guess at encodings.

Stripping punctuation

Now here's a job for regular expressions. We can make short work of this task especially if we take advantage of the new, alternative regex package for Python, which is intended to eventually replace the current implementation.

 >>> import regex as re  
>>> rx = re.compile(ur'[\p{P}_\d]+')
>>> t = u'Ακάμας, Όμιλος Ανάδειξης Μνημείων Σαλαμίνος, Ενημερωτικό Δελτίο'
>>> print(rx.sub(u'', t))
Ακάμας Όμιλος Ανάδειξης Μνημείων Σαλαμίνος Ενημερωτικό Δελτίο

Putting it all together

Consider the following script:


>>> import nose.tools
>>> import pprint
>>> tools = [n for n in dir(nose.tools) if n[0] != '_']
>>> pprint.pprint(tools)

New in Maia: Kristina Killgrove

I have just added the following two blogs to the Maia Atlantis feed aggregator:

title = Kristina Killgrove (Forbes)
url = http://www.forbes.com/sites/kristinakillgrove/
creators = Kristina Killgrove
description = Kristina Killgrove's stories.
feed = http://www.forbes.com/sites/kristinakillgrove/feed/

title = Powered By Osteons
url = http://www.poweredbyosteons.org/
creators = Kristina Killgrove
feed = http://www.poweredbyosteons.org/feeds/posts/default?alt=rss

I'm embarrassed to admit that Powered By Osteons is only now getting into Maia. I've been impressed with and reading it for a long time; I don't know how I failed to include it previously. My apologies to the author!


Worth remembering:

$ pdftotext -eol mac -nopgbrk foo.pdf 

$ find . -iname *.pdf -exec pdftotext -eol mac -nopgbrk {} \;