Pleiades by Country: Iraq and Syria

I was recently asked by a colleague how many "Iraqi and Syrian places" there are in Pleiades. Pleiades is not set up to automatically answer that kind of question; indeed we don't store modern national boundaries in the system at all. So, I thought I'd try to start a series of blog posts in which I work through getting the answer. Hopefully along the way we'll observe some useful things about the structure of Pleiades data and develop some ideas about how to exploit it and make it more useful.

First, a caveat. The structure of the Pleiades dataset is not simple, so the first rule of GIS applies here (as elsewhere): get to know your dataset before you draw conclusions from it. For Pleiades, essential reading includes: "Pleiades Data Model" and the "Pleiades Downloads" page, as well as the README file for any dataset one downloads.

First, we need data. For the Pleiades place resources, we'll grab the latest nightly "dump" file, in CSV format:

curl -O http://atlantides.org/downloads/pleiades/dumps/pleiades-places-latest.csv.gz

For modern countries, we'll grab the latest version (March 2013) of the US State Department's Detailed World Polygons file, derived from the Large Scale International Boundaries (LSIB) dataset.


curl -O https://hiu.state.gov/data/EurasiaAfrica_LSIBPolygons_2013March08_HIU_USDoS.zip

If we pull both of these datasets into a GIS (I'm using QGIS) using their native geographic reference system (both are in WGS-84), our first order of business is obvious: discard Pleiades data that is not in Iraq. 

Since the State Department claims online that the LSIB boundary data is accurate to within "a couple of kilometers" and I don't have any more accurate metadata for these boundaries, I'd like to buffer the Iraq country polygon by 5 kilometers before running an intersection selection on the Pleiades data. To do this easily in QGIS (with the vector geoprocessing buffer tool), I need first to reproject both datasets into a coordinate system that uses meters, rather than degrees. Not wanting to spend to much time pondering, I did a quick web search, came across Dwayne Wilkins' Iraqi GIS Projections page, and picked the UTM 38N (WGS-84) projection, which is already provide along with so many others in QGIS. In saving off the projected version of the country boundaries, I selected just the polygon for Iraq in order to save my self a bit of time later. Here are the resulting datasets in their new cartographic projection (zoomed in a general way to southwest Asia):

Since we're now in a projection that uses meters, it's an easy matter to set the parameters in the buffer tool so that we get a 5km exterior buffer around the country polygon:

In QGIS, vector -> geoprocessing -> intersect lets us select just those Pleiades places that fall within that buffer polygon (446 point features, including at least one that certainly falls in modern Syria and another in Iran, but let's not worry about that for the moment):

It's at this point that we have to fall back on our knowledge of the Pleiades dataset. Pleiades is in the first instance a historical and archaeological gazetteer, not a database of extant archaeological sites. 






New in Maia: Turkish Archaeological News and Sarah E. Bond

I have just added the following blogs to the Maia Atlantis feed aggregator:

title = Turkish Archaeological News
url = http://turkisharchaeonews.net/
license =  http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en_US
feed = http://turkisharchaeonews.net/rss.xml

title = Sarah E. Bond
url = http://sarahemilybond.wordpress.com/
description = Late Antiquity, Digitial Humanities, and Musings on the Classical World
feed = http://sarahemilybond.wordpress.com/feed/

New in Maia: Mār Šiprim and Laboratoire Orient et Méditerranée

I have added feeds for the following web resources to the Maia Atlantis feed aggregator:

title = Mār Šiprim
url = http://mar-shiprim.org/
creators = International Association for Assyriology
license = None
description = Official Newsletter for the International Association for Assyriology (IAA). Through this Newsletter, the IAA aims to provide an online platform for Assyriologists and Near-Eastern enthusiasts where to interact with each other on both an intellectual and an informal level, thus establishing an international linkage among colleagues.
keywords = None
feed = http://mar-shiprim.org/feed/

title = Laboratoire Orient et Méditerranée
url = http://www.orient-mediterranee.com/?lang=fr
creators = None
license = None
description = Orient & Méditerranée est une Unité Mixte de Recherche en Sciences historiques, philologiques et religieuses, associant le Centre National de la Recherche Scientifique, CNRS, l’Université Paris-Sorbonne, Paris IV, l’Université Panthéon-Sorbonne, Paris 1 et l’École Pratique des Hautes Études
keywords = académie des inscriptions et belles-lettres, actualités, annuaire, antique, antiques, antiquité classique et tardive, arabie, araméen, archeology, archives, archéologiques, archéologues, bible, calendrier, centre national de la recherche scientifique, chantiers de fouille, cnrs, collections, colloques, collège de france, communication, contact, coopérations, coran, cours spécialisés, crédits, disciplines, distinctions, documentaires, débuts du christianisme, electroniques, formation, historiens des religions, informations administratives, initiation, islam médiéval, langue syriaque, les chercheurs du lesa, lesa, liens utiles, linguistes, l’université panthéon-sorbonne, l’université paris-sorbonne, mediterranee, membres, missions de terrain, monde byzantin, monde méditerranéen, mondes cananéen, médecine grecque, méditerranée, médiévale, organigramme, orient, orient & méditerranée, orient chrétien, ougarit, ouvrages récents, paris 1, paris iv, philologiques, philologues, phénicien, plan du site, proche-orient, programmes, présentation, publications, publications des membres de l’umr, punique, qumrân, rassemble cinq laboratoires, recherches, religions monothéistes, responsabilité d’entreprises documentaires, ressources documentaires, revues, sciences historiques, sciences humaines, sciences religieuses, soutenances, spip 2, spécialistes du monde, séminaires, sémitique, sémitique occidental, template, textes fondateurs, thèses, thèses en cours, umr 8167, umr8167, unité mixte de recherche, vallée de l’euphrate syrien, valorisation de la recherche, vient de paraître, école pratique des hautes études, écoles doctorales, époques, éthiopie, études sémitiques
feed = http://www.orient-mediterranee.com/spip.php?page=backend

New in Planet Maia: Building Tabernae and Archaeology of Portus (MOOC)

I have just added the following resources to the Maia Atlantis feed aggregator:

title = Building Tabernae
url = http://buildingtabernae.org/
creators = Miko Flohr
license = None
description = About two years ago, I received a quarter million euro grant from the Dutch government for a  four year project on urban commercial investment in Roman Italy, and a project blog was already in the proposal. The project – Building Tabernae – started April, 2013, and is now about to enter a new phase, in which some results will start emerging, and new data will be gathered. The blog, I hope, is a way to force the scholar in charge of this project – me – to record and communicate the project’s successes and failures, and everything else that one encounters when investigating commercial investment in Roman Italy, and to be available for discussion with specialists and non-specialists alike.
feed = http://buildingtabernae.org/feed/

title = Archaeology of Portus: Exploring the Lost Harbour of Ancient Rome
url = http://moocs.southampton.ac.uk/portus/
creators = University of Southampton and FutureLearn
license = None
description = The University of Southampton and FutureLearn are running a MOOC (Massive Open Online Course), focusing on the archaeological work in progress at the Roman site of Portus. It is one of a number of Southampton-based courses that will be made available for you to study online, for free, wherever they are based in the world, in partnership with FutureLearn.
feed = http://moocs.southampton.ac.uk/portus/feed/

Additions and corrections in Planet Atlantides

I've just added the following blog to the Maia and Electra feed aggregators:

title = Standards for Networking Ancient Prosopographies
url = http://snapdrgn.net/
creators = Gabriel Bodard, et al.
description = Networking Ancient Prosopographies: Data and Relations in Greco-Roman Names (hereafter SNAP:DRGN or SNAP) project aims to address the problem of linking together large collections of material (datasets) containing information about persons, names and person-like entities managed in heterogeneous systems and formats.
feed = http://snapdrgn.net/feed

I've also updated the entry for MutEc as follows (corrected feed url):

title = Mutualisation d'outils numériques pour les éditions critiques et les corpus (MutEC)
url = http://www.mutec-shs.fr
creators = Marjorie Burghart, et al.
description = MutEC est un dispositif de partage, d'accumulation et de diffusion des technologies et des méthodologies qui émergent dans le champ des humanités numériques.
feed = http://www.mutec-shs.fr/?q=rss.xml

MITH and tDAR continue to respond to our bot with 403 Forbidden, so their content will not appear in the aggregators.

Planet Atlantides Updates: Antiquitas, Archeomatica, Source, tDAR and MITH

I have added subscriptions for the following resources to the indicated aggregators at Planet Atlantides:

To Electra:

title = Source: Journalism Code, Context & Community
site = https://source.opennews.org/en-US/
license = CC Attribution 3.0 http://creativecommons.org/licenses/by/3.0/
feed = https://source.opennews.org/en-US/rss/

To Maia:

title = Antiquitas
site = http://antiquitas.hypotheses.org/
creators = Hervé Huntzinger
description = Ce carnet a pour objet de fédérer la communauté pédagogique et scientifique investie dans le Parcours « Sciences
 de l'Antiquité » de l’Université de Lorraine. Il fournit aux futurs étudiants une information claire sur l’offre de formation. Il ouvre aux étudiants de master et de doctorat un espace pour mettre en valeur leurs travaux et s’initier à la recherche. Il offre, enfin, aux enseignants-chercheurs une plateforme permettant d’informer les chercheurs, les étudiants et le public averti de l’actualité de la recherche. La formation est adossée à l’équipe d’accueil Hiscant-MA (EA1132), spécialisés en Sciences de l’Antiquité.
feed = http://antiquitas.hypotheses.org/feed

I have also updated the feed URL in both Electra and Maia for the following resource:

title = Archeomatica: Tecnologie per i Beni Culturali
site = http://www.archeomatica.it/
description = Tutte le notizie sulle tecnologie applicate ai beni culturali per il restauro e la conservazione
feed = http://feeds.feedburner.com/Archeomatica

The following resources are presently responding to requests from the Planet Atlantides Feed Bot for access to their feeds with a 403 Forbidden HTTP status code. Consequently, updates from these resources will not be seen in the aggregators until and if the curators of these resources make a server configuration change to permit us to syndicate the content.

title = Maryland Institute for Technology in the Humanities (MITH)
site = http://mith.umd.edu/
description = Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland, College Park
feed = http://mith.umd.edu/feed/

title = The Digital Archaeological Record (tDAR)
site = http://www.tdar.org/
feed = http://www.tdar.org/feed/




Mining AWOL more carefully for ISSNs

I made a couple of bad assumptions in my previous attempt to mine ISSNs out of the content of the AWOL Blog:

  1. I assumed that the string "ISSN" would always appear in all caps.
  2. I assumed that the string "ISSN" would be followed immediately by a colon (:).
In fact, the following command indicates there are at least 673 posts containing instances of the string (ignoring capitalization) "issn" in the AWOL content:
ack -hilo issn  post-*.xml | wc -l
 In an attempt to make sure we're capturing real ISSN strings, I refined the regular expression to try to capture a leading "ISSN" string, and then everything possibly following until and including a properly formatted ISSN number. I've seen both ####-#### and ########, (where # is either a digit or the character "X") in the wild, so I accommodated both possibilities. Here's the command:
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml > issn-raw.txt
You can see the raw list of the matched strings here. If we count the lines generated by that command instead of saving them to file, we can see that there are at least 1931 ISSNs in AWOL.
ack -hio 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
Then I wondered, are we getting just one ISSN per file or multiples? We know that some of the posts in the blog are about single resources, but there are also plenty of posts about collections and also posts that gather up all the references to every known instance of a particular genre (e.g., open-access journals or journals in JSTOR). So I modified the command to count how many files have these "well-formed" ISSN strings in them (the -l option to ack):
ack -hilo 'issn[^\d]*[\dX]{4}-?[\dX]{4}' post-*.xml | wc -l
For a total of 638 affected files. Here's a list of the affected files, for future team reference.

One wonders about the discrepancy between 638 and 673, but at least I know I now have a regular expression that can capture most of the ISSNs and their values. I'll do some spot-checking later to see if I can figure out what's being missed and why.

More importantly, it's now very clear that mining the ISSNs out of the blog posts on our way to Zotero is a worthwhile task. Not only will we be able to add them to the records, we may also be able to use them to look up existing catalog data from other databases with which to better populate the fields in the corresponding Zotero records.




Mining AWOL for Identifiers

NB: There is now a follow-up post to this one, in which various bad assumptions made here are addressed: "Mining AWOL more carefully for ISSNs".

In collaboration with Pavan Artri, Dawn Gross, Chuck Jones, Ronak Parpani, and David Ratzan, I'm currently working on a project to port the content of Chuck's Ancient World Online (AWOL) blog to a Zotero library. Funded in part by a grant from the Gladys Krieble Delmas Foundation, the idea is to make the information Chuck gathers available for more structured data needs, like citation generation, creation of library catalog records, and participation in linked data graphs. So far, we have code that successfully parses the Atom XML "backup" file we can get from Blogger and uses the Zotero API to create a Zotero record for each blog post and to populate its title (derived from the title of the post), url (the first link we find in the body of the post), and tags (pulled from the Blogger "labels").

We know that some of the post bodies also contain standard numbers (like ISSNs and ISBNs), but it has been unclear how many of them there are and how regular the structure of text strings in which they appear. Would it be worthwhile to try to mine them out programmatically and insert them into the Zotero records as well? If so, what's our best strategy for capturing them ... i.e., what sort of parenthetical remarks, whitespace, and punctuation might intervene between them and the corresponding values? Time to do some data prospecting ...

We'd previously split the monolithic "backup" XML file into individual XML files, one per post (click at your own risk; there are a lot of files in that github listing and your browser performance in rendering the page and its JavaScript may vary). Rather than writing a script to parse all that stuff just to figure out what's going on, I decided to try my new favorite can-opener, ack (previously installed stresslessly on my Mac with another great tool, the Homebrew package manager).

Time for some fun with regular expressions! I worked on this part iteratively, trying to start out as liberally as possible, thereby letting in a lot of irrelevant stuff so as not to miss anything good. I assumed that we want to catch acronyms, so strings of two or more capital letters, preceded by a word boundary. I didn't want to just use a [A-Z] range, since AWOL indexes multilingual resources, so I had recourse to the Unicode Categories feature that's available in most modern regular expression engines, including recent versions of Perl (on which ack relies). So, I started off with:
\b\p{Lu}\p{Lu}+
After some iteration on the results, I ended up with something more complex, trying to capture anything that fell between the acronym itself and the first subsequent colon, which seemed to be the standard delimiter between the designation+explanation of the type of identifier and the identifying value itself. I figure we'll worry how to parse the value later, once we're sure which identifiers we want to capture. So, here's the regex I ultimately used:
\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]
The full ack command looked like this:
ack -oh "\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]" post-*.xml > ../awol-acronyms/raw.txt
where the -h option telling ack to "suppress the prefixing of filenames on output when multiple files are searched" and the -o option telling ack to "show only the part of each line matching" my regex pattern (quotes from the ack man page). You can browse the raw results here.

So, how to get this text file into a more analyzable state? First, I thought I'd pull it into my text editor, Sublime, and use its text manipulation functions to filter for unique lines and then sort them. But then, it occurred to me that I really wanted to know frequency of identifier classes across the whole of the blog content, so I turned to OpenRefine.

I followed OR's standard process for importing a text file (being sure to set the right character encoding for the file on which I was working). Then, I used the column edit functionality and the string manipulation functions in the Open Refine Expression Language (abbreviated GREL because it used to be called "Google Refine Expression Language") to clean up the strings (regularizing whitespace, trimming leading and trailing whitespace, converting everything to uppercase, and getting rid of whitespace immediately preceding colons). That part could all have been done in a step outside OR with other tools, but I didn't think about it until I was already there.

Then came the part OR is actually good at, faceting the data (i.e., getting all the unique strings and counts of same). I then used the GREL facetCount() function to get those values into the table itself, followed this recipe to get rid of matching rows in the data, and exported a CSV file of the unique terms and their counts (github's default display for CSV makes our initial column very wide, so you may have to click on the "raw" link to see all the columns of data).

There are some things that need investigating, but what strikes me is that apparently only ISSN is probably worth capturing programmatically. ISSNs appear 44 times in 14 different variations:


ISSN: 17
ISSN paper: 9
ISSN electrònic: 4
ISSN electronic edition: 2
ISSN electrónico: 2
ISSN électronique: 2
ISSN impreso: 2
ISSN Online: 2
ISSN edición electrónica: 1
ISSN format papier: 1
ISSN Print: 1
ISSN print edition: 1
ONLINE ISSN: 1
PRINT ISSN: 1

Compare ISBNs:


ISBN of Second Part: 2
ISBN: 1
ISBN Compiled by: 1

DOIs make only one appearance, and there are no Library of Congress cataloging numbers.

Now to point my collaborators at this blog post and see if they agree with me...