Ancient Studies Needs Open Bibliographic Data and Associated URIs

Update 1:  links throughout, minor formatting changes, proper Creative Commons Public Domain tools, parenthetical about import path from Endnote and such, fixing a few typos.

The NEH-funded Linked Ancient World Data Institute, still in progress at ISAW, has got me thinking about a number of things. One of them is bibliography and linked data. Here's a brain dump, intended to spark conversation and collaboration.

What We Need

  • As much bibliographic data as possible, for both primary and secondary sources (print and digital), publicly released to third parties under either a public domain declaration or an unrestrictive open license.
  • Stable HTTP URIs for every work and author included in those datasets.

Why

Bibliographic and citation collection and management are integral to every research and publication in project in ancient studies. We could save each other a lot of time, and get more substantive work done in the field, if it was simpler and easier to do. We could more easily and effectively tie together disparate work published on the web (and appearing on the web through retrospective digitization) if we had a common infrastructure and shared point of reference. There's already a lot of digital data in various hands that could support such an effort, but a good chunk of it is not out where anybody with good will and talent can get at it to improve it, build tools around it, etc.

What I Want You (and Me) To Do If You Have Bibliographic Data
  1. Release it to the world through a third party. No matter what format it's in, give a copy to someone else whose function is hosting free data on the web. Dump it into a public repository at github.com or sourceforge.net. Put it into a shared library at Zotero, Bibsonomy, Mendeley, or another bibliographic content website (most have easy upload/import paths from Endnote, and other citation management applications). Hosting a copy yourself is fine, but giving it to a third party demonstrates your bona fides, gets it out of your nifty but restrictive search engine or database, and increments your bus number.
  2. Release it under a Creative Commons Public Domain Mark or Public Domain Dedication (CC0).  Or if you can't do that, find as open a Creative Commons or similar license as you can. Don't try to control it. If there's some aspect of the data that you can't (because of rights encumberance) or don't want to (why?) give away to make the world a better place, find a quick way to extract, filter, or excerpt that aspect and get the rest out.
  3. Alert the world to your philanthropy. Blog or tweet about it. Post a link to the data on your institutional website. Above all, alert Chuck Jones and Phoebe Acheson so it gets announced via Ancient World Online and/or Ancient World Open Bibliographies.
  4. Do the same if you have other useful data, like identifiers for modern or ancient works or authors.
  5. Get in touch with me and/or anyone else to talk about the next step: setting up stable HTTP URIs corresponding to this stuff.
Who I'm Talking To

First of all, I'm talking to myself, my collaborators, and my team-mates at ISAW. I intend to eat my own dogfood.

Here are other institutions and entities I know about who have potentially useful data.
  • The Open Library : data about books is already out there and available, and there are ways to add more
  • Perseus Project : a huge, FRBR-ized collection of MODS records for Greek and Latin authors, works, and modern editions thereof.
  • Center for Hellenic Studies: identifiers for Greek and Latin authors and works
  • L'Année Philologique and its institutional partners like the American Philological Association: the big collection of analytic secondary bibliography for classics (journal articles)
  • TOCS-IN: a collaboratively collected batch of analytic secondary bibliography for classics
  • Papyri.info and its contributing project partners: TEI bibliographic records for  much of the bibliography produced for or cited by Greek and Latin papyrologists (plus other ancient language/script traditions in papyrology)
  • Gnomon Bibliographische Datenbank: masses of bibliographic data for books and articles for classics
  • Any and every university library system that has a dedicated or easily extracted set of associated catalog records. Especially any with unique collections (e.g., Cincinnati) or those with databases of analytical bibliography down to the level of articles in journals and collections.
  • Ditto any and every ancient studies digital project that has bibliographic data in a database.
Comments, Reactions, Suggestions

Welcome, encouraged, and essential. By comment here or otherwise (but not private email please!).

First pass at extracting useful data from my dissertation

You'll find context in yesterday's post on the dissertation.

It turns out it wasn't as hard as I anticipated to start getting useful information extracted from my born-digital-for-printing-on-dead-trees dissertation. Here's a not-yet-perfect xml serialization (borrowing tags from the TEI) of "instance" information found in the diss narrative:

https://github.com/paregorios/demarc/blob/master/xml/instances.xml

Each instance is a historical event (or in some cases event series) relating to boundary demarcation or dispute within the empire. Here's a comparison between the original formatting for paper and the xml.

For paper:

XML:
<?xml version="1.0" encoding="UTF-8"?>
<div type="instance" xml:id="INST9">
<idno type="original">INST9</idno>
<head>A Negotiated Boundary between the <placeName
type="ancient">Zamucci</placeName> and the <placeName
type="ancient">Muduciuvi</placeName></head>
<p rend="indent">Burton 2000, no. 78</p>
<p>Date(s): <date>AD 86</date></p>
<p type="treDisputeStatement">This boundary marker was placed in
accordance with the agreement of both parties (<foreign xml:lang="la">ex
conven/tione utrarumque nationum</foreign>), and therefore may be taken as
evidence of a <hi rend="bold">boundary dispute</hi>.</p>
<p rend="indent">This single boundary marker from coastal <placeName
type="modern">Libya</placeName> provides the only evidence for the resolution
of a boundary dispute between these two indigenous peoples. The date of the
demarcation, as calculated from the imperial titulature, places the event in
the same year as the reported ‘destruction’ of the <placeName
type="ancient">Nasamones</placeName> by <placeName type="ancient">Legio III
Augusta</placeName> as a consequence of a tax revolt in which tax collectors
were killed.<note n="286"> Zonaras 11.19. </note> It is not clear whether
the boundary action was related to the conflict, or merely took advantage of
the temporary presence of the legionary legate in what ought to have been
part of the proconsular province. Surviving documentation for proconsuls
during the 80s AD is incomplete, and therefore we cannot say who was
governing <placeName type="ancient">Africa Proconsularis </placeName>at the
time of this demarcation.<note n="287"> Thomasson 1996, 45-48. </note>
Neither party seems to have been related to the <placeName
type="ancient">Nasamones</placeName>; rather, they are thought to be sub-
tribes of the <placeName type="ancient">Macae.</placeName><note
n="288">Mattingly 1994, 27-28, 32, 74, 76.. </note></p>
</div>



One thing that made this a lot easier than it might of been was the way I used styles in Microsoft Word back when I created the original version of the document. Rather than just painting formatting onto my text for headings, paragraphs, strings of characters, and so forth, I created a custom "style" for each type of thing I wanted to paint (e.g., an "instance heading" or a "personal name"). I associated the desired visual formatting with each of these, but the names themselves (since the captured semantic distinctions that I was interested in) provided hooks today for writing this stuff out as sort-of TEI XML.

There's more to do, obviously, but this was a satisfying first step.

Five Minutes to Ancient World Linked Data JavaScript

Probably not even that long:

  • Signed in to blogger
  • Went to the blog overview
  • Selected the "template" menu option
  • Selected the "Edit HTML" button 
  • Selected the "Proceed" button because I am fearless!
  • Scrolled to the bottom of the HTML "head" element and pasted in the following two lines:
<script src='http://isawnyu.github.com/awld-js/lib/requirejs/require.min.js' type='text/javascript'></script>
<script src='http://isawnyu.github.com/awld-js/awld.js?autoinit' type='text/javascript'></script>
  • Save and enjoy

Information about (and code) for the Ancient World Linked Data JavaScript library.

Open-Access Epigraphic Evidence for Boundary Disputes in the Roman Empire

I've been sitting on the dissertation way too long. So here it is, unleashed upon the world under the terms of a Creative Commons Attribution Share-alike license.

I have visions of hacking it up into a super-cool, linked data, ever-updated information resource, but there's no reason -- even though it's pretty niche in a lot of ways -- why anyone who might benefit from having, using, or critiquing it meantime should have to wait for that to happen.

Comments, questions, and post-release reviews are welcome via comment here, or via email to tom.elliott@nyu.edu, or on your own blog. And feel free to fork the repos and play around if you're mob-epigraphically inclined.

Give Me the Zotero Item Keys!

I fear and hope that this post will cause someone smarter than me to pipe up and say UR DOIN IT WRONG ITZ EZ LYK DIS ...

Here's the use case:

The Integrating Digital Papyrology project (and friends) have a Zotero group library populated with 1,445 bibliographic records that were developed on the basis of an old, built-by-hand Checklist of Editions of Greek and Latin Papyri (etc.). A lot of checking and improving was done to the data in Zotero.

Separately, there's now a much larger pile of bibliographic records related to papyrology that were collected (on different criteria) by the Bibliographie Papyrologique project. They have been machine-converted (into TEI document fragments) from a sui generis Filemaker Pro database and are now hosted via papyri.info (the raw data is on github).

There is considerable overlap between these two datasets, but also signifcant divergeance. We want to merge "matching" records in a carefully supervised way, making sure not to lose any of the extra goodness that BP adds to the data but taking full advantage of the corrections and improvements that were done to the Checklist data.

We started by doing an export-to-RDF of the Zotero data and, as a first step, that was banged up (programmatically) against the TEI data on the basis of titles. Probable matches were hand-checked and a resulting pairing of papyri.info bibliographic ID numbers against Zotero short titles was produced. You can see the resulting XML here.

I should point out that almost everything up to here including the creation and improvement of the data, as well as anything below regarding the bibliography in papyri.info, is the work of others. Those others include Gabriel Bodard, Hugh Cayless, James Cowey, Carmen Lantz, Adam Prins, Josh Sosin, and Jen Thum. And the BP team. And probably others I'm forgetting at the moment or who have labored out of my sight. I erect this shambles of a lean-to on the shoulders of giants.

To guide the work of our bibliographic researchers in analyzing the matched records, I wanted to create an HTML file that looks like this:
  • Checklist Short Title = Papyri.info ID number and Full Title String
  • BGU 10 = PI idno 7513: Papyrusurkunden aus ptolemäischer Zeit. (Ägyptische Urkunden aus den Staatlichen Museen zu Berlin. Griechische Urkunden. X. Band.)
  • etc. 
In that list, I wanted items to the left to be linked to the online view of the Zotero record at zotero.org and items on the right linked to the online view of the TEI record at papyri.info. The XML data we got from the initial match process provided the papyri.info bibliographic ID numbers, from which it's easy to construct the corresponding URIs, e.g., http://papyri.info/biblio/7513.

But Zotero presented a problem. URIs for bibliographic records in Zotero server use alphanumeric "item keys" like this: CJ3WSG3S (as in https://www.zotero.org/groups/papyrology/items/itemKey/CJ3WSG3S/).

That item key string is not, to my knowledge, included in any of the export formats produced by the Zotero desktop client, nor is it surfaced in its interface (argh). It appears possible to hunt them down programmatically via the Zotero Read API, though I haven't tried it for reasons that will be explained shortly. It is certainly possible to hunt for them manually via the web interface, but I'm not going to try that for more than about 3 records.

How I got the Zotero item keys

So, I have two choices at this point: write some code to automate hunting the item keys via the Zotero Read API or crack open the Zotero SQLLite database on my local client and see if the item keys are lurking in there too. Since I'm on a newish laptop on which I hadn't yet installed XCode, which seems to be a prerequisite to installing support for a Python virtual environment, which is the preferred way to get pip, which is the preferred install prerequisite for pyzotero, which is the python wrapper for the Zotero API, I had to make some choices about which yaks to shave.

I decided to start the (notoriously slow) XCode download yak and then have a go at the SQLLite yak while that was going on.

I grabbed the trial version of RazorSQL (which looked like a good shortcut after a few minutes of Googling), made a copy of my Zotero database, and started poking around. I thought about looking for detailed documentation (starting here I guess), but direct inspection started yielding results so I just kept going commando-style. It became clear at once that I wasn't going to find a single table containing my bibliographic entries. The Zotero client database is all normalized and modularized and stuff. So I viewed table columns and table contents as necessary and started building a SQL query to get at what I wanted. Here's what ultimately worked:

SELECT itemDataValues.value, items.key FROM items 
INNER JOIN libraries ON items.libraryID = libraries.libraryID
INNER JOIN groups ON libraries.libraryID = groups.libraryID
INNER JOIN itemData ON items.itemID = itemData.itemID
INNER JOIN itemDataValues ON itemData.valueID = itemDataValues.valueID
INNER JOIN fields ON itemData.fieldID = fields.fieldID
WHERE groups.name= "Papyrology" AND fields.fieldID=116

The SELECT statement gets me two values for each match dredged up by the rest of the query: a value stored in the itemDataValues table and a key stored in the items table. The various JOINs are used to get us close to the specific value (i.e., a short title) that we want. 116 in the fieldID field of the fields table corresponds to the short title field you see in your Zotero client. I found that out by inspecting the fields table; I could have used more JOINs to be able to use the string "shortTitle" in my WHERE clause, but that would have just taken more time.

The results of that query against my database looked like this:

P.Cair.Preis.    2245UKTH
CPR 18 26K8TAJT
P.Bodm. 28 282XKDE9
P.Gebelen 29ETKPXC
O.Krok 2BBMS7NS
P.Carlsb. 5 2D2ZNT4C
P.Mich.Aphrod. 2DTD2NIZ
P.Carlsb. 9 2FWF6T6I
P.Col. 1 2G4CF756
P.Lond.Copt. 2 2GAEU5QP
P.Harr. 1 2GCCNGJV
O.Deir el-Bahari 2GH3FEA2
P.Harrauer 2H3T6EU2
(etc).

So, copy that tabular result out of the RazorSQL GUI, paste it into a new LibreOffice spreadsheet and save it and I've got an XML file that I can dip into from the XSLT I had already started on to produce my HTML view.

Here's the resulting HTML file.

On we go.

Oh, and for those paying attention to such things, XCode finished downloading about two-thirds of the way through this process ...

Playing with PELAGIOS: Open Context and Labels

Latest in the Playing with PELAGIOS series.

I've just modified the tooling and re-run the Pleiades-oriented-view-of-the-GAWD report to include the RDF triples just published by Open Context and to exploit, when available, rdfs:label on the annotation target in order to produce more human-readable links in the HTML output. This required the addition of an OPTIONAL clause to the SPARQL query, as well as modifications to the results-processing XSLT. The new versions are indicated/linked on the report page.

You can see the results of these changes, for example, in the Antiochia/Theoupolis page.

Playing with PELAGIOS: The GAWD is Live

The is the lastest in an on-going series chronicling my dalliances with data published by the PELAGIOS project partners.

I think it's safe to say that, thanks to the PELAGIOS partner institutions, that we do have a Graph of Ancient World Data (GAWD) on the web. It's still in early stages, and one has to do some downloading, unzipping, and so forth to engage with it at the moment, but indeed the long-awaited day has dawned.

Here's the perspective, as of last Friday, from the vantage point of Pleiades. I've used SPARQL to query the GAWD for all information resources that the partners claim (via their RDF data dumps) are related to Pleiades information resources. I.e., I'm pulling out a list of information resources about texts, pictures, objects, grouped by their relationships to what Pleiades knows about ancient places (findspot, original location, etc.). I've sorted that view of the graph by the titles Pleiades gives to its place-related information resources and generated an HTML view of the result. It's here for your browsing pleasure.

Next Steps and Desiderata

For various technical reasons, I'm not yet touching the data of a couple of PELAGIOS partners (CLAROS and SPQR), but the will hopefully be resolved soon. I still need to dig into figuring out what Open Context is doing on this front. Other key resources -- especially those emanating from ISAW -- are not yet ready to produce RDF (but we're working on it).

There are a few things I'd like the PELAGIOS partners to consider/discuss adding to their data:

  • Titles/labels for the information resources (using rdfs:label?). This would make it possible for me to produce more intuitive/helpful labels for users of my HTML index. Descriptions would be cool too. As would some indication of the type of thing(s) a given resource addresses (e.g., place, statue, inscription, text)
  • Categorization of the relationships between their information resources and Pleaides information resources. Perhaps some variation of the terms originally explored by Concordia (whence the GAWD moniker), as someone on the PELAGIOS list has already suggested.
What would you like to see added to the GAWD? What would you do with it?

Playing with PELAGIOS: Dealing with a bazillion RDF files

Latest in a Playing with PELAGIOS series

Some of the PELAGIOS partners distribute their annotation RDF in a relatively small number of files. Others (like SPQR and ANS) have a very large number of files. This makes the technique I used earlier for adding triples to the database ungainly. Fortunately, 4store provides some command line methods for loading triples.

First, stop the 4store http server (why?):
$ killall 4s-httpd
Try to import all the RDF files.  Rats!
$ 4s-import -a pelagios *.rdf
-bash: /Applications/4store.app/Contents/MacOS/bin/4s-import: Argument list too long
Bash to the rescue (but note that doing one file at a time has a cost on the 4store side):
$ for f in *.rdf; do 4s-import -av pelagios $f; done
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.00000.rdf>
Pass 1, processed 10 triples (10)
Pass 2, processed 10 triples, 8912 triples/s
Updating index
Index update took 0.000890 seconds
Imported 10 triples, average 4266 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.101.rdf>
Pass 1, processed 11 triples (11)
Pass 2, processed 11 triples, 9856 triples/s
Updating index
Index update took 0.000936 seconds
Imported 11 triples, average 4493 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.10176.rdf>
Pass 1, processed 8 triples (8)
Pass 2, processed 8 triples, 6600 triples/s
Updating index
Index update took 0.000892 seconds
Imported 8 triples, average 3256 triples/s
...
This took a while. There are 86,200 files in the ANS annotation batch.

Note the use of the -a option on 4s-import to ensure the triples are added to the current contents of the database, rather than replacing them! Note also the -v option, which is what gives you the report (otherwise, it's silent and that makes my ctrl-c finger twitchy).

Now, back to the SPARQL mines.

Playing with PELAGIOS: Arachne was easy after nomisma

Querying Pleiades annotations out of Arachne RDF was as simple as loading the Arachne Objects by Places RDF file into 4store the same way I did nomisma and running the same SPARQL query.  Cost: 5 minutes. Now I know about 29 objects in the Arachne database that they think are related to Akragas/Agrigentum. For example: