Playing with PELAGIOS: Dealing with a bazillion RDF files

Latest in a Playing with PELAGIOS series

Some of the PELAGIOS partners distribute their annotation RDF in a relatively small number of files. Others (like SPQR and ANS) have a very large number of files. This makes the technique I used earlier for adding triples to the database ungainly. Fortunately, 4store provides some command line methods for loading triples.

First, stop the 4store http server (why?):
$ killall 4s-httpd
Try to import all the RDF files.  Rats!
$ 4s-import -a pelagios *.rdf
-bash: /Applications/4store.app/Contents/MacOS/bin/4s-import: Argument list too long
Bash to the rescue (but note that doing one file at a time has a cost on the 4store side):
$ for f in *.rdf; do 4s-import -av pelagios $f; done
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.00000.rdf>
Pass 1, processed 10 triples (10)
Pass 2, processed 10 triples, 8912 triples/s
Updating index
Index update took 0.000890 seconds
Imported 10 triples, average 4266 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.101.rdf>
Pass 1, processed 11 triples (11)
Pass 2, processed 11 triples, 9856 triples/s
Updating index
Index update took 0.000936 seconds
Imported 11 triples, average 4493 triples/s
Reading <file:///Users/paregorios/Documents/files/P/pelagios-data/coins/0000.999.10176.rdf>
Pass 1, processed 8 triples (8)
Pass 2, processed 8 triples, 6600 triples/s
Updating index
Index update took 0.000892 seconds
Imported 8 triples, average 3256 triples/s
...
This took a while. There are 86,200 files in the ANS annotation batch.

Note the use of the -a option on 4s-import to ensure the triples are added to the current contents of the database, rather than replacing them! Note also the -v option, which is what gives you the report (otherwise, it's silent and that makes my ctrl-c finger twitchy).

Now, back to the SPARQL mines.

Playing with PELAGIOS: Arachne was easy after nomisma

Querying Pleiades annotations out of Arachne RDF was as simple as loading the Arachne Objects by Places RDF file into 4store the same way I did nomisma and running the same SPARQL query.  Cost: 5 minutes. Now I know about 29 objects in the Arachne database that they think are related to Akragas/Agrigentum. For example:

Playing with PELAGIOS: Nomisma

So, I want to see how hard it is to query the RDF that PELAGIOS partners are putting together. The first experiment is documented below.

Step 1: Set up a Triplestore (something to load the RDF into and support queries)

Context: I'm a triplestore n00b. 

I found Jeni Tennison's Getting Started with RDF and SPARQL Using 4store and RDF.rb and, though I had no interest in messing around with Ruby as part of this exercise, the recommendation of 4store as a triplestore sounded good, so I went hunting for a Mac binary and downloaded it.

Step 2: Grab RDF describing content in Nomisma.org

Context: I'm a point-and-click expert.

I downloaded the PELAGIOS-conformant RDF data published by Nomisma.org at http://nomisma.org/nomisma.org.pelagios.rdf.

Background: "Nomisma.org is a collaborative effort to provide stable digital representations of numismatic concepts and entities, for example the generic idea of a coin hoard or an actual hoard as documented in the print publication An Inventory of Greek Coin Hoards (IGCH)."

Step 3: Fire up 4store and load in the nomisma.org 

Context: I'm a 4store n00b, but I can cut and paste, read and reason, and experiment.

Double-clicked the 4store icon in my Applications folder. It opened a terminal window.

To create and start up an empty database for my triples, I followed the 4store instructions and Tennison's post (mutatis mutandis) and so typed the following in the terminal window ("pelagios" is the name I gave to my database; you could call yours "ray" or "jay" if you like):

$ 4s-backend-setup pelagios
$ 4s-backend pelagios
Then I started up 4store's SPARQL http server and aimed it at the still-empty "pelagios" database so I could load my data and try my hand at some queries:
$ 4s-httpd pelagios
Loading the nomisma data was then as simple as moving to the directory where I'd saved the RDF file and typing:
$ curl -T nomisma.org.pelagios.rdf 'http://localhost:8080/data/http://nomisma.org/nomisma.org.pelagios.rdf/'
Note how the URI base for nomisma items is appended to the URL string passed via curl. This is how you specify the "model URI" for the graph of triples that gets created from the RDF.

Step 4: Try to construct a query and dig out some data.

Context: I'm a SPARQL n00b, but I'd done some SQL back in the day and XML and namespaces are pretty much burned into my soul at this point. 

Following Tennison's example, I pointed my browser at http://localhost:8080/test/. I got 4store's SPARQL test query interface. I googled around looking grumpily at different SPARQL "how-tos" and "getting starteds" and trying stuff and pondering repeated failure until this worked:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX oac: <http://www.openannotation.org/ns/>

SELECT ?x
WHERE {
?x oac:hasBody <http://pleiades.stoa.org/places/462086> .
}

That's "find the ID of every OAC Annotation in the triplestore that's linked to Pleiades Place 462086" (i.e., Akragas/Agrigentum, modern Agrigento in Sicily). It's a list like this:
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch1910-agrigentum-5
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch2089-agrigentum-24
  • http://nomisma.org/nomisma.org.pelagios.rdf#igch2101-agrigentum-32
  • ...
51 IDs in all.

But what I really want is a list of the IDs of the nomisma entities themselves so I can go look up the details and learn things. Back to the SPARQL mines until I produced this:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX oac: <http://www.openannotation.org/ns/>

SELECT ?nomismaid
WHERE {
?x oac:hasBody <http://pleiades.stoa.org/places/462086> .
?x oac:hasTarget ?nomismaid .
}

Now I have a list of 51 nomisma IDs: one for the mint and 50 coin hoards that illustrate the economic network in which the ancient city participated (e.g., http://nomisma.org/id/igch2081).

Cost: about 2 hours of time, 1 cup of coffee, and three favors from Sebastian Heath on IRC.

Up next: Arachne, the object database of the Deutsches Archäologisches Institut.



Changes to Electra and Maia Atlantis

I've just added the feed for the news blog on the following site to both Maia and Electra:

The following sites have been removed from Maia for the reasons indicated (they were not in Electra):
  • Antiquated Vagaries: feed returns 401 (i.e., it's been taken private)
Records for some other blogs were updated to reflect the fact that their feeds had moved to new URLs (with proper forwarding instructions). I pass over these details in silence here.

Pleiades, Flickr, and the Ancient World Image Bank

Many of you are already aware that Pleiades, Flickr, and the Ancient World Image Bank have joined forces to link together online, open-access imagery and ancient geographical information. This blog post is intended to answer some lingering questions that users and potential contributors have been asking about the process.

Other Blog Posts

How to Construct a Pleiades Machine Tag

In Flickr, you add the machine tag the same way you add regular tags when editing an individual image or a set or group in the organizer. The machine tag should use the following syntax:
pleiades:TERM=#####
where "TERM" is one of the recognized terms (originally from the Concordia Thesaurus, aka the Graph of Ancient World Data, or GAWD, terms) listed below and "#####" is the numeric identifier of the Pleiades place you wish to associate with the photo.

You can get the identifier by visiting pleiades.stoa.org then searching for and finding the place. Copy the numeric portion of the URL of the place page and paste it into your tag.

So, for example, if I wanted to tag a photo that "depicts" ancient Athens (or a portion thereof), I'd visit Pleiades and search for Athens. I'd find this place page: http://pleiades.stoa.org/places/579885/. So, I'd grab 579885 and construct the following machine tag for use in Flickr:
pleiades:depicts=579885
Recognized Terms in Pleiades Machine Tags

TERMS for use in Pleiades machine tags should begin with a lowercase letter. The following TERMS are recognized in Pleiades machine tags:
depicts
the photo so tagged can be said to "depict" the referenced ancient place or a significant or exemplary portion thereof
(this term is equivalent to CIDOC CRM p62 "depicts")
findspot
the photo so tagged shows an object that was first found in modern times at the referenced ancient place
(this term is particularly useful for items now in museums or elsewhere, especially those no longer at the initial place of finding)
origin
the photo so tagged shows an object that is believed, with reasonably high certainty, to have been originally located or produced at the referenced ancient place
(this can differ from the findspot, as when an inscription or other object was moved in antiquity)
observedAt
the photo so tagged shows an object that was observed in modern times at the referenced ancient place
(the implication being that the place observed is neither the modern findspot nor the presumed original location; I suspect this term will rarely need to be used to link photos with Pleiades place resources)
where
the photo so tagged is related in some way to the referenced ancient place, but for some unspecified reason no more specific relationship can be asserted
(this term should not be used unless none of above terms are deemed to be appropriate)
place
this term is DEPRECATED; it was originally used in exploring the Pleiades machine tag idea (and is highlighted in my previous blog post). Its semantics are assumed to be equivalent to "where". If at all possible, photos carrying this tag should be updated to use one of the more specific terms above.

Please note that pleiades:places=##### (i.e., with a plural) is not a recognized machine tag. Its behavior in Flickr or Pleiades is undefined. So is the behavior of any Pleiades machine tag with a misspelling or a term not included in the list above. ("finspot" seems to be a popular typo at present).

Any photo tagged with the proper Pleiades machine tag syntax and one of the terms above will be noticed by Pleiades and picked up in the summary counts and links on individual place resource pages. In order for a photo to be considered for the Pleiades Places group on Flickr (and therefore as a "portrait image" for a Pleiades place), the photo must be tagged with a pleiades:depicts tag.

How do I Add Pleiades Machine Tags Quickly?

It would be unreasonable of us to ask a prolific photographer with an extensive, well-tagged collection already on Flickr to go through an individually add appropriate machine tags by hand. Fortunately, Flickr provides a mechanism for easy batch editing of tags.
  1. Suppose that you have tagged a large number of your photos with the name of the ancient site (e.g., Halicarnassus)
  2. Visit the following link: http://www.flickr.com/photos/me/alltags/; it will give you an alphabetical list of all your tags.
  3. Find the name of the site and click on the corresponding link. You'll see all the photos thus tagged in your photostream.
  4. Find and click on the "Change this tag" link (Really. Skip "Edit these in a batch" link for now).
  5. Insert the cursor after your existing tag string, type a space, then type or paste in the desired Pleiades machine tag (read the fine print on that page for an extended explanation of what's going on).
  6. Click the "save" button. Flickr will go off and add the new tag to all those images at once.
  7. If you'd rather be more selective about which photos you want to add a tag to, you can choose the "Edit these in a batch" link I told you to skip above, then paste the Pleiades machine tag into the tag lists associated with only those images you wish to update.
Why Can't We Just Use Geotagging That's Already in the Photos?

Some photographers have geotagged their photos, either using a GPS-enabled digital camera or some method of post-processing. I was recently asked on twitter why we're putting people to all the trouble above when we could just use the geotagging? There are several reasons:
  • Not everybody's photos are machine-tagged.
  • Many ancient sites are coincident with urban areas (or areas of natural beauty or places where someone took a picture of their dog) and so mere proximity to an ancient site can't be interpreted as indicating a given photo is relevant to a nearby Pleiades place.
  • Horizontal precision and accuracy can vary widely in geotagged photos as a function of the geotagging method used and the interests and skill of the person doing the geotagging. As a result, a photo might be geotagged at a location closer to another, unrelated pleiades place.
  • The horizontal precision and accuracy of Pleiades coordinates also varies widely given the varying sources from which it derives and the subsequent coordinate extraction methods. This makes the process of proximity correlation even more fraught.
This is not to say that we're not interested in exploring uses for geotagged photos in Flickr (or supporting a geotag-your-photo-using-Pleiades-coordinates tool), but I hope this discussion helps explain why we like the machine-tag approach for indicating relevance.

Please let me know, via comments here, if you have additional questions or suggestions.