Batch XML validation at the command line

Updated: 8 August, 2017 to reflect changes in the installation pattern for jing.

Against a RelaxNG schema. I had help figuring this out from Hugh and Ryan at DC3:

$ find {searchpath} -name "*.xml" -print | parallel --tag jing {relaxngpath}
The find command hunts down all files ending with ".xml" in the directory tree under searchpath. The parallel command takes that list of files and fires off (in parallel) a jing validation run for each of them. The --tag option passed to jing ensures we get the name of the file passed through with each error message. This turns out (in general terms as seen by me) to be much faster than running each jing call in sequence, e.g. with the --exec primary in find.

As I'm running on a Mac, I had to install GNU Parallel and the Jing RelaxNG Validator. That's what Homebrew is for:
$ brew install jing
$ brew install jing-trang
$ brew install parallel
NB: you may have to install a down version of Java before you can get the jing-trang formula to work in homebrew (e.g., brew install java6).

What's the context, you ask? I have lots of reasons to want to be able to do this. The proximal cause was batch-validating all the EpiDoc XML files for the inscriptions that are included in the Corpus of Campā Inscriptions before regenerating the site for an update today. I wanted to see quickly if there were any encoding errors in the XML that might blow up the XSL transforms we use to generate the site. So, what I actually ran was:
$ curl -O http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng
$ find ./texts/xml -name '*.xml' -print | parallel --tag jing tei-epidoc.rng
 Thanks to everybody who built all these tools!


Planet Atlantides grows up and gets its own user-agent string

So, sobered by recent spelunking and bad-bot-chasing in various server logs and convicted by sage advice that ought to be followed by everyone in the UniversalFeedParser documentation, I have customized the bot used on Planet Atlantides for fetching web feeds so it identifies itself unambiguously to the web servers from which it requests those feeds.

Here's the explanatory text I just posted to the Planet Atlantides home page. Please let me know if you have suggestions or critiques.

Feed reading, bots, and user agents

As implied above, Planet Atlantides uses Sam Ruby's "Venus" branch of the Planet "river of news" feed reader. That code is written in the Python language and uses an earlier version of the Universal Feed Reader library for fetching web feeds (RSS and Atom formats). Out of the box, its http requests use the feed parser's default user agent string, so your server logs will only have recorded "UniversalFeedParser/4.2-pre-274-svn +http://feedparser.org/" when our copy of the software pulled your feed in the past. 

Effective 27 February 2014, the Planet Atlantides production version of the code now identifies itself with the following user agent string: "PlanetAtlantidesFeedBot/0.2 +http://planet.atlantides.org/". Production code runs on a machine with the IP address 66.35.62.81, and never runs more than once per hour. Apart for a one-time set of test episodes on 27 February 2014 itself, log entries recording our user agent string and a different IP address represent spoofing by a potential bad actor other than me and my automagical bot. You should nuke them from orbit; it's the only way to be sure. Note that from time-to-time, I may run test code from other IP addresses, but I will in future use the user agent string beginning with "PlanetAtlantidesTestBot" for such runs. You can expect them to be infrequent and irregular.

Please email me if you have any questions about Planet Atlantides, its bot, or these user agent strings. In particular, if you put something like "PlanetAtlantidesBot is messing up my site" in your subject line, I'll look at it and respond as quickly as I can.

Pruned from Maia: Dead and Damaged Feeds

The following resources have been pruned from the Maia Atlantis feed aggregator because their feeds (and in some cases the whole resource) have disappeared with no alternative address or are consistently returning errors:

  • GIS for Archaeology and CRM (formerly at http://www.gisarch.com; domain now up for sale)
  • ABZU Recent Additions (feed returns 404)
  • epea pteroenta (feed and site perpeturally return 500)
  • Internet Archaeology (feed content is invalid; site sports a notice saying a server upgrade is impending)
  • Portable Antiquities Scheme Blog (feed returns 404)
  • Art and Social Identities in Late Antiquity (University of Aarhus) (site and feed are gone)
  • ArcLand News (feed returns 404; site sports a notice saying a server upgrade is impending)
  • Jonathan Eaton (Imperium Sine Fine) (feed returns 404; blogger site says "feed has been removed")
 Please contact me if you have updated feed URLs for any of these resources.

I have also updated a number of feed that had moved, including some of which that did not provide redirects and had to be sought out manually.