What Zotero Wants From My Blog

So, I'm on a quest to get my blog to play well with Zotero. Today I come to grips with the translator code that Zotero uses to create a record from a blog entry.

In the first report from the field, I did some comparative metadata investigations to see why some (WordPress) blogs worked well with the Zotero browser plugin, but mine (built with the Nikola static site generator) didn't. There was nothing super-obvious in the HTML metadata, but I was suspicious that the mere fact of being WordPress or not might have something to do with it.

The Zotero developer community puts its translator code on GitHub in a public repository, so we can look at it directly. There are multiple translators, some keyed to particular domains. The one that's getting used for both my blog and the ones I'm comparing calls itself (in the Zotero interface) "Embedded Metadata." A bit of filename skimming leads us to a likely JavaScript source file named "Embedded Metadata.js". Searching in page for the string "blog," we discover quickly that our guess about special handling for WordPress is probably right:

      } else if(tag.toLowerCase() == 'generator') {
        var lcValue = value.toLowerCase();
        if(lcValue.indexOf('blogger') != -1
          || lcValue.indexOf('wordpress') != -1
          || lcValue.indexOf('wooframework') != -1
        ) {
          generatorType = 'blogPost';
        }
}

I'm not fluent in JavaScript, but searching around for how generatorType seems to get used elsewhere (it only appears in this file), it's pretty clear that, in the absence of any other indicia, Zotero will always interpret an HTML page that claims to have been generated by WordPress as a blog post:

      rdf.defaultUnknownType = hwType || hwTypeGuess || generatorType || 
        (nodes.length ? "webpage":false);

Indeed, in the WordPress blog we were examining previously (ruthtillman.com), posts include a <meta> element in the HTML header that identifies the creating application:

<meta name="generator" content="WordPress 4.9.5" />

The Nikola theme I'm using also writes a generator tag for my blog posts, but as Nikola can be used to create non-blog sites, it's not appropriate for the Zotero translator to force the item type in this way.

Now what?

Well, given that the variable name rdf.defaultUnknownType contains the word "default," presumably there are other ways in which the itemType can be determined. In fact, just a few lines down, we find the promising invocation of a JavaScript function named rdf.detectType:

  _itemType = nodes.length ? rdf.detectType({},nodes[0],{}) : rdf.defaultUnknownType;

More searching leads us to another source file in the same repository, RDF.js. wherein the function detectType is defined. At the very bottom of this function, we see that the value of an internal variable named itemType is returned by the function. We need to find out how it gets set to "blogPost". Reading backward from this point, we quickly come across a big assignment:

  var itemType = t.zotero || t.bib || t.prism ||t.eprints|| t.og || t.dc || 
    exports.defaultUnknownType || t.zoteroGuess || t.bibGuess || 
    t.prismGuess || t.ogGuess || t.dcGuess ;

There must be code further above that line that sets one or more of the variables on the right-hand side of the assignment to "blogPost". I don't want to read this whole function, so let's search for the string "blogPost". There it is:

  //PRISM:genre
  type = getFirstResults(node, [n.prism+"genre", n.prism2_0+"genre",
    n.prism2_1+"genre"]);
  switch(type) {
    case 'abstract':
    case 'acknowledgements':
    case 'authorbio':
    case 'bibliography':
    case 'index':
    case 'tableofcontents':
      t.prism = 'bookSection';
    break;
    case 'autobiography':
    case 'biography':
      t.prism = 'book';
    break;
    case 'blogentry':
      t.prism = 'blogPost';
    break;
    case 'homepage':
    case 'webliography':
      t.prism = 'webpage';
    break;
    case 'interview':
      t.prism = 'interview';
    break;
    case 'letters':
      t.prism = 'letter';
    break;
    case 'adaptation':
    case 'analysis':
      t.prismGuess = 'journalArticle';
    break;
    case 'column':
    case 'newsbulletin':
    case 'opinion':
      //magazine or newspaper
      t.prismGuess = 'newspaperArticle';
    break;
    case 'coverstory':
    case 'essay':
    case 'feature':
    case 'insidecover':
      //journal or magazine
      t.prismGuess = 'magazineArticle';
    break;
}

My takeaway from this slap-dash bit of code forensics is that we need to figure out how to embed some RDF metadata that uses the PRISM vocabulary to assert that the "genre" of our blog posts is "blogentry."

Stay tuned to my Zotero tag for the next dispatch ...