\b\p{Lu}\p{Lu}+After some iteration on the results, I ended up with something more complex, trying to capture anything that fell between the acronym itself and the first subsequent colon, which seemed to be the standard delimiter between the designation+explanation of the type of identifier and the identifying value itself. I figure we'll worry how to parse the value later, once we're sure which identifiers we want to capture. So, here's the regex I ultimately used:
\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]The full ack command looked like this:
ack -oh "\b\p{Lu}\p{Lu}+[:\s][^\b\p{P}]*[\b\:]" post-*.xml > ../awol-acronyms/raw.txtwhere the -h option telling ack to "suppress the prefixing of filenames on output when multiple files are searched" and the -o option telling ack to "show only the part of each line matching" my regex pattern (quotes from the ack man page). You can browse the raw results here.
ISSN: | 17 |
ISSN paper: | 9 |
ISSN electrònic: | 4 |
ISSN electronic edition: | 2 |
ISSN electrónico: | 2 |
ISSN électronique: | 2 |
ISSN impreso: | 2 |
ISSN Online: | 2 |
ISSN edición electrónica: | 1 |
ISSN format papier: | 1 |
ISSN Print: | 1 |
ISSN print edition: | 1 |
ONLINE ISSN: | 1 |
PRINT ISSN: | 1 |
ISBN of Second Part: | 2 |
ISBN: | 1 |
ISBN Compiled by: | 1 |