This post is second is a series that I'm hoping will be more regular. I want to try to share XQuery code that I've been writing for the development of digital library projects.
This library module (which contains one function) was partially written by one of my colleagues, Greg Murray. We wanted to replace "dirty" OCR with high-quality TEI text in XML documents for 60 digital books. Initially, Greg was going to write an XSLT stylesheet to do this, but found some XQuery code at the TEI wiki (written by David Sewell) that provided the idea for what we've done here using XQuery instead.
The function has two parameters: the TEI identifier for documents in one database and the identifier for the OCR text ($iaID) originally downloaded from the Internet Archive which are in another database. I was able to query both databases at once using a MarkLogic (ML) function, xdmp:eval(). And the nodes are replaced using another ML function, xdmp:node-replace.
xquery version "1.0-ml"; (: This module provides a function to replace OCR text with TEI text. The function gets the text
nodes between TEI page breaks and wrap those text nodes in elements. :) module namespace tei = "http://www.catalogingfutures.com/tei-replace"; declare namespace ia = "http://www.catalogingfutures.com/ia"; declare function tei-replace($teiID as xs:string?, $iaID as xs:string?) { let $query := 'xquery version "1.0-ml"; declare namespace tei = "http://www.catalogingfutures.com/tei-replace"; declare variable $tei:t as xs:string external; fn:doc(fn:concat("/TEI/DLAK/", $tei:t, ".xml"))/TEI.2/text' let $dbase := {xdmp:database("tei")} let $tei := xdmp:eval($query, (xs:QName("tei:t"), $teiID), $dbase) let $oldNode := fn:doc(fn:concat("/xml/", $iaID, ".xml"))/ia:doc/ia:text let $pages := for $pb at $count in $tei/descendant::*:pb return { for $text-node in
$tei/descendant::text()[. >> $pb and . << $tei/descendant::*:pb[$count + 1] ] return $text-node } let $newNode := <text type="tei" fileid="{ $teiID }" xmlns="http://www.catalogingfutures.com/ia">{ $pages }</text> return xdmp:node-replace($oldNode, $newNode) };
Recent Comments