• ADHO DH 2020 long paper - The Semantics of Structure in Large Historical Corpora

    Author(s):
    Rik Hoekstra, Marijn Koolen (see profile)
    Date:
    2020
    Group(s):
    DH2020
    Subject(s):
    Digital humanities, Text data mining
    Item Type:
    Abstract
    Tag(s):
    digital historical corpora, information extraction, text analysis, Digital history, Text analytics
    Permanent URL:
    http://dx.doi.org/10.17613/m1rp-6460
    Abstract:
    Structuring large historical corpora that are too big to be processed manually can take two approaches. The first is an inductive method extracting implicit entities and meaning from textual (and sometimes visual) content. With the help of AI or manually compiled (existing) lists of entities, the entities are converted into information. The second, that Colavizza (2019) calls referential information systems, takes existing reference systems (like archival indexes) and uses them to contextualize individual documents. Both methods are used to turn corpora into computer accessible information systems. Ideally a more complete information system would result from combining both approaches, but in practice they are hard to bridge because of a number of different problems. This paper presents an approach that addresses those problems and combines inductive methods of automated text analysis and information extraction techniques with knowledge of the referential information systems to add rich semantic layers of information to large historical corpora.
    Metadata:
    Status:
    Published
    Last Updated:
    4 years ago
    License:
    All Rights Reserved

    Downloads

    Item Name: pdf the-semantics-of-structure-in-large-historical-corpora.pdf
      Download View in browser
    Activity: Downloads: 326