PhD defence: Entity Resolution on Historical Knowledge Graphs

to

Semantic web technology is increasingly being used in projects of humanities researchers, such as historians and literary scholars. This technology makes it easier to access large-scale data sets from the cultural heritage world, such as the indexes on persons and locations of the Amsterdam City Archives. Semantic web technology also facilitates the integration of different data sources into knowledge graphs, which in turn enables cross-data set analyses that were previously infeasible. This makes it possible, for example, to reconstruct someone's life using primary archival sources.

However, the integration of different historical data sets entails a number of complications. Because the majority of these archival data sets are aimed at providing quick and easy access, it is likely that the same person is included multiple times within a single set and/or appears multiple times in different data sets, each time with a new entry and unique identifier. Until these unique entries are resolved, and the duplicate entities are disambiguated, it is not yet possible to conduct this type of investigation.

This dissertation offers a solution to this problem and describes a method to reduce, if not solve, the number of duplicates in a set of knowledge graphs. It does so by clustering these unique entries, where each cluster represents a single real-life object. To this end, we describe the application of 'embeddings': a technique for creating representations of entities such as nodes in a knowledge graph, so that they become readable for the computer. Within this method, the embeddings are constructed in such a way that high similarity between two nodes is indicative of a duplicate.

However, relying on these pairwise matches is not without risk. For example, the application of a threshold value can lead to a transitivity violation. That is, the entity pairs $(A, B)$ and $(B, C)$ can both have high similarity, but this need not be the case for pair $(A, C)$. To address this issue, we employ algorithms that make use of the pairwise similarities to find clusters that conform as best as possible to the computed similarities. Nevertheless, it happens that these clustering algorithms produce false positives and negatives. To counteract this and significantly improve the clustering results, this work describes the use of domain-specific knowledge and constraints to detect and correct clustering errors. An example of such a restriction is that one cannot marry oneself or that a person is first baptized and then buried.

Start date and time
End date and time
Location
Academiegebouw, Domplein 29 & online (livestream link)
PhD candidate
J. Baas
Dissertation
Entity Resolution on Historical Knowledge Graphs
PhD supervisor(s)
prof. dr. M.M. Dastani
prof. dr. E. Stronks
Co-supervisor(s)
dr. A.J. Feelders