Geographic and temporal information retrieval on massive document collections
Dr Benjamin Adams, Research Fellow, Department of Computer Science
The Frankenplace tool running on 64K screens in the Centre for eResearch visualisation facility.
Overview
The project aims to discover the relationships between the topics, places, and events that can be found in a very large document corpus. By understanding these knowledge relationships we build new kinds of exploratory ad hoc search systems that use the contexts of geography and time to help people search and discover otherwise hidden information using interactive map and timeline based search interfaces.
Frankenplace
Frankenplace is an interactive thematic map search engine that uses geographic context as a means to discover, organise, and interactively visualise the documents related to a search query. The current version of Frankenplace indexes over 5 million articles from the English version of Wikipedia and online social media blog entries.
Cartographers have known for centuries the power of organising thematic information by geographic context and visualizing it on a map. A thematic map allows you to understand the spatial dynamics of a topic within a structure that has strong reference points in human experience – myriad places and regions around the world. At the same time a thematic map can show us important relationships between these places because they are similar or differ with respect to some property of interest.
Frankenplace is also an ad hoc search engine designed to help users find relevant documents that match a query. By visualizing the interaction between the thematic and geographic content of documents, users can use the map interface to quickly explore through hundreds or thousands documents that match their query while bringing their background knowledge about geography to bear on their interpretation of the results. Likewise, unexpected patterns that arise on the map provide opportunities to learn new things about how a topic of interest relates to places.
This chart shows the relative number of references to dates by century for China, France, Iraq, and Greece. These results come from over 68 million references mined out of the English Wikipedia.
How are times and places related?
There is demonstrated need for knowledge systems that can organize large document collections through space and time (e.g., for the digital humanities). However, little is actually known about the statistics of how places and times are written about in these document collections. We are currently using the Pan cluster to mine these relationships. Our goal is that these insights will lead to more useful schemas to spatially and temporally organize and present search results in interactive information retrieval systems.
How High Performance Computing has helped
The NeSI High Performance Computing (HPC) facilities have been a critical resource in our research. It has allowed us to build the indices for Frankenplace orders of magnitude faster than we would have been able to otherwise. We were able to pre-process millions of documents in parallel to clean the text to remove irrelevant HTML and other noise and run natural language processing code to tag place names and dates. In order to build the indices we also needed to perform millions of spatial intersection operations to match text paragraphs in these documents to cells on a discrete global grid. Here again the ability to run these operations in parallel has been a tremendous help in allowing us to test, improve, and re-run our algorithms. In future work, we plan to use the advanced GPU facilities on the cluster to create better geographic and temporal taggers using deep neural networks.
The role of visualisation facility
In addition to the HPC infrastructure that NeSI provides, the Centre for eResearch also offered the advanced visualisation facility with an entirely new avenue of research, exploring the impact of large collaborative and interactive visualisation for map-based search. The ability to see search results for the entire world at high granularity has led to new opportunities to evaluate the role of human computer interaction in our research.