Skip to content

Toggle service links

Big Data Opportunities and Challenges for IR, Text Mining, and NLP
Professor Beth Plale

This event took place on 12th December 2013 at 10:30am (10:30 GMT)
Knowledge Media Institute, Berrill Building, The Open University, Milton Keynes, United Kingdom, MK7 6AA

HTRC is a collaborative effort of Indiana University and the University of Illinois at Urbana-Champaign, along with the HathiTrust, to provide a new mode of access to the content of research libraries. That is, HTRC enables computational exploration of the digitized volumes that make up the HathiTrust digital library.
Initially launched in 2011, Phase I of the HTRC initiative was dedicated to construction of underlying software and services. Spring 2013 marks Phase II, focused on engaging with the research community to support and showcase computational research on the public domain corpus alongside ongoing technical development.   

In this talk, I will talk about a couple of recent developments to HTRC:

Community Contributed Analytics in Secure Capsule. Through funding from the Alfred P. Sloan Foundation, HTRC is developing secure software through which researchers can submit their own analytics algorithms to run against the full corpus of 11 M volumes, including both public domain and copyrighted content. Researchers with smaller scale needs obtain a dedicated virtual machine (VM) that is pre-configured but can be customized by the researcher with his or her own software. The VM runs on HTRC compute resources. When running, the VM has limited access to the network to ensure the safety of the data. HTRC is working on expanding access to statistical information about the entire 11M volume corpus, working with community members to identify particularly useful information like page-level token counts.
Metadata Enhancement.  Going beyond MARC, the HTRC team is adding more metadata fields to better serve the diverse needs of the community. Our indexing service has separated the full text index from the metadata index, making it more convenient to add additional metadata fields without interfering with the OCR content. So far, “gender” and “token count” fields have been added with plans to investigate and implement additional attributes.
See more details at

The webcast was open to 100 users

Click below to play the event (53 minutes)

Creative Commons Licence KMi logo