Skip to content

Toggle service links

Invited Speaker: Dataset Profiles - investigating the role of data in experimental NLP
Prof. Anne de Roeck

This event took place on 8th March 2006 at 9:00am (09:00 GMT)
Knowledge Media Institute, Berrill Building, The Open University, Milton Keynes, United Kingdom, MK7 6AA

Replay includes welcome introduction to the 9th annual CLUK Research Colloquium by Professor Donia Scott, The Open University.

Abstract:
It has been known for a long time that the performance of Information Retrieval and Natural Language Processing techniques in the context of a particular task is very sensitive to the characteristics of the data on which they are used. Though widely accepted, this fact has never been taken to its logical conclusion and in evaluation, for instance, experimental results are reported without reference to the impact of the underlying datasets or collections. This raises some very serious methodological, and practical issues around replicability. These could be addressed if we had reliable ways of profiling datasets, using measures that highlight relevant differences between collections. A first step would be to investigate what such measures might look like for a given range of tasks or techniques.

In this talk, I will show that even standard textual datasets such as the TIPSTER collection differ in ways that challenge widely accepted assumptions about the general applicability of techniques, and that similar differences in data profile will show up between texts in the same genre but in different languages. In exploring what might be suitable profiling measures, I will set out some desirable properties that such measures should have. I will then introduce our work on modelling term burstiness, and explore what term distribution, and variations in burstiness patterns in the occurrence of a term can tell us about genres and datasets.

Download PowerPoint presentation (620kb ZIP file)
Return to the event page

Click here to submit a question or comment

The webcast was open to 50 users

Click below to play the event (63 minutes)

Creative Commons Licence KMi logo