|
|
Evaluation Methodologies for Multilabel Classification Evaluation
Stefanie Nowak
This event took place on 18th December 2009 at 11:30am (11:30 GMT)
Knowledge Media Institute, Berrill Building, The Open University, Milton Keynes, United Kingdom, MK7 6AA
Semantic indexing of multimedia content is a key research challenge in the multimedia community. Several benchmarking campaigns exist that assess the performance of these systems. My PhD thesis deals with approaches for the annotation of images with multiple visual concepts and evaluation methodologies for annotation performance assessment.After a short outline of the different parts of my thesis, I would like to illustrate three case studies that were performed based on the results of a recent benchmarking event in ImageCLEF in more detail. In ImageCLEF 2009, we conducted a task that aims at the detection of 53 visual concepts in consumer photos. These concepts are structured in an ontology which covers concepts concerning the scene description of photos, the representation of photo content and the photo quality. For performance assessment, a recently proposed ontology-based measure was utilized that takes the hierarchy and the relations of the ontology into account and generates a score per photo. Starting from this benchmark, three case studies have been conducted related to evaluation methodologies. The first study deals with the ground truth assessment for benchmark datasets. We investigate how much annotations from experts differ from each other, how different sets of annotations influence the ranking of systems and whether these annotations can be obtained with a crowdsourcing approach. A second case study examines the behaviour of different evaluation measures for multilabel evaluation and points out their strengths and weaknesses. Concept-based and example-based evaluation measures are compared based on the ranking of systems. In the third case study, the ontology-based evaluation measure is extended with semantic relatedness metrics. We apply several semantic relatedness measures based on web-search engines, WordNet and Wikipedia and evaluate the characteristics of the measures concerning stability and ranking. |
The webcast was open to 100 users
|