ISWC 2009 Keynote/Tom Mitchell
Populating the Semantic Web by Macro-Reading Internet Text
Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University
A key question to the future of the semantic web is "how will we acquire structured information to populate the semantic web on a vast scale?" One approach is to enter this information manually. A second approach is to take advantage of the great deal of structured information already present in various databases, and to develop common ontologies, publishing standards, and reward systems to make this data widely accessible. We consider here a third approach: developing software that automatically extracts structured information from unstructured text present on the web.
This talk will survey attempts to extract structured knowledge from unstructured text, and will focus on an approach with three characteristics that we hypothesize make it viable. First, in contrast to the very difficult problem of reading information from a single document, we consider the much easier problem of reading hundreds of millions of documents simultaneously, so that our system can extract facts that are stated many times by combining evidence from many documents. Second, our system begins with a given ontology that defines the types of information to be extracted, enabling it to focus its effort and to ignore most of the text which is irrelevant to the target ontology. Third, the system uses a new class of semi-supervised learning algorithms to learn how to extract information from web pages -- algorithms designed to achieve greater accuracy when given more complex ontologies. Our experiments show that this approach can produce knowledge bases containing tens of thousands of facts to populate given ontologies with approximately 90% accuracy, starting with only a handful of labeled training examples and 200 million unlabeled web pages.