PCD Seminar 1/20/95 Marti Hearst

Context and structure in full-text information access

Marti Hearst, Xerox PARC
hearst@parc.xerox.com

Seminar on People, Computers, and Design
Stanford University January 20, 1995

Until recently there has been little research on information access from full-length texts (as opposed to titles, abstracts and newswire articles). I suggest that full-text documents be characterized according to the structure of their content, that is, as a set of main topics that occur throughout the length of the text, and a sequence of local subtopical discussions. I have developed an algorithm, called TextTiling, that uses lexical frequency information to partition expository texts according to subtopic structure, and whose results correspond well to human judgments.

An important aspect of information access is the display of retrieval results. Most systems use inter-document similarity as a criterion for display, via clustering or some version of the vector space similarity measure. However, inter-document similarity measures that work well for short texts and abstracts are often inappropriate for long texts. As an alternative, I present a new graphical search tool, called TileBars, which uses term distribution information to show the relationship between the subtopic structure of the retrieved texts and the terms of the query. TileBars use TextTiles to simultaneously and compactly display query term frequency, query term distribution and relative document length, providing an informative alternative to ranking long texts according to their overall similarity to a query.

I have recently completed a set of experiments on a large full-text collection that show that term distribution and overlap constraints, such as those used in the TileBar search interface, can significantly improve retrieval results using standard measurement criteria.

Marti Hearst completed her PhD in Computer Science at UC Berkeley in April 1994 and is now a member of the research staff at Xerox PARC. Her research interests include intelligent information access, corpus-based computational linguistics, user interfaces, and psycholinguistics.

Titles and abstracts for all years are available by year and by speaker.

For more information about HCI at Stanford see