WordSieve identifies task-specific terms and generates representations of documents reflecting both their content and the context in which they were consulted. WordSieve identifies boundaries in user tasks from implicit information--the characteristics of document access subsequences--and describes tasks by keywords that are useful in distinguishing those subsequences. To illustrate, consider a user searching the web first for information about genetic algorithms, then for some other subject. The term ``genetic'' is a good task-specific term since it occurs frequently for a certain period of time, and then stops occurring. In doing so, ``genetic'' effectively segments the document accesses into two groups, the first group probably related to ``genetic.'' (Note that the term ``genetic'' need not occur in all the documents accessed in the sequence, and that there may be some overlap between tasks as transitions occur).
WordSieve represents user access patterns in a 3-layer architecture, the lowest layer reflecting the frequency of words currently occurring in a document stream and the upper two layers reflecting a user access profile. Each layer consists of a set of units, each of which corresponds to a word from the input text stream. WordSieve uses a competitive learning algorithm to assign keywords to units while the document stream passes through it. The goal of WordSieve is to identify keywords that occur frequently during a given task, but infrequently during other tasks. WordSieve's information flow is shown in figure 1, and the function of each layer is described below.