Clusters of documents can be summarized by finding the top terms (words) for the documents in the cluster, e.g., by taking the most frequent k terms, where k is a constant, say 10, or by taking all terms that occur more fre- quently than a specified threshold. Suppose that K-means is used to find clusters of both documents and words for a document data set.
(a) How might a set of term clusters defined by the top terms in a document
cluster differ from the word clusters found by clustering the terms with
K-means?
(b) How could term clustering be used to define clusters of documents?
(a) First, the top words clusters could, and likely would, overlap somewhat.
Second, it is likely that many terms would not appear in any of the
clusters formed by the top terms. In contrast, a K-means clustering of
the terms would cover all the terms and would not be overlapping.
(b) An obvious approach would be to take the top documents for a term
cluster; i.e., those documents that most frequently contain the terms in
the cluster.
You might also like to view...
What does the DateTimePicker allow for?
a) time selection b) date selection c) a and b d) None of the above
If no memory is available, keyword new throws an __________.
a. OutOfMemoryException. b. OutOfMemoryEvent. c. OutOfMemoryExhaustion. d. OutOfMemoryError.
The ____________________ function moves the internal array pointer to the last element in an array.
Fill in the blank(s) with the appropriate word(s).
The Global System for Mobile Communications (GSM), the first group of networking technologies widely applied to mobile devices, relied on a type of time-division multiplexing called ____________________.
Fill in the blank(s) with the appropriate word(s).