The k-Means algorithm uses a similarly metric of distance between a record and a cluster centroid. If the attributes of the records are not quantitative but categorical in nature, such as Income Level with values {low, medium, high} or Married with values {yes, no} or State of Residence with values {Alabama, Alaska,…, Wyoming} then the distance metric is not meaningful. Define a more suitable similarity metric that can be used for clustering data records that contain categorical data.

What will be an ideal response?


We can define a distance metric or rather a similarity metric between
two records based on the number of common values the two records have
across all dimensions.

For n dimensional data records, we define the similarity of two records
rj and rk as

similarity(rj, rk) = sim(rj1, rk1) + sim(rj2, rk2) + ... + sim(rjn, rkn)

where sim(rji, rki) is 1 if rji = rki else 0.

In this case, higher similarity means the records are closer together
with respect to the usual distance metric.

For example, consider the following 3 records:

RID INCOME LEVEL MARRIED STATE
1 high yes ny
2 low no ny
3 high yes ca

We have the following similarity values:
similarity(1,2) = 1
similarity(1,3) = 2
similarity(2,3) = 0

Records 1 and 3 are the most similar (or closest).

Computer Science & Information Technology

You might also like to view...

What C++ word is used to read information in from the keyboard?

What will be an ideal response?

Computer Science & Information Technology

To print a range of pages, you should enter the first page, a colon, then the last page in the range.

Answer the following statement true (T) or false (F)

Computer Science & Information Technology

(Coin Tossing) Write a program that simulates coin tossing. For each toss of the coin, the program should print Heads or Tails. Let the program toss the coin 100 times and count the number of times each side of the coin appears. Print the results. The program should call a separate function flip that takes no arguments and returns 0 for tails and 1 for heads. [Note: If the program realistically simulates the coin tossing, then each side of the coin should appear approximately half the time.]

What will be an ideal response?

Computer Science & Information Technology

?The default marker for each item in a(n) _________ list is a bullet.

Fill in the blank(s) with the appropriate word(s).

Computer Science & Information Technology