1.Below is a table representing eight transactions and five items: Beer, Coke, Pepsi, Milk, and Juice. The items are represented by their first letters; e.g., "M" = milk. An "x" indicates membership of the item in the transaction.
B
C
P
M
J
1
x
x
2
x
x
3
x
x
x
4
x
x
5
x
x
x
6
x
x
7
x
x
8
x
x
x
truncate带查询x
Compute the support for each of the 10 pairs of items. If the support threshold is 2, which of the pairs are frequent itemsets?
Answer: Here is the table of support
J
M
P
C
B
2
2
1
3
C
2
3
0
P
1
1
M
2
2.Here is a table with seven transactions and six items, A through F. An "x" indicates that the item is in the transaction.
A
B
C
D
E
F
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Assume that the support threshold is 2. Find all the closed frequent itemsets.
Answer: There are many ways to find the frequent itemsets, but the amount of data is small, so we'll just list the results.
Among the pairs, all but AF are frequent. The counts are:
AC, CE: 5
AE, CD: 4
AD, BC, BD, BE, BF, DE: 3
AB, CF, DF, EF: 2
Here are the counts of the frequent triples:
ACE: 4
ACD, BCE, CDE: 3
ABC, ABE, ADE, BCD, BDE, BDF, BCF, BEF, CEF: 2
There are four quadruples that are frequent, all with counts of 2: BCEF, BCDE, ACDE, and ABCE. There are no frequent sets of five items.
To be closed, the itemset must have a larger count than all of its immediate supersets. Thus, all four of the listed quadruples are closed. A triple with a count of 2 cannot be closed unless it is contained in none of the four frequent quadruples. Among these, only BDF quali
fies as closed. However, each of the triples with a count of 3 or 4 is closed, since there are no quadruples with counts this high.
Among the pairs, only AC, CD, BD, and BF are closed. Among the singletons, only A and F are not closed. A, which appears 5 times, is contained in AC, which also occurs 5 times, and F, which occurs 3 times, is not closed because BF also appears 3 times. 
3. Find the set of 2-shingles for the "document":
ABRACADABRA
and also for the "document":
BRICABRAC
Answer the following questions:
1.How many 2-shingles does ABRACADABRA have?
2.How many 2-shingles does BRICABRAC have?
3.How many 2-shingles do they have in common?
4.What is the Jaccard similarity between the two documents"?
Answer: The 2-shingles for ABRACADABRA: AB, BR, RA, AC, CA, AD, DA.
The 2-shingles for BRICABRAC: BR, RI, IC, CA, AB, RA, AC.
There are 5 shingles in common:AB, BR, RA, AC, CA.
As there are 9 different shingles in all, the Jaccard similarity is 5/9.
4. Consider the following matrix:
C1
C2
C3
C4
R1
0
1
1
0
R2
1
0
1
1
R3
0
1
0
1
R4
0
0
1
0
R5
1
0
1
0
R6
0
1
0
0
Perform a minhashing of the data, with the order of rows: R4, R6, R1, R3, R5, R2. State the correct minhash value of each column.
Answer: Look at the rows in the stated order R4, R6, R1, R3, R5, R2, and for each row, make that row be the minhash value of a column if the column has not yet been assigned a minhash value. We sart with R4, which only has 1 in column C3, so the minhash value for C3 is R4.
Next, we consider R6, which has 1 in C2 only. Since C2 does not yet have a minhash value, R6 becomes its value.
Next is R1, with 1's in C2 and C3. However, both these columns already have minhash values, so we do nothing.
Next, consider R3. It has 1's in C2 and C4. C2 already has a minhash value, but C4 does not. Thus, the minhash value of C4 is R3.
When we consider R5 next, we see it has 1's in C1 and C3. The latter already has a minha
sh value, but R5 becomes the minhash value for C1. Since all columns now have minhash values, we are done. 
5. Perform a hierarchical clustering of the following six points:
using the centroid proximity measure (distance between two clusters is the distance between their centroids). If you do this task correctly, you will find that there is a stage at which there is a tie for which pair of clusters is closest. Follow both choices. You will find that some sets of points are clusters in both cases, some sets are clusters in only one, and some are not clusters regardless of which choice you make.
Answer: First, A and B, being the closest pair of points gets merged. The centroid for this pair is at (5,5). The next closest pair of centroids is C and F, so these are merged and their centroid is at (24.5, 13.5). At this time, there is a tie for closest centroids. AB and CF have centroids at distance sqrt(452.5), and so do D and CF. Thus, there are two possible third merges:
1.Merge AB and CF, giving three clusters ABCF, D, and E. The centroid of ABCF is (14.75, 9.25). In this case, the next merge is ABCD with E.
2.Merge CF with D, giving three clusters CDF, AB, and E. The centroid of CDF is at (27.33, 20). In this case, the next merge is E with AB.
As a result, the two sequences of clusters created are:
1.AB, CF, ABCF, ABCEF, ABCDEF.
2.AB, CF, CDF, ABE, ABCDEF.
6.Consider three Web pages with the following links:

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。