Site Loader

Hierarchical Document Clustering:A hierarchical clustering technique generates hierarchical structure of partitions. The structure includes a common single point at the top, and singleton clusters of individual points at the bottom. It creates tree structure by merging two nearest point together from bottom to top. This structure is also known as dendogram. Hierarchical clustering algorithm is divided into two categories namely Agglomerative and Divisive. The concept behind both techniques is common only the difference is that Agglomerative bottom-up approach whereas Divisive algorithm works in top-to-bottom manner. In document clustering Agglomerative approach is commonly used. In agglomerative approach, clustering process starts from bottom. The two most similar documents are merged together. Then the next similar document is merged. This process will be repeated until all the documents are merged into a single point. To compute the similarity of two documents, a standard distance measure technique is used.  Fig. 1 Hierarchical ClusteringIn the fig above Hierarchical clustering process for 11 data points is shown. The fig. 1 shows how two nearest points are repeatedly merged with each other to form hierarchical structure. The Agglomerative hierarchical clustering Process: The Agglomerative hierarchical clustering algorithm is explained briefly as bellow.1. Compute the similarity between all pairs of clusters using similarity function, i.e., calculate a similarity matrix whose ijth entry gives the similarity between the ith and jth clusters.2. Merge the two clusters that are closest to each other.3. Update the similarity matrix.4. Repeat steps 2 and 3 until only a single cluster remains.K-Means Document Clustering: K-Means is the most commonly used partitional clustering algorithm. It is an unsupervised learning algorithm which groups the N documents into K-clusters. K-means algorithm is centroid based algorithm which defines k centroids, one for each cluster. The distance of each document is calculated from the centroid. To compute this distance, commonly K-means algorithm uses Euclidean distance method. Based on the distances, documents are re-clustered. This process will be repeated until some predefined criteria for clustering algorithm meets. Here the stopping criteria might be a distance between two clusters which should be less than some threshold value suppose ?. K-means algorithms is simple for implementation. Fig. 2 K-Means ClusteringThe figure above shows K-Means clustering with four clusters with their centroids. The K-Means Algorithm Process: Following steps gives the process of K-Means clustering in brief.· Initially the document set is partitioned into K clusters. Each document is randomly assigned to one of these clusters.· Calculate the centroid of each cluster.· For each document :· Calculate the distance from each document to each cluster using similarity measures.· If the document is nearest to its own cluster, leave it where it is. Otherwise, move it into its nearest cluster.· Repeat the above step until you get stable clusters i.e. no document is moving from one cluster to another.· At this point the clusters are stable and the clustering process stops.Difference between K Means and Hierarchical clustering:· K-Means clustering is capable of handling Big data more efficiently as compare to Hierarchical clustering· As K Means clustering begins with initializing the clusters randomly its results may vary when we execute the same algorithm more than one time.  Hierarchical clustering produces constant results. · K Means works more effectively when the shape of the clusters is hyper spherical.· While implementing K-Means clustering algorithm, we must specify the number of clusters to be formed.Fuzzy C-Means Clustering: Fuzzy C-Means algorithm is the most popular soft clustering algorithm. Fuzzy C-Means algorithm is a variation of Hard K-Means algorithm. In K-Means each document belongs to exactly one and only one cluster. In k-Means clustering the two points which are closer to each other may belongs to two different clusters. (See fig 2) We can easily imagine that there must be some relationship between these two data point. Hence these points must fall in both clusters. This problem leads to overlapping. There must be some methodology which permits the two points to fall in more than one cluster. This is achieved by a soft computing technique known as Fuzzy C-Means technique.  Fuzzy clustering allows the two data points to be fall in more than one cluster. Specifically each data point belongs to each cluster. The belongingness of each data point to the cluster is represented by assigning membership value to them. As Fuzzy C-Means algorithm is a variation of K-Means algorithm, its works in the similar fashion as that of K-Means algorithm.  The Fuzzy C-Means Algorithm Process: Following steps gives the process of Fuzzy C-Means clustering in brief.· Initialize a fuzzy membership partition i.e. Initialize U by assigning some membership values to each data set.· Calculate the centroid of each cluster.· Compute the distance of each data point from the centroid. Update the fuzzy Membership matrix.· If the difference between any two fuzzy membership values is less than some threshold ? the stop, otherwise repeat the above process.The detailed fuzzy clustering algorithm is studied furtherThe major advantage of Fuzzy Clustering is that it permits overlapping. The major drawback of Fuzzy clustering algorithm FCM is that it imposes the restriction that sum of membership values of data point xi in all cluster must be equal to 1. It causes the problem of outliers’ points i.e. the points that do not falls in any cluster. This constrain may assign high membership value to such outlier points. It causes to misguide the cluster. The second problem is that the membership values are interdependent which may produce ambiguous results.  Possibilistic C-Means (PCM) and Possibilistic Fuzzy C-Means (PFCM) are two variations of FCM algorithm. Possibilistic C-Means Clustering: FCM algorithm is incapable of handling noisy data. For example suppose we have  formed two clusters of N data elements and some data element xi is at similar distance from both cluster center. In such case FCM algorithm assigns equal membership value of 0.5 to this element for both cluster. But in reality the element is neither closer to any cluster. This happens due to the restriction in FCM algorithm of sum of membership value must be equal to 1. To overcome this problem of noisy data which misguides the clustering results, Krishnapuram and Keller proposed a new clustering algorithm known as possibilistic C-Means (PCM). In PCM algorithm, the constraint of sum of membership function to be equal to 1 is released. The sum may lie between the intervals of 0-1. 10In the PCM algorithm, each cluster is independent of the other clusters. In the Possibilistic C-Means algorithm, the objective function is designed which is based on the typicality values of data to the clusters. Implementation of PCM algorithm with typicality values liberates the relaxation of the FCM constraint where the sum of each column of U has to be equal to one. With the typicality values we get a matrix that has independent rows and columns. PCM algorithm results in low typicality values for outliers and automatically eliminates these noise points. The drawback of PCM algorithm is as the columns and rows of the typicality matrix are independent of each other it may lead to generation of coincident clusters.Fuzzy Possibilistic C-Means Clustering: FPCM algorithm which is the hybridization of FCM and PCM is designed to overcome the problem of coincident clusters. It incorporates the features of both Fuzzy C-Means and Possibilistic C-Means by using the fuzzy values of the FCM and the typicality values of the PCM in order to obtain a better clustering model. This algorithm imposes the constraint according to which the sum of all typicality values of all data to a cluster must be equal to one may lead to problems for big data sets10

Post Author: admin


I'm Sonya!

Would you like to get a custom essay? How about receiving a customized one?

Check it out