Site Loader

This work has used randomly sub-sampled data from the 20 Newsgroups data set which is used for training and testing purposes. The sub-sampled data contains 75\% of the data (14135 documents) for training the Machine Learning algorithms and 25\% (4711 documents) to test the accuracy of the algorithms by comparing the predicted output with the actual output.  %
ewlineThis article has used open source tool Word2vec for converting the text data into vectors of scores.Word2vec is a good algorithm for measuring syntactic and semantic word similarities. It can represent a word as a mathematical vector. It creates vectors that are distributed numerical representations of word features such as the context of each word. To find the similarity between the two words the cosine distance (dot product) of the vectors of the words is computed. Word2vec is useful to group the vectors of similar words together in vector space, it detects similarities mathematically.%
ewlineParameters used by Word2vec model to generate word vectors are based on observing the output and the requirement for this paper. They are:size: generates the number of unique vectors for a particular word, it is set to 10window size: pulls the vectors of two words closer and is set to 5min\_count: helps to reduce the noise in semantic space and is set to 5workers: specifies the numbers of CPU cores to be used, set to 4 sg: implements CBOW architecture of Word2vec model, set to 0
ewline While implementing NB Classifier, the value of alpha is set to 0.1.In Linear SVC, the parameter C is used which is assigned the value of 5.0. C is essentially a regularization parameter, which controls the trade-off between achieving a low error on the training data.In Decision tree classifier, default parameters are used such as criterion is set to be gini, max depth is set to 3, min samples leaf is set to 5.In Random Forest, the value of estimators is set to 100 and max depth is set to 2 which decides the depth of the tree.section{Evaluating Results}In Word2vec, a distributed representation of a word is used. As the neural net reads through document after document, learning word-vectors end up displaying very interesting relationships between one another.The basic idea is that semantic vectors (such as the ones provided by Word2vec) should preserve most of the relevant information about a text while having relatively low dimensionality which allows better machine learning treatment than the straight one-hot encoding of words. It is found that the learned word representations, in fact, capture meaningful syntactic and semantic regularities in a very simple way. Many existing Natural Language Processing (NLP) applications, such as machine translation, information retrieval and question answering systems, and may enable other future applications yet to be invented can be improved by using word vectors with such semantic relationships.All of these are implemented by using word embeddings or numerical representations of texts so that computers may handle them. Word2vec creates a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in. Similar words are encoded into unique vector representations and the algorithm ignores the words that occur below a certain number of times. They are fast and provide better performance and accuracy.

Post Author: admin

x

Hi!
I'm Sonya!

Would you like to get a custom essay? How about receiving a customized one?

Check it out