1.

What is word-embeddings? Can you talk about some state-of-the art techniques for Word Embeddings?

Answer»

Wikipedia defines word embedding as collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word embeddings are a way to transform words in text to numerical vectors so that they can be analysed by standard MACHINE learning algorithms that require vectors as numerical input. 

Now vectorisation can be done in many ways – One-hot-encoding, Latent Semantic Analysis (LSA),TF-IDF (Term Frequency, Inverse document frequency etc.). However, these representations capture a slightly different document-centric IDEA of semantic similarity. 

Distributed Representation :

Distributed representations attempt to capture the meaning of a word by considering its relations with other words in its context. The idea is captured in this quote from J. R. Firth (for more information refer to the article: Document Embedding with Paragraph Vectors, by Andrew M. Dai, Christopher Olah, and Quoc V. Le, arXiv:1507.07998, 2015), a linguist who FIRST proposed this idea: “You shall know the word by the company it keeps”. 

Consider the following pair of sentences: 

Paris is the capital of France. Berlin is the capital of Germany. 

Even assuming you have no knowledge of world geography (or English for that matter), you would still conclude without too much effort that the word pairs (Paris, Berlin) and (France, Germany) were related in some way, and that corresponding words in each pair were related in the same way to each other, that is: 

Paris : France :: Berlin : Germany 

Thus, the aim of distributed representations is to find a general transformation function φ to convert each word to its associated vector such that relations of the following form hold true: 

Word2vec: 

The word2vec group of models was created in 2013 by a team of researchers at Google led by Tomas Mikolov. The models are basically unsupervised, taking as input a large corpus of text and producing a vector space of words. The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space, which is the size of the vocabulary. The embedding space is also more dense compared to the sparse embedding of the one-hot embedding space. 

The two architectures for word2vec are as follows: 

  • Continuous Bag Of Words (CBOW) 
  • Skip-gram 

In the CBOW architecture, the model PREDICTS the CURRENT word given a window of surrounding words. In addition, the order of the context words does not influence the prediction (that is, the bag of words assumption). In the case of skip-gram architecture, the model predicts the surrounding words given the centre word. According to the authors, CBOW is faster but skip-gram does a better job at predicting infrequent words. 



Discussion

No Comment Found