NLP Lecture 4 (cs224d)通常,我们有一个训练数据集,包括下面的组成:
The large context you get,the more order of the words you ignore.The less you know whether that word was actually in a position of a adv aj or noun. 意思:你的context选取的越大,越多的单词顺序就会被忽略,越不可能知道这个单词是adj,adv还是noun。打个比方,人看一段话,如果这段话越长,其中的单词顺序也就没那么重要,甚至是错乱一两个单词的顺序也不影响整段话的意思,而其中的一两个单词的具体拼写也没那么重要,即便错了一两个也不影响整段话的意思。
![](http://s9.sinaimg.cn/middle/002RSgYjzy78bBeRGdi38&690)W是softmax的权重向量,分子的y代表ground truth的index,分母是将所有可能的class的值相加,最后求一个概率。
- Word vector matrix L is also called lookup table.
- Word vectors = word embeddings = word representations
![](http://s9.sinaimg.cn/middle/002RSgYjzy78bDHMMKA18&690) 只需要更新其决策边界: ![](http://s11.sinaimg.cn/middle/002RSgYjzy78bDHPqqeba&690) 对于深度学习,需要学习W和word vectors x: ![](http://s8.sinaimg.cn/middle/002RSgYjzy78bDHNDddd7&690)If you only have a small training data set,don't train the word vectors.
如果不用window classification,就容易出现ambiguity这种问题。
window classification如何实现呢?Instead of classifying a single word,just classify a word together with its context window of neighboring words。给center word定义一个label然后连接所有他周围的word vector使其形成一个更长的vector。
然后怎么进行window classification呢?还是使用softmax,只是word vector不再仅仅是center word vector,并且是concatenating all word vectors surrounding it。
- A single neuron简单的说就是多个softmax的组合。
- 多加几个out layers就使得结构更复杂,能力也更强。
- 再增加一个或多个hidden layers 就更666了。
BP的内容,网上一搜一大堆。 The procedure repeatedly adjusts the weights of the connections in the networks so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.
Connections within a layer of from higher to lower layers are forbidden,but connections can skip intermediate layers.