An empirical study of smoothing techniques for language modeling http://www.aclweb.org/anthology/P96-1041 Scalable modified kneser-ney language model estimation https://kheafield.com/papers/edinburgh/estimate_paper.pdf
A neural probabilistic language model http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Efficient Estimation of Word Representations in Vector Space https://arxiv.org/pdf/1301.3781.pdf
Best Tutorial http://colah.github.io/posts/2015-08-Understanding-LSTMs/
RECURRENT NEURAL NETWORK REGULARIZATION https://arxiv.org/pdf/1409.2329.pdf
Best Survey http://ruder.io/optimizing-gradient-descent/index.html#minibatchgradientdescent
Random Search for Hyper-Parameter Optimization http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf A good answer from stackoverflow https://stats.stackexchange.com/questions/95495/guideline-to-select-the-hyperparameters-in-deep-learning