Word and character based language modeling using LSTM Anubhab Majumdar Department of Computer Science North Carolina State University be meaningful and correct in terms of rules of the language. But this is an extremely difficult task languages have complicated and intricate set of rules, large repertoire of characters and words, and a wide variety of exceptions to the general rules. Also, one language can have variation of same set of rules and all these needs to be taken into account while building the model. Another major challenge is to come up with an exhaustive training set. As mentioned already, language has millions of distinct words and thousands of intricate rules and exceptions. Humans learn a language by interacting and learning words and grammar. For training a machine, we need to build a corpus of sentences that would use most of the words present in the language, most of the rules of the language and the exceptions. This is necessary as the the model needs to learn all these words and rules from the corpus available. Thus all these constraints need to be enforced while building a dataset.

Abstract— Language is “...a systematic means of communicating ideas or feelings by the use of conventionalized signs, sounds, gestures, or marks having understood meanings” [2]. In natural language processing, we try to come up with algorithms that can parse information, facts and sentiments quickly, efficiently and accurately from written text or speech. The fundamental problem in solving any of these tasks involve understanding the language itself, specifically its semantics and grammar. Language modeling addresses this key issue - it is a statistical measure of whether a sequence of words semantically makes sense in a particular language [3]. In this report we will take a look at how we can model the English language using a deep learning architecture called Recurrence Neural Network (RNN) with Long-Short Term Memory (LSTM) units. Two modeling approaches are explored - one at a word level and another at a character level. The report compares and contrasts the two approaches through exhaustive experiments and identify the tradeoff and limitations of both the approaches.

I. INTRODUCTION Machine learning is being extensively used to solve some of the most complex problems involving images, speech and written text. Deep neural networks have been employed to solve these problems, in fields of computer vision, speech and natural language processing. Therefore, it is only natural that deep neural networks would be used to try and solve the problem of language modeling. This section introduces the problem we are trying to solve and the approaches we will use to tackle the problem.

B. Approaches The report explores deep neural network architectures capable of handling textual content. One key distinction of textual content from something like image is variable length - the architecture should be able to handle variable length sentences. Secondly, the architecture should take into account the context of the input sequence. This is especially important for long sentences, because the next sequence of words may depend on what was said at the very beginning of the sentence. Finally, the output of the network should be continuous. Give an input of certain length, we expect the network to be able to churn out continuous sequence of output (words or characters). Given the following requirements, we employ a Recursive Neural Network (RNN) architecture. RNN can handle continuous and variable input. Also, RNN maintains context of the sentence using feedback from

A. Problem Statement As mentioned earlier, a fundamental approach to solve a lot of NLP problems is to develop an algorithm to model a particular language. We should be able to encapsulate the semantics and grammar of a language as a statistical model. The model should be able to generate a sequence of words based on one or more words as input, and the generated sequence should 1

previous time step. Lastly, RNN is capable of producing continuous output on being provided any sequence of words as input. RNN are well suited for our sequential time series data, but its not suitable for long sentences. Therefore, to improve its performance, we will be using Long Short Term Memory (LSTM) units in RNN architecture. They improve the RNN performance by retaining context over long periods of time. The RNN was trained at two different granular level - word based and character based. As the names suggest, word based is a statistical model where the network predicts the next word at every time step, whereas in character based, the networks predicts the next character at every time step. There are significant difference and trade-off in the two approaches and the report explores them through rigorous experiments.

The authors “...employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model” (location Abstract) [4]. They achieve state-of-the-art accuracy on English Penn Treebank dataset, a standard dataset used to benchmark NLP models. Interestingly, while modeling certain languages like Spanish and Russian, the model performs exceptionally well over word-level language models. III. D EEP N EURAL N ETWORK A PPROACH TO L ANGUAGE M ODELING In this project, we employ a RNN for modeling English language. However, RNN is difficult to train and is limited in handling long sequences. So LSTM units are used to overcome this shortcoming. We also experiment with two different approaches - word and character level models. In this section we discuss the theory and mathematics involved in RNN and LSTM.

II. R ELATED W ORKS Language modeling has been an active field of work for long time. Various approaches has been taken to solve this complex task, and in this section we take a look at some of these approaches and briefly discuss their methodology and results. Brown et. al. tried to model English language using n-gram models. The paper also “...discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words” (location - abstract) [6]. Mnih et. al. describes “...three new probabilistic language models that use distributed word representations to define the distribution of the next word in a sequence given several preceding words” (location Introduction) [7]. The paper describes using an undirected model, a model with temporal connection and binary variables and a model that uses linear functions to predict the next word given a sequence of words as input. The paper compares the models and shows that one model outperforms the state-of-the-art n-gram model available. Sundermeyer et. al. used LSTM units with RNN to “...gain considerable improvements in WER on top of a state-of-the-art speech recognition system” (location Abstract) [5]. The paper describes how neural networks are limited in accuracy due to their inability to handle continuous sequence of data and how RNN architecture is a possible solution. However, RNN is difficult to train and this is overcome with LSTM units. Kim et. al. describes an approach where the modeling is done at character level instead of word level [4].

A. Neural Networks A standard representation of neural network is shown in Figure 1. It consists of fixed length input layer, multiple hidden layers with varying nodes and non-liner activation function and an output layer with softmax function to predict the class of the input instance. We perform feed forward to predict the class and train the network using backpropagation. With sufficient sample training instances and varying the hyper parameters (input layer size, number of hidden layers, size of hidden layers, activation functions, etc.), neural networks can be trained to produce excellent results on a variety of pattern recognition problems.

Fig. 1. Neural Network (source - http:// neuralnetworksanddeeplearning.com/images/ tikz11.png) [8]

However, this standard neural network architecture is not suitable for language modeling. Firstly, textual 2

input for language modeling have varied length. We prefer to train a neural network with sentences and not all sentences can be of fixed length. Therefore, to train a neural network, we would need to cut the sentences into fixed length and thereby lose context. This would affect the accuracy of the model. Secondly, neural network fail to contextualize over multiple inputs. It treats each training instance as discrete and the output is solely dependent on the current input. However, this may not be the case for language modeling. For example - while trying to predict the next word, we may need to remember what words it encountered beforehand.

C. Long Short Term Memory To address the problem of vanishing gradient in RNN, Hochreiter et. al. came up with Long Short Term Memory cell [1]. It is depicted in Figure 3.

Fig. 3. Long Short Term Memory (source - https: //deeplearning4j.org/img/greff_lstm_diagram. png) [10]

B. Recurrent Neural Network Recurrent Neural Network (RNN) is designed to handle inputs that appear as sequences over time like speech, text or video. For any such type of data, the output may be dependent not only on the current instance but also on any number of past instances encountered. Thus, we need feedback from previous instances besides the current instance to make accurate prediction. This is what RNN enables us to do. A pictorial representation of RNN is shown in Figure 2. It shows the working of RNN unrolled in time.

To retain past instances, LSTM cell incorporates a memory unit and allows three operations on it - read, write and forget. The operations are controlled through gates. However, unlike digital gates which can have values of either 0 or 1, here the gates are assigned weights in the range [0, 1]. This allows the LSTM cell to retain information as needed and prevents both exploding and vanishing gradient problem. The weights are tuned using logistic regression [1], [11], [12], [13]. LSTM units can be used as nodes to create RNN architecture. Hochreiter et. al. has shown through various experiments that such an architecture produces much better results as compared to non-LSTM RNN architecture [1]. The only disadvantage is that LSTM units introduces more parameters to train - this requires more computation power, training data and time.

Fig. 2. Recurrent Neural Network (source - http://colah. github.io/posts/2015-08-Understanding-LSTMs/ img/RNN-unrolled.png) [9]

D. Word based language modeling In a word based language model, we expect the LSTM network to output a word when provided with a sequence of input words. The granularity level of such a model is word. The approach to train such a network is as follows [15]: 1) Start with a corpus of text with N unique words. Fix a window of size t, where each cell of the window encompasses a word. 2) Initialize a LSTM network with random weights, with input nodes equal t and N output nodes. 3) Feed forward the t words of the window through the LSTM network to predict the (t + 1)th word. 4) Use the actual (t+1)th word in the text corpus to calculate the error in prediction. Use backpropagation to tune the weights of LSTM network.

RNN addresses all the problems with the vanilla neural network architecture and is a suitable candidate for the purpose of language modeling. However, RNN has some severe limitations when handling long sequences. Any neural network weights are tuned using backpropagation. However, for RNN, we are doing backpropagation over multiple instances instead of just one, because we are incorporating the feedback from last instance while performing feed forward pass. This operations is problematic for gradient descent because it causes either exploding gradient or vanishing gradient problem [11], [12], [13]. Exploding gradient can be handled by thresholding the gradient’s value [11], but vanishing gradient is a severe problem. This handicaps RNN from remembering long sequences. 3

5) Move the window by one word. GOTO step 3. Given enough parameters in the network and sufficient text corpus, the network would be trained such that it would be able to predict the next word on being provided a sequence of words as input. The actual implementation differs a bit from this pseudo code, but the essential idea remains same.

A. Dataset The experiments involved modeling the English language using LSTM. For data, I decided to select novels that are available freely on the internet. Novels are a good source of data because: • They have substantial volume of sentences. • A large variety of words are used. This is good because sentences constructed by me may not have such a wide range of words because of my limited vocabulary. • Novels have a good proportion of narration and dialogue. This helps the model learn variety of language patterns. • Novels help humans get better in a particular language. Thus they may be helpful while modeling a language. For my experimentation, I have chosen the classic work of Arthur Conan Doyle “The Adventures of Sherlock Holmes”, available freely at [14]. The book is available in .txt format.

E. Character based language modeling Word based models are a bit restrictive, they strictly learn the pattern of training data and not generic modeling of a particular language. Also, they cannot predict punctuations, which are important for accurate expression. If we can train a network such that it learns a language at a character level, i.e., it understands how to use the letters to construct words, and use words with punctuations to construct sentences - then it would be a much powerful model to represent a language. Character based language modeling focuses on building a LSTM network capable of doing just that [16]. The approach to train such a network is similar to word based model [16]: 1) Start with a corpus of text with N unique characters (letters, punctuation and blank space). Fix a window of size t, where each cell of the window encompasses a character. 2) Initialize a LSTM network with random weights, with input nodes equal t and N output nodes. 3) Feed forward the t characters of the window through the LSTM network to predict the (t+1)th character. 4) Use the actual (t + 1)th character in the text corpus to calculate the error in prediction. Use backpropagation to tune the weights of LSTM network. 5) Move the window by one character. GOTO step 3. The only disadvantage of such a model is that it will require more parameters - that means more hidden nodes and layers. This correlates to more training time and compute resources. Also, it requires much more training data to accurately model the language [16].

B. Preprocessing The novel was too large to be used for training in the limited hardware I had access to. Thus I used only the first four stories to train the model. Also, the .txt formatted file had empty lines, special characters (like ‘*’), roman numerals, content page and informations like ISBN. In the preprocessing phase, all this were removed using a combination of manual checking and a script. The text was now clean and ready to be used for training. It had a total of 3077 lines and 4569 unique words. C. Training word based models We use the preprocessed text to train word based models. The code and methodology is developed following the tutorial [15]. 1) Tokenized input: The first step in training is to identify unique words in the training set. This is done by reading the entire text and removing punctuations and new lines. Then we split the text on white spaces and store results in a list. Finally, we perform a set operation to retrieve the unique words. The next step is to map each of these words to an unique integer. This is necessary for transforming input sentences into a stream of integers, a format that can be passed as input to Keras API. This can be done easily using the T okenizer module of Keras package. Thus at the end of this section,we have a mapping of words to their corresponding integer representation [15].

IV. E XPERIMENTS This section describes the experiments performed with the word and character based language modeling. We will delve into details about the dataset, preprocessing and cleaning the data, the training of different models with varied hyperparameters, the test set and how the models are evaluated and the results of the same. 4

2) Training instances: LSTM library in Tensorflow needs the following for training: • •

At the end of this section, we have the network defined as a graph in Tensorflow [15]. 4) Training: With the training and testing set prepared and network defined, we are ready to train. We use the standard Tensorflow provided cross entropy as loss function and adam optimizer as suggested in [15]. Another hyperparameter we decide here is the epoch count. The number of epochs can affect train and test accuracy of the network and is important to experiment with. Finally, we pass the training set and label set to the network along with the loss function, optimizer and epoch count. At the end of this step, we have a LSTM network trained with English language and can be used to predict next words provided a sequence of words [15]. 5) Generate output: To test the model we need to provide a sequence of words as input to the model and it would predict the next word in the sequence. We can append this predicted word to the previous (HIST ORY − 1) words in the input sequence and predict the next word. In this way we can predict as many words as we want [15].

Fixed length input - LSTM library takes a fixed number of word encodings as input. Expected output - The next word of the sequence is expected as output.

The LSTM library in Tensorflow works as a supervised learning setup. It expects a fixed length of word embeddings in the form [i + 1, ..., i + t] . It performs feed-forward with the t embeddings words and the feedback from the last pass to predict the (t + 1)th word. During the training phase the expected output is provided. This is used to train the weights of the network using backpropagation. The input is a list of t word embeddings. The output is one hot encoded, that is, the output layer is expected to have nodes equal to the number of unique words. The output should be a 1-D vector, where the predicted word has value 1, while all other word should have value 0. To achieve this, we first fix the hyperparameter HIST ORY , which specifies the sequence length t of training instances. Next we move a window of size HIST ORY over the sentences and copy the content of the window into a training set. At every step, we move the window by one word. The (HIST ORY + 1)th word, i.e, the word just right of the window is our expected output. The word is one hot encoded and stored in the label set. After this section, we are ready with the test and training instances [15]. 3) Network definition: Now we define the LSTM network architecture. The architecture has the following components: •





D. Training Character based model We use the preprocessed text to train character based models. The code and methodology is developed following the tutorial [16]. 1) Tokenized inputs: For training a character based model, we need tokenized characters as the Keras API for LSTM expects an array of integer as input. We start by reading the entire text and removing the newline character. Next, we use set operation to figure out how many unique characters and punctuations are present in the entire text, including the blank space. For our dataset, we have 30 unique characters (26 letters and 3 punctuation and blank space). We use the T okenizer module of Keras package to map characters to unique integers. Thus at the end of this section,we have a mapping of characters to their corresponding integer representation [16]. 2) Training instances: As mentioned in the corresponding section of word based modeling, LSTM library in Tensorflow needs the following for training: • Fixed length input - LSTM library takes a fixed number of character encodings as input. • Expected output - The next character of the sequence is expected as output. The LSTM library in Tensorflow works as a supervised learning setup. It expects a fixed length of character embeddings in the form [i + 1, ..., i + t] . It performs

Input layer - This is determined by the hyperparameter HIST ORY . It expects integer array of size HIST ORY . Hidden layer(s) - The architecture can have one or more hidden layer with varied number of LSTM units. This layer(s) actually houses the LSTM units and responsible for maintaining context from last instances and contextualize over multiple sentences. Two hyperparameters to decide here are number of LSTM units per layer and the number of hidden layers. Output layer - The output layer dimension is determined by number of unique words in the training set. It uses softmax function to identify the word that has the highest probability of appearing next corresponding to the input sequence. 5

feed-forward with the t embeddings words and the feedback from the last pass to predict the (t + 1)th character. During the training phase the expected output is provided. This is used to train the weights of the network using backpropagation. Like the word based modeling, the input is a list of t character embeddings. The output is one hot encoded, that is, the output layer is expected to have nodes equal to the number of unique character. The output should be a 1-D vector, where the predicted character has value 1, while all other character should have value 0. To achieve this, we first fix the hyperparameter HIST ORY , which specifies the sequence length t of training instances. Next we move a window of size HIST ORY over the sentences. Only this time, the granularity level is characters; we copy HIST ORY number of characters into a training set. At every step, we move the window by one character. The (HIST ORY + 1)th character, i.e, the character just right of the window is our expected output. The character is one hot encoded and stored in the label set. For the dataset we have, we come up with 237581 training instances. After this section, we are ready with the test and training instances [16]. 3) Network definition: Next we build the network. Similar to the word based model it has three components [16]: • Input layer - This is determined by the hyperparameter HIST ORY . It expects integer array of size HIST ORY . • Hidden layer(s) - The architecture can have one or more hidden layer with varied number of LSTM units. Two hyperparameters to decide here are number of LSTM units per layer and the number of hidden layers. • Output layer - The output layer dimension is determined by number of unique characters in the training set. It uses softmax function to identify the character that has the highest probability of appearing next corresponding to the input sequence. 4) Training: Similar to word based model, we pass the training and label set to the network. We use the standard Tensorflow cross entropy loss function and adam optimizer. The epoch count is a hyperparameter decided in this phase. At the end of this step, we have a LSTM network trained with English language and can be used to predict next character provided a sequence of characters [16]. 5) Generate output: With the network trained, we can use it to predict the next character on providing a

sequence of character as input. We can again use the predicted character with the already present sequence to predict yet another character. Following this technique, we can predict as many character as we want [16]. E. Testing It is difficult to quantify the performance of the different networks. We tried the following ways to test the different models: • We note the final train accuracy. This is the accuracy of the network on the entire training set after the last epoch. • Use sentences from the training set to test the models. We select random sentence of varying length from the training set and use it to generate the next few words/characters. • We use a sequence of seed words/characters and use the model to generate the next few words/characters. The quality is judged manually and not using any quantitative measure. F. Results The results for the different testing approaches mentioned above is reported in this section. Table I shows the training accuracy of various word based models trained. The hyperparameters varied are history (number of past sequences), LSTM units in hidden layer, number of hidden layers and epochs. History

Units

Layers

Epochs

10 10 10 10

100 100 300 300

1 2 2 3

200 200 200 200

Train Accuracy 97.17 91.73 82.47 98.66

TABLE I T RAINING ACCURACY OF WORD BASED MODELS

Table II shows the training accuracy of various character based models trained. History

Units

Layers

Epochs

10 10 10 10 10 10

75 75 100 100 300 300

1 2 2 3 2 3

100 100 300 300 200 300

Train Accuracy 62.67 67.77 73.95 75.99 89 89.44

TABLE II T RAINING ACCURACY OF CHARACTER BASED MODELS

6

Model W,10,100,2

Table III and Table IV shows much more interesting results. We selected two random sentences from the training set and used the different models to generate the next words of the sequence. The model is expressed as tuple of the form Granularity (Word/Character), HISTORY, UNITS, LAYERS. The seed sentence for Table III is “Last Monday I had finished for the day and was dressing in my room above the opium den when I looked out of my window and saw, to my horror and astonishment, that my wife was standing in the street,”. The expected output is “with her eyes fixed full upon me.” Table III shows the output for the different word and character based models. Model W,10,100,2

W,10,300,3

C,10,75,2

C,10,100,3

C,10,300,3

W,10,300,3

C,10,75,2

C,10,100,3

C,10,300,3

Output A man entered who could hardly have been less than six feet six inches in height such anything mr holmes i pray just had that that A man entered who could hardly have been less than six feet six inches in height his the go long to bearing known so come keen a man entered who could hardly have been less than six feet six inches in height, but as the steps which had been to talk, provin a man entered who could hardly have been less than six feet six inches in height of the street with some play. a sandwich and abso a man entered who could hardly have been less than six feet six inches in height, with the conviction of young mccarthy. i promis

TABLE IV S AMPLE OUTPUT ON SENTENCES FROM TRAINING SET II

Output Last Monday I had finished for the day and was dressing in my room above the opium den when I looked out of my window and saw, to my horror and astonishment, that my wife was standing in the street, with her eyes fixed full upon me i gave the Last Monday I had finished for the day and was dressing in my room above the opium den when I looked out of my window and saw, to my horror and astonishment, that my wife was standing in the street, with her the fixed full which me i along of last monday i had finished for the day and was dressing in my room above the opium den when i looked out of my window and saw, to my horror and astonishment, that my wife was standing in the street, and the lascar of the street, and the lascar of t last monday i had finished for the day and was dressing in my room above the opium den when i looked out of my window and saw, to my horror and astonishment, that my wife was standing in the street, and i can only deduce from it not are to a conclu last monday i had finished for the day and was dressing in my room above the opium den when i looked out of my window and saw, to my horror and astonishment, that my wife was standing in the street, and it was a sign that the matter was so dear to

Model W,10,100,2 W,10,300,3 C,10,75,2 C,10,100,3 C,10,300,3

Output Sherlock holmes made back in a hansom well had seen to Sherlock receive receive receive receive receive receive receive that drawn on sherlockess which lay at the time when he was a woman was sherlock as a lover my heart of him, and that the surgeon sherlock to come for half wages, it was at least four and TABLE V

S AMPLE OUTPUT ON SENTENCES OUTSIDE TRAINING SET

word “Sherlock”. There is no expected output, rather the idea was to observe whether the words/character predicted by the models make sense or not. V. A NALYSIS AND C OMPARISON In this section, we will try to rationalize the experimental results. Table I shows the training accuracy of the word based models. We can observe that for 100 LSTM units, training accuracy decreases on increasing hidden layer. This maybe a sign that the network is overfitting in case of the first architecture. However, on adding one more layer, the total number of parameter increases, but the training accuracy is still low. This could be because both networks have same epoch count. Maybe on increasing the epoch count for the second network, the training accuracy would increase. The scenario is a bit different for character based models. Here we see the accuracy consistently increases on increasing more parameter. While using 75 LSTM units and 1 hidden layer, we achieve a training accuracy

TABLE III S AMPLE OUTPUT ON SENTENCES FROM TRAINING SET I

The seed sentence for Table IV is “A man entered who could hardly have been less than six feet six inches in height”. The expected output is “with the chest and limbs of a Hercules.” Table IV shows the output for the different word and character based models. Table V shows the output on providing only one 7

The last two character models didn’t generate the word ‘holmes’, but predicted the first few words that were semantically and grammatically correct and made sense. Thus we can conclude that the network can, at the very least, partially model the English Language, even with such limited number of parameters and training set. Overall we can comment that the word based models actually faired better, especially when the input sentence was part of the training set. The character model did generate some meaningful output, even though it was not the expected output. What I observed is that the word based model is trained more coarsely on this training set. It does a better job of predicting the words in the context of the training set. However, the character model is more generic. In the results we could see it is predicting words that are semantically correct and not a line the model encountered in the training set. Another observation is about cost - it takes far more time and example data to sufficiently train a character model. Given the limited compute resource I had, I couldn’t use more training set and deeper architecture, but the network still learned to model the English language to some extent. Also, the network has the additional advantage of predicting the punctuations, something the word based model lacks.

of 62.67%. This could indicate underfitting. Thus we increase one more hidden layer and training accuracy increases. This trend continues as we increase LSTM units per layer and keep on adding more layers. The highest training accuracy obtained is 89.44%. Next, we try testing the output generated by the trained models. We choose a long sentence from the training dataset and tried to predict the last 7 words of the sentence using the models. The results are shown in Table III. We could see that the word based model with 100 LSTM units and 1 layer produces perfect output. This may be an indication that the network doesn’t have overfitting. The second model with 2 hidden layer gets first four out of five words right. This could indicate that more epochs are needed to train the network since it has more parameters. The character models fails to predict any of the seven words correctly. Character models require more parameter and requires more training examples than word model. This is because they are modeling the language in a more generic level than word based models. However, if we observe closely, we can see the words predicted by the character models immediately after the seed sentence ends, makes sense as per the semantics of English language. For example, one of the output reads “... my wife was standing in the street, and it was a sign that”. This sentence is not wrong. Thus the network can, at the very least, partially model the English Language. Another sentence from the training dataset was tried - “A man entered who could hardly have been less than six feet six inches in height”; this time the sentence is much shorter. The result is shown in Table IV. This time, the word models could not predict the next words correctly. In fact, the predicted words didn’t even make much sense semantically. The situation is similar for character based models. We believe the reason is less word count in seed sentence. Because there is less words, the model cannot contextualize enough to predict the next words correctly. Finally, we tested the models by just providing the word ‘Sherlock’ as the seed word. We wanted to observe what the model predicts as the next word. Table V shows the results of the experiment. The word based model with 100 LSTM units and 1 hidden layer outputs ‘holmes’ as the next word, which makes perfect sense. The second word based model however got stuck, repeatedly predicting ‘receive’ as the next word. This may indicate that it requires more epochs and training data to model the language more accurately.

VI. C ONCLUSION This project takes a look at the field of language modeling using deep neural networks. Specifically we take a look at two approaches using LSTM models word based and character based model. We introduced the problem statement and delved into the theory of RNN and LSTM, before moving into implementation. Experiments were performed by varying various hyperparameters of the network like number of hidden units, number of hidden layers, number of past sequences etc. We used chapters from ‘Sherlock Holmes’ to train our model and tested them using different approaches. Finally, we do a detailed analysis of the results and compared the different models based on the experimental results. This project has helped me understand the theory and mathematics behind LSTM networks and how to implement them using python and Keras library. I have also identified the strengths and shortcomings of word based and character based language modeling techniques and armed with this knowledge, can conduct further experiments and research in the field of natural language processing tasks like language translation, 8

speech recognition, information retrieval and generating summary from large documents.

R EFERENCES [1] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural com- putation, vol. 9, no. 8, pp. 17351780, 1997. [2] Merriam Webster Dictionary, Accessed on 11/22/2017; https://www.merriam-webster.com/dictionary/language. [3] Language Modelling https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf; Accessed on 11/22/2017. [4] Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M. Rush. ”Character-Aware Neural Language Models.” In AAAI, pp. 2741-2749. 2016. [5] Sundermeyer, Martin, Ralf Schlter, and Hermann Ney. ”LSTM neural networks for language modeling.” In Thirteenth Annual Conference of the International Speech Communication Association. 2012. [6] Brown, Peter F., Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. ”Class-based n-gram models of natural language.” Computational linguistics 18, no. 4 (1992): 467-479. [7] Mnih, Andriy, and Geoffrey Hinton. ”Three new graphical models for statistical language modelling.” In Proceedings of the 24th international conference on Machine learning, pp. 641-648. ACM, 2007. [8] Neural network image; http:// neuralnetworksanddeeplearning.com/images/ tikz11.png; Accessed on - 12/03/2017. [9] Recurrent Neural network image; http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/img/ RNN-unrolled.png; Accessed on - 12/03/2017. [10] Long Short Term Memory unit image; https: //deeplearning4j.org/img/greff_lstm_ diagram.png; Accessed on - 12/03/2017. [11] Deep learning - Udacity; https://www.youtube. com/playlist?list=PLAwxTw4SYaPn_ OWPFT9ulXLuQrImzHfOV$; Accessed on - 12/03/2017. [12] Understanding LSTM Networks Colah’s blog; http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/; Accessed on - 12/03/2017. [13] Beginners Guide to Recurrent Networks and LSTMs - DeepLearning4J; https://deeplearning4j.org/ lstm.html; Accessed on - 12/03/2017. [14] https://www.gutenberg.org/; Accessed on 12/03/2017. [15] How to Develop Word-Based Neural Language Models in Python with Keras; https://machinelearningmastery.com/develop-word-basedneural-language-models-python-keras/; Accessed on 12/03/2017. [16] How to Develop a Character-Based Neural Language Model in Keras; https://machinelearningmastery.com/developcharacter-based-neural-language-model-keras/; Accessed on 12/03/2017.

9

word-character-based (1).pdf

Page 3 of 3. PHARMA COURSES CUTOFF RANK OF CET-2016 - R2 Extended ALLOTMENT ( General ) 26-07-2016. 28. 29. 30. 31. 32. 33. 34. 35. 37. 38. 39. 40. 41. 42. B028 Luqman College of Pharmacy Gulbarga. B029 M.M.U College of Pharmacy Ramanagara. B030 M.S. Ramaiah Univerisity of Applied Sciences ...

312KB Sizes 1 Downloads 123 Views

Recommend Documents

No documents