Chapter 6 Sequence Data

¡Supera tus tareas y exámenes ahora con Quizwiz!

1. Full sequences of successive outputs for each timestep (batch size, timesteps, output features) 2. Last output of each input sequence (batch size, output features)

SimpleRNN can return two of the of the following:

1. Recurrent dropout - fights overfitting in recurrent layers 2. Stacking recurrent layers - increases the representational power of the network (higher computation cost) 3. Bidirectional recurrent layers - These present the same information to a recurrent network in different ways, increasing accuracy and mitigating forgetting issues

Three methods to improve Advanced RNN

Natural Language Processing

What are bidirectional RNNs mostly used for

A for loop that reuses quantities computed during the previous iteration of the loop

A simpler explaination of an RNN

Time can be treated as a spatial dimension like the height and width of a 2D image. They are a faster alternative compared to RNNs and have sometimes beaten them in text classification and timeseries forecasting. Similar to Comvi, it grabs patches of sequencing data and transforms it into a feature map (although not called that) containing important patterns

1D Convnet

Document classification such as indentifying a type of news article Timeseries comparison such as estimating how closely related two documents or stock tickers are Sentiment analysis and detecting positivity or negativity of a movie or tweet Timeseries forecasting such as predicting the weather

Applications of Sequence Data

When data sequence can be looked at both ways such as NLP, they're strong. When the recent past is much more informative at predicting what comes next, they are not strong performers

Bidirectional RNNS pros and cons

Not always, when applying this concept to weather data, we had WORSE results then keeping the chronological status quo. Thats because weather is better predicted with points that follow closely to the prediction rather than trying to do this in reverse. This, however can be better off in NLP where the position of the word in a sentence doesn't matter and could mean many things.

Can going antichronological in the bidirectional approach always yield to better results?

Why: CNN can preprocess the data faster and then hand it off so the RNN can handle the time-sensitive components of the data Useful when the sequences are so long that RNNS alone cannot handle it, so the CNN will turn the thousands of steps into a downsampled sequence of higher- level features Interestingly this method isn't covered in research and practice but should be

Combining RNN and CNN

The output of the loop at time t

Define timestep in RNN

We put words together based on relevance within a geometric space. If we have a wolves and lions on the upper vector space and dogs and cats on the lower, we can have an UP vector that takes us to wild animals and a RIGHT vector that takes us from canines to felines. If female were thrown near King, the space would assume "Queen"

Embedding space

Consists of using either two GRUs or LSTMs and processes the input chronologicallu and antichronologically and then merging both representations found. When a normal RNN processes the data, it process it chronologically which a completely different representation then going reverse. Bu combining both of their reps, the bidirectional can catch patterns not seen by the undirectional RNN

Explain bidirectional RNNs

Two arguments: dropout which is a float specifying the dropout rate for the input units of the layer AND recurrent dropout specifying the dropout rate of the recurrent units

Explain keras built dropout mechanism in the RNN layers

Feeding all the timeseries data at once because networks do not hold any memory of the data, they just do their transformations on it and pass it on

Feedforward networks

Recurrent attention, and sequence masking which are relevant for NLP

For future study:

Same principle as LSTM but better streamlined and cheaper to run although lacking the representational power introduced by an LSTM

Gated Recurrent Unit (GRU) Layer

Apply same dropout mask (same pattern of dropped units) at every step within the inner recurrent activations allowing network to properly propagate. Doing so randomly would disrupt the error signal

How to apply dropout to RNNs

1. Adjust number of units in stacked RNN (neurons) 2. Adjust learning rate in RMSprop 3. Try LSTM over GRU layers 4. Try using a bigger Dense layer or a couple stacks of Dense layers 5. Once you train all these models, remember to run it on the test set

How to improve our wx forecast prediction

Go to the data generator and replace the last line with yield samples[:, ::-1, :], targets

How to reverse data timestaps when applying bidirectional RNN

1D convnets will turn out to work at least as well and are cheaper (good for text data where a keyword at beginning is just as good as a keyword found at the end)

If global order isnt fundamentally meaningful

Use a recurrent network to process it b/c the recent past is more informative than the distant past.

If global order matters in your sequence data

Finding representations that are different then what previously done (looking at your data from a new angle)

Intuition behind ensembling (spinning off why we use bidirectional RNNs)

More of an art. Guidelines on how to work on a given problem are always provided but every problem is unique and extensive trial and error are needed along with creativity to best approach a problem

Is deep learning an art or a science?

many good defaults in its functions. When in doubt, just leave it alone

Keras has.....

LSTM achieves 88% val_acc but it can still do better after you hypertune parameters in the embedding layer and LSTM output dimensionality Also a lack of regularization may be the issue Finally, LSTM see their strengths in Q&A and machine translation, not so much in sentiment analysis

LSTM's effectiveness at IMDB sentiment analysis

How does LSTM save information and solve the vanishing gradient problem? It has a conveyor belt (carry track) system that runs parallel to the network where information can be stored and moved up to a later timestep for use. Most importantly, it allows past information to be reinjected at a later time

Long Short-Term Memory (LSTM) Layer

Per author, trying to forecast the future price of securities on the stock market is a waste of time. Markets have very different statistical characteristics than natural phenomena such as wx patterns. When it comes to markets, past performances are not a good predictor of the future. My take, however, is to observeeeconomic metrics in growth and other quarterly observations and see how they affect the larger indexes in the market. If people are able to predict recessions based on a series of measures, then we can determine which direction certain securities may go to in response to what is happening in the overall economy.

Markets and Machine Learning

You lack data so you can download pre-trained embeddings that have been computed using word-occurence statistics Most successful scheme: Word2vec algorithm, Google and GloVe

Pre-trained embeddings

Like the one recently worked with, it has over 400,000 thousands words and 100 dimensions. That means each word has 100 layers with a corresponding geometric distance defining each inter-word relationship with sentences, words, and other representations. They are not as accurate as 1D Convnets or Recurrent Neural Networks

Pre-trained embeddings and their details

Looks at each word and keeping memories of what came before maintaining a fluid representation of teh meaning conveyed within the setnence. It maintains information about a state as it goes through the sequences. Its called recurrent because as it experiences new data, it will loop back to old data

Recurrent Neural Network

Giving each of the words a unique index or number identifier. Once you use this with the Tokenizer class, you can create a word_index which is a dictionary of the common words each holding a unique identifier for each word

Sequencing

Suffers from the vanishing gradient problem: the more layers you add, the harder it is remember past inputs. LSTM and GRU solve this.

SimpleRNNs major issue

Offer more representational power than a single layer which is best in areas such as machine translation. But faulter with their tedious computational requirements making them not worth it in simpler problems

Stacked RNN layers pros and cons

recurrent neural networks 1D convnets

Two fundamental deep-learning algorithms for sequence processing:

LSTM and GRUs SimpleRNN are generally to simple to be of real use

Two other recurrent layers:

One-hot encoding - Have 200 columns for all the words and you put a 1 where the word is present like in IMDB Token Embedding - low dimensional floating-point vectors, they pack more information w/ fewer dimensions

Two ways to associate a vector with a token

model = Sequential() model.add(layers.GRU(32, dropout=0.1, recurrent_dropout=0.5, return_sequence=True, input_shape=(None, float_data.shape[-1]))) model.add(layers.GRU(64, activation='relu', dropout=0.1, recurrent_dropout=0.5)) model.add(layers.Dence(1))

What does a stack of recurrent layers look like?

It is the number of unique data in the validation set divided by the batch size. The validation_steps let the fit_generator know when to stop drawing data from the validation_generator since itll loop through the data endlessly

What does validation_steps mean

It will freeze your model and you will not proceed past the first epoch

What hapens when you mess up on validation_steps?

It slightly improved val_loss a little, nowhere compared to the improvements observed by the dropout being applied. You can apply more layers but it will computationally expensive. Alls we gained was a 1% loss on val_loss.

When stacking recurrent layers in our wx forecast problem, what did we observe?

Recurrent networks (GRU or LSTM) outperform models that flatten the temporal data (Dense)

Which layer should you use when temporal ordering matters?

Random forest and logistic regression

Who uses n-gram?

We achieved an 81% val_acc rate because we were only drawing from the first 500 words from the review and Simple RNNs are not that good at long text sequences - LTSMs and GRUs perform much better

Why did the Simple RNN network not perform well on the imdb set?

Increase te representational power of a network

Why do you stack RNN networks?

When the tokens have no specific order. NOTE: Coming up with the right n-gram or sequence of words wont be necesary because 1-D convnets and Recurrent learning networks will learn representations for these words

bag-of-words

Overlapping groups of multiple consecutive words or characters

n-gram

Breaking down text or characters into specific units

token


Conjuntos de estudio relacionados

OAE Reading Foundations Multiple Choice

View Set

WEEK 1 NRSG 102 AQ safety and quality care

View Set

The nursing process Practice questions

View Set

Solve the equation., Solve Equations

View Set

12 sentence paragraph w Mother Teresa

View Set

NURS 445 Exam 2: Assessment of Older Adult

View Set