ML4T Final Prep

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

KNN where K varies - when does it overfit?

- k=n : we get a flatline - k=1 - tag each individual point and more likely to overfit as k increases we are less likely to overfit

ML Optimizer and Parameterized model

-Find minimum values of functions - build parameterized models based on data. Optimizer marches down (gradient descent) graph to find a minimum Import scipy.optimize as spo spo.minimize(f, xguess, method='SLSQP, options={'disp':True Minimizer finds coefficients C0, C1, ...etc... f(x) = mx + b f(x) = C0*X + C1 -

Write Put

-Give someone else option to sell us stock at strike price if they choose to do so -want price of stock to go down -loss bounded

RL as a Trading Problem

-In trading - environment is market, actions is trading (buy/sell/hold), s are stock factors, r is return actions => buy, sell, do nothing states => holding long, bollinger value, daily return rewards => return from trade, daily return

Q-Learning Gamma and Alpha

-Low value of gamma means we value later/future rewards less (similar to a high Discount Rate) -High value of gamma near 1 means we value later/future rewards more significantly (a reward 20 steps in the future is worth just as much as a reward now) -Lower values cause to learn more slowly - High values of alpha cause us to learn more quickly

2 approaches to finding policies from experience tuples

-Model based -build model of T[s,a,s'] , R[s, a], Value/policy iteration -Model free - Q Learning

Reinforcement Learning

-Other learners previously discussed have provided forecast which ignores the certainty of price change or when to exit -RL creates policies on which specific action to take -We take an action that maximizes reward

Dyna

-Problem w/ Q Learning is it takes many experience tuples to converge (reach a max reward). This is expensive when interacting with the real world because we have to take a real step (execute a trade) in order to gather data -Dyna solves this by building models of T (transition matrix) and R (reward matrix) - then after each real interaction with teh world - we hallucinate many additional interaction to update the Q table -rather than interacting w/ real world we hallucinate an experience. Leverage experience from real world and update our model more completely (maybe 100 times)

RL Problem - defining variables

-Sense environment, think, and do another action -Algorithm Q determines the policy pi to maximize reward S = state or what we see in environment Pi(s) = policy from input state A = output from policy that affects environment T = transition function state that moves to a new state with each action and produces a new state s R = reward associated with each action in a particular state -In trading - environment is market, actions is trading (buy/sell/hold), s are stock factors, r is return

Q Learning for Trading - possible state values

-also an algo to calculate thresholds

LinReg Overfitting where d (degree) varies

-as we increase d we are more likely to overfit (x^3 we get that extra curl vs just x^2)

Covered Call

-buy a stock, then write a call - give someone priviege to buy away a stock -miss out on upside potential if it gets 'called away'

Dyna Q Recap

-direct RL from real exp tuples gathered by acting an environment -updating an internal model of the environment -using the model to simulate experiences

Write Call

-initially profitable w/ premium -give someone else the right to buy stock at strike price if they choose

Q-Learning

-model free approach -does not know about or use model T or r -Builds a table of utility values as the agent interacts with the world . Q values can be used at each step to select the best action based on what it has learned so far. -Guaranteed to provide an optimal policy

Options

-referring to exchange traded options not employee stock options -legal contract which gives buyer the right to buy or sell underlying stock at a specific price on or before the expiration date (US is on or before, euro is on exp date) -specific price = strike price -last - premium (for 1 share - although options contract written for round lots 100 shares) -Break even = (strike price) + (premium)

Buy Call

-right to purchase stock at strike price on or before exp date -initially lose premium -profit unlimited

Buy Put

-right to sell/short stock on or before exp date -profit bounded

Butterfly Option

-strategy for a sideways market -loss is capped -AAPL is at 111 Buy a 105 and 115, write 2 110s (all Calls) Premium: -7.16 + (2*2.73) - 0.53 = -2.23 Cost to enter butterfly: $223

Cross validation

-training is generall a 60/40% split but in cross validation we slice data up into difference chunks and train on different portions of 80%

Q Learning - What to use as Reward for Fastest Convergence

-using daily return is a more frequent rewards and convergeses faster

Model Free vs Model Based

1) Model-Free Reinforcement Learning (for example Q-learning), we do not learn a model of the world. We do not explicitly learn transition probabilities or reward functions. We only try to learn the Q-values of actions, or only learn the policy. Essentially, we just learn the mapping from states to actions, maybe modelling how much we're expecting to get in the long run. The algorithm learns directly when to take what action. 2) In Model-Based Reinforcement Learning, you keep track of the transition probabilities and the reward function. These are typically learned as parametrized models. The models learn what the effect is going to be of taking an particular action in a particular state. This results in an estimated Markov Decision Process which can then be either solved exactly or approximately, depending on the setting and what is feasible. Model-Based techniques tend to do better since they keep a more detailed model of the world. However, for this very same reason, they do require more data. Q-learning was brilliant because it is based on the fact that you only need to know what action to take, not why. But of course, knowing why will give you a more detailed

Steps to Optimize a Portfolio

1) provide a function f(x) to minimize ( ie... f(x) is negative SR) 2) Provide an initial guess for x ( where x is allocations) 3) Call the optimizer

Q-Learning Procedure

1) select training data 2) iterate over time obtaining exp tuple 3) test policy Pi 4) repeat until converged -we are converged when the return stops improving

Options Chain

A form of quoting options prices through a list of all of the options for a given security. An option chain is simply a listing of all the put and call option strike prices along with their premiums for a given maturity period. The majority of online brokers and stock trading platforms display option quotes in the form of an option chain. -Strike price - price can buy/sell at before exp date -Last - premium

Correlation vs RMSE

As RMSE increases , correlation decreases generally inversely correlated

LinReg vs KNN vs Decision Tree Performance (cost of query, cost of

Cost of Learning (least to most): 1)KNN - plop data into ram and query later 2) LinReg 3) Decision Tree - esp with Decision Forest Cost of Query (least to most) 1) LinReg - param model easy to compute 2) Decision Tree - binary tree (1000 elements only have to ask max 10 times) 3)KNN - worst since we have to compute distance to ALL individual data points, sort them, and find closest K points Quality: LinReg - not great Space: LinReg > KNN DTrees we don't need to normalize data KNN we do have to normalize data Parametric - Training is slow but querying is fast NonPara/InstanceBased - training is fast but querying is slow

RL: What to Optimize

Goal is to find policy pi(s) that makes some action to maximize reward

Intrinsic Value of Stock

Intrinsic Value (Call) = Underlying Price - Strike Price Intrinsic Value (Put) = Strike Price - Underlying Price In-the-Money (Call) = Strike Price < Underlying Price In-the-Money (Put) = Strike Price > Underlying Price

KNN vs DT - which needs to be normalized?

KNN needs to be normalized Decision Trees do not need to be normalized

Options Pros and Cons

Pros: 1) Higher leverage - can control more money using less money 2) can't lose more than premium paid up front ($273 in our example w/ 100 shares and premium of $2.73) Cons: 1)premium is lost up front money paid for contract 2) Expiration dates add layers to bet - it's usually a short time period 3) Don't own the stock thus no dividend, voting rights etc...

Q-Learning Pros and Cons

Pros: 1) model free approaches can easily be applied to domains where all states and/or transitions are not fully defined 2) no need for additional data structure to store transitions T(s,a) or rewards R(s,a) 3) Q value for any state-action pair takes into account future rewards. Encodes both best possible value of state as well as best policy in terms of the action that should be taken Cons: 1) reward often comes in future - representing this require look-ahead and careful weighting 2)taking random actions (such as tradeS) just to learn a good strategy is not good (you will lose money w/ tradeS) 3) #2 can be fixed by simulating the effect of actions based on historical data

Q-Learning Variables

Q[s,a] = immediate reward + discounted reward Q represents the value of taking action a in state s. Value can be immediate reward plus the discounted reward . Disc reward is for future actions -Look over all Q table actions and find which value of Q[s,a] is maximized -this is denoted as argmax(a) (Q[s,a]) - Optimal policy is pi*(s) and Q*[s,a]

Decision trees

Query comes in and bounces down tree - each node of the tree represents a yes/no question. We finally reach a leaf which is the regression value returned Decision forests - lots of decision trees together - query each one to get an overall results

Markov Decision Problems

RL is a form of markov decision problems -set of states s -set of actions A -Transition function T[s,a,s'] - probability that we are in state s and we take action a, we get new state s' -sum of all next states we might end up in is 1 -Reward function R[s,a] - if we're in state s and take action a we get some reward Find a policy Pi(s) that will maximize reward

Calculating R

R[s,a] expected reward if we're in state s and action a r - immediate reward we get in an experience tuple

Regression vs Classification

Regression - try to make numerical prediction Classification - classifying into one or several types

Q Learning Random Action

Success depends on exploration of as much of state and action space as possible -we do this by flipping coin twice 1) random action or pick argmax 2) if random action which random action - helped via random action rate RAR - at the beginning a high RAR will force us to explore the states

Equations for: - Discounted Reward - Finite horizon - Infinite horizon

Sum i=1 to inf [ gamma^(i-1) * r(i) ] -reward now is much better than reward later -the goal is to maximize the sum of all future rewards -Infinite horizon - sum of all rewards over all steps -finite horizon - sum of rewards of some number of steps

Calculating T transition matrix

Tc[s,a,s'] / Sum Tc[s,a,i]

Time Value

Time Value = Premium - Intrinsic Value In general, the more time to expiration, the greater the time value of the option. It represents the amount of time the option position has to become profitable due to a favorable move in the underlying price. In most cases, investors are willing to pay a higher premium for more time (assuming the different options have the same exercise price), since time increases the likelihood that the position will become profitable. Time value decreases over time and decays to zero at expiration. This phenomenon is known as time decay.

Q-Learning Update Rule

alpha - learning rate gamma - discount rate Q prime = Q ' = new improved version of Q The formula for computing Q for any state-action pair <s, a>, given an experience tuple <s, a, s', r>, is: Q'[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s', argmaxa'(Q[s', a'])]) Here: • r = R[s, a] is the immediate reward for taking action a in state s, • γ ∈ [0, 1] (gamma) is the discount factor used to progressively reduce the value of future rewards, • s' is the resulting next state, • argmaxa'(Q[s', a']) is the action that maximizes the Q-value among all possible actions a' from s', and, • α ∈ [0, 1] (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.

Ensemble Learner

combine multiple different models and take the mean at the end -lower error than individual learner - each type of learner has it's own bias so combining is better -less overfitting

Linear regression (parametric learning)

finds parameters for a model. Take data to get parameters and then throw away data Problems: -noisy and uncertain - value to be found - but it has to be accumulated over many trading opportunities -challenging to estimate confidence -holding time/allocation is uncertain -RL policy learning is better

Overfitting

in sample error decreasing and out of sample error increasing For knn when k=1 the model fits the training data perfectly and therefore in sample error is low but out of sample error is high. As k increases the model becomes more generalized and aout of sample error decreases. After a while the model becomes too general and starts performing worse on both train and test.

K Nearest Neighbor (KNN / instance based)

keep historical X,Y pair data - when we want to make prediction we use the data -use mean of y values from k nearest neighbors

Boosting

modified bagging where we choose the data points that have been modeled poorly from other (previous bag) the more bags we have the more Boosting (AdaBoost) is likely to overfit boosting and bagging algos are just wrappers on existing (KNN, LinReg) algos

Backtesting

roll back time and test system for different time periods

RMS Error

root mean square - take the sq root of the average error square sum. An approx of average error. Out of sample (test ) error is generally larger -good tool but w/ finance data can peek into future data which is bad. we can avoid this with roll forward cross validation

Bagging

same learning algo with different subsets of the original data using random w/ replacement selection

Correlation

we plot our test vs our predict: 1) straight line - good/high correlation 2) scattered shotgun - bad range: -1 to +1

Supervised vs Unsupervised

we show the machine many examples of X and Y - which is how it learns to predict Unsupervised - only inputs

Kernel Regression

weight contributions of each of the nearest neighbors depending on how distant they are . This is instance based and just an alternate to KNN


Kaugnay na mga set ng pag-aaral

How to Bold, Italic, Underline, or Red type

View Set

Geol Final Practice exam ch.13 & 14

View Set

Chapter 15 Personal Construct theory

View Set

Final Chapters: Operations (Practice Questions)

View Set

Facilities Management One: Regulations and Standards (1 hr) JKO

View Set

Accounting 210 Exam 1 Study Guide

View Set

Peptic Ulcer nursing assessments

View Set