ML4T Final Prep
KNN where K varies - when does it overfit?
- k=n : we get a flatline - k=1 - tag each individual point and more likely to overfit as k increases we are less likely to overfit
ML Optimizer and Parameterized model
-Find minimum values of functions - build parameterized models based on data. Optimizer marches down (gradient descent) graph to find a minimum Import scipy.optimize as spo spo.minimize(f, xguess, method='SLSQP, options={'disp':True Minimizer finds coefficients C0, C1, ...etc... f(x) = mx + b f(x) = C0*X + C1 -
Write Put
-Give someone else option to sell us stock at strike price if they choose to do so -want price of stock to go down -loss bounded
RL as a Trading Problem
-In trading - environment is market, actions is trading (buy/sell/hold), s are stock factors, r is return actions => buy, sell, do nothing states => holding long, bollinger value, daily return rewards => return from trade, daily return
Q-Learning Gamma and Alpha
-Low value of gamma means we value later/future rewards less (similar to a high Discount Rate) -High value of gamma near 1 means we value later/future rewards more significantly (a reward 20 steps in the future is worth just as much as a reward now) -Lower values cause to learn more slowly - High values of alpha cause us to learn more quickly
2 approaches to finding policies from experience tuples
-Model based -build model of T[s,a,s'] , R[s, a], Value/policy iteration -Model free - Q Learning
Reinforcement Learning
-Other learners previously discussed have provided forecast which ignores the certainty of price change or when to exit -RL creates policies on which specific action to take -We take an action that maximizes reward
Dyna
-Problem w/ Q Learning is it takes many experience tuples to converge (reach a max reward). This is expensive when interacting with the real world because we have to take a real step (execute a trade) in order to gather data -Dyna solves this by building models of T (transition matrix) and R (reward matrix) - then after each real interaction with teh world - we hallucinate many additional interaction to update the Q table -rather than interacting w/ real world we hallucinate an experience. Leverage experience from real world and update our model more completely (maybe 100 times)
RL Problem - defining variables
-Sense environment, think, and do another action -Algorithm Q determines the policy pi to maximize reward S = state or what we see in environment Pi(s) = policy from input state A = output from policy that affects environment T = transition function state that moves to a new state with each action and produces a new state s R = reward associated with each action in a particular state -In trading - environment is market, actions is trading (buy/sell/hold), s are stock factors, r is return
Q Learning for Trading - possible state values
-also an algo to calculate thresholds
LinReg Overfitting where d (degree) varies
-as we increase d we are more likely to overfit (x^3 we get that extra curl vs just x^2)
Covered Call
-buy a stock, then write a call - give someone priviege to buy away a stock -miss out on upside potential if it gets 'called away'
Dyna Q Recap
-direct RL from real exp tuples gathered by acting an environment -updating an internal model of the environment -using the model to simulate experiences
Write Call
-initially profitable w/ premium -give someone else the right to buy stock at strike price if they choose
Q-Learning
-model free approach -does not know about or use model T or r -Builds a table of utility values as the agent interacts with the world . Q values can be used at each step to select the best action based on what it has learned so far. -Guaranteed to provide an optimal policy
Options
-referring to exchange traded options not employee stock options -legal contract which gives buyer the right to buy or sell underlying stock at a specific price on or before the expiration date (US is on or before, euro is on exp date) -specific price = strike price -last - premium (for 1 share - although options contract written for round lots 100 shares) -Break even = (strike price) + (premium)
Buy Call
-right to purchase stock at strike price on or before exp date -initially lose premium -profit unlimited
Buy Put
-right to sell/short stock on or before exp date -profit bounded
Butterfly Option
-strategy for a sideways market -loss is capped -AAPL is at 111 Buy a 105 and 115, write 2 110s (all Calls) Premium: -7.16 + (2*2.73) - 0.53 = -2.23 Cost to enter butterfly: $223
Cross validation
-training is generall a 60/40% split but in cross validation we slice data up into difference chunks and train on different portions of 80%
Q Learning - What to use as Reward for Fastest Convergence
-using daily return is a more frequent rewards and convergeses faster
Model Free vs Model Based
1) Model-Free Reinforcement Learning (for example Q-learning), we do not learn a model of the world. We do not explicitly learn transition probabilities or reward functions. We only try to learn the Q-values of actions, or only learn the policy. Essentially, we just learn the mapping from states to actions, maybe modelling how much we're expecting to get in the long run. The algorithm learns directly when to take what action. 2) In Model-Based Reinforcement Learning, you keep track of the transition probabilities and the reward function. These are typically learned as parametrized models. The models learn what the effect is going to be of taking an particular action in a particular state. This results in an estimated Markov Decision Process which can then be either solved exactly or approximately, depending on the setting and what is feasible. Model-Based techniques tend to do better since they keep a more detailed model of the world. However, for this very same reason, they do require more data. Q-learning was brilliant because it is based on the fact that you only need to know what action to take, not why. But of course, knowing why will give you a more detailed
Steps to Optimize a Portfolio
1) provide a function f(x) to minimize ( ie... f(x) is negative SR) 2) Provide an initial guess for x ( where x is allocations) 3) Call the optimizer
Q-Learning Procedure
1) select training data 2) iterate over time obtaining exp tuple 3) test policy Pi 4) repeat until converged -we are converged when the return stops improving
Options Chain
A form of quoting options prices through a list of all of the options for a given security. An option chain is simply a listing of all the put and call option strike prices along with their premiums for a given maturity period. The majority of online brokers and stock trading platforms display option quotes in the form of an option chain. -Strike price - price can buy/sell at before exp date -Last - premium
Correlation vs RMSE
As RMSE increases , correlation decreases generally inversely correlated
LinReg vs KNN vs Decision Tree Performance (cost of query, cost of
Cost of Learning (least to most): 1)KNN - plop data into ram and query later 2) LinReg 3) Decision Tree - esp with Decision Forest Cost of Query (least to most) 1) LinReg - param model easy to compute 2) Decision Tree - binary tree (1000 elements only have to ask max 10 times) 3)KNN - worst since we have to compute distance to ALL individual data points, sort them, and find closest K points Quality: LinReg - not great Space: LinReg > KNN DTrees we don't need to normalize data KNN we do have to normalize data Parametric - Training is slow but querying is fast NonPara/InstanceBased - training is fast but querying is slow
RL: What to Optimize
Goal is to find policy pi(s) that makes some action to maximize reward
Intrinsic Value of Stock
Intrinsic Value (Call) = Underlying Price - Strike Price Intrinsic Value (Put) = Strike Price - Underlying Price In-the-Money (Call) = Strike Price < Underlying Price In-the-Money (Put) = Strike Price > Underlying Price
KNN vs DT - which needs to be normalized?
KNN needs to be normalized Decision Trees do not need to be normalized
Options Pros and Cons
Pros: 1) Higher leverage - can control more money using less money 2) can't lose more than premium paid up front ($273 in our example w/ 100 shares and premium of $2.73) Cons: 1)premium is lost up front money paid for contract 2) Expiration dates add layers to bet - it's usually a short time period 3) Don't own the stock thus no dividend, voting rights etc...
Q-Learning Pros and Cons
Pros: 1) model free approaches can easily be applied to domains where all states and/or transitions are not fully defined 2) no need for additional data structure to store transitions T(s,a) or rewards R(s,a) 3) Q value for any state-action pair takes into account future rewards. Encodes both best possible value of state as well as best policy in terms of the action that should be taken Cons: 1) reward often comes in future - representing this require look-ahead and careful weighting 2)taking random actions (such as tradeS) just to learn a good strategy is not good (you will lose money w/ tradeS) 3) #2 can be fixed by simulating the effect of actions based on historical data
Q-Learning Variables
Q[s,a] = immediate reward + discounted reward Q represents the value of taking action a in state s. Value can be immediate reward plus the discounted reward . Disc reward is for future actions -Look over all Q table actions and find which value of Q[s,a] is maximized -this is denoted as argmax(a) (Q[s,a]) - Optimal policy is pi*(s) and Q*[s,a]
Decision trees
Query comes in and bounces down tree - each node of the tree represents a yes/no question. We finally reach a leaf which is the regression value returned Decision forests - lots of decision trees together - query each one to get an overall results
Markov Decision Problems
RL is a form of markov decision problems -set of states s -set of actions A -Transition function T[s,a,s'] - probability that we are in state s and we take action a, we get new state s' -sum of all next states we might end up in is 1 -Reward function R[s,a] - if we're in state s and take action a we get some reward Find a policy Pi(s) that will maximize reward
Calculating R
R[s,a] expected reward if we're in state s and action a r - immediate reward we get in an experience tuple
Regression vs Classification
Regression - try to make numerical prediction Classification - classifying into one or several types
Q Learning Random Action
Success depends on exploration of as much of state and action space as possible -we do this by flipping coin twice 1) random action or pick argmax 2) if random action which random action - helped via random action rate RAR - at the beginning a high RAR will force us to explore the states
Equations for: - Discounted Reward - Finite horizon - Infinite horizon
Sum i=1 to inf [ gamma^(i-1) * r(i) ] -reward now is much better than reward later -the goal is to maximize the sum of all future rewards -Infinite horizon - sum of all rewards over all steps -finite horizon - sum of rewards of some number of steps
Calculating T transition matrix
Tc[s,a,s'] / Sum Tc[s,a,i]
Time Value
Time Value = Premium - Intrinsic Value In general, the more time to expiration, the greater the time value of the option. It represents the amount of time the option position has to become profitable due to a favorable move in the underlying price. In most cases, investors are willing to pay a higher premium for more time (assuming the different options have the same exercise price), since time increases the likelihood that the position will become profitable. Time value decreases over time and decays to zero at expiration. This phenomenon is known as time decay.
Q-Learning Update Rule
alpha - learning rate gamma - discount rate Q prime = Q ' = new improved version of Q The formula for computing Q for any state-action pair <s, a>, given an experience tuple <s, a, s', r>, is: Q'[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s', argmaxa'(Q[s', a'])]) Here: • r = R[s, a] is the immediate reward for taking action a in state s, • γ ∈ [0, 1] (gamma) is the discount factor used to progressively reduce the value of future rewards, • s' is the resulting next state, • argmaxa'(Q[s', a']) is the action that maximizes the Q-value among all possible actions a' from s', and, • α ∈ [0, 1] (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.
Ensemble Learner
combine multiple different models and take the mean at the end -lower error than individual learner - each type of learner has it's own bias so combining is better -less overfitting
Linear regression (parametric learning)
finds parameters for a model. Take data to get parameters and then throw away data Problems: -noisy and uncertain - value to be found - but it has to be accumulated over many trading opportunities -challenging to estimate confidence -holding time/allocation is uncertain -RL policy learning is better
Overfitting
in sample error decreasing and out of sample error increasing For knn when k=1 the model fits the training data perfectly and therefore in sample error is low but out of sample error is high. As k increases the model becomes more generalized and aout of sample error decreases. After a while the model becomes too general and starts performing worse on both train and test.
K Nearest Neighbor (KNN / instance based)
keep historical X,Y pair data - when we want to make prediction we use the data -use mean of y values from k nearest neighbors
Boosting
modified bagging where we choose the data points that have been modeled poorly from other (previous bag) the more bags we have the more Boosting (AdaBoost) is likely to overfit boosting and bagging algos are just wrappers on existing (KNN, LinReg) algos
Backtesting
roll back time and test system for different time periods
RMS Error
root mean square - take the sq root of the average error square sum. An approx of average error. Out of sample (test ) error is generally larger -good tool but w/ finance data can peek into future data which is bad. we can avoid this with roll forward cross validation
Bagging
same learning algo with different subsets of the original data using random w/ replacement selection
Correlation
we plot our test vs our predict: 1) straight line - good/high correlation 2) scattered shotgun - bad range: -1 to +1
Supervised vs Unsupervised
we show the machine many examples of X and Y - which is how it learns to predict Unsupervised - only inputs
Kernel Regression
weight contributions of each of the nearest neighbors depending on how distant they are . This is instance based and just an alternate to KNN