Competitive Programming - Bit Shifting, Formulas, and More., Machine Learning - Andrew Ng, Vanilla JS Project Bits, Programming Paradigms & Functional Programming & Haskell, Angular 2, Javascript: OOP, Design Patterns and more, JS Datastructures, Rea...
Gradient Descent for linear regression?
(Review again)
How to check if a number is a power of two?
(n & (n - 1)) == 0 this works because every power of 2 has just one 1 in binary
Get absolute value of n using bit ops?
(n ^ (n >> 31)) - (n >> 31);
Add one to n using bit ops
-~n
The sum of powers of two is...
2^(n+1) - 1
What is a Node?
A Node is an interface from which a number of DOM types inherit and allow these various types to be treated or tested similarly.
Who founded Lambda Calculus?
Alonzo Church (1930)
What does it mean for an algorithm to converge?
An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a specific value. In some circumstances, an algorithm will diverge; its output will undergo larger and larger oscillations, never approaching a useful result. The "converge to a global optimum" phrase in your first sentence is a reference to algorithms which may converge, but not to the "optimal" value (e.g. a hill-climbing algorithm which, depending on initial conditions, may converge to a local maximum, never reaching the global maximum).
What is an octet? Why do we use the term?
An octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, since historically there was no standard definition for the size of the byte.
What function should you use when intentionally iterating with a predicate that produces side effects?
Array.prototype.foreach(predicate) -- as recommended by Kyle Simpson
Classifier?
Classifier: A classifier is a special case of a hypothesis (nowadays, often learned by a machine learning algorithm). A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points. In the email classification example, this classifier could be a hypothesis for labeling emails as spam or non-spam. However, a hypothesis must not necessarily be synonymous to a classifier. In a different application, our hypothesis could be a function for mapping study time and educational backgrounds of students to their future SAT scores.
What is feature scaling?
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. The range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization[citation needed]. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance[citation needed]. If one of the features has a broad range of values, the distance will be governed by this particular feature[citation needed]. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance[citation needed]. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it[citation needed]. Some methods are rescaling, standardization, scaling to unit length.
Synonym for Input variable?
Features
What letter is typically used to depict a cost function?
Function J
Derive SSE
Given a linear regression model, the difference at each predicted point with the correct point is given by diff = y_i - (mx + b)
What is a nibble in Computer Programming?
Half an octet, 4-bit aggregation 0xF
What is "hypothesis" in machine learning?
Hypothesis: A hypothesis is a certain function that we believe (or hope) is similar to the true function, the target function that we want to model. In context of email spam classification, it would be the rule we came up with that allows us to separate spam from non-spam emails.
Inflection points
Inflection points are where the function changes concavity. Since concave up corresponds to a positive second derivative and concave down corresponds to a negative second derivative, then when the function changes from concave up to concave down (or vise versa) the second derivative must equal zero at that point. So the second derivative must equal zero to be an inflection point. But don't get excited yet. You have to make sure that the concavity actually changes at that point.
First FP programming language?
Lisp -- this is important because Lisp is often used as a good comparison and frequently referenced.
Model definition?
Model: In machine learning field, the terms hypothesis and model are often used interchangeably. In other sciences, they can have different meanings, i.e., the hypothesis would be the "educated guess" by the scientist, and the model would be the manifestation of this guess that can be used to test the hypothesis.
Can you assign to a const after declaring the const?
No. A const promises that we won't change the value of the variable for the rest of its life and this is considered a reassignment or rewrite.
Sum of geometric series?
S_n = a_1(1-r^n)/(1-r)
Sum of an arithmetic sequence?
S_n = terms(a_1 + a_n) / 2
Supervised learning
Supervised learning is a type of machine learning algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. From it, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset. A test dataset is often used to validate the model. Using larger training datasets often yield models with higher predictive power that can generalize well for new datasets. Called Supervised learning BECAUSE the data is labeled with the "correct" responses.
What is SSE?
The sum of squared error
3D Surface Plot - how can it be used to plot the cost function?
Theta 0 and Theta 1 in a univariate linear regression can be plotted on the x and y axes. the Z axis will indicate the actual cost
Training sample definition?
Training sample: A training sample is a data point x in an available training set that we use for tackling a predictive modeling task. For example, if we are interested in classifying emails, one email in our dataset would be one training sample. Sometimes, people also use the synonymous terms training instance or training example.
Unsupervised learning
Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. Unsupervised learning is closely related to the problem of density estimation in statistics.[1] However unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.
Nth term of geometric series?
a_n = a_1 * r^(n-1)
Nth term of arithmetic sequence?
a_n = a_1 + (n-1)d
x^i_j notation in ML means?
an index into a training set for the ITH training example and JTH feature (input variable)
character literals in C++
completely distinct from string literals - and always enclosed in single quotes whereas strings are always enclosed in double quotes.
How to detect if two integers have opposite signs?
int x, y; // input values to compare signs bool f = ((x ^ y) < 0); // true iff x and y have opposite signs
How to clear a bit
number &= ~(1 << x);
How to toggle a bit?
number ^= 1 << x;
hypothesis model
remember that the HYPOTHESIS MODEL or SET is what is depicted with h(theta)
Document.querySelector(String selector)
returns the first element node within the document in document order that matches the specified selectors.
What does this hypothesis represent? h_theta(x) = theta_0 + theta_1 x
univariate linear regression model
local variables
varaibles defined inside a pair of curly braces and exist only while executing the part of the program within those braces.
when does a stream stop writing
whitespace encounter
How to fill first byte with 1's to a variable x?
x |= (0xFF)
Subtract one to n using bit ops
~-n
What is the downside of using an alpha (learning rate) that is too small?
Gradient descent can be way too slow.
What is the downside of using an alpha (learning rate) that is too big?
Gradient descent can overshoot the minimum and it may fail to converge or even diverge.
Document.createElement
In an HTML document, the Document.createElement() method creates the specified HTML element or an HTMLUnknownElement if the given element name isn't a known one.
Higher derivatives?
Let f be a differentiable function, and let f ′(x) be its derivative. The derivative of f ′(x) (if it has one) is written f ′′(x) and is called the second derivative of f. Similarly, the derivative of a second derivative, if it exists, is written f ′′′(x) and is called the third derivative of f. Continuing this process, one can define, if it exists, the nth derivative as the derivative of the (n-1)th derivative. These repeated derivatives are called higher-order derivatives. The nth derivative is also called the derivative of order n.
Synonym for output variable?
Targets
How do you change the sign of an integer?
~x + 1
"Batch" Gradient Descent (BGD or GD)
Each step of gradient descent uses all the training examples. batch GD - This is different from (SGD - stochastic gradient descent or MB-GD - mini batch gradient descent) In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set -- thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum (note that the SSE cost function is convex).
What is a side effect?
In computer science, a function or expression is said to have a side effect if it modifies some state or has an observable interaction with calling functions or the outside world. For example, a particular function might modify a global variable or static variable, modify one of its arguments, raise an exception, write data to a display or file, read data, or call other side-effecting functions. In the presence of side effects, a program's behaviour may depend on history; that is, the order of evaluation matters. Understanding and debugging a function with side effects requires knowledge about the context and its possible histories
For a sufficiently small alpha...
J(Theta) should decrease EVERY iteration
Regression vs Classification?
Regression: the output variable takes continuous values. - Price of house given a size. Classification: the output variable takes class labels, or discrete value output - Breast cancer, malignant or benign? Almost like quantitative vs categorical
What is left associativity?
a Q b Q c If Q is left associative, then it evaluates as (a Q b) Q c And if it is right associative, then it evaluates as a Q (b Q c)
How to check if the nth bit is set?
if (x & (1 << n)) { //set } else { //not-set }
Get Max Int (Bit Ops)
int maxInt = ~(1 << 31); int maxInt = (1 << 31) - 1; int maxInt = (1 << -1) - 1;
In 8-bit Two's Complement what is -1 in binary, what about -127?
1111 1111
What are contour plots?
A contour plot is a graphical technique for representing a #D surface by plotting constant z slices, called contours onto a 2Dimensional format. That is, given a value for z, lines are drawn for connecting the x,y, coordinates twhere that z value occurs. The circles in a contour plot are called level sets - the function J is equal here.The center of the contour plot is the minimum of the cost function typically in ML.
Cost function vs Gradient Descent?
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
Array.prototype.slice vs Array.prototype.slice is a good demonstration of what?
A pure function vs a non-pure function. arr.slice([begin[, end]]) --> begin is zero-indexed, end is zero-indexed but is not included in the slice. slice does NOT alter the given array. It returns a shallow copy of elements from the original array. array.splice(start, deleteCount[, item1[, item2[, ...]]]) --> start is the beginning index to start changing the array, deleteCount is the number of elements to delete, items are anything to be added var myFish = ['angel', 'clown', 'mandarin', 'surgeon']; // removes 0 elements from index 2, and inserts 'drum' var removed = myFish.splice(2, 0, 'drum'); // myFish is ['angel', 'clown', 'drum', 'mandarin', 'surgeon'] // removed is [], no elements removed // myFish is ['angel', 'clown', 'drum', 'mandarin', 'surgeon'] // removes 1 element from index 3 removed = myFish.splice(3, 1); // myFish is ['angel', 'clown', 'drum', 'surgeon'] // removed is ['mandarin'] // myFish is ['angel', 'clown', 'drum', 'surgeon'] // removes 1 element from index 2, and inserts 'trumpet' removed = myFish.splice(2, 1, 'trumpet'); // myFish is ['angel', 'clown', 'trumpet', 'surgeon'] // removed is ['drum'] // myFish is ['angel', 'clown', 'trumpet', 'surgeon'] // removes 2 elements from index 0, and inserts 'parrot', 'anemone' and 'blue' removed = myFish.splice(0, 2, 'parrot', 'anemone', 'blue'); // myFish is ['parrot', 'anemone', 'blue', 'trumpet', 'surgeon'] // removed is ['angel', 'clown'] // myFish is ['parrot', 'anemone', 'blue', 'trumpet', 'surgeon'] // removes 2 elements from index 2 removed = myFish.splice(myFish.length -3, 2); // myFish is ['parrot', 'anemone', 'surgeon'] // removed is ['blue', 'trumpet']
Why is it unnecessary to change alpha over time to ensure that the gradient descent converges to a local minimum?
As we approach a local minimum, the gradient descent will take smaller steps because of the change of the derivative or the steepness of the cost function J. Don't need to worry about divergence.
What is cross-validation?
Cross-validation, sometimes called rotation estimation,[1][2][3] is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset).[4] The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.
Why do we square instead of using the absolute value when calculating variance and standard deviation?
First I'll answer the mathematical question asked in the question details, which I'm going to restate because I think it is stated wrong: The short answer is "Because of Jensen's inequality." See http://en.wikipedia.org/wiki/Jen... and the rest of the article for context. It says in particular that for a concave function What about the more general question, "Why variance?" I don't believe there is any compelling conceptual reason to use variance as a measure of spread. If forced to choose, my guess is that most people would say more robust measures like interquartile range or MAD better capture the concept of "spread" in most cases. But variance (and more generally "sum of squares") has some attractive properties, many of which flow from the Pythagorean theorem one way or another. Here some of them, without much math: We can decompose sums of squares into meaningful components like "between group variance" and "within-group variance." To generalize the above point, when a random variable Y Y is partly explained by another random variable X X there is a useful decomposition of the variance of Y Y into the part explained by X X and the unexplained part. (See http://en.wikipedia.org/wiki/Law...). If we think more broadly about mean squared error, this too can be decomposed into the sum of variance and squared bias. It is easy to interpret this total error as the sum of "systematic error" and "noise." Often we want to minimize our error. When the error is a sum of squares, we are minimizing something quadratic. This is easily accomplished by solving linear equations. So yes, variance and mean squared error are conveniences rather than conceptual necessities. But they are convenient conveniences.
What are polymorphic types in Haskell?
Haskell also incorporates polymorphic types---types that are universally quantified in some way over all types. Polymorphic type expressions essentially describe families of types. For example, (forall a)[a] is the family of types consisting of, for every type a, the type of lists of a. Lists of integers (e.g. [1,2,3]), lists of characters (['a','b','c']), even lists of lists of integers, etc., are all members of this family. (Note, however, that [2,'b'] is not a valid example, since there is no single type that contains both 2 and 'b'.)
Gradient Descent algorithm
Image result for gradient descent Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. used with simultaneous updates of the parameters of success Susceptible to falling into local optimum depending on initialization.
Pure Function conditions
In computer programming, a function may be considered a pure function if both of the following statements about the function hold: The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information or state that may change while program execution proceeds or between different executions of the program, nor can it depend on any external input from I/O devices (usually—see below). Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects or output to I/O devices (usually—see below).
What is a stream?
In computer science, a stream is a sequence of data elements made available over time. A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches Streams are processed differently from batch data - normal functions cannot operate on streams as a whole, as they have potentially unlimited data, and formally, streams are codata (potentially unlimited), not data (which is finite). Functions that operate on a stream, producing another stream, are known as filters, and can be connected in pipelines, analogously to function composition. Filters may operate on one item of a stream at a time, or may base an item of output on multiple items of input, such as a moving average. - *processed one at a time rather than in batches*
K-Fold?
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[7] but in general k remains an unfixed parameter. When k=n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation. In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels. Ultimately this helps fix the problem that we want to maximize both the training and test sets in cross-validation.
Convex functions?
In mathematics, a real-valued function defined on an interval is called convex (or convex downward or concave upward) if the line segment between any two points on the graph of the function lies above or on the graph, in a Euclidean space (or more generally a vector space) of at least two dimensions. Equivalently, a function is convex if its epigraph (the set of points on or above the graph of the function) is a convex set. Well-known examples of convex functions include the quadratic function {\displaystyle x^{2}} x^{2} and the exponential function {\displaystyle e^{x}} e^{x} for any real number x. Convex functions play an important role in many areas of mathematics. They are especially important in the study of optimization problems where they are distinguished by a number of convenient properties. For instance, a (strictly) convex function on an open set has no more than one minimum.
What are first order methods in numerical analysis?
In numerical analysis, methods that have at most linear local error are called first order methods. They are frequently based on finite differences, a local linear approximation.
What is normalization?
In statistics and applications of statistics, normalization can have a range of meanings.[1] In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment. In the case of normalization of scores in educational assessment, there may be an intention to align distributions to a normal distribution. A different approach to normalization of probability distributions is quantile normalization, where the quantiles of the different measures are brought into alignment.
What is overfitting?
In statistics and machine learning, one of the most common tasks is to fit a "model" to a set of training data, so as to be able to make reliable predictions on general untrained data. In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data. how to avoid overfitting? cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison and more!
What is a simple linear regression?
In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible. minimize squared error
Learning algorithm?
Learning algorithm: Again, our goal is to find or approximate the target function, and the learning algorithm is a set of instructions that tries to model the target function using our training dataset. A learning algorithm comes with a hypothesis space, the set of possible hypotheses it can come up with in order to model the unknown target function by formulating the final hypothesis
Node.appendChild()
Node.appendChild() method adds a node to the end of the list of children of a specified parent node. If the given child is a reference to an existing node in the document, appendChild() moves it from its current position to the new position.
Standardization vs normalization
Normalization rescales the values from to a range of [0,1]. This might useful in some cases where all parameters need to have the same positive scale, but outliers from data set are lost. Xchanged = (X - Xmin)/(Xmax-Xmin) Standardization rescales data to have a mean of 0 and standard deviation of 1 (unit variance). Xchanged = (x-mean)/sd For most applications standardization is recommended. In the business world, "normalization" typically means that the range of values are "normalized to be from 0.0 to 1.0". "Standardization" typically means that the range of values are "standardized" to measure how many standard deviations the value is from its mean. However, not everyone would agree with that. It's best to explain your definitions before you use them. In any case, your transformation needs to provide something useful.
Bitwise Swap Intuition x = x xor y y = x xor y x = x xor y
On line 1 we combine x and y (using XOR) to get this "hybrid" and we store it back in x. XOR is a great way to save information, because you can remove it by doing an XOR again. So, this is exactly what we do on line 2. We XOR the hybrid with y, which cancels out all the y information, leaving us only with x. We save this result back into y, so now they have swapped. On the last line, x still has the hybrid value. We XOR it yet again with y (now with x's original value) to remove all traces of x out of the hybrid. This leaves us with y, and the swap is complete! *The mathematics are fairly simple and work because XOR has a useful property, when you XOR A and B, you get a value C. If you XOR C and A you'll get B back, if you XOR C and B you'll get A back.*
What is the difference between a deep copy and a shallow copy?
Shallow copies duplicate as little as possible. A shallow copy of a collection is a copy of the collection structure, not the elements. With a shallow copy, two collections now share the individual elements. Deep copies duplicate everything. A deep copy of a collection is two collections with all of the elements in the original collection duplicated.
Target Function definition?
Target function: In predictive modeling, we are typically interested in modeling a particular process; we want to learn or approximate a particular function that, for example, let's us distinguish spam from non-spam email. The target function f(x) = y is the true function f that we want to model. The target function is the (unknown) function which the learning problem attempts to approximate.
Cocktail party effect/problem
The cocktail party effect is the phenomenon of being able to focus one's auditory attention on a particular stimulus while filtering out a range of other stimuli, much the same way that a partygoer can focus on a single conversation in a noisy room. Example of source separation.
What does the derivative of a function tell us?
The derivative of a function of a real variable measures the sensitivity to change of a quantity (a function value or dependent variable) which is determined by another quantity (the independent variable). Derivatives are a fundamental tool of calculus. For example, the derivative of the position of a moving object with respect to time is the object's velocity: this measures how quickly the position of the object changes when time is advanced. The derivative of a function of a single variable at a chosen input value, when it exists, is the slope of the tangent line to the graph of the function at that point. The tangent line is the best linear approximation of the function near that input value. For this reason, the derivative is often described as the "instantaneous rate of change", the ratio of the instantaneous change in the dependent variable to that of the independent variable.
What happens if you initialize a parameter at a local minimum and attempt to use gradient descent on it?
The derivative turns out to be zero because the tangent is a flat line meaning that regardless of alpha it is multiplied by zero, indicating no change.
What to do when J(Theta) is moving up and down in waves ?
Use a smaller Alpha!!
How to swap two values a,b using bits?
a ^= b; b ^= a; a ^= b; OR (easier to memorize) x = x xor y y = x xor y x = x xor y
Clustering
a method of unsupervised learning - a good way of discovering unknown relationships in datasets. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
multivariate linear regression
ask: why is the notation shorter and what does that convenience notation indicate?
why use features that are on a similar scale?
contour plots with differently scaled features will be extremely thin or extremely fat resulting in a very slow gradient descent (convergence is slower) get them to a -1 <= x <= 1 scale. poorly scaled is too large -100 to 100 or -0.00001 or 0.00001
How to make sure gradient descent is working properly?
create an automatic convergence test - declare convergence based on amount of decrease of J(Theta) plot on graph, y axis being J and axis being number of iterations.
Get Min Int (Bit Ops)
int minInt = 1 << 31; int minInt = 1 << -1;
What does J(Theta) increasing tell you about ur gradient descent?
it's not working lol. use a bigger alpha. on the other end if you use too big of an alpha you'll end up with a bowl shaped curve and you might be moving farther away from convergence.
What is a lambda expression?
lambda expression in computer programming, also called anonymous function, a function (or a subroutine) defined, and possibly called, without being bound to an identifier Lambda Expressions are nameless functions given as constant values. They can appear anywhere that any other constant may, but are typically written as a parameter to some other function. The canonical example is that you'll pass a comparison function to a generic "sort" routine, and instead of going to the trouble of defining a whole function (and incurring the lexical discontinuity and namespace pollution) to describe this comparison, you can just pass a lambda expression describing the comparison. HOWEVER, this misses one of the most important features of Lambda Expressions, which is that they execute in the context of their appearance. Therefore, they can use the values of the variables that are defined in that context. This differentiates function-pointers from true lambda expressions. In languages supporting mutable variables, proper lambda expressions offer the power change the values of those variables. Lambda expressions are rooted in lambda calculus.
How to set any bit?
number |= 1 << x;
SSE formula?
observation - mean for each observation squared.
What does theta typically represent in stat/ML?
quite often θ stands for the set of parameters of a distribution.
Definition of stochastic?
randomly determined; having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely.
How to iterate across an object with side effects?
use the for... in loop while checking .hasOwnProperty because for...in will check parents as well.