Data Science and Statistics

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is a "Non-Linear Kernel"

"Kernels" are transformation functions in Support-Vector-Machines. They take input data that isn't linearly separable, a line can't be drawn between points in their current space to separate them, and they put them into a higher dimensional space. Within the higher dimensional space a hyperplane is used to separate the points instead. In higher dimensions it's a lot easier to separate points with a hyperplane, because there's a lot more space to fit between them.

What's the difference between a "Forecast" and a "Prediction"?

A Forecast can be considered a subset of all possible prediction problems where you're trying to predict values that are going to occur in the future and aren't immediately accessible in the present. For example, based on input parameters of your company you might predict how much money you'll be earning in 2 years. This is forecast because you can't know the true answer until time has elapsed.

What is the Canonical Correlation?

A correlational technique used when there are two or more X and two or more Y. It seems to be a way of grouping variables and comparing them. According to Wikipedia, it allows you to infer information from covariance matrices. It sounds like it's called canonical because it was the first time a method like this was introduced. Almost all other forms of correlation are a deriviation of this method. You try and find vectors that multiply two different distributions and in doing so maximise the correlation between the datasets. The vectors applied to X and Y produce the first pair of canonical variables. This process can be repeated to get more canonical variables. I guess this allows you to get at what correlations are actually occuring in the dataset that might be difficult to see from two variate analysis. This would allow you to see the maximum pattern that exists between your datasets.

What are Constraints in statistics?

A description from Wikipedia says that a constraint is a measure of the statistical dependence between variables. The relationships between them *constrain* the allowed values of the statistic. For example, the probability of the first five faces of dice constrain the probability of the sixth face because the probability of all the faces must sum to one. Constraints can also be applied that force models to behave in accurate ways. For example, if you're building a statistical model to explain a physical process the model would have to obey physical constraints that prevent it from giving you erroneous values. Such constraints can also be applied to parameters in Machine Learning models to improve their accuracy by enforcing logical relationships. In terms of pure statistics though, it appears to be about relationships between variables. Although, perhaps that's true for the physical world too. For example, the temperature of a substance in a constrained vessel and the pressure are related. You can't increase temperature without increasing pressure. Therefore, a model that considered these two completely independent would be produce incorrect results. In summary, you have to be careful of how the parameters you're using to build your model, or your statistic interact. If they're dependent on one another in some way then you have to take that into consideration and recognise you have a constraint. This could affect the degrees of freedom of your model and therefore future hypothesis testing.

What is a Metaheuristic?

A heuristic is a technique for approximating the solution to a problem without completely solving the problem. They're used everywhere, particularly when calculation of a solution is NP-Hard. A Metaheuristic is a heuristic way of selecting a heuristic to solve a problem! In other words, it's a higher level heuristic for selecting the approximately correct heuristic for you problem! Okay, that's all very confusing, but you can think of it like a general set of ideas that help you form a strategy to fight somebody. Imagine warfare, you can't *know* the best way to beat your enemy, your tactics are approximate. But you can work within a group of ideas and strategies for selecting your approximate tactics. That's a Metaheuristic.

What is Bayesian Statistics?

A method in statistics that estimates epistemological uncertainty using probability. Within this paradigm, you specific degrees of belief in states of nature. They are non-negative and sum to one. Degrees of belief in states of nature means you have a prior distribution of things that you think are true. The "prior distribution". You then fit data to your model and update the distribution of things that you think are true. This becomes the "posterior distribution". This is what you use moving forward to make inferential decisions.

What is a p-value?

A p-value is the probability an observed result in a statistical hypothesis test was due to random chance. The lower this value is, the more likely your observed result was not due to random chance. For example, say I had a certain null hypothesis and based on that null hypothesis I calculated that an observed result would happen randomly 0.000001% of the time. That means there is a very high probability that our null hypothesis is wrong... although what that probability is is unknowable: https://stats.stackexchange.com/questions/275527/using-p-value-to-compute-the-probability-of-hypothesis-being-true-what-else-is The value we chose above is way over the typical required value in statistics, which is p < 0.05. Or a 5% chance that the observed value is due to random chance. Pretty pathetic test if you ask me.

What is Statistics?

A practical mathematical science that involves collecting, analysing, interpreting, and presenting data. Basically, if you're collecting information, and you're studying it in any way, you're doing some form of statistics.

What is Statistical Hypothesis Testing?

A procedure that allows us to evaluate hypotheses about population parameters based on sample statistics. This is a form of "statistical inference". In other words, we form a hypothesis. We then test that hypothesis by creating a test statistic of some kind. For example, through random sampling. Depending on the value of the test statistic we can accept or reject our hypothesis within some defined probability level. This idea allows us to make probabilistic statements about population level parameters, like mean.

What is Inferential Statistics?

A set of statistical techniques that allows you to make inferences about a population, given a sample of that population. This is the type of statistics that is used to make predictions about populations. This is the kind that's most important to me as I'm trying to make predictions about the population of people who apply for credit cards. Particularly I'm trying to infer who's most likely to be fraudulent!

What is a Statistic?

A statistic is a quantity computed from a sample that is considered for some "statistical purpose". A "statistical purpose" could be something like, estimating a population parameter, describing a sample, or evaluating a hypothesis. Statistics used for different purposes generally have different names. For example, if a statistic is used to describe a population parameter it's called an "estimator"; if it's used to describe a sampler it's called a "descriptive statistic"; if it's used to test a hypothesis it's called a "test statistic". ChatGPT: A numerical measure that summarizes a set of data.

What is a Box Cox transfomation?

A transformation of your target variable so that it represents a normal distribution. The name comes from the authors who came up with the idea, who were named Box and Cox. I feel like this type of transformation is unlikely to be useful for a binary classification model.

What is a Copula Transformation?

According to this question and answer: https://stats.stackexchange.com/questions/88661/what-is-copula-transformation A Copula transformation allows you to remove all the noise from a pair of features and analyse ONLY the dependence between them. Apparently this method makes it easier to see when they share information? (Maybe) More formally: In probability theory and statistics, a copula is a multivariate cumulative distribution function for which the marginal probability distribution of each variable is uniform on the interval [0, 1]. So it forces all the values in the variable into a small range, which makes it easier to tell when things vary together.

What is Apache Arrow?

Apache Arrow is a small and fast data format that was designed to be efficient in memory. It's a collection of C, C++, Rust, or Go data structures that can store and process data efficiently. It's being integrated into Pandas for the 2.0 update. It comes with a binary format called "feather", which should make loading and storing pandas DataFrames simpler and quicker.

Why do Point Estimates reduce the degrees of freedom?

Apparently this reduces the degrees of freedom because the population parameter estimate takes up one of the possible degrees of freedom. There's something about condensing all the values into a single estimate that fundamentally means there's less information and reduces our degrees of freedom. Not entirely sure how this works though, it would be good to see a mathematical explanation.

What are some of the issues with R^2?

Because R^2 uses the total variation in relation to the mean as a denominator, if your data is positioned closely to that mean then the total variation due to the mean will be low, and any improvements, even if visually they fit much better, will not be as pronounced as if the data were far away. For an example, you can read this article: https://www.quantics.co.uk/blog/r-we-squared-yet-why-r-squared-is-a-poor-metric-for-goodness-of-fit/ It's possible to have data drawn from the same underlying population that fit the same model and have much different R^2 scores simply due to where those data points are relative to the mean. This means R^2 is not a data-independent metric for goodness of fit.

Why do some people describe machine learning models as hypotheses?

Because the form of model chosen is an assumption about the dataset you're training on. For example, a linear model implies that your dataset is linear! But, more generally, you can think of a machine learning model as a learning algorithm that provides you with a hypothesis. It selects a hypothesis from all the potential hypothesis that most closely matches your data. It does this using the model parameters. The model's parameters and underlying behaviour form the hypothesis you're making about the dataset. Your hypothesis is accurate if your model is accurate.

What are Sparse Data Matrices?

Data matrices that have mainly 0 or None values. Essentially, a Spare data matrix contains little information in the majority of its columns. This can be an issue because it adds significant and unnecessary computational overhead. They also add unnecessary complexity to a model, because the model now requires extra columns, even though most of those columns are unimportant. Linear regression is apparently susceptible to this because it uses matrix calculations for optimisation (so says ChatGPT). When you have a large, and majority meaningless matrix this results in serious wasted computation for very little improvement. Hence, it's susceptible to the issue.

What is Wide Data?

Data that has a lot of variables/ columns/ features. Hastie says the number of rows must be less than the number of columns for the data to be wide.

What is a Transformation in data science?

Data transformations take data and process it, cleanse it, and structure it into a format that's useable by the next stage in the data pipeline. For example, this could mean a machine learning model that can only handle numerical data. Because this is a brand new field, the precise definition is still somewhat up in the air. Personally, it seems to me that a data transformation is an action taken on data you have that changes its representation in some way. Typically this is done to make the data easily useable in a future part of your pipeline. It also seems that some people consider deleting data a transformation, so perhaps you have to consider "data" to mean the "dataset" as oppose to individual pieces of information. Otherwise transforming something into non-existent doesn't make much sense. You can also include in that the process of copying, replicating, adding, etc. These are "dataset transformations" but they still seem to fall under the generic heading of "data transformations".

What are Descriptive Statistics?

Descriptive statistics are summary statistics. They provide basic summaries of information in the dataset. They're a good way to get an overview of a dataset, but their generality means they cannot tell you the whole story of your data. They tend to describe the entire dataset as a complete entity, they're not granular in how they explain the dataset.

What are Design Matrices?

Design matrices are basically the dataset you use to train your model. More precisely, they contain information about observations relating to specific objects. Each row in the design matrix represents a unique observation, and each column represents a type of observation made on many objects.

What is Feature Engineering?

Feature engineering is the process of re-representing a dataset so that it's easier for a model to make good predictions on it. This process is used to minimise features whilst maintaining performance. In some cases this process may be even more important. Models could break when features aren't in the proper formats.

What is a Fixed Effect?

Fixed effects are effects that have a defined constant effect, particularly on a statistical model. In that sense they are a constant of the model. Another way to define it is that they are an affect that doesn't vary across a population, they remain constant for everybody in that population. Please see: https://ademos.people.uic.edu/Chapter17.html

Who was Bayes, what did he contribute?

He developed a method for applying probability to epistemological uncertainty. In other words, uncertainty that is purely due to an uncertainty about the nature of the world. As oppose to uncertainty about the future, which is the uncertainty that occurs in games of random chance. Specifically, he started with a set of parameters theta, and defines the following relation: p(theta | y) = p(y | theta)*p(theta)/p(y) Essentially, this says, what is the probability of my prior distribution given the probability of my y given theta, multiplied by the ratio of the probability of my theta and my y.

How is Hypothesis testing used in Exploratory Data Analysis?

Hypothesis testing allows you to test assumptions that you have about a dataset and is an important final part of EDA. For example, you might think you see a correlation between two variables. But before you engineer a joint feature from them you want to check the correlation is statistically significant. You do this using a hypothesis test. This is probably something I could start doing more of in my own code, as I have the hypothesis testing module already built. It seems like you can take an SQR4 approach to this. Whilst going through the initial stages of EDA (cleaning, visualising, and summarising) you can put together a list of question that you have about the dataset. Then, when you've gone through the whole thing, you can use those question to generate and test hypothesis that you have about the data.

How does LightGBM calculate information gain?

If you read their paper, you can see that information gain is calculated https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

What are Degrees of Freedom in statistics?

In statistics, this refers to the maximum number of logically independent variables in a system. For example, for a dice roll, there could be 6 probabilities. One for each face of the dice. However, the sixth probability is defined by the preceding five. It has to take the value 1-(5 preceding probabilities). Therefore, there are only 5 logically independent variables in this situation, and the degrees of freedom are 5 and not 6. From Wikipedia, the degrees of freedom in a statistic are the values that are allowed to differ during the calculation. In the dice example, the first five faces can take whatever probability they like, but the sixth is constrained to a value defined by the preceding 5. Basically, you need to look at your problem and figure out if there's any dependence between the values you're using to create your statistic. If there are some dependencies you need to think if logically one value is defined by the other. If it is, then you should consider whether you need to reduce your degrees of freedom. https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)

What is mRMR or MRmr?

In turn these stand for "minimum Redundancy, Maximum Relevancy", and correspondingly "Maximum Relevancy, minimum redundancy". This is a feature selection technique that uses mutual information to remove features that share the same information, and features that are irrelevant to the target column.

What is the Central Limit Theorem?

It describes the tendency of random independent variables to converge to a normal distribution when their averge is taken. This is an interesting idea. Does that mean, if they're not independent, you can show that by seeing what happens to the distribution? I guess yes, but calculating that distribution is such hard work you may as well not bother and use a different and approximate method! So is that what a lot of statistics is? Trying to find an approximation of the distribution between variables?

How does LightGBM handle categorical features?

It internally enumerates them and uses that enumerated representation of the data to make decisions about which way different categorical features will go in its decid

What does MAR mean?

It means Missing At Random. This means that data is missing according to some pattern, particularly by some other feature of the data. For example, imagine you had a dataset that showed performance in a series of basketball games for a set of players. Some of the players have missing data for a whole series of games. When you look closer you see that on those days they're recorded as ill and didn't play, hence no data. This is an example of missing at random. In the above example the data is missing because of another feature in the dataset, but not because of the data values themselves. https://stefvanbuuren.name/fimd/sec-MCAR.html

What does MNAR mean?

It means Missing Not At Random. This is when the missingness of your data *is* dependent on the values of the data in that particular column. For example, you have a weighing scale that return NaN for very heavy objects. That means the missing values are dependent on how heavy the object is. Beyond a certain weight all the values are missing. This is a difficult class to deal with, and you have to perform tests and experiments to check why this has happened, and what is the sensitivity of what you're doing to this data being missing. https://stefvanbuuren.name/fimd/sec-MCAR.html

What is a Learned Embedding in encoding?

It sounds like it's a method of producing an ordering from a nominal categorical column. Basically, it seems each category is abstracted out into a vector and then run through a neural net. The categories that are most similar are allowed to group together and the overall relationship between all the classes becomes more obvious. That allows you to give the classes a label encoding despite them being nominally nominal (lol). A great thing about this technique is that you don't have to use the same model to create the embedding as you do once the embedding has been created. The embedding can be applied generally to any model. It can be used to transform the data as a pre-processing step for any arbitrarily complicated model. This technique was originally used in language models to find words that had a similar meaning to one another. An embedding layer will be required for each categorical variable, so this could add significant complexity to a model build. However, if it's possible, it could also make my models far more robust!

What does MCAR mean?

It stands for Missing Completely At Random. t means that data is missing for no reason at all. It's not dependent on anything else in the dataset. This occurs when a value is missing purely due to random chance. It also means that any possible value that missing data could take is equally likely to be missing. Essentially, this is like accidentally missing a tick in a box in an in-person survey and running it through a machine which collects no value in the box. This is completely random, you missed the box and forgot about it, and you're equally likely to miss any box. https://stefvanbuuren.name/fimd/sec-MCAR.html

Is Multiple Imputation useful in predictive modelling?

It would appear the answer to this is a reasonably definitive no! Multiple imputation is used for statistical tests where assumptions about the distribution of the predictors are known in advance. In a predictive modelling scenario you may know absolutely nothing about the predictors, and so you can't test how reasonable your imputations are. This is why I think it's not so useful. It's costly to do this over and over, it's also not clear how you would store the imputation for application to new data, because that's the point in a predictor, to generalise to new data. But let's say you do store the imputation somehow, perhaps it's a KNN model, now you have to run your data through the KNN before it's allowed to go through your true model. Therefore, you're exposing yourself to potentially serious issues http://www.feat.engineering/imputation-methods.html

What is the Z statistic?

It's a count of how many standard deviations away from the mean a particular observation is. The more standard deviations away from the expected mean an observation is, the more likely your expected mean and distribution is wrong. This statistic is used in hypothesis testing to accept or reject a null hypothesis.

What is Label Encoding?

It's a data transformation that takes a categorical column and assigns each of its categories a numerical value. This can be an issue because the column may not have an order and the numerical values can make the machine learning algorithm believe that there is an order, or treat the column as if it has an order. Therefore, this sort of encoding should only be done on ordinal categorical columns, not nominal columns.

What is a Probability Distribution?

It's a distribution that represents the probability of all the possible outcomes of an experiment. For example, dice rolls. Formally, it's a mathematical function that gives the probabilities of occurrence of different possible outcomes from an experiment. In other words, if you rolled 30 dice and counted the score as 58, you could use the probability distribution function to calculate how likely that particular outcome is. In this case there would be some defined value, as the number produced are discrete. More generally, in a situation that doesn't produce discrete outcomes and is practically infinite, the probability distribution can be integrated over to find the probability of a particular REGION of outcomes occurring. Take for example the position of an electron, any single position has a probability of occurring of 0, but a region could have a probability of 0.5.

What is Generalized Least Squares?

It's a generalisation of the least squares algorithm to take into account the correlations between variables and quantify it somehow. It helps you "estimate the unknown parameters" in your model. Basically, it's an extension of ordinary least squares. You use a weighted least squares to make your prediction, but base the weights on the correlation between the parameters. This makes your model fit the data better. https://en.wikipedia.org/wiki/Generalized_least_squares

What is Mutual Information?

It's a measure of how much uncertainty you can remove from your prediction about a column Y, given that you have all the information from column X. If you read the wikipedia, they seem to say that this is a measure of the amount of wasted units of information when you decide to encode X and Y as independent random variables, when, in fact, they have some amount of dependence on one another. You can calculate mutual information by summing every product of the joint distribution, and the logarithm of the joint distribution divided by the product of the marginal distribution for x \in X and y \in Y. https://en.wikipedia.org/wiki/Mutual_information

What is a Mixed Model?

It's a model that includes two types of effect. Fixed effects and random effects. Fixed effect variables have a constant unchanging effect on the model, whereas random effect variables have a changing non-constant effect. Each type of effect is associated with a variable. Mixed models try and figure out what effect each variable is having whilst accounting for the fact that some of the variability in the model is due to random unmeasured factors. Apparently these are very well studied models, and they have strong theoretical underpinnings which makes them useful for doing a style of categorical encoding. All mixed means is "contains random and fixed effects". Please see: https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode

What is Dask?

It's a parallel computing library for Python. It allows you to work with parallelised numpy arrays and pandas dataframes. This can massively speed up your data science workflow. It's the backend for the dataprep library apparently, which makes the dataprep library fast!

What's the drawback of the Mean Absolute Error metric?

It's a scale-dependent metric, which means it's value is dependent on the data it's being used on. Therefore, it cannot be compared across datasets that have different scales/ units, etc.

What is the idea of the "No Free Lunch" theorem in Data Science?

It's an idea from a paper (I believe) that states all optimization algorithms perform equally well if you average them across all possible problems. The idea here being that if you take every possible problem that exists, there is no single solution. That means there is also no single solution for every machine learning problem that exists, as machine learning is a form of optimisation. The theorem is named "No-Free-Lunch" because it matches the idea that you cannot have something for nothing. Your optimisation algorithm may be incredible for the problem you're trying to solve, but it cannot solve all the problems you want to solve.

What is Helmert Encoding?

It's something to do with calculating the mean of a category and comparing to the other categories. But it's not really clear what the "mean" of a category means here, so I don't know.

What's a Conditional Shift?

It's the shift in the distribution conditional on some value.

What is the Variance-Bias Trade off?

It's the trade off inherent in any modelling process. You want your models to be portable across datasets, and therefore to have a low level of variance! However, you also want them to be accurate, in other words, to have a low level of bias. Your model choice affects this trade-off because they all have different characteristics. Therefore, you have to be clear about what you want when selecting a model. You also have to be clear about hyperparameter selection for those models. Because hyperparameters can have a large affect on how the model trains, and therefore what patterns its trained upon. This has actually given me a great idea for a model setup that allows you to get exactly the variance and bias you want by combining decision trees and linear/ logistic regression.

What is Multicollinearity?

It's when two or more features/ variables have a high degree of correlation with one another. That is, they seem to be predicting roughly the same thing and are actually good predictors of one another! This is a form of redundancy which can be an issue for some models. Perhaps all these variables are good predictors, so the model overfits to include all of them, despite them representing the exact same information. This has the overall effect of lowering the model's performance. It can also effect the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model.

What is Target Encoding? (also called mean encoding or likelihood encoding)

It's when you use known information about the target of your training data and use it to encode your categorical columns. This usually results in target leakage and can cause your model to overfit, which is high undesirable. You can get around this by adding noise to the data, or using a different form encoding that prevents the fit from being too powerful. You can also add regularization to prevent your model fitting on these variables too well. It seems to follow some sort of mathematical definition. I guess that definition provides you with nice feature properties. There are various types of this form of encoding that have come about to reduce the issues with overfitting. At it's most basic it takes the category and sums up the all values it's associated with in the target. It then divides the sum by the number of times the category appears and hey presto, you have an average target value for when this category turns up. Apparently, this style of encoding has been used to win numerous Kaggle competitions, so there's clearly something about it that's effective. You can look up this encoding on categorical encoders: https://contrib.scikit-learn.org/category_encoders/

Why do you need Mixed Models?

Mixed models basically combine two types of model. One which can be used to predict the fixed (non-random) parts of the problem you're trying to explain, and another to explain the random variability in the problem you're trying to explain. Think of it this way, Linear or Logistic Regression are simple models. They can explain fixed effects easily, because the fixed effects are constant an known. But what if the problem you're working on doesn't allow you to use fixed effect models for everything? How do you deal with the non-linear variability? This is when you turn to another type of "model", i.e. you turn to something like a normal distribution, which "models" the expected parameters of your dataset. According to this: https://www.theanalysisfactor.com/understanding-random-effects-in-mixed-models/, this can allow you to incorporate ideas into your model such as variability due to unmeasured factors. For example, perhaps one doctor has a better recovery rate than another. Perhaps you can't find a way to explain that difference statistically. Mixed effect models allow you to explain that statistically! That's why you need them, to incorporate effects that would be difficult to see otherwise. To summarise, it appears that mixed models are useful for quantifying random noise in your data and why that noise might be appearing, whilst still maintaining an underlying predictive model that is strong and valid for your fixed effect variables (parameters, whatever). Splitting them this way allows you to keep the simplicity of the fixed effect model, whilst still accounting for the randomness it doesn't explain. This can be called a "soft-constraint". Mixed models can posit distributions for latent, unobserved parameters. They're not fully Bayesian though, as they don't have priors for the top-level parameters (hyperpar

What is a James Stein encoding?

Not entirely sure, sounds like some statistical gubbins. However, from what I can gather, it appears to be a way of scaling a mean calculated for each of the categories (target encoding?). You scale the target encoded means by the mean of the entire target column.

What's the clash between Frequentists and Bayesians?

Part of this is philosophical, what are you doing when you measure random effects? The Bayesians say you have random effects all drawn from some normal distribution and fixed effects drawn from independent distributions with known priors. That makes it simpler to think about...? (maybe)

What are Bayesians?

People who believe it makes sense to assume a probability distribution that data is drawn from and then see how that matches the real behaviour. They believe that frequentists are special form of Bayesians who are dishonest in their assumptions.

What are Frequentists?

People who believe that the only sensible way to calculate probabilities or statistics is to count how frequently different things happen and make no assumptions.

What is Partial Pooling?

Pooling refers to combining datasets that you believe have roughly the same population level parameters and using them together to perform some form of statistical test. Partial pooling doesn't assume that you've drawn samples from the same population parameters. Instead, it assumes you've drawn samples from the same *distribution* of population level parameters. That is, you're samples are related by a distribution of parameters defined as: N(\mu, \sima).

What is a Random Effect?

Random effects are effects that vary across a population, or vary in their effect on an output when put into a statistical model. Because they are variable it's not always knowable what they will do, hence they are called random(? Don't actually know this for certain). Random effects are basically uncontrolled variability in your sample. Say you're building a model on a sample, it has some amount of variability that you cannot capture with the data that you've collected about the sample. The variability that cannot be explained can be termed "random", it may not actually random, but so far as your dataset is concerned it's completely random. Who knows why there are these small variations? Perhaps some of them are explainable, perhaps some of them aren't. Perhaps it all comes down to the uncertainty relation at the base of physics, you cannot be certain about anything. A random effect is something that you model with a soft constraint. Basically, if you notice that group X has a different value on average compared to group Y, but you can't capture this. You take groups X and Y, compute some value for them, assume it's a population parameter for the full population of the group X or Y and give it a Gaussian distribution in your model. That allows your model to capture the effect within those groups without specifying anymore parameters about them.

What is a Shrinkage Estimator in Statistics?

Shrinkage is the reduction in the effects of sampling variance. Therefore, a shrinkage estimator is probably an estimator that makes a prediction of a population parameter whilst shrinking the variance in that estimate. In loose terms, a previously estimated value, perhaps from a small sample (high variance) is shrunk towards a different value by the inclusion of new information. A shrinkage estimator implicitly, or explicitly incorporates the effects of shrinkage. This means that a naive or raw estimate is combined with extra information that improves the estimation.

What is a Dichotomized Transformation?

Sounds like it's one hot encoding. Effectively splitting the data into true and false, based on some parameter. Dichotomy normally mean two though, hence the 'di'. That could be the case here, or it could just be saying that you get several dichotomous columns at the end (One-Hot-Encoding).

What is Hierarchical Bayesian Modelling?

Statistical model implemented over multiple levels. It uses the data passed to it to estimate the values of the posterior distribution. It's composed of multiple sub models that are combined together to create the complete model that makes predictions. You can fit the data to hierarchical distributions that you have assumed form your complete model. Doing that gives you more flexibility in the definition of the model and allows you to make better predictions.

What is Bias in a Machine Learning model?

Taken from the statistical idea of bias. It's a deviation of the model from the underlying truth. In other words, a model is biased if it's predictions don't match reality. Which basically means all models are biased in some sense, because no model is a perfect match to reality. Simple models usually have higher bias as they're unable to capture complex patterns. Models like linear regression are an example of this. Complex models, like Tree ensembles, usually have lower bias as they are able to capture the complex patterns.

What is Variance in a Machine Learning model?

Taken from the statistical idea of variance. Models that have a high variance give predictions that have a high level of reliance on small subsets of the dataset. Models that have a low variance have a higher reliance on general trends. An example of the former is a decision tree, which is partially dependent on several small groups when it makes predictions. An example of the latter is a linear regression model, which cannot model non-linear behaviour but is not susceptible to overfitting little patterns (it doesn't notice them).

How does the R^2 metric work?

The R^2 metric uses the following equation: R^2 = 1 - SS_R/ SS_T SS_T is the total variation in the dataset relative to the mean, so it's the difference between each y value and the mean y value summed. SS_R is the total variation in the dataset relative to the model, so the the difference between each y value and it's prediction by the model. Taking the ratio of these tells you how much better the model is compared to the mean. Taking it away from 1 allows you to say the optimum score is 1, zero variation is unexplained by your model. A score of 0 tells you the model is no better than the mean, and any score inbetween tells you your model is better to some degree. A score < 0 tells you your model is *worse* than a simple mean model, which tends to tell you that you shouldn't bother using it.

What is an α?

The significance level of a statistical hypothesis test. It sets how happy you are with the possibility of incorrectly rejecting your null hypothesis. The larger it is, the more likely you are to incorrectly reject your null hypothesis. 5% is a commonly chosen level.

What is a database schema?

The term "schema" comes from Greek and means "shape" or "plan". This fits because a database schema is a plan telling you about the shape of database and how information fits together. A schema defines the fields in a table, the tables, and the relationships between the tables (relations). They were developed by IBM in the 1970s to handle the increasing amount of data that was being produced and the the database operations that needed to be performed. They developed the idea of a relational database that used a schema to formally describe what it contained. NoSQL databases, like MongoDB, don't require you to describe your schema so formally, but having some kind of schema is still very helpful when you're trying to alter or access the database.

Where does the Z statistic name come from?

The z-score is the number of standard deviations above or below the mean. If you read this post on Reddit: https://www.reddit.com/r/statistics/comments/mr3gwc/m_why_are_they_called_z_scores/ and this article: https://builtin.com/data-science/z-test-statistics It seems that the 'Z' in the Z statistic is derived from the idea of a z-score, and the name of the z-score is purely for convenience. It's the next letter after the algebraic x and y.

What are Marginals?

These sound like marginal distributions. So the distributions that are left when you aggregate the probability associated with all the other features of a model and pretend you only have access to a small set of feature that are a total of the whole. So the "Marginals" of a model are these distributions.

What are Point Estimates?

They are a single value that is used to estimate a population parameter. They effectively "point" to a single value from the sample. In other words they refer to a single point in the parameter space, not an interval, which is another common type of estimate. Their name comes from this fact. They are an estimator for a population parameter. It's the process of finding some approximate value for a parameter, like a mean or an average.

What are priors in Bayesian Statistics?

They are prior assumptions about what you believe the data distribution to be. You add them into your model and they help inform your prediction. They can be updated during training to give you a posterior distribution, that is effectively an update of your assumptions about your data/ the underlying distribution.

How are Learned Embeddings useful in Machine Learning?

They can be used to represent abstract concepts, such as words, or parts of an image in a low dimensional space. This makes them more efficient for training machine learning algorithms. They're effectively vector representations of parts of data. This could be useful in categorical columns, as each category could be represented by a vector and those that are most similar grouped together. The groupings could allow for an order which makes the data encodable via label encoding. Embeddings require a dimension for each categorical variable. The Machine Learning mastery article recommends using 10.

What are Mixed Models useful for in Machine Learning?

They're useful because building mixed effect models makes them resistant to variability caused by correlated data. Apparently, mixed effect models can actually take advantage of the correlated aspect of the data and produce better models from it. As it's the algorithm's specification that makes it resistant, there's less to think about when building your model... probably. https://stats.stackexchange.com/questions/250277/are-mixed-models-useful-as-predictive-models

Support Vector Classifier

This is a classification method that finds an N-1 hyperplane in a space of N dimensional points that separates the points into two classes. This method is called a soft-margin method, because when it separates the points it's okay with misclassifying some of them if it produces a better model. This is compared to the maximal margin classifier, which will perfectly split the points if it can, even if it means new data is misclassified.

What does Statistical Significance mean?

This is related to statistical hypothesis testing. To perform statistical hypothesis testing, you produce a null hypothesis and a test statistic. You then calculate how extreme your test statistic is given your null hypothesis. The extremeness of your test statistic allows you to determine if your test is statistically significant. Extremeness can be defined by a p-level, which is the probability of test statistic occurring if the null hypothesis is true. We call this probability p. The Statistical Significance Level is frequently termed α, which is the probability of incorrectly rejecting your null hypothesis, given that your null hypothesis is true. This α is also known as the chance of a false positive or the chance of a Type I error. So a result must be below your statistical significance level for it to be statistically significant. That is, your test statistic must have a p level < α. Or p < α if we wish to prove statistical significance. α can be set to any arbitrary level, but a common level is 0.05, or a 5% chance of rejecting the null hypothesis if it's true. This is not useful for us; we should be able to go much lower without issue.

What is Big Data?

This is such a new field that these terms don't have completely agreed upon definitions. However, Max Kuhn's feature engineering book describes it as data that has a lot of points, but not necessarily a lot of variables (features). A definition from the A level course is that "Big Data" is data that cannot be contained on a single computer. It's so large it has to be stored across multiple systems and batch processed (for example). I suppose it could also mean datasets that have hundred of thousands to millions of features as well though, think NLP. Again, from the A level syllabus, there are a few hallmarks of big data: Lots of it, More Variety, More Velocity. Be aware, it is not always good. High bias, low variance models (linear regression, logistic regression) can't really take advantage of it at all! So it's best to not bother and save yourself the extra training time.

What is Listwise Deletion?

This is the process of deleting all rows that have one or more missing values in your dataset. This is not typically considered good practice, although it is a very common one. This method makes the very big, and generally untrue assumption that the missingness is MCAR. Because this is not the general case at all, this can introduce very serious bias in your datasets, as you can end up throwing away over 50% of the data and all the relationships that come with that.

What is Multiple Imputation?

This is when you create multiple imputations of your dataset based on the background distribution of the missing data in each column and across the rows. Creating many of these datasets allows you to incorporate the uncertainty associated with your guess at imputation. Doing this means you're not as worried that your results are dependent on the imputation method you used. Instead your results are an average across the different kinds of imputation available. The likelihood you use a particular imputation is weighted by the probability that that imputation is likely according to the observed data. If an imputation is likely, then it's commonly filled in regardless of the number of imputations.

What is One Hot Encoding?

This term comes from computer science, and treats every category in a categorical column as its own variable and creates a new column for it. This prevents you from treating the column as it's ordinal and has some relationship between its categories.

What does "sparse" mean in Machine Learning?

This term seems to refer to models that use few parameters, or matrices that have few important variables in them for the space they take. For example, sparse models are models that requires few parameters as most of the parameters have been zeroed out by some form of regularization.

What is Statistical Inference?

Using data analysis to infer the properties of an underlying probability distribution. This method allows you to infer population parameters by devising hypotheses and performing statistical tests. You can perform tests by sampling from an (assumed) background population and calculating how likely your observations are given your original (null) hypothesis. It seems like statistical inference is the name given to inferring things about probability distributions through the use of statistics. Using data drawn from the population with some form of sampling, you use statistical inference to make a proposition about a population. The propositions could be things like a "point-estimate" (I think this is something like an estimate of a population parameter), or the rejection or acceptance of a null hypothesis.

What are Ordinal variables?

Variables that are categorical and have some order to them. This is something like: small, medium, large. These things are categories, but they have a clear order to them which contains information that is probably useful to a model. In other words, these things may have a completely different order to how they appear, but their order should still roughly follow the pattern they represent.

What are Nominal variables?

Variables that have no inherent order to them. The term comes from the idea of a "name", that's where "nom" comes from. Names are categorical, but don't have an inherent order beyond alphabetical, which could be completely meaningless for the task we're dealing with. They could be associated with something, they may even have an order, but that order is not contained in their categorical representation alone.

What simple methods are there for imputation?

You can do mean imputation, which just puts the mean of a column into every missing value. This doesn't affect the mean of your data, but it's recommended you don't do this unless you have very few missing values. You can do regression imputation, which is taking the data that exists and performing regression to estimate the missing data. However, this method is very dangerous, as it reinforces patterns in the data. Depending on the extent of the issue that can severely bias your model. You can do stochastic regression. This is an improved version of regression that doesn't have the same issue with reinforcing patterns because it adds a level of randomness to the imputations. Finally, you could use an indicator variable. That is, for each incomplete column, you could replace it's value with 0 or some other filler and then put 1 in the indicator column to indicate this value was previously missing. Apparently, this has shown some good effects, but can be biased and incorrect in non-perfect scenarios. https://stefvanbuuren.name/fimd/sec-nutshell.html

What metrics can you use to measure the performance of a regression model?

You can use: Mean-squared-error: To do this you take the square of the difference between the predicted and the actual values, and then find their mean. Root-mean-squared-error: To do this you do the same thing as the mean-squared-error but take the square root to bring it back to the same units. R^2: So apparently R^2 is a ratio between the sum of squares that results because of the regression and the total variation that's present in the dataset. If the regression can explain all the variation this goes to a value of 1. It typically takes a value between 0 and 1. Adjusted R^2: Which is apparently similar but also takes into account how many variables are in the model. Mean Absolute Percentage Error: This calculates the average percentage difference between the observed and the actual values. This is more useful when you naturally have wide variation in your dataset. https://deepchecks.com/question/what-are-the-best-metrics-for-the-regression-model/ https://corporatefinanceinstitute.com/resources/data-science/r-squared/ https://scikit-learn.org/stable/modules/classes.html#regression-metrics

What should you do to impute in predictive modelling?

You could use a Tree method or a KNN to predict what the values should be based on your background dataset. This could work, but, ultimately, you're potentially exposing yourself to issues in the future. It's probably best to think hard about your missing data and figure out what to do with columns that have a lot of missing data, but can't really be dealt with through encoding or some other method.

What is Stein's Paradox?

https://efron.ckirby.su.domains//other/Article1977.pdf

What is the Probability Integral Transform?

https://en.wikipedia.org/wiki/Probability_integral_transform It's also known as the Universality of the Uniform. It's a transform that moves data values that are modeled as being random variables from a continuous distribution to random variables having a standard uniform distribution. This is a general result apparently. This method is used in Copulas to make the relationship between variables more readily apparent.

What are Generalised Linear Mixed Models?

https://stats.stackexchange.com/questions/17331/what-is-the-difference-between-generalized-estimating-equations-and-glmm/17403#17403

What are Generalized Estimating Equations?

https://stats.stackexchange.com/questions/17331/what-is-the-difference-between-generalized-estimating-equations-and-glmm/17403#17403


Kaugnay na mga set ng pag-aaral

Psychology Chapters 11 & 12 Notes

View Set

Microeconomics Test 2 Review (Frank)

View Set