DSCI 4520 Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

False Negative

B. Actual 1 and Predicted 0. Predicted it negative and actually positive.

Linear Regression

Provides a mathematical approach to model this relationship We hope the regression model outperforms the "average" model Our goal is to identify the line through the data that minimizes error We use the least squares method to find the line uses ordinary least squares to allow us to predict a target (dependent) variable based on one or more input (independent) variables y=β_0+β_1 x+ε y represents the target/dependent variable x represents the predictor/independent variable β (beta) represents the coefficients used in the model ε (epsilon) represents error Dependent variable must be a numerical variable Independent variables may be numerical or categorical (via dummy coding) When the model is used for prediction, error (residual) value is unknown at the given point

Variable R Types

R Vector: Character (categorical) , Numeric (3, continuous), Integer (3L, discrete), Logical (T/F, binary), Complex (1+5i, continuous)

RFM analysis

Recency: How recent was the customer's last purchase? Frequency: How often did this customer make a purchase in a given period? Monetary: How much money did the customer spend in a given period?

Minimizing Error

Regression seeks to find coefficient values (βs) that minimize the sum of all squared error terms (∑▒ε_i^2 ) Minimize e^2

Over Fitting Problem

Rule of thumb for the number of records and predictors: n≥5(p+2) Including uncorrelated variables in the model increases the variance of the predictions Excluding correlated variables from the model increases the average error of predictions Including too many variables in the model increases the risk of over-fitting When over-fitted: Regression coefficients represent the noise rather than the genuine relationships in the population. Three strategies to avoid over-fitting: 1- Splitting the data into validation and training sets 2- Adding variables to the model only if they improve the model performance and goodness-of-fit. Also use metrics that take into account the number of variables 3- Penalizing the model for including more variables: regularization

SAS Approach

SEMMA: Sample Explore Modify Model Assess

Data Mining Process

Sampling - We begin with raw data. In most organizations, far more data is available than is really needed to develop statistically relevant models. In order to reduce effort, a sample is generally drawn from the pool of available data. Preprocessing - Before we can begin modeling the target data must be clean and ready to use. Business data is often "dirty." Modeling - Once the data are clean, we can begin to produce models. It is very common to develop multiple competing models and select the best of the group. Interpretation - A good performing model tells you something by identifying patterns in the data. If that information is interesting (previously unknown and potentially useful), it may be of value to the organization. Decision making - Once an insight has been gleaned from the data, a decision maker can act with confidence, knowing that action is supported by the past experience of the organization.

Pearson Correlation Coefficient

Shows how strongly changes in one variable is associated with changes in another variable Between -1 and 1 A fundamental assumption of the linear regression model is that THERE IS a linear relationship between the input and output variables Correlation coefficient is a good measure of such relationship In statistics, it is not acceptable to include an input variable in a regression model that has no correlation with the output variable

Character

Stores strings of letters, numbers, and symbols.

Data-Driven Decision-Making Technology Pyramid

TOP: Decision Making (End User) Data Presentation (Visualization Techniques) Data Mining (Information Discovery) Data Exploration (Statistical Analysis, Querying and Reporting) Data Warehouses/Data Marts (OLAP, MDA) Data Sources (Paper, Files, Information Providers, Database Systems, OLTP)

Model Terms

Thing you want to predict: Statisticians- Dependent Variable Data Miners- Target (Outcome) Social Scientists- Response Variable Things you can use to predict target: Statisticians- Independent Variables Data Miners- Predictors Social Scientists- Explanatory Variables

Issues with Data Mining

Too much data. Need interesting results. Need special skills. Data quality. How to combine data and appropriate external data. What if stats misled us? (Spurious-correlations). Subject to human thinking.

It is a supervised learning task to classify credit card purchases into fraudulent and legitimate ones. True/False

True

Histogram

a chart that plots the distribution of a numeric variable's values as a series of bars. Each bar typically covers a range of numeric values called a bin or class; a bar's height indicates the frequency of data points with a value within the corresponding bin.

Linear Regression

a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. (numerical)

Supervised Learning

a machine learning approach that's defined by its use of labeled datasets

Confusion Matrix

a: number of positive instances predicted as positive (true positive) b: number of positive instances predicted as negative (false negative) c: number of negative instances predicted as positive (false positive) d: number of negative instances predicted as negative (true negative)

Bar chart

are used to compare count, size, or other metrics of a categorical variable at different level. It can be used to compare and rank levels

Ordinal

categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. EX: socio economic status ("low income","middle income","high income"), education level ("high school","BS","MS","PhD"), income level ("less than 50K", "50K-100K", "over 100K"), satisfaction rating ("extremely dislike", "dislike", "neutral", "like", "extremely like").

Boxplots

composed of an x-axis and a y-axis. The x-axis assigns one box for each Category or Numeric field variable. The y-axis is used to measure the minimum, first quartile, median, third quartile, and maximum value in a set of numbers.

Binary

data whose unit can take on only two possible states. These are often labelled as 0 and 1. EX: Smoking is a binary variable with only two possible values: yes or no. A medical test has two possible outcomes: positive or negative. Gender is traditionally described as male or female

Nominal

"labeled" or "named" data which can be divided into various groups that do not overlap. labelled into mutually exclusive categories within a variable. EX: male/female (albeit somewhat outdated), hair color, nationalities, names of people, eye color, smart phone

To create a plot to demonstrate the interactions of three variables, we may (a) All of the other three (b) Create subplots based on the categories of the third variable (c) Map the third variable to different sizes for the geometric objects (d) Map the third variable to different colors for the geometric objects

(a) All of the other three

Which of the following statements about business intelligence workflow is CORRECT? (a) Data in the operational database is transformed to analytical data in the data warehouse (b) External data sources are never used in the BI analytics (c) Business users and BI analysts are usually the same group of people (d) Operational software is directly connected to data warehouse

(a) Data in the operational database is transformed to analytical data in the data warehouse

The target variable in a multiple linear regression model must be a: (a) Numeric variable (b) Categorical variable (c) Binary variable (d) Ordinal variable

(a) Numeric variable

What is the measurement level of the variable that measures customer intention as strongly agree, agree, neutral, disagree and strongly disagree? (a) Ordinal (b) Binary (c) Numeric (d) Nominal

(a) Ordinal

We have developed a model to predict student grades based on the number of books they read and the number of classes they attend. The model is estimated as follows:𝐺𝑟𝑎𝑑𝑒 = 𝑏0 + 𝑏1 ∗ 𝐵𝑜𝑜𝑘𝑠 + 𝑏2 ∗ 𝐴𝑡𝑡𝑒𝑛𝑑𝑎𝑛𝑐𝑒 Where 𝑏0 = 37.59 𝑏1 = 3.14 𝑏2 = 1.19 What is the predicted grade of a student who reads 3 book and attend 10 classes this semester (rounded to 1 decimal point)? (a) 73.8 (b) 58.9 (c) 41.9 (d) 48.2

(b) 58.9

In the logistic regression model the target variable is: (a) A numeric variable (b) A binary variable (c) Either a numeric or a binary variable (d) A number between 0 and zero

(b) A binary variable

Which of the following statements is FALSE about data-driven decision-making method: (a) It is based on the collected facts (b) It is loaded with assumptions and theories (c) It requires computational work (d) It is an iterative and algorithmic process

(b) It is loaded with assumptions and theories

Which of the following statements about data mining process is INCORRECT? (a) Cleaning and pre-processing data usually takes the majority of the time spent on the task (b) The overall objective of any data mining process is to optimize the prediction algorithms (c) No value presents in raw, unprocessed, and unactionable data (d) Data mining is a multidisciplinary field that makes a significant use of statistics

(b) The overall objective of any data mining process is to optimize the prediction algorithms

We import the dataset WestRoxbury, which contains records of more than 5000 houses in WestRoxbury. The data dictionary is shown below: TABLE 2.1 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATASET TOTAL VALUE Total assessed value for property, in thousands of USD TAX Tax bill amount based on total assessed value multiplied by the tax rate, in USD LOT SQ FT Total lot size of parcel in square feet YR BUILT Year the property was built GROSS AREA Gross floor area LIVING AREA Total living area for residential properties (ft2) FLOORS Number of floors ROOMS Total number of rooms BEDROOMS Total number of bedrooms FULL BATH Total number of full baths HALF BATH Total number of half baths KITCHEN Total number of kitchens FIREPLACE Total number of fireplaces REMODEL When the house was remodeled (Recent/Old/None) We have built a linear regression model to predict Total Value of the house. R output of the model summary is show below: REMODELOld: p= 0.0133 REMODELNew: p=0.00000000002 Which of the following statement is INCORRECT? Assume the significance level is 0.05. (a) Overall, the model is significant with the F-statistic of 3602 (b) Coefficients of the Remodel variable are significant at all levels of REMODEL (c) Coefficient of REMODELOld is less significant than REMODELRecent The intercept is significant

(c) Coefficient of REMODELOld is less significant than REMODELRecent

Which of the following steps comes before other steps in the data mining process? (a) Model training (b) Data pre-processing (c) Define the problem (d) Specify the modeling approach

(c) Define the problem

Which of the following statements is INCORRECT about the logistic regression model? (a) Logistic regression can be used for classification (b) Logistic regression uses odds and the natural logarithm function (c) In the logistic regression, the intercept cannot be zero because of the natural logarithm function (d) Logistic regression can be developed for a binary or a multi-class target variable

(c) In the logistic regression, the intercept cannot be zero because of the natural logarithm function

Which of the following plot types is suitable for showing correlation between two numeric variables? (a) Stacked bar chart (b) Side-by-Side bar chart (c) Scatter plot (d) Histogram

(c) Scatter Plot

Given the following confusion matrix and assuming the positive class is "1", what is the specificity score of the model? (rounded to 2 decimal points) Predicted 1 0 Actual 1 156 13 0 37 401 Specificity (TNR)= True Negative/ (True Negative + False Positive) (a) 0.90 (b) 0.86 (c) 0.14 (d) 0.92

(d) 0.92

We import the dataset WestRoxbury, which contains records of more than 5000 houses in WestRoxbury. The data dictionary is shown below: TABLE 2.1 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATASET TOTAL VALUE Total assessed value for property, in thousands of USD TAX Tax bill amount based on total assessed value multiplied by the tax rate, in USD LOT SQ FT Total lot size of parcel in square feet YR BUILT Year the property was built GROSS AREA Gross floor area LIVING AREA Total living area for residential properties (ft2) FLOORS Number of floors ROOMS Total number of rooms BEDROOMS Total number of bedrooms FULL BATH Total number of full baths HALF BATH Total number of half baths KITCHEN Total number of kitchens FIREPLACE Total number of fireplaces REMODEL When the house was remodeled (Recent/Old/None) To explore the relationship between Total Value and Remodel, what graphs may we use? (a) A scatter plot with remodel categories as the color (b) Two histograms, one for Total Value and another for Remodel (c) A heat map using Total Value as the color (d) Boxplots across remodel categories

(d) Boxplots across remodel categories

In R, what kind of variables represents Numerical Discrete Variables? (a) Complex (b) Character (c) Numeric (d) Integer

(d) Integer

The following figure shows residual plots of two linear regression model A and B. Check word dox. Which of the following statements is CORRECT? (a) Both models have met all the assumptions of the linear regression model (b) Model A is violating linearity assumption (c) Model B is better than Model A because it shows smaller residuals (d) Model B is violating homoscedasticity assumption

(d) Model B is violating homoscedasticity assumption

We have developed a standardized linear regression model to predict car prices based on 5 predictors: Age, Fuel Type, HP, CC, and Tax Price= -0.77 * AGEs + 0.47 * FUELTYPEdiesels + 0.27 * FUELTYPEgasolines + 0.40 * HPs - 0.26 * CCs + 0.24 * TAXs Which of the following statements is INCORRECT about interpretation of the model? (a) As the age of a car increases, its price drops on average (b) All coefficients of the standardized model are between -1 and 1 (c) Age of the car has the largest effect on the price (d) One unit increase in HP is associated with 0.40 unit change of price

(d) One unit increase in HP is associated with 0.40 unit change of price

Which of the following tasks is unsupervised learning task? (a) Identify a network data packet as dangerous vs. non-dangerous (b) Separate loyal customers from not-loyal customers (c) Forecasting sales with a predictive model (d) Segment customers into unknown groups based on their demographics and purchase history

(d) Segment customers into unknown groups based on their demographics and purchase history

Which of the following statements is TRUE about classification and linear regression? (a) Both methods are unsupervised learning algorithms (b) Linear regression is a generalized form of all classification algorithms (c) Linear regression is used for prediction, but classification is not useful for prediction (d) Target variable in classification and liner regression is categorical and numerical, respectively.

(d) Target variable in classification and liner regression is categorical and numerical, respectively.

Linear Model Development

0. Remove outliers, handle missing values, prepare data 1.Select variables to be included in the model 2.Split data into training and validation sets 3.Fit the model on the training set 4.Predict target values on the validation set 5.Calculate model performance metrics for both training and validation set predictions 6.Go to step 1, if model performance is not acceptable

Regression Assumptions

1) Linearity •Violations of the linearity assumption can usually be observed in a scatter plot of x and y •The residual plot exaggerates the non-linearity and makes it easier to see •To resolve violations •Add nonlinear transformation of the variables •Add new variables •Simple linear regression results in a poor model (high error); A better approach is to perform a quadratic regression by using a squared term to "bend" the line 2) Homoscedasticity •Violation of the homoscedasticity assumption means that as you increase x, your model gets progressively worse (or better) at prediction •To resolve violations •Transform variables •Add new variables 3) Autocorrelation •Autocorrelation means that the errors are not independent of each other (i.e., the magnitude of the last error somehow influences the magnitude of this error) •Autocorrelation is an issue in time series data (think about how the season might impact sales) •To resolve violations •Add new variables •Add lags as predictors 4) Normal Distribution •If there is no discernable pattern in your residual plot, your errors are probably normally distributed •Violations of the assumption of normal distribution are generally related to non-normality of model parameters •To resolve violations Transform variables

Layered grammar

1- Data: This is the foundational layer that provides quantitative and qualitative ingredients for your plot. It is more efficient and easier if data is in the tidy format 2- Aesthetic (aes) is the layer that adjusts the visibility of items in your plot: x, y: Variables that should become visible colour: color of visible geometries (geom) fill: the inside color of the visible geoms group: what group a geom belongs to shape: the figure used to plot a point linetype: the type of line used (solid, dashed, etc) size: size scaling of the geom elements for an extra dimension alpha: the transparency of the geom 3- Geometric objects (geoms - determines the type of plot) geom_point(): scatterplot geom_line(): lines connecting points by increasing value of x geom_path(): lines connecting points in sequence of appearance geom_boxplot(): box and whiskers plot for categorical variables geom_bar(): bar charts for categorical x axis geom_histogram(): histogram for continuous x axis geom_violin(): distribution kernel of data dispersion geom_smooth(): function line based on data 4- Facets facet_wrap() or facet_grid() for small multiples 5- Statistics similar to geoms, but computed. show means, counts, and other statistical summaries of data 6- Coordinates - fitting data onto a page coord_cartesian to set limits of the cartesian coordinates coord_polar for circular plots coord_map for different map projection 7- Themes overall visual defaults fonts, colors, shapes, outlines

Metrics to measure accuracy of model prediction

1- MAE (mean absolute error/deviation) gives the magnitude of the average absolute error. 2- Mean Error: this measure is similar to MAE except that it retains the sign of the errors, so that negative errors cancel out positive errors of the same magnitude. It therefore gives an indication of whether the predictions are on average over- or underpredicting the outcome variable. 3- MPE (mean percentage error): This gives the percentage score of how predictions deviate from the actual values (on average), taking into account the direction of the error. 4- MAPE (mean absolute percentage error): This measure gives a percentage score of how predictions deviate (on average) from the actual values. 5- RMSE (root mean squared error): This is similar to the standard error of estimate in linear regression, except that it is computed on the validation data rather than on the training data. It has the same units as the outcome variable. (more useful and popular)

Data Mining Process

1.Define/understand purpose 2.Obtain data (may involve random sampling) 3.Pre-process, clean, explore, and visualize data 4.Reduce the data; if supervised DM, partition it 5.Specify task (classification, clustering, etc.) 6.Choose the model and techniques (regression, CART, neural networks, etc.) 7.Fit the models to data; iterative implementation and hyper-parameter tuning 8.Assess results - compare models 9.Deploy best model

Data Driven Decision Making

1.Identify and frame the problem (know your mission) 2.Identify sources of relevant data: data owner, data platform, privacy issues, validity and accuracy of data 3. Collect, clean, organize data 4.Select and perform appropriate analytical method: simulation, statistical analysis, data mining, etc. 5.Post-process the results (visualization) and put them in the right context to make sense of them and draw conclusions

Multiple regression model assumptions:

1.There must be a linear relationship between each predictor and target variable 2.Residuals of the model should be normally distributed 3.No multi-collinearity: predictors are not highly correlated 4.Independence of observations 5.Residuals have constant variance at every point in the linear model (Homoscedasticity assumption)

Scatter plot

A 2-dimensional plot that shows the relationship between two numerical variables (how they change). Can show 3rd and 4th variable with size, shape, color

Numeric Variable Normalization

A mathematical function that changes the scale (range) and values of a numerical variable with the same function

Factors in R

A special way to store categorical variables in R Factor variable can only take a specific levels (e.g. Male, Female) versus character variable can take any value Factor variables are (behind the scene) integer variables that are referencing a defined set of characters (like a map) Ex: Levels being Male and Female ( 1 and 2 to represent)

Unsupervised Learning

A type of model creation, derived from the field of machine learning, that does not have a defined target variable.

True Positive

A. 1 and 1

False Positive

C. Actual 0 and Predicted 1. Predicted it positive and actually false.

IBM Approach

CRISP-DM Business Understanding back and forth with data understanding Data Understanding to Data Preparation Data preparation back and forth with modeling Modeling to Evaluation Evaluation to Business Understanding and Deployment

Variable Types

Categorical: Nominal (two levels? binary, can be expressed as #s: freshmen=1 sophomore=2 junior=3 senior= 4) , Ordinal (order) Numerical: Discrete (# of ppl, cars, houses, cities), Continuous

2 Supervised Learning

Classification: predict categorical target, target variable is usually binary. (purchase/ no purchase, graduate/withdraw) Linear Regression: Predict numerical target, target variable usually a continuous numeric one (sales, student gpa, # of car accidents)

True Negative

D. 0 and 0

Data Mining vs Statistics

Data Mining: Goal: To discover useful patterns in plenty of data (whole to part, inductive process) Starts with data Works for larger data sets Widely used in business application and scientific discoveries Statistics: Goal: To infer properties of a population by examining samples (part to whole, deductive process) Starts with a theory Better for smaller data sets Widely used in quantitative research and scientific discoveries

Core Ideas in Data Mining

Data cleaning and preprocessing, data exploration, data and dimension reduction, visualization, classification, regression, association rules and recommenders, sampling, training, testing

Why Data Visualization

External Representations: Augment human capacity by allowing us to surpass the limitations of our own internal cognition and memory In other words, balances cognition with perception Visualizations can be designed to support perceptual inferences which humans make easily Identify and understand the patterns and trends in data (from data to information) Present large amount of data in a very compact form Engage audience Re-evaluate the business problem and re-define it if necessary Select more appropriate data mining models

Confusion Matrix Terms

Fall Out (FPR) = 1 - Specificity Miss Rate (FNR) = 1 - Recall (Sensitivity) Precision = 1 - Fall Out (FNR) Sensitivity/Recall: portion of positive classified positive Fall Out (FPR): portion of negative classified as positive Miss Rate (FNR): portion of positive classified as negative Good Classifier: High TPR (low FNR), Low FPR

Consider two models A and B. If the prediction accuracy of Model A is higher than that of Model B for the training dataset, we can safely say that Model A is better than Model B. True/False

False

Training data set serves to detect and prevent overfitting of the predictive model. True/False

False

Horizontal vs Vertical Cases

Horizontal: can be applied to many different areas of a given business. Accounting, Management, Operations Vertical: used across a wide range of industries. Healthcare, Logistics, Retail

Success?

Improved Business Outcomes

Model performance evaluation

In statistics: Goodness-of-Fit is quantified by variety of metrics Can the selected model sufficiently explain the data? It is used to select the best model and understand the limits of its generalizability In Data Mining: Assessment is based on the validation set Can the model acceptably predict future records? It is used to select the tune model hyper-parameters and estimate its predictive power

Tabular Data Formats

Individual files on local/remote storage devices: Plain text formats: fixed width, tab delimited, space delimited, comma delimited Spreadsheet formats (Excel, OpenDocument, OpenOffice) Statistical formats (Stata DTA, SAS7BDAT, SPSS SAV, Rdata) Tables on DMBS servers: MS Access SQL tables (MySQL, MS SQL Server, PostgreSQL,...) Distributed servers (Hadoop, BigQuery, AWS)

Two decision making processes

Intuition-base: rather more traditionally approach and it relies on the expertise, feelings, thoughts, and guesses of the decision-maker (typically one person). It may or may not involve brain-storming sessions, and arguments in support or against alternatives are important. It draws on past experiences, creativity, and wholistic vision of the decision-makers. Data-driven: usually starts with questions about data that is available and what solution "emerges" from it. It involves algorithmic, step-by-step and systematic process with significant computational works. In this approach the people who make the final decisions are not usually the people who collect, clean, process, discover, and present data. Each approach has its cons and pros depending on the business and situation. However, businesses are more and more leaning toward data-driven decision making or at least inform they intuition-based process with the patterns and insights discovered in the business data.

Variable Selection

It is expensive or not feasible to collect a full complement of predictors The more predictors, the higher the chance of missing values in the data Parsimony is an important property of good models: easier to explain The more predictors, the higher the chance of multi-collinearity (correlation between the predictors): less stable estimates of model coefficients Rule of thumb for the number of records and predictors: n≥5(p+2)

Transformation

Normal distributions are desirable but not always present in a given data set Transformation involves mathematically manipulating the data to make modeling more feasible Transformation can hurt the interpretability of your model

Linear Relationship

One where increasing or decreasing one variable n times is associated with a corresponding increase or decrease of n times in the other variable too. Association of change Proportionality of change Could be causal or not

Numeric Variable Transformation

Performing mathematical functions on them and creating new variables that are better suited for our data mining model

Residual

The difference between the observed value and predicted value and it is denoted by e. Negative residual value means the model is underpredicting and positive residual value means it is overpredicting. It is very common to use 'error' and 'residual' interchangeably, however, they are not the same. With residuals we can calculate mean error, mean absolute error, and root squared mean error that are model performance metrics.

gg in R

The ggplot2 is a very useful and popular R package developed by Hadley Wickham for creating presentation-quality visualization in a wide variety of contexts. It is based on the concepts of layered grammar of graphics Robust Logical and structured syntax Flexible Functional Reproducible

Dummy Coding

The practice of converting a categorical variable into several binary variables (1's and 0's) that can be used in a regression model. If your variable has n levels, dummy coding will always result in n-1 variables R and most other modern data modeling tools will perform the dummy coding operation for you

R Objects

Vector (1-D Homogeneous), Matrices (2-D Homogeneous), List (Heterogeneous), Factor (T/f; small, medium, large; Homogeneous), Data Frame (Heterogeneous)

Heat Map

depicts values for a main variable of interest across two axis variables as a grid of colored squares

Box plot

gives a visual stat summary of a numeric variable. The line that divides the box into 2 parts represents the median of the data. The end of the box shows the upper and lower quartiles. The extreme lines show the highest and lowest value excluding outliers

Heat map

is a 3d graphical representation of data where the individual values contained in a matrix are represented as colors. It is a bit like looking a data table from above. It is useful to display a general view of numerical data, not to extract specific data point

Integer

is a whole number (not a fractional number) that can be positive, negative, or zero

Business Intelligence

is an umbrella term that combines architectures, tools, databases, analytical tools, applications and methodologies. •an infrastructure that helps collecting, storing, transforming, analyzing, and presenting data to support business decision-making process •Enable easy access to data to provide business managers with the ability to conduct analysis •Transform data, to information (knowledge), and to decisions that finally lead to action

ROC Curve

is used to compare classifiers Random classifier (benchmark) is the diagonal line. The curve that stand on top of others shows the best classifier Moving away from the diagonal line in the up-left direction means the classification performance is improving

Missing Values in R

is.na( )

Business intelligence workflow

provides processes and systems that optimize complicated data production methods initially done 'manually' by data analysts in each commercial area

Categorical

represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group

Stacked Bar Chart

shows two categorical variables. The first (and primary) variable is shown along the entire length of the bar, and the second variable is represented as stacks within each categorical bar

Numerical

the data that is in the form of numbers, and not in any language or descriptive form

Numeric

the data that is in the form of numbers, and not in any language or descriptive form. EX: height, weight, age, number of movies watched, IQ

Propensity

the probability of class membership All classification models first calculates propensity score of a record, then decides class membership based on a default cutoff value Changing the propensity cutoff value can change TPR and FPR

Classification

the process of organizing data into categories that make it easy to retrieve, sort and store for future use. (categorical)

Data Mining Process

understanding data through cleaning raw data, finding patterns, creating models, and testing those models

Histogram

univariate plot that shows distribution of data (value vs frequency) Useful to understand the nature of a variable in terms of its spread, skewedness, peaks,

Side by Side Bar Chart

used to display two categorical variables

What is data-driven decision-making?

using facts, metrics, and data to guide strategic business decisions that align with your goals, objectives, and initiatives.

Complex

usually a composite of other existing data types. For example, you might create a complex data type whose components include built-in types, opaque types, distinct types, or other complex types. EX: bills of materials, word processing documents, maps, time-series, images and video.

Negative Skewness

•Moderately negative: xnew= square root of K-x •Substantially negative: xnew= log10(K-x) •Where K is a constant from which each score is subtracted so that the smallest score is 1 •Usually equal to the largest score plus 1

Positive Skewness

•Moderately positive: xnew= square root x •Substantially positive: xnew= log10(x) •Substantially positive (with zero values): xnew= log10(x+C) Where C is a constant added to each score so that the smallest score is 1

Data Mining

•is the extraction of interesting (previously unknown and potentially useful) patterns or knowledge from large volumes of data •is an attempt to remove some of the uncertainty associated with decision making in the business environment •falls under the Business Intelligence umbrella and has numerous AKAs (Business analytics, Data analysis, Knowledge extraction, Knowledge discovery in databases (KDD))


Conjuntos de estudio relacionados

Life Cycle Nutrition (Ch. 10-13) Prep for Quiz #3

View Set

Environmental Science - Chapter 5: Section 5-3 Page 3

View Set

PSY 361 - Ch. 11 - Cognitive Development in Middle Childhood

View Set

IS-200.B - ICS for Single Resources and Initial Action Incidents

View Set