MIS 3300 Midterm 1

¡Supera tus tareas y exámenes ahora con Quizwiz!

Correlation _______ causation.

is NOT

Parsing

locating and identifying individual data elements in the source files and then isolating these data elements in the target files

What k-means clustering is (steps involved), what k stands for, and how k is determined

1. # of K clusters is fixed by analyst 2. Aggregation centroids (seeds) are provided 3. All units are assigned to nearest cluster seed 4. New seeds are computed 5. Go back to step C until no further reclassification is necessary K- Means clustering is the clustering of data around set values. K stands for the # of groups requested by the data analyst. The analyst determines K.

How Confidence values are calculated and the meaning/interpretation

A conditional probability estimate (how likely will one item be purchased given that another item is purchased). Example: How likely fins are to be purchased, if a mask has already been purchased. Example with fins and mask: To find the confidence value, take the 150 people who have fins and a mask, and divide that by the 270 people who have a mask in total. We are trying to find the probability of how many people have Fins if they already have a mask.

Loose data

A source of Data: (eg spreadsheets) that isn't necessarily part of a centralized database

Business Intelligence

A source of Data: a set of theories, methodologies, architectures, and technologies that transform raw data into meaningful and useful information for business purposes

Predictive Analytics

A source of Data: a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events

Prescriptive Analytics

A source of Data: automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences, and business rules, to make predictions and then suggests decision options to take advantage of the predictions

An organization's internal databases

A source of Data: sales data, employee information, etc.

Data Mining

A source of Data: the computational process of discovering patterns in large data sets ("big data")

Business Analytics

A source of Data: the skills, technologies, applications, and practices for continuous, iterative exploration and investigation of past business performance to gain insight and drive business planning

Big Data

A source of Data: the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Data Marts

A source of Data: thematic databases for sale

Descriptive Analytics

A source of Data: to gain insight from historical data with reporting, scorecards, clustering, etc.

Target or Label attribute

An attribute we'd like to predict- the Dependent Variable.

Data instance

An entity (such as an object or person) that is described by a collection of attributes, also known as a feature vector

Diagnostic Data Analytics:

B Data Mining -- the computational process of discovering patterns in large data sets ("big data").

If you want to compare results across categories, use a ______ chart (e.g., showing the total sales for the quarter by each of five retail outlets).

Bar Chart

Goal of Supervised data, and an example

Build a simple model that accurately predicts the value of a target attribute using the values of other attributes example: How can we predict if a customer will cancel their service after their contract expires?

The sequence of steps in the CRISP-DM process

Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment

Inaccuracy: the threat it poses to data analysis

Can completely throw off your analysis if it is not caught in the Data Understanding Phase. Because inaccurate data is a structural issue, make sure the values make sense and that maintenance is performed on sensors, machines, etc.

Nominal to Numerical (Operator in RapidMiner) Does what?

Changes nominal data (words) to numerical data. If you have a categorical variable in your dataset that you'd like to see included in the analysis, you'll need to first turn that categorical variable into a set of dummy variables1. The Nominal to Numerical operator does just that.

The difference between classification and regression types of supervised data mining

Classification: Assign data into predefined classes (ex: spam detection, fraudulent credit card detection) Categorical (Discrete data). Regression: Predict a real value for a given data instance (ex: predict the price for a given house) Numerical (Continuous data).

Is Numerical data discrete or continuous?

Continuous

What correlation analysis is, the type of data it requires, and the types of business questions it can answer

Correlation analysis is a statistical technique that allows us to see the extent to which variables relate to each other (or one variable predicts the value of another). The type of data it requires is quantitative continuous data (numerical). It can answer questions like "how does one customers age relate to how much they spend?"

What data science is and how it's used?

DS uses certain methods (algorithms) and other systems to extract knowledge from structured and unstructured data. Useful Programs/tools include: Excel, PowerBi, and RapidMiner. Results can be used for business inquiries, case studies, and other various things.

Nominal Data

Data for which each observation can be assigned to one of any number of labels. These labels are sometimes referred to as classes. (each observation in a dataset belongs to one of X number of different categories) (ex. Continents, zip codes, months of the year etc.)

Inconsistency: the threat it poses to data analysis

Data is in different units, different scales, etc. Make sure everything is on the same scale or converted to a z-score

Assumptions and limitations of correlation analysis, including homoscedasticity

Data must be measured on an interval or ratio scale Data must be at least approximately normally distributed -The x and y data should be sampled from populations with normal distributions -Do not overly rely on a single indicator of normality; use histograms skewness, and kurtosis Avoid Outliers -Outliers can be disproportionately increase or decrease r -Consider removing outliers or transforming data (eg logarithmic transformation)

The following are examples of WHAT KIND of data? Errors (typos, misspellings), Inconsistent Data, Absence of Data, Multipurpose Fields, Contradicting Data, Violation of Business Rules, Reused Primary Keys

Dirty Data

Goal of Unsupervised data, and an example

Discover naturally occurring patterns or groupings within the data without trying to predict the value of a target attribute •Example: Do our customers fall naturally into different groups?

Is Categorical data discrete or continuous?

Discrete

CRISP-DM PROCESS: Data Preparation

Ensure that the data is accurate and can be used efficiently and effectively, usually by putting things together and cleaning them up in one way or another.

CRISP-DM PROCESS: Evaluation

Evaluating and interpreting the results of those models.

The types of data sources that can be imported into Power BI

Excel, text, CSV, SQL, web, and PowerBI datasets

What ETL is and why it is important in data analytics?

Extract, Transform, Load. It is important because it allows for the data to be analyzed and processed properly and in the right location. Another Reason it is important: if you put garbage in, you get garbage out (GIGO). It keeps the information uniform and in the correct form for the warehouse or application it is used in.

What data mining is

Extracting or "mining" knowledge from data Discovery of hidden and actionable patterns in data Extracting implicit, previously unknown, unexpected, and potentially useful information/knowledge from data

Data Mining Applications

Identifying (ex. Identifying fraudulent transactions) Extracting (ex. Extracting purchase patterns from existing records) Forecasting (ex. Forecasting future sales and needs according to given samples) Extracting (ex. Extracting groups of like minded people in a group)

Incomplete Data: the threat it poses to data analysis

If looking to find a dependent variable, missing data can make it impossible to find that variable. The best way to deal with this is to either fill in the data using common sense or to remove any data that is incomplete that you can't fill out.

5 Common steps in data cleansing and how long data cleansing takes as part of the overall data mining process

The steps in the data cleansing process are parsing, correcting, standardizing, matching, and consolidating. 50-80% of time is spent on data cleansing. This is referred to as "data janitor-work"

What association analysis is, the type of data it requires, and the types of business questions it can answer.

Finding frequent patterns or associations among sets of items or objects in transaction databases, relational databases, and other information repositories (when x is true, y is also true). Binomial data is used (1 or 0, yes or no) showing whether something occurred or not Uses: Market Basket analysis, cross selling (customer bought x so they could be interested in y), what products to sell together, and how confident are we that if a customer buys x they will also buy y.

Interval

Have numerical value that can be measured along a continuum. The distance between values is consistent unlike with ordinal variables. The difference between 4 & 5 is the same as the distance between 12 & 13. Can be zero or less than zero

Basic principles of visualization, including most common chart types and their purpose (e.g., line charts, box plots, pie charts, heat or filled maps, stacked column chart, etc.) and how to create visualizations that clearly communicate a trend

Histogram: A columnar chart used to describe distributions. Each bar represents the frequency with which a given value is observed. Often, you need to create categories with ranges to make a histogram work If you want to compare results across categories, use a bar chart. Example being showing the total sales for the quarter by each of five retail outlets If your data is being measured over time, use a line chart. Example being if your showing revenue per year over the last 10 years If you want to show the proportion of something that is accounted for by various classes of a variable, use a pie chart. Example being if you wanted to show the percentage of total revenue contributed by each of the five retail outlets If you want to compare results across categories, but then further break down those results by sub-categories, use a stacked column chart.

Homoscedastic VS Heteroscedasticity

Homoscedastic: the variance of one variable roughly the SAME at all levels of the other variable. Heteroscedasticity : the variance of one variable is DIFFERENT at different levels of the other variable

What a centroid is and how to interpret results using centroid values

In a clustering analysis, the centroid represents the mean value of a given variable for a given cluster

Understand what cluster analysis is, the type of data it requires, and the types of business questions it can answer

In clustering analysis, the analyst('s software) uses a dataset's observed variables to create discrete groupings of (similar) observations. Clustering analysis can only use numerical data sets. It answers questions such as: Do cases tend to cluster into natural groups that we can use to take some action? Do certain groups of customers tend to display similar purchase patterns? Are there certain clients who have a higher risk profile than others? Who are my organization's best members? Who is most likely to buy my services? What types of products are most profitable?

5 Common forms of "dirty data"

Inaccuracy, Incomplete Data, Inconsistency, Time appropriateness, Uniqueness.

How to interpret a cluster analysis

Interpretation involves examining the distinguishing characteristics of each cluster's profile and identifying substantial differences among clusters The cluster centroid, a mean profile of the cluster on each clustering variable, is particularly useful in the interpretation stage. Cluster solutions failing to show substantial variation indicate other cluster solutions should be examined. The cluster centroid should also be assessed against the analyst's prior expectations based on theory or practical experience

Basic functions and capabilities of Power BI in transforming data (no specific user interface questions; just general concepts and functionality) - e.g., trimming

It can help correct typos in the data set to make sure it is all consistent. It allows the user to remove extra spaces at the end of values for consistency You can slice a column of values into 2 columns to help understanding of the dataset You can consolidate records that are duplicated so that the data is unique. You can save your updated dataset back into Excel to add other formulas or corrections that need finished.

If your data is being measured over time, a ______ chart is usually the right choice (e.g., if you're showing revenue per year over the last 10 years).

Line Chart

Uniqueness: the threat it poses to data analysis

Make sure that the data is normalized / all data only appears once.

Time appropriateness: the threat it poses to data analysis

Make sure the data was collected/is used so that it is actually going to do some good. Probably don't use data from 30 years ago today

Understand each of the following descriptive statistics and their purpose: mean, median, mode, variance, standard deviation, interquartile range, and outliers (no need to memorize equations)

Mean: arithmetic average of a set of values. Describes "the Center". Susceptible to outliers and skews. Mean = Sum of All Data Points / Number of Data Points Median: Middle number when ordered smallest to largest. Not very affected by outliers or skews. Mode: Number that occurs most frequently in the data set. Variance: a measure of how much the observed values tend to differ (or "vary") from the mean value. SD: Mean distance of each data point from the center of the dataset. 68% of data within 1 SD, 95% of data within 2 SDs, 99% of data within 3 SDs. Interquartile Range: Middle 50% of the data (Q1 to Q3). Outliers: Any data point that does not correlate to the dataset because of distance from the average. Can use these formulas to calculate lower and higher outliers (or if data are more than 2 standard deviations from the mean).

Correlation matrix: (Operator in RapidMiner) Does what?

Measures the degree of correlation between 2 or more attributes. This correlation can be positive or negative.

what do skewness and kurtosis refer to?

Skewness is related to the shape or distribution of the data. It is the degree to which a distribution is asymmetric (has a long tail on one side or the other) relative to a normal distribution. Kurtosis - k = 0, mesokurtic (normal) / k > 0, leptokurtic / k < 0, platykurtic Kurtosis is the degree to which the distribution is peaked or flat. Low Kurtosis means that the data has a lack of outliers. High Kurtosis = heavy tails/outliers Skewness is where the mean value is lower or higher than the median value. If the data is positively skewed then the mean value is greater than the median value. Negatively skewed is where the mean data is less than the median value.

Continuous variables:

Numerical values are numbers that act like numbers and fit meaningfully on a number line. Ratio and Interval are both continuous variables.

CRISP-DM PROCESS: Modeling

Performing analyses, including data-mining techniques such as clustering, regression, and decision tree analysis, to help us understand the current situation and/or predict how changes (i.e., business decisions) will/can affect future outcomes and/or what decisions should be applied to present/future events.

If you want to show the proportion of something that is accounted for by various classes of a variable, use a _______ chart (e.g., showing the percentage - of total revenue contributed by each of the five retail outlets).

Pie Chart

Primary and Secondary Sources of Data

Primary data: Surveys, interviews, focus groups Secondary data: research articles, internet searches, etc.

What is the difference between qualitative and quantitative data

Qualitative data: data about qualities - can't necessarily be measured. Termed and categorical (e.g. hair color, eye color) Quantitative data: data about quantities - can measure or count them. Two types of Quantitative data:

Discrete Data is:

Quantitative: can be counted in whole numbers or integers (e.g., # kids in a family) •

Continuous Data is:

Quantitative: can be measured on a finer and finer scale (e.g., age or height)

Ratio

Ratio variables are numerical and have meaningful intervals. The ratios between - numerical, have meaningful intervals, but they have one more feature that interval variables don't have: the ratios between values are meaningful. Cannot be zero or less than zero

Ordinal Data

Refers to data that imply a certain order or ranking but don't necessarily function quite like numbers on a number line. (Disagree, Somewhat Disagree, Neither Agree or Disagree, Somewhat Agree, Agree) or (Cold, Warm, Hot) or (1st, 2nd, 3rd,...)

5 Basic functions we studied in RapidMiner (in developing data mining processes).

Retrieve the data set, set role, nominal to numerical, numerical to binomial, select attributes,

Matching

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications

If you want to compare results across categories, but then further break down those results by sub-categories, use a ______________ __________ chart.

Stacked Column

What scatter plots look like for strong versus no relationship between two variables

Strong correlations are measured more compact or close to the line of best fit. Not as strong of a relationship results in points that are farther away from the line of best fit

Data Mining tasks (supervised and unsupervised):

Supervised Classifications: Assign data into predefined classes (ex. spam detection, fraudulent credit card detection) Regression: predict the real value for a given data instance (ex. Predict the price for a house) Unsupervised Clustering: Group similar items together into clusters (ex. detect communities in social network)

Are the following Supervised or Unsupervised forms of data? linear and logistic regression, decision tree analysis, linear discriminant analysis, neural network analysis, and random decision forests

Supervised (target variable, attempt to predict a real value).

Select Attributes (Operator in RapidMiner) Does what?

Tells rapid miner which attributes to take into consideration when analyzing the data. This operator allows you to specify which attributes (variables) within the entire dataset will continue forward in the process. It's useful if there are variables in your dataset that aren't going to be helpful in the analysis.

FP-Growth operator in RapidMiner (why it's used, how to use it)

The FP- Growth operator outputs frequency patterns that meet whatever criterion you establish; this just means that it identifies the item sets that are present in a dataset and yields support values (so long as the support value is higher than the minimum requirement). The support is the probability that two items will be purchased together. The way we understand it, the FP-Growth operator basically acts as a filter for the values. It limits certain values to ensure only criteria we establish will appear in our results. The criteria we choose relates to the support values (probability that two items will be purchased together).

The basics of how the k-means algorithm works, including the concept of cluster seeds (centroids)

The centroid represents the mean value of a given variable for a given cluster. 1. The number k of clusters is fixed by the analyst 2. An initial set of k "seeds" (aggregation centroids) is provided -First k elements -Other seeds (randomly selected or explicitly defined) 3. Given a certain fixed threshold, all units are assigned to the nearest cluster's seed 4. New seeds are computed 5. Go back to Step 3 until no reclassification is necessary

What the coefficient of determination is and how to calculate and interpret it

The coefficient of determination, most commonly known as R^2 or "R-Squared", is the percentage of variance in one variable explained by the other variable in a correlation analysis. This value is found by squaring the correlation coefficient. For example, if we have a correlation coefficient of 0.8 for variables A and B, that tells us that the variation in A accounts for 0.82 = 64% of the variation in B.

How to interpret a correlation coefficient and correlation analysis results

The higher the absolute value of r the stronger the correlation regardless of if it's positive or negative. It goes from 0-1. R-squared tells us the percentage of the variance in one variable explained by the other variable in a correlation analysis. .8-1 very strong .6-.8 strong .4-.6 moderate .2-.4 weak 0-.2 very weak

How support values are calculated and the meaning/interpretation

The probability that two items will be purchased together. Example, Fins and a Mask. i.e. how many times these items were purchased together divided by all purchases #/# Support P(A&B) = the number of people who have both fins and masks / 1000 total items

Potential effect of outliers on cluster analysis

They can severely distort the representativeness of analysis results because the centroid value(s) are heavily impacted by the outlier. Outliers can be very detrimental to a cluster analysis, especially when they represent a very small proportion of the observations and really distort the analysis results. This is because centroid values are a mean of all cluster observations, and an outlier in the cluster will heavily impact the mean/centroid value. Since cluster analysis is interpreted based on the centroids, outliers distort cluster analysis results.

Set Role (Operator in RapidMiner) Does what?

This operator creates a new Attribute with the special role id.

Generate aggregates: (Operator in RapidMiner) Does what?

This operator creates a new example set from the imputed example sets showing the results of the selected aggregation functions. Most used functions are sum, count, min., max, and average.

Typical uses of Supervised Data

Typical uses: Prediction, Diagnostic

CRISP-DM PROCESS: Business Understanding

Understand the problem, the people who have the problem, the constraints around the eventual solution, and the expected outcome of the analysis.

The difference between supervised and unsupervised data mining

Unsupervised data analysis: refers to any data analysis that is descriptive in nature and in which there is no dependent variable (sometimes referred to as a target variable) Supervised data analysis: Data analysis in which there's a target variable. Attempts to predict a real value or assign data into a predefined class (Spam detection, Fraud detection, Price of a House).

CRISP-DM PROCESS: Deployment

Using the evaluation to develop a plan of action, then executing that plan.

Basic functions of Power BI in creating visualization dashboards (linkage, slicers, filters)

Visualization such as bar graphs, scatter plots, pie graphs, etc. can be added. Variables can then be added to the Axis, Legend, and Values in order to compare variables with each other. Results can then be filtered to narrow selections.

How to calculate similarity/dissimilarity between cases (e.g., Euclidean distance) and how to interpret distance measures

We are trying to maximize intra-class homogeneity, and maximize inter-class heterogeneity (trying to find the distance between each observation) Euclidean distance: Is a measure of the true straight line distance between each observation set in Euclidean space. (can be calculated by the hypotenuse on a Cartesian Coordinate Plane or a number line). Values that have a greater distance apart are more dissimilar.

What convergent validity is and when we use it

When two variables appear to be measuring the same thing (example: two survey questions are made to gauge a respondents answer to a specific whatever about a certain topic or concept.) This is beneficial because if it is seen that these two variables do in fact correlate, then we can be more sure about the results. As we decide two or more variables converge, we would then aggregate/combine them and find a new mean value for them and use it. If two or more variables were meant to converge but do not seem to have a strong correlation, the data may be thrown out

CRISP-DM PROCESS: Data Understanding

Where we obtain, evaluate, and explore available data, trying to get a better sense of the (a) what it tells us about the problem and (b) what it might be able to offer in terms of analysis and solutions.

The tradeoffs between adjusting the minimum support and confidence thresholds and the resulting association rules generated

You can change this parameter to lower (or raise!) the minimum support value required for RapidMiner to output an itemset with corresponding support value. Unfortunately, while lower values will return more item sets and higher values will return fewer This is the same for confidence thresholds as well. When you lower the minimum values, you will return more item sets and higher values will return less. This makes sense because with a higher minimum value, the results are more limited. This is like saying you're limiting a dataset to students who only received A's in a class VS. students who received anything above a C. The results will have a lot more students for the "lower" or more inclusive value.

Numerical to Binomial (Operator in RapidMiner) Does what?

You may have data that are primarily numerical using 1's and 0's. Unfortunately, FP-Growth requires all values to be either True or False. Numerical to Binomial translates values of 1 and 0 into values of True and False.

hy we sometimes normalize data in cluster analysis (z-score)

You normalize data by getting the variables onto a common scale so you can compare them, it puts all of the z-scores on the same scale. Z-score is the number of standard deviations away the observed value is from the mean of all observed variable values.

Feature/Attribute

a property or characteristic of an object, often referred to as a variable

Predictive Data Analytics: (what it is used for and what questions it answers)

a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. Question: What might happen in the future?

Consolidating

analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation

Standardizing

applying conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules

Prescriptive Data Analytics: (what it is used for and what questions it answers)

automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences, and business rules, to make predictions and then suggests decision options to take advantage of the predictions. Question: What should we do about it?

Record, case, example, and observation are all data ________________

data instances

4 Different types of data analytics

diagnostic, descriptive, predictive, prescriptive

Correcting

modifying individual data components either manually or using sophisticated data algorithms and secondary data sources

Retrieve the Data Set (Operator in RapidMiner) Does what?

pulls the values into the design that we want

Dichotomous data

refers to any variable for which the value must be one of two possible values (either/or) (True/False)

How Lift values are calculated and the meaning/interpretation

the ratio of confidence to the base probability of buying an item Example fins and masks: probability of fins and a mask (confidence value) / probability of fins OR probability of fins and a mask (confidence value) / probability of masks

Descriptive Data Analytics: (what it is used for and what questions it answers)

used to gain insight from historical data with reporting, scorecards, clustering, etc. Question: What happened?

What an association rule looks like, including antecedents and consequents

{x} ==> {y}, x is the antecedent (or premise), y is the consequent (or conclusion) Example: x= Bread, Y=Milk. If you already have bread, what is the likelihood that you also have milk? This is an association rule.


Conjuntos de estudio relacionados

Chapter 8 Nervous System Knowledge

View Set

AQA A-Level Chemistry: Equilibrium constant, Kp

View Set

Chapter 04: Individual Values, Perceptions, and Reactions

View Set