Tutorial Questions

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Explain the purpose of Market Basket Analysis

- discovers co-occurrence relationships among items in customer baskets - helps retailers understand customer purchasing behaviour which can help increase profit margin

List four data mining applications in businesses

- financial services (fraud detection), telecommunications (cross-sell opportunities), retail (loyalty programs, web service (intelligent search engine)

Explain the relationship between Market-Basket transactions and Association Rule.

- given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Explain the following distribution plots and give an example for each. • Box Plots

- indicates which observations, if any, are considered unusual or outliers. graphically depicts the 5 number summary: 1. smallest observation 2. lower quartile 3. median 4. upper quartile 5. largest observation

Explain the following basic plots. Line Graphs

- plots values of two variables against each other and connects data values with a line

Explain the following basic plots. Scatter Plots

- plots values of two variables against each other. displays relationship between two numerical values.

Describe the term Cross-selling using marketing examples.

- push new products to current customers based on their past purchases - buy a case from amazon, maybe they will suggest a travel carrying bag

Describe the main methods for normalising data.

- subtract the mean and divide by standard deviation - alternatively, scale 0 to1 by subtracting minimum and dividing by the range

Discuss the term clustering and aspects of clustering in the context of data mining.

- unsupervised learning technique for finding similarity groups in data (all variables are treated equally as input variables). - maximise similarity between cases within a cluster - minimise dissimilarity between cases between clusters

Explain how we can determine the distance between two cases.

- you can use average linkage (average distance) , complete linkage (maximum difference) , single linkage (minimum distance) , centroid distance (distance between centroids) or ward's variance (ANOVA sum of squares between two clusters added up over all input variables) to measure distance. 1. appropriately scale numeric attributes

List three basic algorithms for building decision trees

1. CART (Classification and Regression Trees) 2. C4.5 and C5.0 (or SEE5) 3. CHAID (Chi-squared Automatic's Interaction Detection)

Briefly explain how the decision tree is built essentially

1. Choose the splitting rule at the root node 2. Consider splitting each leaf node (k) 3. Consider splitting each variable (j) at k 4. find the j&k which gives the overall best split int he current tree 5. split the tree using the overall best split provided its significant

Using the example given in the lecture, allocate the following cases into suitable brand. Show the steps of your calculation. Case 1 - Income= $80,000 Car age=5 years Case 2 - Income= $100,000 Car age=6 years

Case 1: income = 4, car age = 0.42 Case 2: income = 0.5, car age = 0.29 Steps: income = (income/200k), car age = (1 if < 12 months, or 8-age/7)

In the root node, assume that the training data has 65% GOOD cases and 35% BAD cases. Calculate the Gini and Entropy impurity index.

Gini = 2(p1)(p2) = 2(.65)(.35) = 0.455 Entropy = -[(p1)(log e)(p1)+(p2)(log e)(p2)]/log e2 =-[(.65)(loge)(.65)-(.35)(loge)(.35)/loge2 =0.18-0.05/loge2 = 0.598

Explain how Average Square Error is used to adjust Neural Network model's weight parameters.

The average square error always decreases with the iterations of the algorithms for training data. But for validation data, it decreases with the iterations at first, then it starts to increase because the neural network model is being overfitted to the training data and is beginning to fit the errors in the training data.

Explain the term Score Data Set in data mining.

a new data set consisting of unclassified new data cases. the chosen fitted model is used to classify the new cases- process is called scoring the data.

Discuss advantages and disadvantages of association rules

advantages: simple computations, can be undirected, hypothesis isn't necessary before analysis, different data forms can be analysed disadvantages: market basket analysis only identifies hypotheses, which needs to be tested, measurement of impact needed, difficult to identify product groupings, complexity grows exponentially

Explain Neural Network model in the context of data mining

use network of connected nodes in layers connecting inputs to outputs.

Explain the goal of the Lift and how the chart could be used for business intelligence.

useful for assessing performance by identifying the most important class. gives explicit assessment of results over a large number of cutoffs. can use it to see how many loans you want to grant or how many tax records to examine.

Describe Euclidean distance.

when different measurements are proportionate and measured in the same units

Explain the term Data Transformation in data mining.

brings data into a natural scale. useful when dealing with skewed data. ex: sauce root, reciprocal, logarithm, raising to power.

Explain what should be involved in the Data Understanding phase

collecting, desiring, exploring and verifying data quality. address what resources are and characteristics of those resources

Consider all possible binary splits of variable x having s distinct values. What are the possible binary splits for a continuous variable x, an ordinal variable x, and a nominal variable x?

continuous: s-1 ordinal: s-1 nominal: 2^(s-1)-1

List the steps of CRISP-DM

data understanding > data preparation > model building and validation > deployment

Divisive clustering

hierarchical clustering procedure where all objects start out in one giant cluster; clusters are formed by dividing this cluster into smaller and smaller clusters

agglomerative clustering

hierarchical clustering procedure where each object starts out in a separate cluster; clusters are formed by grouping objects into bigger and bigger clusters

Discuss the purpose Business Understanding in the context of data mining

identifying business objectives, assess the situation, determine the goals of the project, produce a plan

Explain the following variable selection method in Logistic Regression: Stepwise

like the forward method-- systematically adds effects that are significantly associated with the target. however, after an effect is added to the model, stepwise may remove an effect already in the model that is not significantly associated with the target.

Explain Pruning a Decision Tree in the context of data mining.

methods of selecting the tree size with low bias and variance to return good performance

Describe the main methods of handling missing data and noisy data.

missing: omission or imputation (replacing missing values with reasonable substitutes) noisy: binning, regression, clustering, combining computer and human inspection.

List and explain four main data measurement in data mining.

nominal - no order. gives names/labels to various categories ordinal - have order, but interval between measurements is not meaningful. interval - meaningful intervals between measurements, but no true starting point ratio - highest level of measurement. meaningful because there is a starting point.

Explain the term Data Partition in data mining.

partitions the data into 3 subgroups: 1. training data - used to build the model 2. validation data - refines the model (for decision tree and neural network models) 3. test data - used to assess the model performance (ie accuracy) best model is the model that performs best on the test data set.

List the steps of the K-means Algorithm.

partitions the given data into k clusters 1. randomly choose k data points to be the initial centroids, cluster centers 2. assign each data point to the closest centroid 3. re-compute the centroids using the current cluster memberships. 4. repeat 2-4 until the clusters stabilise.

Explain the following distribution plots and give an example for each. • Histograms

provides statistical relations between x values which are grouped as bins or discrete.

Explain how to evaluate the Cumulative Response rate.

the cumulative proportion of responses. at the last decile, it indicates the baseline percentage of responders.

Define and explain the Cumulative Lift chart.

the cumulative ration of percentage captured responses within each decile to the baseline percentage response. the higher the value, the better the model.

Explain what the lower and higher cut-off value choice of the ROC chart represents.

the cut-off choice represents a trade off between sensitivity and specificity. lower = high sensitivity but low specificity higher = low sensitivity but high specificity ideally we want high sensitivity and specificity.

Explain the Supervised Learning and its objective. Give an example area of application.

the data consists of a number of cases for each of which input variables and a target variable are recorded. the objective is to model the value of the target variable using the input variables mapping function from the input to the output. example area of application: credit card fraud detection - accept or stop a transaction

Explain what should be considered when selecting data mining modelling techniques.

the data types available for mini, data mining goals, specific modelling requirements

Explain Logistic Regression model in the context of data mining

the probability (p) that the target variable (y) takes the value 1 is modelled as a transformation of a linear combination of input variables.

Define and explain the Lift chart.

the ratio of percentage captured response within each decile to the baseline percentage response.

Use diagrams to illustrate the Supervised Learning Process.

training data > learning algorithm > model > test data > accuracy

List Data Preparation tasks

extracting & integrating data, reconciling inconsistent field values, identify missing/incorrect/extreme values, data selection, transforming relevant fields, splitting data into training and test data sets

Discuss advantages and disadvantages of clustering.

- Advantages: useful for initial investigation of large datasets, treats variables equally so no need to identify a target variable, allows all types of input variables, easy to apply and understand the procedure and use clusters may be identified by the procedure. - Disadvantages: can be difficult to choose the right distance measure when there are different types of input variables, results can be sensitive to choice of k, resulting clusters can be difficult to interpret and some may be meaningless if wrong value of k is chosen.

Explain the following basic plots. Bar Charts

- bar chart provides statistical relations between categorical values.

Data Mining

-The Automated Extraction of Hidden Predictive information from Large databases. (HELPA) -Process of exploration and analysis of large quantities of data in order to discover meaningful patterns and rules

List and briefly explain three classification techniques

1. regression: liner or any other polynomial 2. decision tree classifier: divides the decision space in to piecewise constant regions 3. neural networks: partition by non-linear boundaries

Discuss the advantages of using Logistic Regression model in predictive modelling.

1. can be written explicitly and are easy to program 2. can easily deal with continuous variables (using their actual values) 3. can deal easily with simple interactions between input variables (using the interaction builder) 4. relatively stable

Discuss the disadvantages of using Logistic Regression model in predictive modelling.

1. can't deal with missing values in input variables. 2. sensitive to outliers in continuous input variables. 3. unable to model complicated interactions between input variables.

List three clustering business applications.

1. clustering bank customers based on personal variables and accounts they hold, banking activities and balances 2. clustering supermarket customers based on their personal variables and shopping habits 3. clustering clothing customers based on their physical measurements

List three main Predictive Models and Important Assumptions when apply predictive models.

1. decision tree models 2. logistic regression models 3. neural network models important assumptions: past is a good predictor of the future data is available data contains what we want it to predict

Discuss the disadvantages of using a Neural Network model in predictive modelling.

1. difficult to explain 2. difficult to interpret 3. can't deal with missing values in input variables 4. sensitive to outliers in continuous input variables 5. often over parameterised and the parameter estimates can therefore be unstable.

List the three types of data mining experts

1. domain expert (business and its problems) 2. data expert (data structures) 3. analytical expert (capabilities and limitations of the data mining techniques)

Discuss the advantages of using a Neural Network model in predictive modelling.

1. flexible 2. can give good predictions in cases where the relationship between the target variable and the input variables are complex 3. work better than regression if relationships are nonlinear

What are the foundations of Data Mining? (3)

1. increasing computing power 2. improved data collection 3. statical and learning algorithms

Describe the process phases of CRISP-DM

Business Understanding Data Understanding Data Preparation Modelling Evaluation Deployment

Explain why data need to be cleaned before mining.

It may be incorrect, noisy, inconsistent, incomplete and may have disguised missing data.

Describe the process phases of SEMMA

Sample Explore Modify Model Assess

Describe the difference between Supervised Learning and Unsupervised Learning.

Supervised learning (predict) discovers patterns in the data that relate input variables with a target variable and unsupervised learning (explore) has no target variable.

Business Intelligence (5)

a set of theories, methodologies, processes, architectures and technologies that transform raw data into meaningful and useful information for business purposes.

Explain why we need standardise data before clustering with K-means algorithm.

because raw distance measurements are highly influenced by the scale of measurements so they all need to be numerical variables i the range [0,1]

Explain the following variable selection method in Logistic Regression: Backward

begins with all candidate effects in the model and then systematically removes effects that are not significantly associated with the target until no other effect in the model meets the "stay significance level" or until the "stop criterion"

Explain the following variable selection method in Logistic Regression: Forward

begins with no candidate effects in the model and then systematically adds effects that are significantly associated with the target until none of the remaining effects meet the "entry significance level" or until the "stop criterion" is met.

Discuss the Evaluation and its tasks.

evaluate how the data mining results can help you achieve your objectives - meets business objectives - any issues not considered? - model makes sense - model is actionable


Kaugnay na mga set ng pag-aaral

Red Penny Book- Chapter 4: The Bile Ducts

View Set

Ch 12 study guide(Health Information)

View Set

NATIONAL PORTION CONTRACTS SUBTOPICS

View Set