Chapter 4 BI

Ace your homework & exams now with Quizwiz!

Define Gini index. What does it measure?

A metric that is used in economics to measure the diversity of the population. The same concept can be used to determine the purity of a specific class as a result of a decision to branch along a particular attribute/variable.

What is the major difference between cluster analysis and classification?

Classification methods learn from previous examples containing inputs and the resulting class labels, and once properly trained they are able to classify future cases. Clustering partitions pattern records into natural segments or clusters.

What are the most popular commercial data mining tools?

EX. IBM (SPSS), SAS, StatSoft. most developed by largest statistical software companies

What are the major data mining processes?

Similar to other information systems initiatives, a data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.

Define data mining.

The process through which previously unknown patterns in data is discovered. Discovering knowledge from large amounts of data.

What are the privacy issues in data mining?

data that is collected, stored and analyzed in data mining often contains info about real people includes identification, demographic, financial, personal and behavioral info. most data can be analyzed by 3rd party data providers to maintain privacy and protection of rights data mining professionals have ethical and legal obligation

What are the most popular free data mining tools? Why are they gaining overwhelming popularity?

Weka--most popular free data mining tool others include RapidMiner and Microsofts SQL server Popular for their intuitive user interfaces, having a fairly large number of algorithms available, and incorporating a variety of visualization features.

What are three common data mining mistakes/blunders?

electing the wrong problem for data mining Ignoring what your sponsor thinks data mining is and what it really can or cannot do Beginning without an end in mind Leaving insufficient time for data prep Looking only at aggregated results and not at individual records

List and briefly define the phases in the CRISP-DM process.

CRISP-DM provides a systematic and orderly way to conduct data mining projects. This process has six steps. First, an understanding of the data and an understanding of the business issues to be addressed are developed concurrently. Next, data are prepared for modeling; are modeled; model results are evaluated; and the models can be employed for regular use. Six steps are - Business understanding, Data understanding, Data preparation, Model building, Testing and evaluating the model, Deployment

How does CRISP-DM differ from SEMMA?

CRISP_DM: A cross industry standardized process of conducting data mining projects, which is a sequence of six steps that start with a good understanding of the business and the need for the data mining project and ends with the deployment of the solution that satisfied the specific business need. SEMMA: An alternative process for data mining projects proposed by the SAS Institute. "Sample, Explore, Modify, Model, and Assess CRISP-DM is a more comprehensive approach, broadly understanding the business and all relevant data. SEMMA just deals with the data mining goals and objectives.

What are three common myths about data mining?

Data mining provides instant, crystal-ball-like predictions Data mining is not yet viable for business applications Data mining requires a separate, dedicated database Only those with advance degrees can do data mining Data mining is only for large firms that have lots of customer data

What are some major data mining methods and algorithms?

Data mining tasks can be classified into three main categories: prediction, association, Segmentation (clustering). Based on the way in which the patterns are extracted from the historical data, the learning algorithms of data mining methods can be classified as either supervised or unsupervised. With supervised learning algorithms, the training data includes both the descriptive attributes (i.e., independent variables or decision variables) as well as the class attribute (i.e., output variable or result variable). In contrast, with unsupervised learning the training data includes only the descriptive attributes. Algorithms: Prediction: decision trees, neural network, Linear/nonlinear regression, exponential smoothing Association: OneR, Expectation Maximization, Graph-Based Matching Segmentation: k-means, Expectation Maximization

List and briefly define at least two classification techniques.

Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. Statistical analysis. Statistical classification techniques include logistic regression and discriminant analysis, both of which make the assumptions that the relationships between the input and output variables are linear in nature, the data is normally distributed, and the variables are not correlated and are independent of each other. Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category . Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). Genetic algorithms. The use of the analogy of natural evolution to build directed search-based mechanisms to classify data samples. Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.

What are the key differences among the major data mining tasks?

Prediction: the act of telling about the future. It differs from simple guessing by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. A term that is commonly associated with prediction is forecasting. Even though many believe that these two terms are synonymous, there is a subtle but critical difference between the two. Whereas prediction is largely experience and opinion based, forecasting is data and model based. That is, in order of increasing reliability, one might list the relevant terms as guessing, predicting, and forecasting, respectively. In data mining terminology, prediction and forecasting are used synonymous, and the term prediction is used as the common representation of the act. Prediction is supervised learning, while the other two are unsupervised


Related study sets

Chapter39 Fluid and Electrolyte Imbalances

View Set

Topic Review 4 Multiple choice Fnce 125

View Set

Biochem Chapter 7 Study Questions

View Set

Module 1: Role of human resources

View Set

AJ2, Kormányzati rendszerek, Országgyűlés (5.téma)

View Set