ISM3451_Exam1_TermsChapter1-2

Ace your homework & exams now with Quizwiz!

Profiling

AKA: Behavior description. Attempts to characterize the typical behavior of an individual, group, or population. Behavior can be described generally over an entire population, or down to the level of small groups or even individuals. Profiling is used to establish behavioral norms for anomaly detection applications such as fraud detection. Example: "What is the typical cell phone usage of this customer segment?" Profiling cell phone usage might require a complex description of night and weekend airtime averages, international usage, roaming charges, etc.

Strategic Asset

Assets that are needed by an entity in order for it to maintain its ability to achieve future outcomes. Without such assets the future well being of the company could be in jeopardy. Certain competitive advantages are based not on the distinctive capabilities of firms but on their dominance or market position. Means an asset or group of assets that the entity needs to retain if the entity is to maintain the entitys capacity to achieve or promote any outcome that the entity determines to be important to the current or future well-being of the entity.

Classification

Attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. A data mining procedure produces a model that, given a new individual, determines which class that individual belongs to. -Example: "Among all the customers of MegaTelCo, which are likely to respond to a given offer?" The two classes could be called "will respond" and "will not respond".

Clustering

Attempts to group individuals in a population together by their similarity, but not driven by any specific purpose. Is useful in preliminary domain exploration to see which natural groups exist because these groups in turn may suggest other data mining tasks or approaches. -Example: "Do our customers form natural groups or segments?" Clustering is also used as input to decision-making processes focusing on questions such as: "What products should we offer or develop? How should our customer care teams (or sales teams) be structured?"

Causal Modeling

Attempts to help us understand what events or actions actually influence others. Techniques for causal modeling include those involving a substantial investment in data, such as randomized controlled experiments, as well as sophisticated methods for drawing causal conclusions from observational data. Both methods can be viewed as "counterfactual" analysis. The data scientist should always include the assumptions that must be made for the causal conclusion to hold.

Similarity Matching

Attempts to identify similar individuals based on data known about them. Similarity matching can be used directly to find similar entities. Similarity matching is the basis for one of the most popular methods for making product recommendations. Similarity measures underlie certain solutions to other data mining tasks, such as classification, regression, and clustering. -Example: IBM is interested in finding companies similar to their best business customers, in order to focus their sales force on the best opportunities. They use similarity matching based on "firmographic" data describing characteristics of the companies.

Link Prediction

Attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Common in social networking systems. -Example: "Since you and Karen share 10 friends, maybe you'd like to be Karen's friend?" Link Prediction can also estimate the strength of a link and form the basis for recommendations.

Data Reduction

Attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. The smaller dataset may be easier to deal with or process and may better reveal the information. Data reduction usually involves loss of information, what is important is the trade-off for improved insight.

Data Mining Process

Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. The process itself is iterative and not linear. The Data Mining Process is based on the Cross Industry Standard Process for Data Mining (CRISP-DM).

Data Science

involves principles, processes, and techniques for understanding phenomena via (automated) analysis of data. The ultimate goal of data science is to improve decision-making.

Decomposing Data

(1) A critical skill in data science is the ability to decompose a data-analytics problem into pieces such that each piece matches a known task for which tools are available. Tying this statement to the two types of decisions. (2) In collaboration with business stakeholders, data scientists decompose a business problem into subtasks. The solutions to the subtasks can then be composed to solve the overall problem. Some of these subtasks are unique to the particular business problem, but others are common data mining tasks.

Difference Between Data Mining and Data Science

(1) Data Science: a set of fundamental principles that guide the extraction of knowledge from data. As a term, Data Science is applied more broadly than the traditional use of "Data Mining". (2) Data Mining: the extraction of knowledge from data, via technologies that incorporate these principles. (3) For the sake of this course, we will use "data mining" and "data science" as synonyms.

Difference between Data Science and Data-Driven Business

(1) Data science needs access to data and it often benefits from sophisticated data engineering that data processing technologies may facilitate. (2)Data processing technologies are very important for many data-oriented business tasks that do not involve extracting knowledge for data-driven decision-making.

Types of Decisions in the Book

(1) Decisions for which "discoveries" need to be made within data. (2) Decisions that repeat, especially at massive scale, and so decision-making can benefit from even small increases in decision-making accuracy.

Scoring or Class Probability Estimation

A scoring model applied to an individual produces, instead of a class prediction, a score representing the probability (or some other quantification of likelihood) that that individual belongs to each class. In our customer response scenario, a scoring model would be able to evaluate each individual customer and produce a score of how likely each is to respond to the offer. Classification and scoring are very closely related; a model that can do one can usually be modified to do the other.

Co-Occurrence Grouping

AKA: Frequent Itemset Mining. Attempts to find associations between entities based on transactions involving them. -Example: "What items are commonly purchased together?" Considers similarity of objects based on their appearing together in transactions. -Example: Analyzing purchase records from a supermarket may uncover ground meat is purchased together with hot sauce more frequently than expected. Could suggest a special promotion, product display, or combination offer. Some recommendation systems also perform a type of affinity grouping by finding, for example, pairs of books that are purchased frequently by the same people. *RESULT: A description of items that occur together. Usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.*

Regression

Aka: Value Estimation. Attempts to estimate or predict, for each individual, the numerical value of some variable for that individual. A regression procedure produces a model that, given an individual, estimates the value of a particular variable specific to that individual. -Example: "How much will a given customer use the service?" The property (variable) to be predicted here is service usage, and a model could be generated by looking at other, similar individuals in the population and their historical usage.

Supervised Core Data Mining Tasks

Classification, regression, and causal modeling.

Core Data Mining Tasks

Classification, regression, similarity matching, clustering, co-occurrence grouping, profiling, link prediction, data reduction, and causal modeling.

Unsupervised Core Data Mining Tasks

Clustering, co-occurrence grouping, and profiling.

Market-Basket Analysis

Co-occurrence of products in purchases is a common type of grouping known as, Market-basket analysis, a type of co-occurrence grouping.

Unsupervised Method

Considered open and do not have a specific target variable. The net result is a grouping/clustering of data points.

Data Preparation

Data preparation is a phase that often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results. Typical examples of data preparation are converting data to tabular format, removing or inferring missing values, and converting data to different types. Some data mining techniques are designed for symbolic and categorical data, while others handle only numeric values. Data scientists may spend considerable time early in the process defining variables used later in the process. This is one of the main points at which human creativity, common sense, and business knowledge come into play. Often the quality of the data mining solution rests on how well the analysts structure the problems and craft the variables.

Big Data

Essentially means data sets that are too large for traditional data processing systems, and therefore require new processing technologies.

Exploiting Data

Exploiting Data, new and existing, can be a competitive advantage. Companies that believe this to be true will view data analysis as critical to their business strategy and are creating/investing in Data Scientists.

Data Mining and Its Results

There is a framework to collect the data, mine the data, model the data, and present the results. We "mine" the data to find patterns and this constitutes the BUILD models component. We use the RESULTS of the data mining tasks to influence and inform the process (ex: storytelling with data).

Supervised Method

Have a target variable, segmentation is being done for a specific reason. Technically, another condition must be met: there must be data on the target.

Counterfactual Analysis

They attempt to understand what would be the difference between the situations- which cannot both happen- where the "treatment" event (ex: showing an advertisement to a particular individual) were to happen, and were not to happen.

Other Analytics Techniques and Technologies

Include statistics, database querying, data warehousing, regression analysis, and machine learning/data mining.

Data Mining Process versus Software Development Cycle

It is tempting - but usually a mistake - to view the data mining process as a software development cycle. The CRISP cycle is based around exploration; it iterates on APPROACHES and STRATEGY rather than on software designs.

Data-Driven Decision Making

Refers to the practice of basing decisions on the analysis of data, rather than purely on intuition.

Core Data Mining Tasks that Can Be Either Supervised or Unsupervised

Similarity matching, link prediction, and data reduction.

Regression Analysis in Data Science

Some of the same methods we discuss in this book are at the core of a different set of analytic methods, which often are collected under the rubric regression analysis, and are widely applied in the field of statistics and also in other fields founded on econometric analysis. Here we are less interested in explaining a particular dataset as we are in extracting patterns that will generalize to other data, and for the purpose of improving some business process. Typically this will involve estimating or predicting values for cases that are not in the analyzed data set. Useful for testing patterns on NEW data to evaluate their generality. Secondly, regression analysis supports techniques for reducing the tendency to find patterns specific to a particular set of data and introduces different views through dimensions and positioning the models to predict data with a focus on reducing uncertainty.

Business Understanding and Data Understanding

The most important fundamental principles of data science is the thoughtfulness of the design team to think carefully about the problem to be solved. Stresses importance of correctly identifying and confirming the problem statement in both individual and team problem-solving models.

Modeling Stage

The output of modeling is some sort of model or pattern capturing regularities in the data. The modeling stage is the primary place where data mining techniques are applied to the data. This is the part of the craft where the most science and technology can be brought to bear.

Goals of the Course

The primary goals of the course are to (1) help you view business problems from a data perspective and (2) understand principles of extracting useful knowledge from data.


Related study sets

Sociology 101 FINAL REVIEW weeks 4-7

View Set

Honors World History A Unit 13: New Global Connections

View Set