INFO 320 Midterm

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Data cleaning/scrubbing

--Step 2 of data prepossessing -Impute Data(filled with most probable value) -Reduce Noise( smooth out outliers) -Eliminate Duplicates

data transformation

--Step 3 of data prepossessing -Process of changing the data from their original form to a format suitable for performing a data analysis. -Normalize data Discretization(converting to categorical)

1. Why do we need a standardized data mining process? What are the most commonly used data mining processes?

-A data mining project must follow a systematic project management process to be successful. Several data mining processes have been proposed: CRISP-DM, SEMMA, and KDD.

List and briefly define the central tendency measures of descriptive statistics.

-A measure of central tendency is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data. -Mean -Median -Mode

Why has information visualization become a centerpiece in business intelligence and analytics?

-A picture tells a thousand words. Hence, information visualization is key to quickly providing understanding of complex information -In an increasingly complex information age, visualization is even more important

What is a business report (enterprise reporting)? Why is it needed?

-A written document that contains information regarding business matters. - It is an essential part of the larger drive toward improved managerial decision making and organizational knowledge management. -The foundation of these reports is various sources of data coming from both inside and outside the organization. -Creation of these reports involves ETL (extract, transform, and load) procedures in coordination with a data warehouse and then using one or more reporting tools. -Can be distributed in print form or via e-mail, they are typically accessed via a corporate intranet. -a good business report include clarity, brevity, completeness, and correctness.

Information Visualization

-Answers "what happened" and "what is happening" and is closely associated with business intelligence (routine reports, scorecards, and dashboards),

Regression

-Attempts to describe the dependence of a response variable on one (or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. -Concerned with the relationships between all explanatory variables and the response variable.

Discuss the differences between the two most commonly used data mining process CRISP-DM and SEMMA

-CRISP-DM takes a more comprehensive approach—including understanding of the business and the relevant data -SEMMA implicitly assumes that the data mining project's goals and objectives along with the appropriate data sources have been identified and understood.

data warehouse

-Can be defined as "a pool of data produced to support decision making." -"a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision-making process -However,: it is hard, if not impossible, to conceive of a data warehouse that would not be subject-oriented, integrated, etc.

What is the main difference between classification and clustering? Explain using concrete examples.

-Classification learns patterns from past data in order to place new instances (with unknown labels) into their respective groups or classes. The objective of classification is to analyze the historical data stored in a database and automatically generate a model that can predict future behavior. Classifying customer-types as likely to buy or not buy is an example. -Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Customers can be grouped according to demographics

What are the commonalities and differences between regression and correlation?

-Correlation makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. -On the other hand, regression attempts to describe the dependence of a response variable on one (or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect

Why do we need data preprocessing?

-Data preprocessing is essential to any successful data mining study. Good data leads to good information; good information leads to good decisions.

What are the privacy issues with data mining? Do you think they are substantiated?

-Data that is collected, stored, and analyzed in data mining often contains information about real people. -includes identification, demographic, financial, personal, and behavioral information. -can be accessed through some third-party data providers.

What are the main differences between descriptive and inferential statistics?

-Descriptive statistics is all about describing the sample data on hand. -Inferential statistics is about drawing inferences or conclusions about the characteristics of the population.

Association

-Establishing relationships among items that occur together -used to discover two or more items (or events or concepts) that go together.

Why should storytelling be a part of your reporting and data visualization?

-Everyone who has data to analyze has stories to tell. -Stories bring life to data and facts. -They can help you make sense and order out of a disparate collection of facts. -They make it easier to remember key points and can paint a vivid picture of what the future can look like. -Creates interactivity

Clustering

-Finding groups of entities with similar characteristics. -partitions pattern records into natural segments or clusters. Each segment's members share similar characteristics.

List and describe the main steps to follow in developing a linear regression model.

-First perform a quick assessment of the data through the use of a scatter plot and or correlations -Next, perform model fitting by transforming the data into a more usable format and estimating any needed parameters. -Third, model your assessment by testing assumptions and evaluating its fit. -Finally, if the steps show that aggression is warranted, deploy and calculate the regression.

A data mart can replace a data warehouse or complement it. Compare and discuss these options.

-For a data mart to replace a data warehouse, it must make the DW unnecessary. This would mean that all the analyses for which the DW would be used can instead be satisfied by one or more DM -Can be less expensive in terms of development and computer resources -In other situations, a data mart can be used for some analyses which would in its absence use the DW, but not all of them. For those, the smaller DM is more efficient—quite possibly, enough so as to justify the cost of having a DM in addition to a DW. Here the DM complements the DW.

What is an information dashboard? Why are they so popular for BI software tools?

-Information dashboards provide visual displays of important information that is consolidated and arranged on a single screen so that information can be digested at a single glance and easily drilled in and further explored. -common components of most, if not all, performance management systems, performance measurement systems, BPM software suites, and BI platforms.

Classification Data Mining (Prediction)

Analyzing the historical behavior of groups of entities with similar characteristics, to predict the future behavior of a new entity from its similarity to those groups.

Describe the cyclic process of management and comment on the role of business reports.

-Involves these steps: data acquisition leads to information generation which leads to decision making which leads to business process management. Perhaps the most critical task in this cyclic process is the reporting (i.e., information generation)—converting data from different sources into actionable information.

What are the most common metrics that make for analytics-ready data?

-It must be relevant to the problem at hand -Meet the quality/quantity requirements. -Have a certain data structure in place with key fields/variables with properly normalized values -Conform to organizational definitions.

What are the main types of charts/graphs? Why

-Line graphs are good for time-series data. -Bar charts are good for depicting nominal or numerical data that can be easily categorized. -Pie charts should be used for depicting proportions.Scatter plots and bubble charts are good for illustrating relationships between two or three variables (bubble charts add a dimension via the size of the dot). -Histograms are like bar charts, except they depict frequency distributions. -Gantt charts and PERT charts are good at illustrating project timelines and task dependencies. -Geographic maps show geographic information. Bullet graphs show progress toward a goal. -Heat maps and highlight tables illustrate the comparison of continuous values across two categories using color. -Tree maps are good for showing hierarchical information.

What are the commonalities and differences between linear regression and logistic regression?

-Logistic regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. -It differs from linear regression with one major point: its output (response variable) is a class as opposed to a numerical variable.

Are data mining processes a mere sequential set of activities?

-No. Even though these steps are sequential in nature, there is usually a great deal of backtracking. -Data mining is driven by experience and experimentation, depending on the problem/situation and the knowledge/experience of the analyst, the whole process can be very iterative

Immon Approach to Data Ware House Development

-Starts with an enterprise data warehouse, creating data marts as subsets if appropriate. -Most effective when there is a recognized need for an EDW, an executive "champion" of the project, and a willingness to invest in a data warehousing infrastructure before it will show results.

Kimball Approach to Data Ware House Development

-Starts with data marts, consolidating them into an EDW later if appropriate. -Most effective when it is desired to provide a "proof of concept" implementation before embarking on a full-scale EDW project or when a well-defined area with the greatest benefits can be identified.

Data Consolidation

-Step 1 of data prepossessing -Collecting, storing, and integrating data

data reduction

-Step 4 of data prepossessing -Reduce Dimension -Reduce volume -Balance Data

What are the best practices in business reporting? How can we make our reports stand out?

-Story Telling/ Story Structure -Be authentic -Visual -Invite and direct discussion. -Dashboard/Scorecard

Prediction Data-mining

-The act of telling about the future by taking into account the experiences, opinions, and other relevant information in conducting the task of foretelling. -Whereas prediction is largely experience and opinion based, forecasting is data and model based.

Visual analytics

-The combination of visualization and predictive analytics -Visual analytics is aimed at answering "why is it happening," "what is more likely to happen," and is usually associated with business analytics (forecasting, segmentation, and correlation analysis).

Discuss what an organization should consider before making a decision to purchase data mining software.

-cost/benefit analysis -people with the expertise to use the software and perform the analyses -availability of historical data -a business need for the data mining software.

Web accessibility of a data warehouse

-important because many analysis applications are Web-based, because users often access data over the Web (or over an intranet using the same tools) and because data from the Web may feed the DW.

Enterprise Data Warehouse

-large scale data warehouse that is used across the enterprise for decision support -Provides intergration of data from many sources into a standard format. -Provides data for DSS such as CRM,SCM, and BPM,etc..

benefits of data warehouse

-provides decision-making information, organized in a way that facilitates the types of access required for that purpose and supported by a wide range of software designed to work with it.

Define data mining. Why are there many names and definitions for data mining?

1.)Data mining is the process through which previously unknown patterns in data were discovered. 2.)A process that uses statistical, mathematical, and artificial learning techniques to extract and identify useful information and subsequent knowledge from large sets of data." 3.)Data mining is the process of finding mathematical patterns from (usually) large sets of data; these can be rules, affinities, correlations, trends, or prediction models.

Prescriptive Analytics

A set of techniques that use descriptive data and forecasts to identify the decisions most likely to result in the best performance. data and forecasts to identify the decisions most likely to result in the best performance. For example, predictive analytics could forecast the impact on profits of different baggage fees. It might show, for example, that raising baggage fees by $5 will lead to the greatest profits after the airline takes into account fee revenues, ticket sales, the amount of baggage carried, and the cost to transport the baggage

independent data mart

A small data warehouse designed for a strategic business unit or a department

dependent data mart

A subset that is created directly from a data warehouse

Operational Data Store

A type of database often used as an interim area for a data warehouse, especially for customer information files. -Short-term decisions involving mission-critical applications

Considering the new and broad definition of business analytics, what are the main inputs and outputs to the analytics continuum?

Because of the broader definition of business analytics, almost any data from almost any source can be considered an input. In the same way, after analytics has been performed, output can take a wide variety of forms depending on the specific business purpose.

What are the main data mining application areas?

CRM, banking, retailing and logistics, manufacturing and production, brokerage and securities trading, insurance, computer hardware and software, government and defense, travel, healthcare, medicine, entertainment, homeland security and law enforcement, and sports. -The commonalities are the need for predictions and forecasting for planning purposes and to support decision making.

Correlation

Correlation makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. -Interested in the low-level relationships between two variables,

Where does the data for business analytics come from?What are the sources and the nature of that incoming data?

Data can come from a wide variety of locations. Examples can include business processes and systems, the Internet and social media, and machines or the Internet of Things

1. Compare data integration and ETL. How are they related?

Data integration consists of three processes that integrate data from multiple sources into a data warehouse: accessing the data, combining different views of the data, and capturing changes to the data. -It makes data available to ETL tools and, through the three processes of ETL, to the analysis tools of the data warehousing environment.

How do you describe the importance of data in analytics?

Data is the main ingredient in all forms of analytics. You cannot have analytics without data.

Why are there many names and definitions for data mining?

Data mining has many definitions because it's been stretched beyond those limits by some software vendors to include most forms of data analysis in order to increase sales using the popularity of data mining.

Data reduction can be applied to rows (sampling) and/or columns (variable selection). Which is more challenging? Explain.

Data reduction as it applies to variable selection is more complex. This is because variables to be studied must be selected and others discarded. This is typically done by individuals who are experts in the field.

structured data

Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.

Why is the original/raw data not readily usable by analytics tasks?

It is often dirty, misaligned, overly complex, and inaccurate

Can we use the same data representation for all analytics models (i.e., do different analytics models require different data representation schema)? Why, or why not?

No, other data types, including textual, spatial, imagery, video, and voice, need to be converted into some form of categorical or numeric representation before they can be processed by analytics methods. -Needs to be structured

Metric Management Reports

Outcome-oriented metrics based on service level agreements and/or key performance indicators.

Data Mining Methods

Prediction(classification or regression) -Clustering -Association

Dashboard-Type Reports

Present a range of performance indicators on one page, with both static/predefined elements and customizable widgets and views

Analytics

Process of developing actionable decisions based on insights generated from historical data

What are the two most commonly used shape characteristics to describe a data distribution?

Skewness and Kurtosis

What is the relationship between statistics and business analytics?

Statistics can be used as a part of business analytics, either to help generate reports or as a presentation format.

What is a box-and-whiskers plot? What types of statistical information does it represent?

The box-and-whiskers plot is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality, the dispersion, and the minimum and maximum ranges.

What are the main categories of data? What types of data can we use for BI and analytics?

The main categories of data are structured data and unstructured data. Both of these types of data can be used for business intelligence and analytics, although it is easier and more expedient to use structured data.

What are the main data preprocessing steps? List and explain their importance in analytics.

The main data preprocessing steps include data consolidation, data cleaning, data transformation, and data reduction.

What is time series?

Time series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values.

DSS (decision support system)

Umbrella term that combines architectures, tools, data-bases, analytical tools, applications, and methodologies

Describe how data integration can lead to higher levels of data quality.

Without a data integration process to combine data in a planned and structured manner, data might be combined incorrectly. That could lead to misunderstood data (a measurement in meters taken as being in feet) and to inconsistent data (data from one source applying to calendar months, data from another to four-week or five-week fiscal months).

Business Intelligence (BI)

a broad category of applications, technologies, and processes for gathering, storing, accessing, and analyzing data to help business users make better decisions.

Data Mart

a data collection, smaller than the data warehouse, that addresses the needs of a particular department or functional area of the business

Descriptive analytics

applying historical or real-time data to know what is happening in the organization and understand some underlying trends and causes of such occurrences. In the airline example, descriptive analytics would include data about ticket prices, baggage fees, ticket sales, baggage volume, and so on, applied to find relationships among these variables.

data mart

contains a subset of data warehouse information

Information Generation

converting data from different sources into actionable information.

What are the most commonly pronounced assumptions for linear regression? What is crucial to the regression models against these assumptions?

linearity, independence, normality, constant variance, and multicollinearity.

Skewness

measure of asymmetry in a distribution of the data that portrays a unimodal structure—only one peak exists in the distribution of the data

Kurtosis

measure to use in characterizing the shape of a unimodal distribution that is more interested in characterizing the peak/tall/skinny nature of the distribution.

unstructured data

nonnumeric information that is typically formatted in a way that is meant for human eyes and not easily understood by computers. -Photos and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint presentations, emails, blog entries, wikis and word processing documents.

Balanced scorecard reports

present an integrated view of a company's health and include financial, customer, business process, and learning/growth perspectives.

Predictive Analytics

the use of statistical techniques and data mining to determine what is likely to happen in the future. For example, an airline might use predictive analytics to forecast the impact on sales and profits if it raises baggage fees by $10. It applies information from descriptive analytics. Predictive analytics may be applied to prescriptive analytics.

Distinguish BI from DSS.

· BI uses a data warehouse, whereas DSS can use any data source (including a data warehouse). · Most DSS are built to support decision making directly, whereas most BI systems are built to provide information which it is believed will lead to improved decision making. · BI has a strategy/executive orientation whereas DSS are usually oriented toward analysts. · BI systems tend to be developed with commercially available tools, whereas DSS tend to use more custom programming to deal with problems that may be unstructured. · DSS methodologies and tools originated largely in academia, whereas BI arose largely from the software industry. Many BI tools, such as data mining and predictive analysis, have come to be considered DSS tools as well.

Dashboard design

· Benchmark key performance indicators with industry standards. · Wrap the dashboard metrics with contextual metadata (e.g., data source, data currency, refresh schedule). · Prioritize and rank alerts/exceptions streamed to the dashboard. · Enrich the dashboard with business users' comments. · Present information in three different levels (visual dashboard, static report, and self-service cube). · Pick the right visual construct using dashboard design principles. · Provide for guided analytics.

What are the most common myths and mistakes about data mining?

· Data mining provides instant, crystal-ball predictions. · Data mining is not yet viable for business applications. · Data mining requires a separate, dedicated database. · Only those with advanced degrees can do data mining. · Data mining is only for large firms that have lots of customer data.

Discuss the major drivers and benefits of data warehousing to end users.

· Increased competition and pace of business, leading to increased need for good decisions quickly · Successful pioneering experiences with data warehouses, leading to their wider user acceptance · Decreasing hardware costs, making terabyte databases with masses of historical data economically feasible for more firms · Increased availability of software to manage a large data warehouse · Increased availability of analysis tools making DWs potentially more useful · Increased computer literacy of decision makers, making them more likely to use these tools

major issues in implementing BI.

· Properly appreciating the different classes of potential users of the BI applications. · Properly aligning BI with the business strategy. · Developing BI applications that meet users' needs for real-time, on-demand capabilities. · Determining whether to develop or acquire BI systems, and how to do so. · Justifying the BI investment using cost-benefit analysis. · Insuring the security and privacy protection -Integrating BI applications with organizational systems, databases, and e-commerce

What are the main reasons for the recent popularity of data mining?

• More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace • General recognition of the untapped value hidden in large data sources • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse • The exponential increase in data processing and storage technologies • Significant reduction in the cost of hardware and software for data storage and processing • Movement toward the de-massification (conversion of information resources into nonphysical form) of business practices


Ensembles d'études connexes

CH 18- Open Economy Macroeconomics: Basic Concepts

View Set

Certified Ethical Hacker 312-50v11 EXAM STUDY

View Set

HIS 101 - US History 1 All Quizez

View Set

Control of Microbial Growth Ch. 7

View Set

Chapter 1 Lessons 3 Physical and Chemical Changes

View Set