5.1 The Business Intelligence (BI) Stack

¡Supera tus tareas y exámenes ahora con Quizwiz!

How to come up with good ideas

(1) take a complete random guess and hope that it works (2) draw from your past experience (and that of your co-workers and employees) to base new ideas upon (3) gather data about your performance and use it to determine cause and effect.*** all of the very best decisions (at least in business) are generally based on data.

conceptual characteristics of good measures.

-simple -easily obtainable -precisely definable -objective -valid -robust -putting them all together

dashboard

A summary of KPIs that the user can explore. Good dashboards (i.e., more expensive dashboards) also give managers tools for exploring the KPIs in more detail and breaking them down into more specific measures. dashboards can include both descriptive and limited predictive elements. Yet, their primary purpose is to focus on KPIs like the revenue visualization dashboards are a popular way to summarize a series of KPIs, data descriptions, and even data predictions in a single user interface. Dashboard design is an important area of research and a profitable business for skilled user-interface designers because they have such a strong effect on the successful consumption of big data by business users.

Descriptive approach to data mining

Descriptive data mining refers to the set of tools and procedures designed to analyze data in ways that describe the past and immediate present state of the business processes that the data are produced from.

Descriptive Tools and Techniques

Descriptive tools and techniques comprise most of those that people refer to when they speak of "data mining" or "BI". The idea here is to take large amounts of data (hence the term "big data") and summarize it into either a set of key performance indicators (KPIs) or ad-hoc measures.

ETL

ETL: Extract, Transform, Load. Taking data from an operational database, making it functional for another database, then loading it to that new database. Extract/export means that the data is copied from the operational database(s) and other data sources. Transform means that we do not want to load the data "as-is" into another database; we don't need that same level of detail for our analyses. Therefore, we will typically summarize it in some ways. For example, rather than store each individual sale, we will store the daily sales by product line, salesperson, store location, etc. Lastly, load simply means that the data is then pasted into a new database.

the "cause" of success

Harder to determine Does organic food cause autism? Probably not. In fact, this ridiculous chart was made as an example of how data can be terribly misinterpreted if you don't know how to use it.

Precisely Definable

Measures should be clearly defined so that they can be applied and evaluated consistently. Organizations need to establish and adhere to specific rules when collecting measures and ensure that these rules are being followed uniformly so that the integrity of the data is maintained. In baseball, for example, the total number of 'at-bats' a player has determines, in part, his batting average. Therefore, it is important to decide what constitutes an at-bat. When a player is issued a walk, is that considered an at-bat? How about when the result of the at-bat is a sacrifice? An error? These parameters have already been determined for baseball, and it is important that these measures be made uniformly for each player to maintain consistency.

robust

Measures should be insensitive to insignificant changes in the process or product. example: Baseball measurements are robust. Unless the entire structure of the game was changed, the measurements would not be affected by small changes. For example, changing the height of the pitcher's mound or the distance from the pitcher's mound to home plate will not change the measurements used to assess a pitcher's performance.

OLAP Cube

OLAP Cube: "Online Analytical Processing." A table of summarized data.

Predictive approach to data mining

Predictive data mining refers to the set of tools and procedures designed to predict the most likely future outcomes including performance, states, preferences, and much more based on historical data. typically more difficult to obtain, but also more useful for making intelligent business decisions.

The "effect" of success

The "effect" that we are interested in is business success.

valid

The measure should actually reflect the property it is intended to. For example, how would you measure the success of a recording artist? If you wanted to measure commercial success, you could track the number of albums they sell and their total radio airtime. Conversely, if you wanted to measure a singer's vocal ability, you may have a panel of experts assess whether the artist's voice is consistently on key and the expanse of their vocal range. Often measures are designed to reflect specific process improvement constructs such as quality and efficiency.

Other sources

The statistical formulas used in the analyses above can also be used to improve other important steps in data analysis. For example, if you have missing consumer data in your analytical database, many statistical formulas will automatically ignore all of that consumer's data because they require complete data to work at all. Let's say that only 8 percent of your customers have completed their entire online profile. It would be very sad to have to ignore the other 92 percent of your customers. So what options do you have? Well, you can start paying for external databases which might be able to fill in the gaps. However, those databases often have the exact same info you have. Another option is to fill in the missing values of each customer with the average of all other customers. While this will allow you to use more of your data, it will likely reduce the strength of your relationships. More recently, a popular technique has been to use the same statistical analysis used in key influencer analysis (e.g., regression) to predict the most likely value of the missing data based on the actual values of all other attributes of the record. For example, if you know that a customer is male, age 20, not a homeowner, and works part time, your statistical regression model is likely to also predict that this person has a partial college education. While it is definitely possible that this prediction is wrong, it is much more likely to be accurate than using simply the most common value for education found in the data. clustering algorithms, found in category detection tools, to identify records that are outliers. example, your customers may have an average income of $75,000 per year. Therefore, a customer making $200,000 per year may appear to be an outlier. Removing outliers from the data is a great way to improve your predictive power. However, a clustering algorithm would examine all of the other attributes of this seeming outlier in concert and find that because that customer has a graduate degree and 30 years of full-time work experience, they are well within the normal range of customers. Similarly, another customer who earns the exact average of $75,000 would be identified as an outlier if they are 16 years old.

A data warehouse may include data from which of the following sources?

Web scraping Third party database for sale by subscription Operational or "transactional" database Company email

combining them all together

a smart manager will collect several measures that she or he believes to represent what a customer's satisfaction truly is.

Business intelligence (BI)

a term with a broad meaning that generally refers to the process and technology tools used to transform raw data into meaningful and useful information that supports business decision making. often used interchangeably with BA More recently, the term "BI" is used to refer more specifically to the portion of the process that describes and reports on the past and existing state of the business

Detecting categories

also known as "Clustering," the process of clustering relateable data into groups. Clustering analyses will not only tell you how many clusters were found, but also the primary characteristics (attribute:value pairs) of each cluster. This will allow you to group your customers into segments and create unique strategies for each segment.

Market Basket Analysis

analyzing and predicting related purchases, ex: "customers who bought this item also bought..." statistical technique used to perform market basket analysis is called "association analysis. If you've ever visited amazon.com, then you've seen market basket analysis as new products are always suggested based on the product you are viewing.

Why is it BA and BI ambiguitous?

each term is often used to refer to all data analytics, purposes, techniques, and tools rather than a single subset The reason for the frequent ambiguity between these terms is because this area is changing faster than perhaps any other type of organizational technology.

simple

few inputs are required to produce the same measure that could be produced with more inputs. However, just because a measure is simple does not mean that it is the right measure if it doesn't provide the meaning you are looking for. However, all other things being equal (ceteris paribus), a simpler measure is better.

easily obtainable

includes two parts: ease of collection and ease of calculation. Ease of collection: How easy are the measures to collect? Does it require human counting, or is it generated through automation? For example, at checkout stands, the process of running items over a scanner generates the information. Thus, the work itself is "sensed" by some sort of monitoring device. In baseball, the process is not automated, but it is not difficult to obtain counts for the different measures. Anyone with a basic knowledge of the game could collect measures that would cover all of the categories necessary. Ease of calculation: How easily can you calculate useful information from the source data? Counts and simple ratios are easy to calculate. Some complex measures that take into account several variables are more complex to calculate. It is often beneficial to automate the collection and calculation tasks when possible; otherwise, it is unnecessarily arduous and costly, prone to error, and less likely to be done. This is especially true of measurements that are computation-intensive or that require an array of data to calculate.

Analyzing Key Influencers

or "key influencer analysis," measuring the correlation between key independent (x) variables on a dependent variable (y) This is a great way to find out, for example, what characteristics of potential customers are related to their level of repeat purchases in your store. For example, as customers have more education, they may be more likely to make a purchase (represented by the line of best fit through the scatterplot below). However, it is very important to be wary of assuming causality even though you may find statistically significant results. For example, it may not be a customer's education that causes them to make purchases; but rather, their education led to greater income which led to purchases.. There are several statistical formulas used to make these kinds of predictions. The image above depicts a regression analysis. Other formulas include Naive Bayes, Decision Trees, Neural Networks, and more.

purpose of predictive tools and techniques

outsmart the competition by making the products that consumers really want (perhaps before they know they want them), recognizing viruses before they are a "known" virus, knowing if a consumer is going to buy our product before we give them the sales pitch, and much much more.

data cleaning

part of the "transform" step of ETL Data cleaning is the process of improving the quality of the data. hould you ETL every single sale as it happens in real time? Probably not, because that would require significant extra computing power. So, should you ETL once a month? Well, it depends. Is the value of having the most recent month greater than the cost of running an ETL batch once per month? If so, then you should probably run it more often. In other words, that decision affects the timeliness of your data. So, let's say you decide on ETL every morning at 4 a.m. which is when your website traffic is at its lowest. In the past 24 hours, let's say you've had 2500 new customers register. However, few of them filled out all of their profile information. Can you design your ETL process to query the United States Postal Service database and fill in any missing data (to be added into your data warehouse, not your operational customer DB which they see)? Yes, you can. That would improve the missing values problem and possibly any accuracy problems with their contact information. In fact, there are many other databases you can pay to access and program into your ETL process to improve or clean your data.

data mart

portions of a data warehouse exists to... 1. it limits information security vulnerability (i.e., not everyone sees, or has access to, all the data) 2. it reduces the complexity of the data for end-users because they don't see anything they don't need. As organizations realized the value in these mini analytical databases, they realized that the data storage should be centralized. As a result, in many cases, data warehouses were a "backtracked" technology designed to share the analytical data being stored by each organizational department.

Key performance indicators

pre-planned measures that have been carefully determined to indicate the organization's performance on a particular business process.

Presentation to End User Data mining

process of analyzing large amounts of data (a.k.a. "big data") to create new information that is useful in making unstructured business decisions and solving unstructured business problems. can take a "descriptive" or "predictive" approach

Forecasting

process of predicting future values over interval time periods based on known, measured values of the same interval periods. As a result, forecasting always has a standard time period and is charted over time. Sales revenues, profit, costs, and market demand are among the most common measures forecasted over time (i.e., time-series). The ARMA (autoregressive moving average) and ARIMA (autoregressive integrated moving average) formulas are among the most common statistical formulas used for forecasting.

Predictive data analysis

requires relatively complex statistical formulas that use historical data to make predictions about the relationships between sets of variables. -Detecting Categories -Analyzing Key Influencers -Forecasting -Market Basket Analysis -Other sources Creating new and useful ways to integrate statistical prediction into your business processes is a great way to save costs, increase revenues, and become noticed by your managers.

Time-Series Analysis

showing trends in financial data over time

Business Analytics (BA)

subset of BI (often used interchangeably) is often used to refer to the statistical analyses performed to help predict the future.

BI Stack

the set of technologies (hardware and software) that either support or directly offer data description and analytics capabilities.

ad-hoc measures

those created "on the fly" to help in decision making typically associated with an ad-hoc database query . This is a primary reason that Microsoft Excel is such a mainstream tool at every level in an organization. It has many features for data cleaning and analysis on the fly. Excel provides a nice platform for downloading relatively small amounts of data and creating (or experimenting with) new measures. Often, KPIs are created in Excel before they become true KPIs and are shifted into a more permanent online, web-based dashboard

objective

two or more qualified observers should arrive at the same value for the measure. Every once in a while, the official scorer will score something as a hit that could be considered an error or vice versa. These occurrences are rare and do not have a great impact on the outcome of the measurements.

Data Sources

where the data comes from much of our data will come from our operational databases Operational databases keep track of every piece of data necessary for our business processes in a non-redundant (a.k.a. "normalized") relational database. However, there are a lot of other sources of good data that can be used to create business intelligence to aid decision making. Any company-issued hardware (laptops, phones, tablets, smart ID cards, or even watches) can collect data through user interactions or through sensors. Any data transferred over company-managed networks can be used as a data source. Any data you can legally obtain through web searches or webscraping is also a usable source of data. Another important source of data is the research we hire consulting firms to perform or that we perform ourselves. For example, we may administer a survey on our company website. Or, we may hire a firm to call our customers and ask them about their satisfaction with our products. Similarly, we likely have supply chain partners who are incentivized to see us succeed. For example, our suppliers want us to keep buying parts and products from them. Therefore, they will often offer us access to some of their data which may be useful. , it is very common to purchase access to third-party databases. Consider the major credit bureaus (Equifax, Transunion, and Experian). They offer consumer credit information for sale. The United States Post Office sells access to address information. A company can use these data sources to "fill in the gaps" in its own consumer data collection.

data warehouse (a.k.a. analytical database)

where we load the recently transformed data. -where the relevant data is stored the same rules of normality (non-redundancy of data) no longer apply. Data warehouses have their own rules and schemas (e.g. the "star" schema) that are optimized for ad-hoc data analyses. Hadoop is currently the most popular technology for integrating data from a wide range of sources into a data warehouse.


Conjuntos de estudio relacionados

NSG 242 Chapter 33: Caring for Children in Diverse Settings

View Set

ENGLISH SPEAKING COURSE - CHAPTER 2 Bored And How Are You Doing?

View Set

Barnett - Con Law II Final Exam Study

View Set

Which of the following best describes a benefit of increasing the number of offshore wind farms rather than onshore wind farms?

View Set

Module 4: Behavior Change Principles and Application

View Set

Biology Flashcards 8 (CIE IGCSE specific)

View Set