BCOR 2020 Exam 1
Benefits of BI and Analytics
1. Detect fraud 2. Improve forecasting 3. Increase sales 4. Optimize operations 5. Reduce costs
Options for addressing missing data
1. Discard observations (rows) with any missing values 2. Discard any variable (column) with missing values 3. Fill in missing entries with estimated values 4. To apply a data-mining algorithm that can handle missing values
Challenges of Big Data
1. How to choose what subset of the data to store 2. Where and how to store the data 3.How to find the nuggets of data that are relevant to the decision making at hand 4.How to derive value from the relevant data 5.How to identify which data needs to be protected from unauthorized access
Decision Making Process
1. Identify and define the problem 2. Determine the criteria 3. Determine the set of alternative solutions 4. Evaluate the alternatives 5. Choose an alternative
Data mining
A BI analytics tool used to explore large amounts of data for hidden patterns to predict future trends and behaviors for use in decision making
Data Attribute
A characteristic of an entity
Data modeling
A diagram of data entities and their relationships
Primary Key
A field or set of fields that uniquely identifies the record
Histogram
A graphical display of a frequency distribution, relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the class intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis.
Conversion Funnel
A graphical representation that summarizes the steps a consumer takes in making the decision to buy your product and become a customer.
Database Management System (DBMS)
A group of programs that manipulate the database and provide an interface between the database and the user of the database and other application programs.
Data warehouse
A large database that collects business information from many sources in the enterprise in support of management decision making
Linear Regression
A mathematical technique for predicting the value of a dependent variable based on a single independent variable and the linear relationship between the two
Geometric Mean
A measure of location that is calculated by finding the nth root of the product of n values, used in analyzing growth rates in financial data
Variance
A measure of variability that uses all data, based on the deviation about the mean, which is the difference between the value of each observation and the mean, the deviations about the mean are squared, and it measures how far a set of numbers are spread out from their average value
Online Analytical Processing (OLAP)
A method to analyze multidimensional data from many different perspectives. OLAP enables users to identify issues and opportunities and perform trend analysis
Data Entity
A person, place, or thing for which data is collected, stored, or maintained
Data Query
A request for information with certain characteristics
Value Chain
A series (chain) of activities that an organization performs to transform inputs into outputs in such a way that the value of the input is increased.
Information System
A set of interrelated elements that: collect, process, store, disseminate data and information, and provides a feedback mechanism to monitor and control its operation to make sure it continues to meet its goals and objectives
Process
A set of logically related tasks performed to achieve a defined outcome
Computer-based information system (CBIS)
A single set of hardware, software, databases, networks, people, and procedures that are configured to collect, manipulate, store, and process data into information.
Sample data
A subset of the population
Frequency Distributions for Categorical Data
A summary of data that shows the frequency of observations in each of several non-overlapping classes (bins)
Scatter chart
A useful graph for analyzing the relationship between two variables. Positive relationship: one variable increases, the other generally increases as well
Word cloud
A visual depiction of a set of words that have been grouped together because of the frequency of their occurrence.
Tactical (Managerial) Decisions
About how the organizations should achieve the goals and objectives set by strategy
Operational Decisions
Affect how the firm runs it's day-to-day operations
Technology infrastructure
All the hardware, software, databases, telecommunications, people, and procedures that are configured to collect, manipulate, store, and process data into information.
Database as a Service (DaaS)
An arrangement where the database is stored on a service provider's servers and accessed by the service subscriber over a network, typically the Internet, with the database administration handled by the service provider.
Database
An organized collection of data
Data with bell-shaped distribution
Approx. 68% will be within 1 standard deviation, approx. 95% of the data will be within 2 standard deviations, and almost all data will be within 3 standard deviations.
Categorical data
Arithmetic operations can't be performed on them
Utility Theory
Assigns values to outcomes based on the decision maker's attitude toward risk, loss, and other factors
Knowledge
Awareness and understanding of a set of information and the ways it can be useful to support a task. Process of defining relationships among data to create useful information requires knowledge
Spreadsheets
Business managers can often import data into a spreadsheet program and can be used to perform operations of the data based on formulas created by the end user. Also used to create reports and graphs based on that data. Excel Scenario Manager: Used to perform "what-if" analysis to evaluate various alternatives
Range
Can be found by subtracting the smallest value from the largest value in the data set. Drawback: range is based on only two of the observations and thus is highly influenced by extreme values
Cross-sectional data
Collected from several entities at the same, or approximately the same, point in time
Time Series data
Collected over several time periods
Random sampling
Collecting a sample that ensures that (i.) each element selected comes from the same populations and each (ii.) each element is selected independently
Data Dashboard
Collections of tables, charts, maps, and summary statistics that are updated as new data become available
Simulation Optimization
Combines the use of probability and statistics to to model uncertainty with optimization techniques to find good decisions in highly complex and highly uncertain settings
Predictive Analytics
Consists of techniques that use models constructed from past data to predict the future or ascertain the impact of one variable on another. Examples include: Linear regression, time series analysis, some data-mining techniques, and simulation (risk-analysis)
Data Cubes
Contain numeric facts called measures which are categorized by dimensions, such as time and geography. Can be built to summarize unit sales of a specific item on a specific day for a specific store
Before building a database
Content, access, logical structure, physical organization, archiving, security
Information
Data by itself isn't very useful. Collection of data organized so they have value beyond facts themselves
Enterprise data modeling
Data modeling done at the level of the entire enterprise
Entity relationship (ER) diagrams
Data models that use basic graphical symbols to show the organization of and relationship between data
Sources of Data
Data necessary to analyze a business problem or opportunity can often be obtained with an appropriate study (experimental or observational)
Hierarchy of Data
Database > Files > Records > Fields > Characters (8 bits)
Covariance
Descriptive measure of the linear association between two variables. If the covariance is > 0 it indicates a positive relationship. If it is near 0 the variables aren't linearly related. If it is < 0 they're negatively related. In excel: =COVARIANCE.S(ARRAY,ARRAY)
Coefficient of Variation
Descriptive stat that indicates how large the standard deviation is in relative to the mean. Expressed as a percentage
Supply Chain Management (SCM)
Encompasses all the activities required to get the right product into the right consumer's hands in the right quantity at the right time and at the right cost
Descriptive Analytics
Encompasses the set of techniques that describes what has happened in the past. Examples include: data queries, reports, descriptive statistics, and data visualization including data dashboards, some data-mining techniques, and basic what-if spreadsheet models
Business Problems
Every business has problems. Ex. How much to produce? How much to buy? When to open the store? Products to promote?
ETL process
Extract, transform, and load
Identifying Outliers
Extreme values in a data set. Can be identified using z-scores. Any data value with a z-score less than -3 or greater than +3 is an outlier. Such data values can be reviewed to determine their accuracy and whether they belong in the data set.
Box plot
Graphical summary of the distribution of data, developed from quartiles for a data set, by using the IQR limits are located. Lower limit = Q1-1.5 and Upper limit = Q3 + 1.5
Strategic Decisions
High level issues concerned with overall directions of the organization. Define goals and strategies.
Neural computing
Historical data is examined for patterns that are then used to make predictions
Case-based reasoning
Historical if-then-else cases are used to recognize patterns
Business Intelligence
Includes a wide range of applications, practices, and technologies for the extraction, transformation, integration, visualization, analysis interpretation, and presentation of data to support improved decision making. Data used in BI is often pulled from multiple sources and may come from sources internal and external to the organization
Group IS
Includes information systems that improve communications and support collaboration among members of a work group
Personal IS
Includes information systems that improve the productivity of individual users
Enterprise IS
Includes information systems that organizations use to define structured interactions among their own employees and/or external customers, suppliers, government agencies, etc.
Prescriptive Analytics
Indicates a best course of action to take. They provide a forecast or prediction but doesn't provide a decision. A forecast or prediction when combined with a rule becomes a prescriptive model. Examples include: rule-based models, portfolio models in finance, supply network design models in operations, and price-markdown models in retailing (optimization models)
Data Preparation
Involves descriptive stats and data visualization. Treating missing data and identifying erroneous data and outliers
Supply Chain
Key value chain in a manufacturing organization
Non-experimental (observational)
Make no attempt to control the variables of interest
Correlation Coefficient
Measures the relationship between two variables. Not affected by the units of measurement for x and y. It it's less than 0 it's negative linear, it it's near 0 its not linear, and if it's greater than 0 it's positive linear. In excel: =CORREL(ARRAY,ARRAY)
Z-Score
Measures the relative location of a value in the data set. Helps to determine how far a value is from the mean relative to the data set's standard deviation. Often called the standardized value.
Optimization Models
Models that give the best decision subject to the constraints of the situation
Mean/Arithmetic Mean
Most common measure of location and average value for a variable
Frequency distributions for quantitative data
Must be more careful in defining the non-overlapping bins to be used in distribution
Legitimately Missing Data
Naturally missing data. Generally no remedial action taken
Quantitative data
Numeric and arithmetic operations can be performed on them
Managers
Plans, coordinate, organize, and lead their organizations to better performance
Types of data
Population and sample data, Quantitative and categorical data, and cross-sectional and time-series data
Database activities
Providing a user view of the database Adding and modifying data Storing and retrieving data Manipulating data and Generating reports
Domain
Range of allowable values for a data attribute
Histogram Skewness
Skewed whichever way the tail extends further on
Three types of business decisions
Strategic, Tactical (Managerial), and Operational
Percent frequency
Summarizes the percent frequency of the data for each bin
Relative frequency
Tabular summary of data showing the relative frequency for each bin
Data lake
Takes a "store everything" approach to big data, saving all the data in its raw, unaltered form. Also called enterprise data hub. Raw data is available when the users decide just how they want to use the data. Only when the data is accessed for a specific analysis is it extracted from the data lake
Data
The facts and figures collected, analyzed, and summarized for presentation and interpretation. Raw facts: Alphanumeric, audio, image, and video
Standard Deviation
The positive square root of the variance, measured in the same units as the original data, used to quantify the amount of variation or dispersion of a set of data values
Data Visualization Tools
The presentation of data in a pictorial or graphical format. Representing data in a visual form brings immediate impact to dull and boring numbers
Population data
The set of all elements of interest in a particular study
Business Analytics
The solution to the scientific process of transforming data into insights for making better decisions. Creates insights from data, improves our ability to more accurately forecast for planning, helps us quantify risk, categories, and yields better alternatives through analysis and optimization.
Data Item
The specific value of an attribute
Imputation
The systematic replacement of missing values with values that seem reasonable
Missing at random (MAR)
The tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data. Ex. Diagnostic tests are missing when the patient is too sick to do procedure
Missing completely at random (MCAR)
The tendency for an observation to be missing the value for some variable is entirely random; whether data are missing does not depend on either the value of the missing data or any other variable in the data
Missing not at random (MNAR)
The tendency for the value of a variable to be missing is related to the value that's missing. Ex. high income individuals don't want to report income
Data Mining
The use of analytical techniques for better understanding patterns and relationships that exist in large data sets
Percentile
The value of a variable at which a specified (approximate) percentage of observations are below that value. the pth percentile tells us the point in data where: approx. p percent of the observations have values less than the pth percentile and approx. (100-p) percent of the observations have values greater than the pth percentile.
Common approaches to decision making
Tradition (we've always done it this way), intuition ("gut-feeling"), and rules of thumb (offer two sections of BCOR 2020 each semester), and using relevant data
The Database Approach
Traditional approach to data management: Each distinct operational system used data files dedicated to that system. The database approach: Information systems share a pool of related data, offers the ability to share data and information resources, and a database management systems (DBMS) is required
Challenges of decision making
Uncertainties and enormous number of alternatives
Illegitimately missing data
Unnaturally occurring missing data
Simulation (Risk-Analysis)
Use of probability and statistics to construct a computer model to study the impact of uncertainty on a decision
Cumulative distributions
Uses the number of classes, class widths and class limits developed for the frequency distribution, shows the number of data items with values less than or equal to the upper class limit of each class
Median
Value in the middle when the data are arranged in ascending order, take mean of middle two values if needed, mean is preferred measure of central location but it is influenced by extremely small and large data values, and when datasets contain extreme values the median is preferred
Mode
Value that occurs the most frequently in a given data set. Multimodal data: data with multiple modes. Bimodal data: data that contains exactly two modes.
Experimental
Variable of interest is first identified then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest
Quartiles
When data is divided into 4 equal parts. Each part contains approx. 25% of observations. Second quartile = the median. The difference between the third and first quartile in the IQR.
Empirical rule
When the distribution of the data exhibits a symmetrical bell-shape the empirical rule can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean
Relational Database Model
a simple but highly useful way to organize data into collections of two-dimensional tables called relations. Each row in the table represents an entity and each column represents an attribute of that entity
SQL Databases
a special-purpose programming language for accessing and manipulating data stored in a relational database. SQL databases conform to ACID (Atomicity, consistency, isolation, and durability) properties. in 1986 SQL was adopted by ANSI as the standard query language for relational databases
Association analysis
a specialized set of algorithms sorts through data and forms statistical rules about relationships among the items
Data mart
a subset of a data warehouse that is used by small- and medium-sized businesses and departments within large companies to support decision making. A specific area in the data mart might contain greater detailed data than the data warehouse
Reporting and Querying Tools
can present data in an easy to understand fashion via formatted data, graphs, charts. many tools enable users to make their own data requests and format the results without the need for additional help from IT organizations
Big data
extremely large and complex datasets, typically characterized as being of high volume, variety, and velocity
Database Administrations (DBA)
skilled and trained IS professionals. Works with users to define their data needs, applies database programming languages to craft a set of databases to meet those needs, tests and evaluates databases, implements changes to improve their databases performance, and assures that data is secure from unauthorized access