Business analytics
calculation of distances
- euclidean distance -Squared (or absolute) euclidean distance. - city block (manhattan distance). - cheby chev distance -mahalan nobis distance (D2)
IBM describes the phenomenon of big data through the four Vs:
1) Volume 2) Velocity 3) variety 4) Veracity
obeservation, or record
A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database. for example: represented by a row, "jacks record; student Id; first name; last name; address; phone #; Major; status.
observation
A set of values corresponding to a set of variables.
cumulative frequency distributions
A tabular summary of quantitative data showing the number of data values that are less than or equal to the upper class limit of each bin
Decision Analysis
A technique used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain set of future events.
teritary colors
A tertiary color is a color made by mixing one primary color with one secondary color. these colors can be divided by cool and warm colors
charts
A visual method for displaying data; also called a graph or a figure.
random sampling
Collecting a sample that ensures that (1) each element selected comes from the same population and (2) each element is selected independently.
cross-sectional data
Data collected at the same or approximately the same point in time.
unstructured data
Data does not exist in a fixed location and can include text documents, PDFs, voice messages, emails. EXample: have type in the exact address yourself.
structured data
Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion. Example: When entering an adress and there is a drop down menu.
time series data
Data that are collected over a period of time (minutes, hours, days, months, years, etc.). - for example, the gpa of john, freshmen to senior, are times series data -graphs of time series data help analysts understand what happened in the past, identify trends overtime, and project future levels for the timeseries.
variation
Differences in values of a variable over observations.
internet of things (IOT)
The technology that allows data collected from sensors in all types of machines to be sent over the Internet to repositories where it can be stored and analyzed.
Missing completely at random (MCAR)
The tendency for an observation to be missing a value of some variable is entirely random. For example, if missing value for a question on a survey is completely unrelated to the value that is missing and is also completely unrelated to the value of any other question on the survey, the missing value is MCAR.
Missing Not at Random (MNAR)
The tendency for an observation to be missing a value of some variable is related to the missing value.
Missing at random (MAR)
The tendency for an observation to be missing a value of some variable is related to the value of some other variable(s) in the data. For example if the responses to one survey question collected by a specific employee were lost due to a data entry error, then the treatment of the missing data may be less critical.
growth factor
The percentage increase of a value over a period of time is calculated using the formula (1 − growth factor). A growth factor less than 1 indicates negative growth, whereas a growth factor greater than 1 indicates positive growth. The growth factor cannot be less than zero.
data ink ratio
The ratio of the amount of ink used in a table or chart that is necessary to convey information to the total amount of ink used in the table and chart. Ink used that is not necessary to convey information reduces the data-ink ratio.
simulation
The use of probability and statistics to construct a computer model to study the impact of uncertainty on the decision at hand. For example, banks often use simulation to model investment and default risk in order to stress-test financial models. Simulation is also often used in the pharmaceutical industry to assess the risk of introducing a new drug.
simulation optimization
The use of probability and statistics to model uncertainty, combined with optimization techniques, to find good decisions in highly complex and highly uncertain settings.
population
The set of all elements of interest in a particular study.
Which of the following is not present in a time series?
operational variations
The percent of misclassified records out of the total records in the validation data is known as the
overall error rate
two approaches to drawing a conclusion in a hypothesis test are
p-value and critical value.
A forecast is defined as
prediction of future values of a time series.
A tye I error is commited when
a true hypothesis is rejected.
The value of the ___________ is used to estimate the value of the population parameter.
sample statistic
_______ are used in the pharmaceutical industry to assess the risk of introducing a new drug.
simulations
The __________ is a measure of the error that results from using the estimated regression equation to predict the values of the dependent variable in the sample.
sum of squares due to error (SSE)
Data mining methods for classifying or estimating an outcome based on a set of input variables is referred to as
supervised learning
In preparing categorical variables for analysis, it is usually best to
conver the categories to binary, as dummy variables. Typically, it is best to encode categorical variables with 0-1 dummy variables. Using 0-1 dummy variables to encode categorical variables with many different categories results in a large number of variables. In some cases, the number of categories may be reduced by combining categories.
__________ compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability if randomly classified.
cumulative lift
Data dashboards are a type of _________ analytics.
descriptive
experimental study
a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. For example, if a pharmaceutical firm conducts an experiment to learn about how a new drug affects blood pressure, then blood pressure is the variable of interest. The dosage level of the new drug is another variable that is hoped to have a causal effect on blood pressure.
unnatural and warm colors
are considered to draw the readers attention.
A test set is the data set used to
estimate performance of the final model on unseen data.
Bayes' Theorem decision tree
excel example.
color scales
format cells with different colors based on the relative value of a cell compared to other selected cells. can apply 2 - 3 color scales.
Data are considered quantitative data
if numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them.
Which statement is true about mutually exclusive events?
IF events A or event B cannot occur at the same time, they are called mutually exclusive.
All the events in the sample space that are not part of the specified event are called
the complement of the event Ac
Association Rules
if-then statements which convey the likelihood of certain items being purchased together. -widely used in marketing ultimately judged on how actionable it is and how well it explains the relationship between item sets.
A(n) __________ is a visual representation that shows which entities affect others in a mod
influence diagram
the idea of using data to create
information. to knowledge. than strategy. Data > information > knowledge > plan to guide actions or key decisions
Strategy
is a plan of action or policy to achieve your goals.
cluster analysis
the data preparation technique used in market segmentation to divide consumers into different homogenous groups. is a group of multivariate techniques whose primary purpose is to group objects based on the characteristecs they possess.
stemming
the process of converting a word to its stem or root word, would drop the "ing" and "ed" and place only "stack" in the list of words to be tracked.
sample space
the set of all outcomes of an experiment
when moing from information to knowledge
why questions
time series analysis and forcasting is one of the most
widely used analytics in business and economics.
sample mean
x̅ = ( Σ xi ) / n x̅ = (x1 +x2+....Xn)/n
Analytics is generally thought to comprise of three broad categories of techniques:
- descriptive analytics - predictive analytics - prescriptive analytics
the increase in the use of data mining techniques in businesses has been caused largely by three events:
1) the eplosion in te amount of data being produced and electronically tracked. 2) the ability to electronically warehouse these data. 3) the affordability of computer power to analyze the data.
How many Class 1's are correctly classified as Class 1 in the Table below? confusion matrix predicted class actual 1 0 1 221 100 0 30 3,000
221, 1.1
data dashboard
A collection of tables, charts, and maps to help management monitor selected aspects of the company's performance. used to help management monitor specific aspects of the company's performance related to their decision-making responsibilities. For corporate-level managers, daily data dashboards might summarize sales by region, current inventory levels, and other company-wide metrics; front-line managers may view dashboards that contain metrics related to staffing levels, local inventory levels, and short-term sales forecasts.
tactical decision
A decision concerned with how the organization should achieve the goals and objectives set by its strategy. - responsibility of mid-level management. -span a year
venn diagram
A graphical representation of the sample space and operations involving events, in which the sample space is represented by a rectangle and events are represented as circles within the sample space. event A is within the circle; sample space (S) is depicted as the rectangle and within the rectangle but outside of the circle is the complement of event A (Ac) P(A) + P(Ac) = 1
box plot
A graphical summary of data based on the quartiles of a distribution. also known as box-and-whisker plots.
multiplication law
A law used to compute the probability of the intersection of events. For two events A and B, the multiplication law is P(A∩B) = P(B)* P(A|B) or P(A∩B) = P(A)*P(B|A). For two independent events, it reduces to P(A∩B) = P(A) * P(B).
presence/absence or binary term-document matrix.
A matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or absence of a particular word in a particular document ( 1= present and 0 = not present)
mean (arithmetic mean)
A measure of central location computed by summing the data values and dividing by the number of observations. or average value. denoted by x̄ for sample data. if for the population mean its denoted by µ
mode
A measure of central location defined as the value that occurs with greatest frequency. Occasionally the greatest frequency occurs at two or more different values, in which case more than one mode exists. If data contain at least two modes, we say that they are multimodal. A special case of multimodal data occurs when the data contain exactly two modes; in such cases we say that the data are bimodal
median
A measure of central location provided by the value in the middle when the data are arranged in ascending order. Because n=12 is even, the median is the average of the middle two values: 199,500 and 208,000. 199,500 +208,000/ 2 =203750
Skewness
A measure of the lack of symmetry in a distribution.
range
A measure of variability defined to be the largest value minus the smallest value. The range can be calculated in Excel using the MAX and MIN functions.
Bayes' Theorem
A method used to compute posterior probabilities. based on prior probabilities estimates for specific events of interest. provides a means for making these probabilities calculations - a method used to calculate posterior calculations.
Probability
A numerical measure of the likelihood that an event will occur. it can be used as a measure of the uncertainty associated with an event.
rule based model
A prescriptive model that is based on a rule or set of rules.
addition law
A probability law used to compute the probability of the union of events. For two events A and B, the addition law is P(AUB) = P(A) + P(B) - P(A∩B). For two mutually exclusive events, P(A∩B) =0, so P(AUB) = P(A)+P(B) . The addition law is helpful when we are interested in knowing the probability that at least one of two events will occur. That is, with events A and B we are interested in knowing the probability that event A or event B occurs or both events occur.
random experiment
A process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined by chance.
data query
A request for information with certain characteristics from a database.
empiricle rule
A rule that can be used to compute the percentage of data values that must be within 1, 2, or 3 standard deviations of the mean for data that exhibit a bell-shaped distribution. Approximately 68% of the data values will be within 1 standard deviation of the mean. Approximately 95% of the data values will be within 2 standard deviations of the mean. Almost all of the data values (99.7%) will be within 3 standard deviations of the mean.
crosstabulation
A tabular summary of data for two variables. The classes of one variable are represented by the rows; the classes for the other variable are represented by the columns.
relative frequency distribution
A tabular summary of data showing the fraction or proportion of observations in each of several nonoverlapping categories or classes.
frequency distribution
A tabular summary of data showing the number (frequency) of data values in each of several nonoverlapping bins.
percent frequency distribution
A tabular summary of data showing the percentage of observations in each of several nonoverlapping bins/classes.
dendrogram
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering.
z-score, also called standardized value
A value computed by dividing the deviation about the mean (xi - x) by the standard deviation s. A z-score is referred to as a standardized value and denotes the number of standard deviations that xi is from the mean. excel function: =STANDARDIZE zi = (xi - x̅)/ s zi - z score for xi x̅ = the sample mean s = the sample std.dev
percentile
A value such that approximately p% of the observations have values less than the pth percentile; hence, approximately (100 2 p)% of the observations have values greater than the pth percentile. The 50th percentile is the median.
market basket analysis
Analysis of items frequently co-occuring in transactions (such as purchases). example: bread and jelly are antecedents and peanut butter is the consequent.
data scientists
Analysts trained in both computer science and statistics who know how to effectively process and analyze massive amounts of data.
Descriptive analytics
Analytical tools that describe what has happened. and why did it happen? Examples are data queries, reports, descriptive statistics, data visualization including data dashboards, some data-mining techniques, and basic what-if spreadsheet models.
big data
Any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software.
volume
Because data are collected electronically, we are able to collect more of it. To be useful, these data must be stored, and this storage has led to vast quantities of data. Many companies now store in excess of 100 terabytes of data
number of bins
Bins are formed by specifying the ranges used to group the data. As a general guideline, we recommend using from 5 to 20 bins. For a small number of data items, as few as five or six bins may be used to summarize the data. For a larger number of data items, more bins are usually required.
complementary colors
Colors located directly opposite one another on the color wheel
categorical data
Data for which categories of like items are identified by labels or names. Arithmetic operations cannot be performed on categorical data.
quantitative data
Data for which numerical values are used to indicate magnitude, such as how many or how much. Arithmetic operations such as addition, subtraction, and multiplication can be performed on quantitative data.
unstructured data (text data)
Data, such as text, audio, or video, that cannot be stored in a traditional structured database. unlike structured data people may have or use sland, typos, all caps, sarcastic words, etc.
The three steps necessary to define the classes for a frequency distribution with quantitative data are as follows:
Determine the number of nonoverlapping bins. Determine the width of each bin. Determine the bin limits.
when we collect data we are:
Gathering the past observed values of a variable. by collecting these past variables our goal is to learn more about variation or particular index companies.
euclidean distance
Geometric measure of dissimilarity between observations based on Pythagorean Theorem. u = u1, u2,... uq and v = v1, v2, ... vq each compromise measurements of q variables. the distance between u and v is: duv = sqrt[(u1 - v1)^2 + (u2 + v2)^2 + .... (uq +vq)^2)] Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values. Euclidean distance is highly influenced by the scale on which variables are measured.
decision making can be defined as the following process:
Identify and define the problem. Determine the criteria that will be used to evaluate alternative solutions. Determine the set of alternative solutions. Evaluate the alternatives. Choose an alternative.
__________ is the most critical step of the decision-making process.
Identifying and defining the problem.
variety
In addition to the sheer volume and speed with which companies now collect data, more complicated types of data are now available and are proving to be of great value to businesses. Text data are collected by monitoring what is being said about a company's products or services on social media platforms such as Twitter. Audio data are collected from service calls (on a service call, you will often hear "this call may be monitored for quality control"). Video data collected by in-store video cameras are used to analyze shopping behavior. Analyzing information generated by these nontraditional sources is more complicated in part because of the processing required to transform the data into a numerical form that can be analyzed.
prior probability
Initial estimate of the probabilities of events.
what falls under predictive analysis and is referred to as risk analysis?
Linear regression, time series analysis, some data-mining techniques, and simulation
location of the pth percentile
Lp = p/100*(n+1)
complete linkage
Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters. defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.
single linkage
Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters. The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar. thus, single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster.
Group average linkage
Measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between two clusters. Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters
Jaccard coefficient
Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries. number of variables with matching nonzero value for observations u and v/ (total number of variables - number of variables with matching non zero value for u and v.
median linkage
Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters. is analogous to group average linkage except that it uses the median of the similarities computed between all pairs of observations between the two clusters.
illegitimately missing data
Missing data that do not occur naturally. These cases can result for a variety of reasons, such as a respondent electing not to answer a question that she or he is expected to answer, a respondent dropping out of a study before its completion, or sensors or other electronic data collection equipment failing during a study.
Two Events are Independent If...
P(A|B)=P(A) or P(B|A) = P(B)
Gesalt Principles
Principles that describe the brain's organization of sensory information into meaningful units and patterns.
hierarchial clustering
Process of agglomerating observations into a series of nested groups based on a measure of similarity.
k-means clustering
Process of organizing observations into; distinct groups based on a meaure of similarity or, one of k groups based on a measure of similarity. typically euclidean distance
MapReduce
Programming model used within Hadoop that performs the two major steps for which it is named: the map step and the reduce step. The map step divides the data into manageable subsets and distributes it to the computers in the cluster for storing and processing. The reduce step collects answers from the nodes and combines them into an answer to the original problem
data security
Protecting stored data from destructive forces or unauthorized users.
posterior probabilities
Revised probabilities of events based on additional information.
As an example of the addition law, consider a study conducted by the human resources manager of a major computer software company. The study showed that 30% of the employees who left the firm within two years did so primarily because they were dissatisfied with their salary, 20% left because they were dissatisfied with their work assignments, and 12% of the former employees indicated dissatisfaction with both their salary and their work assignments. What is the probability that an employee who leaves within two years does so because of dissatisfaction with salary, dissatisfaction with the work assignment, or both?
S = the event that employee leaves because of salary. W = the event that employees leaves because of work assignment. P(S) = .3 P(W) = .2 P(S∩W) = .12 .3+.2-.12 = .38=P(SUW)
When you use excel, you can sort your data on the HOME tab, in the EDITING GROUP or on the DATA TAB.
Sort and Filter in excel. also can right click and use the drop menu > sort option.
imputation
Systematic replacement of missing values with values that seem reasonable.
Prescriptive Analytics
Techniques that analyze input data and yield a best course of action. why wil it happen? provide a forecast or prediction, but do not provide a decision. used to construct optimal portfolios of investments, to allocate assets, and to create optimal capital budgeting plans. For example, GE Asset Management uses optimization models to decide how to invest its own cash received from insurance policies and other financial products, as well as the cash of its clients, such as Genworth Financial. The estimated benefit from the optimization models was $75 million over a five-year period
predictive analytics
Techniques that use models constructed from past data to predict the future or to ascertain the impact of one variable on another. what will happen and when will it happen? For example, past data on product sales may be used to construct a mathematical model to predict future sales. This mode can factor in the product's growth trajectory and seasonality based on past patterns. A packaged-food manufacturer may use point-of-sale scanner data from retail outlets to help in estimating the lift in unit sales due to coupons or sales events. Survey data and past purchase behavior may be used to help predict the market share of a new product. used to forecast financial performance, to assess the risk of investment portfolios and projects, and to construct financial instruments such as derivatives
data cleansing
The data in a data set are often said to be "dirty" and "raw" before they have been put into a form that is best suited for investigation, analysis, and modeling. Common tasks in data preparation include treating missing data, identifying erroneous data and outliers, and defining the appropriate way to represent variables.
Complement of A
The event consisting of all outcomes that are not in "A". denoted as Ac. example between heads and tails; heads would be event A, and tails would be the compliments of A
intersection of A and B
The event containing the outcomes belonging to both A and B. The intersection of A and B is denoted A∩B.
support count
The number of times that a collection of items occurs together in a transaction data set.
Market Segmentation
The partitioning of customers into groups that share common characteristics so that a business may target customers within a group with a tailored marketing strategy.
Joint Probability
The probability of two events both occurring; in other words, the probability of the intersection of two events.
tokenization
The process of dividing text into separate terms, referred to as tokens. First, symbols and punctuations must be removed from the document and all letters should be converted to lowercase. For example, "Awesome!", "awesome," and "#Awesome" should all be converted to "awesome."
text mining
The process of extracting useful information from text data. to be analyzed, requires to be converted to structured data, so that tools of descriptive statistics, data visualization and data mining can be applied.
lift ratio
The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction. confidence/ (support of consequent/ total number of transactions) A lift ratio greater than one suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than having no rule at all.
business analytics
The scientific process of transforming data into insight for making better decisions. Used for data driven or fact based decision making. also called: - data analytics - data intelligence - busincess science
utility theory
The study of the total worth or relative desirability of a particular outcome that reflects the decision maker's attitude toward a collection of factors such as profit, loss, and risk.
marginal probabilities
The values in the margins of a joint probability table that provide the probabilities of each event separately.
independent events
Two events A and B are independent if P(A|B) = P(A) or P(A|B) = P(B); the events do not influence each other. events a is not changed by the existance of event B.
The random numbers generated using Excel's RAND function follows a __________ probability distribution between 0 and 1.
UNIFORM
What makes decision making difficult and challenging?
Uncertainty is probably the number one challenge.
the event containing the putcomes belonging to A or B or both is the ______ (U) of A and B
Union
Centroid linkage
Uses the averaging concept of cluster centroids to define between-cluster similarity. The centroid for cluster k, denoted as Ck , is found by calculating the average value for each variable across all observations in a cluster
Colors, hue
a powerful preattentive attribute, you can control the audiences' attention or alert them to something important. many color models: - saturation (intensity) - value (brightness)
sample
a subset of the population. For example, with the thousands of publicly traded companies in the United States, tracking and analyzing all of these stocks every day would be too time consuming and expensive. The Dow represents a sample of 30 stocks of large public companies based in the United States, and it is often interpreted to represent the larger population of all publicly traded companies. It is very important to collect sample data that are representative of the population data so that generalizations can be made from them.
Predictive and prescriptive analytics are sometimes referred as
advanced analytics
Spreadsheet models are referred to as what-if models because they
allow easy instantaneous recalculation for a change in model inputs.
A normally distributed error term with a mean of zero would
allow more accurate modeling
orientation
another preattentive attribute. similar to shape, it can be useful to show off categorical comparison or as a direction icon. ex: arrow pointing up could mean something has increased. and a down arrow could mean something is decreasing.
data bars
applies a gradient or filled bar in which the width of the bar represents the cell's value with respect to other cells.
icon sets
are symbols or signs that classify data into categories based on the values in a range
top-down k-means clustering
assigning each obersrvation to one of k-clusters in a manner such that the oberservations assigned to the same cluster are as similar as possible. The algorithm repeats this process (calculate cluster centroid, assign each observation to the cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached. specifiy the number of clusters. = k if you know how many cluster you want and have larger data set, more than 500 observations, then choose k means
to minimize visual interferance
avoid having text with backgrounds that are difficult to read. subtle, low contrast background texture with little texture will interfere less.
As the number of degrees of freedom for a t distribution increases, the difference between the t distribution and the standard normal distribution
becomes smaller
Data Visualization
can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. very helpful for identifying data errors and for reducing the size of your data set by highlighting important relationships and trends. important in conveying your analysis to others illuminate the data to gain insights.
25th, 50th, 75th quartile
can be found using the quartile function in excel (=quartile.exc()) or can be found using percentile.
Size
can be used to encode both categorical and quantitative data. makes it easier to understand.
color value or brightness
can be very useful to encode quantitative values. for an example in a sequential or diverging color scheme. value is perceived as ordered.
If arithmetic operations cannot be performed on the data, they are cosidered to be:
categorical data
width of the bins
choose a width for the bins. As a general guideline, we recommend that the width be the same for each bin. Thus the choices of the number of bins and the width of bins are not independent decisions. A larger number of bins means a smaller bin width and vice versa. To determine an approximate bin width, we begin by identifying the largest and smallest data values.
As we increase the cutoff value, _______ error will decrease and _________ error will rise.
class 0; class 1
The ___________ is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.
coefficient of determination
color hints
color is subjective, and color theory is a science. focus on when and how to use color in visualization in unity. pure colors highlight important elements. subdued hues for everything else. limit to 2-3 color choices. focus on applying color efforts to visual targets.
highlight
color is used to highlight one data point or category.
Preattentive attributes
color, size, orientation, and texture. your brain process the information prior to focusing attention on anything
Alerting
colors are used to get the readers attention. Usually done by an alarming/alerting color to tell the reader that something is wrong. In western culture, red is associated with bad.
diverging
colors encode a quantitatie alue but has a mid-point. the mid-point can be zero or the average or a target you would want to set.
categorical
colors encode categories; contrasting colors for individual comparison.
sequential
colors in code of quantity to value from low to high.
analogous colors
colors that are next to each other on the color wheel
A(n) __________ matrix displays a model's correct and incorrect classification.
confusion
corporate-level managers use _______ to summarize sales by region, current inventory levels, and other company-wide metrics all in a single screen.
data dashboards
A retail store owner offers a discount on product A and predicts that the customers would purchase products B and C in addition to product A. Identify the technique used to make such a prediction.
data mining
the extraction of information on the number of shipments, how much was included in each shipment, th date each shipment was sent, and so on from the manufacturing plant's database exemplifies:
data queries.
__________ is a method of extracting data relevant to the business problem under consideration. It is the first step in the data mining process.
data sampling
When a decision maker is faced with several alternatives and an uncertain set of future events, s/he uses __________ to develop an optimal strategy.
decision analysis
event
defined as a collection of outcomes. For example, consider the case of an expansion project being undertaken by California Power & Light Company (CP&L). shows the number of past construction projects that required 8, 9, 10, 11, and 12 months.
In a linear regression model, the variable that is being predicted or explained is known as _____________. It is denoted by y and is often referred to as the response variable.
dependent variable
The mean absolute error, mean squared error, and mean absolute percentage error are all methods to measure the accuracy of a forecast. These methods measure forecast accuracy by
determining how well a particular forecasting method is able to reproduce the time series data that are already available.
A cluster's __________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.
durability or strength
highlight cell rules
enables you t apply a highlight to the sales to meet your condition
top/bottom rules
enables you tpo specify the top or bottom number or percentage, or values that are above or below the average value in a specified range.
In the simple linear regression model, the ____________ accounts for the variability in the dependent variable that cannot be explained by the linear relationship between the variables.
error term
Determine a freshman's likely first-year grade point average from the student's Scholastic Aptitude Test (SAT) score, high school grade point average, and number of extra-curricular activities. This is an example of
estimation of a continuous outcome
__________ is the amount by which the predicted value differs from the observed value of the time series variable.
forecast error
Excel's __________ tool allows the user to determine the value of an input cell that will cause the value of a related output cell to equal some specified value.
goal seek
cluster analysis questions
how do we measure simiarity? how do we form cluster? how many groups do we form?
when moving from knowledge to strategy
how do we, how can we, and what can we...
when moving from information to stratgy these are the: how questions
how many, how much what are
veracity
how much uncertainty is in the data. Inconsistencies in units of measure and the lack of reliability of responses in terms of bias also increase the complexity of the data. For example, the data could have many missing values, which makes reliable analysis a challenge.
increasing the color intensity
increasing saturation and brightness draws the eye and means the point is more important.
In a linear regression model, the variable (or variables) used for predicting or explaining values of the response variable are known as the __________. It(they) is(are) denoted by x.
independent variable
An estimate of a population parameter that provides an interval of values believed to contain the value of the parameter is known as the
interval estimate
shape
is a preattentive attribute. can be used for categorical comparisons but not useful for quantitative comparisons.
texture
is another preattentive attribute, it is common to use texture in data visualization when printers only had black and white. useful to encode categorical data, not quantitative data.
knowledge
is awareness and understanding of a set of information, and ways it can be used to support a task. - information and skills acquired through experience or education; th theoritical or practical understanding of a subject.
position, length, or height
is much better for showing preceise quantitative comparisons.
information
is the collextion of data organized in such a way that they have value beyond the facts itself. - organized facts provided or learned about something or someone.
The goal of data mining is to
is the extraction of patterns and knowledge from large amounts of data.
by changing color of important data and the irrelevanr data
it makes the important data easier to see and read.
_________ attempts to classify a categorical outcome as a linear function of explanatory variables.
logistic regression
nonexperimental, or observational, studies
make no attempt to control the variables of interest. 1) Identify research questions and variables. 2) develop survey/interview, then distribute and collect. A survey is perhaps the most common type of observational study. For instance, in a personal interview survey, research questions are first identified. Then a questionnaire is designed and administered to a sample of individuals.
A __________ decision is one in which companies have to decide whether they should manufacture a product or outsource production to another firm.
make versus buy
Cluster analysis is commonly used in marketing to divide consumers into different homogeneous groups, a process known as
market segmentation
legitimately missing data
missing data that occur naturally For example, respondents to a survey may be asked if they belong to a fraternity or a sorority, and then in the next question are asked how long they have belonged to a fraternity or a sorority. If a respondent does not belong to a fraternity or a sorority, she or he should skip the ensuing question about how long.
__________ refers to the degree of correlation among independent variables in a regression model.
multicollinearty
bin limits
must be chosen so that each data item belongs to one and only one class. The lower bin limit identifies the smallest possible data value assigned to the bin. The upper bin limit identifies the largest possible data value assigned to the class.
A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size
n has the same probability of being selected.
in a normal distribution, which is greater, the mean or the median?
neither the mean or the median (and mode) because they would be equal.
probability is the
numerical measure of the likelihood that an event will occur.
we can use charts to visualize our data and
obtain more information about the data set.
With reference to a spreadsheet model, an uncontrollable model input is known as a(n)
parameter
What do nodes in an influence diagram represent?
parts of the model
It is the responsibility of managers to
plan, coordinate, organize, and lead their organizations to better performance. Ultimately, managers' responsibilities require that they make strategic, tactical, or operational decisions.
A simple random sample of 31 observations was taken from a large population. The sample mean equals 5. Five is a
point estimate
The purpose of statistical inference is to make estimates or draw conclusions about a
population based upon information obained from the sample
Bayes' theorem is a method used to compute _______ probabilities
posterior
A forecast that helps direct police officers to areas where crimes are likely to occur based on past data is an example of
predictive analytics.
color wheel
primary colors are red, yellow, and blue. secondary colors are green, orange, and purple. which are created by mixing the primary colors.
One of the most important uses of a histogram is to
provide information about the shape, or form, of a distribution. morderately skewed left. moderately skewed right. Symmetric. high skewed right
probability distribution
represents how likely one can find possible values or random variables. - it is useful when you want to know which outcomes are most likely
A time series plot of a period of time (quarterly) versus quarterly sales (in $1,000s) is shown below. Which of the following data patterns best describes the scenario shown
seasonal pattern and linear trend
The goal of clustering is to
segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration. commonly used in marketing to divide customer into different homogenous groups; known as market segmentation. Cluster analysis can also be used to identify outliers,
three primary ways to use colors in data visulization
sequential, diverging, and categorical.
the triangular distribution is a good model for _____ distributions
skewed
In a simple linear regression analysis the quantity that gives the amount by which the dependent variable changes for a unit change in the independent variable is called the
slope of the regression line
the least squares regression line minimizes the sum of the
squared differences between actual and predicted y values
bottom-up hierarchial clustering
starts with each observation belong to its own cluster then sequentially merges the most similar cluster to create a series of nested clusters. small data set less than 500 observations.
visual perception
the ability to interpret the surrounding environment by processing information that is contained in visible light.
Which of the following statements is correct?
the binomial distribution is a discrete probability distribution and the normal distribution is a continuous probability distribution.
correlation
the correlation coeffiecent is most widely used to determine the strength of the relationship; A standardized measure of linear association between two variables that takes on values between −1 and +1. Values near −1 indicate a strong negative linear relationship, values near +1 indicate a strong positive linear relationship, and values near zero indicate the lack of a linear relationship. POPULATION CORRELATION =PXY = σXY/(σX σY) SAMPLE CORRELATION rxy = Sxy/(Sx*Sy) rxy = sample correlation coefficient sxy sample covariance Sx sample std dev of x Sy sample std.dev of y.
relative frequency
the fraction or percent of the time that an event occurs in an experiment. relative frquency of a bin = frequency of the bin / n
A procedure for using sample data to find the estimated regression equation is
the least squares method
logic
the object of the same group or cluster are more similar to ech other than to those in other groups or clusters for example class; instructor and students. gender: male or female or 3rd option. status: freshmen, sophmore, junior, senior.
Remedial action is considered for illegitimately missing data.
the primary options for addressing such missing data are (1) to discard observations (rows) with any missing values, (2) to discard any variable (column) with missing values, (3) to fill in missing entries with estimated values, or (4) to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values.
Data can be categorized in several ways based on how
they are collected and the type collected.
A time series plot of a period of time (in weeks) versus sales (in 1,000's of gallons) is shown below. Which of the following data patterns best describes the scenario shown?
time series with a horizontal pattern
Simulation optimization helps
to find good decisions in highly complex and highly uncertain settings.
Which of the following states the objective of time series analysis?
to uncover a pattern in a time series and then extrapolate the pattern into the future
Which of the following would be a likely mathematical expression for Total Revenue?
total revenue = production volume * revenue per unit.
The impact of two inputs on the output of interest is summarized by a
two-way data table
A positive forecast error indicates that the forecasting method ________ the dependent variable
underestimated
order is a powerful way to
understand your data. by ordering your data it becomes easier to see patterns.
The goal of __________ is to use the variable values to identify relationships between observations.
unsupervised learning
In which of the following scenarios would it be appropriate to use hierarchical clustering?
when binary or ordinal data needs to be clustered
covariance
yxA descriptive measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship. Sample of size n of x1, y1. Sxy = (Σ(xi - x̅)*(yi - y hat))/n-1 population covariance (Σ(xi - µx) * Σ(yi - µy))/N =σxy
population mean
μ = (Σ Xi)/N
measuring similarity between oberservations:
bottom up -hierarchial clustering; top-down k-means clustering.
frequency term-document matrix
A matrix whose rows represent documents and columns represent tokens (terms), and the entries in the matrix are the frequency of occurrence of each token (term) in each document.
Strategic decision
A decision that involves higher-level issues and that is concerned with the overall direction of the organization, defining the overall goals and aspirations for the organization's future
geometric mean
A measure of central location that is calculated by finding the nth root of the product of n values. x̅g = g^Sqrt(x1*x2....*xn)
coefficient of variation
A measure of relative variability computed by dividing the standard deviation by the mean and multiplying by 100. Std.dev/mean *100 = %
Variance
A measure of variability based on the squared deviations of the data values about the mean. σ² = (Σ(Xi - μ)^2)/ N sample variance s^2 = (Σ(xi - x̅ )^2) / n
standard deviation
A measure of variability computed by taking the positive square root of the variance. s = sqrt(s^2) σ = sqrt(σ² )
approximate bin width
(largest data value- smallest data value)/number of bins
the human brain can remeber approximately
10,000 visuals with an 83% recollection rate.
optimization models
A mathematical model that gives the best decision, subject to the situation's constraints.
variable
A characteristic or quantity of interest that can take on different values.
corpus
A collection of documents to be analyzed.
operational decisions
A decision concerned with how the organization is run from day to day -domain of operations managers. + closest to the customers.
histograms
A graphical presentation of a frequency distribution, relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the bin intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis.
scatter chart
A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other on the vertical axis.
random variable, or uncertain variable
A quantity whose values are not known with certainty
pivot table
An interactive crosstabulation created in Excel.
Hadoop
An open-source programming environment that supports big data processing through distributed storage and distributed processing on clusters of computers.
outliers
An unusually large or unusually small data value.
unsupervised learning
Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process. - a descriptive datamining technique used to identify relationships between observations. there is no outcome variable to predict; instead; qualitatie assessments are used to asess and compare the results. thought of as high-dimensional descriptive analytics because they are designed to describe patterns and relationships in large data sets with many observations of many variables.
Probability of an Event
Equal to the sum of the probabilities of outcomes for the event. denoted as P()
mutually exclusive events
Events that have no outcomes in common. A∩B is empty therefore it = 0 P(AUB) =P(A)+P(B)
data mining example for predictive analysis
For example, a large grocery store chain might be interested in developing a targeted marketing campaign that offers a discount coupon on potato chips. By studying historical point-of-sale data, the store may be able to use data mining to predict which customers are the most likely to respond to an offer on discounted chips by purchasing higher-margin items such as beer or soft drinks in addition to the chips, thus increasing the store's overall revenue.
matching coefficient
Measure of similarity between observations based on the number of matching values of categorical variables.
Computing Probability Using the complement
P(A) = 1 - P(Ac)
three events with addition law: A,B,C
P(AUBUC) = P(A)+P(B)+P(C) - P(A∩B)-P(A∩C)-P(B∩C) +P(A∩B∩C)
What are the two decisions that you can make from performing a hypothesis test?
REJECT The null hypothesis, fail to reject the null hypothesis.
Data
Raw data; The facts and figures collected, analyzed, and summarized for presentation and interpretation.
velocity
Real-time capture and analysis of data present unique challenges both in how data are stored and the speed with which those data can be analyzed for decision making. For example, the New York Stock Exchange collects 1 terabyte of data in a single trading session, and having current data and real-time rules for trades and predictive modeling are important for managing stock portfolios.
harmony effect
Shapes that have similar characteristics are visually read as harmonious.
Picks and Axes Inc. is an Internet-based retail seller of hiking boots and mountaineering gear. The company decides to open retail stores across the major areas of the city to help complement its Internet-based strategy. This activity would be categorized as a(n)
Strategic decision
confidence
The conditional probability that the consequent of an association rule occurs given the antecedent occurs. support of (antecedent and consequent)/ support of antecedent A high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, but a high value of confidence can be misleading
Interquartile Range (IQR)
The difference between the third and first quartiles.
Union of A and B
The event containing the outcomes belonging to A or B or both. The union of A and B is denoted by AUB.
Which of the following approaches is a good way to proceed with the influence diagram building for a problem?
The influence diagram for a portion of the problem is built first and then expanded until the total problem is conceptually modeled.
antecedent
The item set corresponding to the if portion of an if-then association rule.
consequent
The item set corresponding to the then portion of an if-then association rule.
Trend refers to
The long-run shift or movement in the time series observable over several periods of time.
bins
The nonoverlapping groupings of data used to create a frequency distribution. Bins for categorical data are also known as classes.
degrees of freedom
The number of individual scores that can vary without changing the sample mean. Statistically written as 'N-1' where N represents the number of subjects.
Conditional probability
The probability of an event given that another event has already occurred. The conditional probability of A given B is reads "the probability of A given B."/ P(A|B) = P(A∩B)/P(B) OR P(B|A) = P(A∩B)/P(A)
the most common form of distributins is
a frequency distribution, which determines how often a value appears in range.
A one-way data table summarizes
a single input's impact on the output of interest.
why do we need data mining?
because there is huge amount of data we should deal with.
Earthtones and cool colors
can be used for categorical data
data mining
the use of a variety of statistical analysis tools to uncover previously unknown patterns in the data stored in databases or relationships among variables. -extraction of data from data base. For example, by analyzing text on social network platforms like Twitter, data-mining techniques (including cluster analysis and sentiment analysis) are used by companies to better understand their customers. By categorizing certain words as positive or negative and keeping track of how often those words appear in tweets, a company like Apple can better understand how its customers are feeling about a product like the Apple Watch.
If a time series plot exhibits a horizontal pattern, then
there is still not enough evidence to conclude that the time series in stationary