Business analytics

Ace your homework & exams now with Quizwiz!

calculation of distances

- euclidean distance -Squared (or absolute) euclidean distance. - city block (manhattan distance). - cheby chev distance -mahalan nobis distance (D2)

IBM describes the phenomenon of big data through the four Vs:

1) Volume 2) Velocity 3) variety 4) Veracity

obeservation, or record

A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database. for example: represented by a row, "jacks record; student Id; first name; last name; address; phone #; Major; status.

observation

A set of values corresponding to a set of variables.

cumulative frequency distributions

A tabular summary of quantitative data showing the number of data values that are less than or equal to the upper class limit of each bin

Decision Analysis

A technique used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain set of future events.

teritary colors

A tertiary color is a color made by mixing one primary color with one secondary color. these colors can be divided by cool and warm colors

charts

A visual method for displaying data; also called a graph or a figure.

random sampling

Collecting a sample that ensures that (1) each element selected comes from the same population and (2) each element is selected independently.

cross-sectional data

Data collected at the same or approximately the same point in time.

unstructured data

Data does not exist in a fixed location and can include text documents, PDFs, voice messages, emails. EXample: have type in the exact address yourself.

structured data

Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion. Example: When entering an adress and there is a drop down menu.

time series data

Data that are collected over a period of time (minutes, hours, days, months, years, etc.). - for example, the gpa of john, freshmen to senior, are times series data -graphs of time series data help analysts understand what happened in the past, identify trends overtime, and project future levels for the timeseries.

variation

Differences in values of a variable over observations.

internet of things (IOT)

The technology that allows data collected from sensors in all types of machines to be sent over the Internet to repositories where it can be stored and analyzed.

Missing completely at random (MCAR)

The tendency for an observation to be missing a value of some variable is entirely random. For example, if missing value for a question on a survey is completely unrelated to the value that is missing and is also completely unrelated to the value of any other question on the survey, the missing value is MCAR.

Missing Not at Random (MNAR)

The tendency for an observation to be missing a value of some variable is related to the missing value.

Missing at random (MAR)

The tendency for an observation to be missing a value of some variable is related to the value of some other variable(s) in the data. For example if the responses to one survey question collected by a specific employee were lost due to a data entry error, then the treatment of the missing data may be less critical.

growth factor

The percentage increase of a value over a period of time is calculated using the formula (1 − growth factor). A growth factor less than 1 indicates negative growth, whereas a growth factor greater than 1 indicates positive growth. The growth factor cannot be less than zero.

data ink ratio

The ratio of the amount of ink used in a table or chart that is necessary to convey information to the total amount of ink used in the table and chart. Ink used that is not necessary to convey information reduces the data-ink ratio.

simulation

The use of probability and statistics to construct a computer model to study the impact of uncertainty on the decision at hand. For example, banks often use simulation to model investment and default risk in order to stress-test financial models. Simulation is also often used in the pharmaceutical industry to assess the risk of introducing a new drug.

simulation optimization

The use of probability and statistics to model uncertainty, combined with optimization techniques, to find good decisions in highly complex and highly uncertain settings.

population

The set of all elements of interest in a particular study.

Which of the following is not present in a time series?

operational variations

The percent of misclassified records out of the total records in the validation data is known as the

overall error rate

two approaches to drawing a conclusion in a hypothesis test are

p-value and critical value.

A forecast is defined as

prediction of future values of a time series.

A tye I error is commited when

a true hypothesis is rejected.

The value of the ___________ is used to estimate the value of the population parameter.

sample statistic

_______ are used in the pharmaceutical industry to assess the risk of introducing a new drug.

simulations

The __________ is a measure of the error that results from using the estimated regression equation to predict the values of the dependent variable in the sample.

sum of squares due to error (SSE)

Data mining methods for classifying or estimating an outcome based on a set of input variables is referred to as

supervised learning

In preparing categorical variables for analysis, it is usually best to

conver the categories to binary, as dummy variables. Typically, it is best to encode categorical variables with 0-1 dummy variables. Using 0-1 dummy variables to encode categorical variables with many different categories results in a large number of variables. In some cases, the number of categories may be reduced by combining categories.

__________ compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability if randomly classified.

cumulative lift

Data dashboards are a type of _________ analytics.

descriptive

experimental study

a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. For example, if a pharmaceutical firm conducts an experiment to learn about how a new drug affects blood pressure, then blood pressure is the variable of interest. The dosage level of the new drug is another variable that is hoped to have a causal effect on blood pressure.

unnatural and warm colors

are considered to draw the readers attention.

A test set is the data set used to

estimate performance of the final model on unseen data.

Bayes' Theorem decision tree

excel example.

color scales

format cells with different colors based on the relative value of a cell compared to other selected cells. can apply 2 - 3 color scales.

Data are considered quantitative data

if numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them.

Which statement is true about mutually exclusive events?

IF events A or event B cannot occur at the same time, they are called mutually exclusive.

All the events in the sample space that are not part of the specified event are called

the complement of the event Ac

Association Rules

if-then statements which convey the likelihood of certain items being purchased together. -widely used in marketing ultimately judged on how actionable it is and how well it explains the relationship between item sets.

A(n) __________ is a visual representation that shows which entities affect others in a mod

influence diagram

the idea of using data to create

information. to knowledge. than strategy. Data > information > knowledge > plan to guide actions or key decisions

Strategy

is a plan of action or policy to achieve your goals.

cluster analysis

the data preparation technique used in market segmentation to divide consumers into different homogenous groups. is a group of multivariate techniques whose primary purpose is to group objects based on the characteristecs they possess.

stemming

the process of converting a word to its stem or root word, would drop the "ing" and "ed" and place only "stack" in the list of words to be tracked.

sample space

the set of all outcomes of an experiment

when moing from information to knowledge

why questions

time series analysis and forcasting is one of the most

widely used analytics in business and economics.

sample mean

x̅ = ( Σ xi ) / n x̅ = (x1 +x2+....Xn)/n

Analytics is generally thought to comprise of three broad categories of techniques:

- descriptive analytics - predictive analytics - prescriptive analytics

the increase in the use of data mining techniques in businesses has been caused largely by three events:

1) the eplosion in te amount of data being produced and electronically tracked. 2) the ability to electronically warehouse these data. 3) the affordability of computer power to analyze the data.

How many Class 1's are correctly classified as Class 1 in the Table below? confusion matrix predicted class actual 1 0 1 221 100 0 30 3,000

221, 1.1

data dashboard

A collection of tables, charts, and maps to help management monitor selected aspects of the company's performance. used to help management monitor specific aspects of the company's performance related to their decision-making responsibilities. For corporate-level managers, daily data dashboards might summarize sales by region, current inventory levels, and other company-wide metrics; front-line managers may view dashboards that contain metrics related to staffing levels, local inventory levels, and short-term sales forecasts.

tactical decision

A decision concerned with how the organization should achieve the goals and objectives set by its strategy. - responsibility of mid-level management. -span a year

venn diagram

A graphical representation of the sample space and operations involving events, in which the sample space is represented by a rectangle and events are represented as circles within the sample space. event A is within the circle; sample space (S) is depicted as the rectangle and within the rectangle but outside of the circle is the complement of event A (Ac) P(A) + P(Ac) = 1

box plot

A graphical summary of data based on the quartiles of a distribution. also known as box-and-whisker plots.

multiplication law

A law used to compute the probability of the intersection of events. For two events A and B, the multiplication law is P(A∩B) = P(B)* P(A|B) or P(A∩B) = P(A)*P(B|A). For two independent events, it reduces to P(A∩B) = P(A) * P(B).

presence/absence or binary term-document matrix.

A matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or absence of a particular word in a particular document ( 1= present and ‍0 = not present)

mean (arithmetic mean)

A measure of central location computed by summing the data values and dividing by the number of observations. or average value. denoted by x̄ for sample data. if for the population mean its denoted by µ

mode

A measure of central location defined as the value that occurs with greatest frequency. Occasionally the greatest frequency occurs at two or more different values, in which case more than one mode exists. If data contain at least two modes, we say that they are multimodal. A special case of multimodal data occurs when the data contain exactly two modes; in such cases we say that the data are bimodal

median

A measure of central location provided by the value in the middle when the data are arranged in ascending order. Because n=12 is even, the median is the average of the middle two values: 199,500 and 208,000. 199,500 +208,000/ 2 =203750

Skewness

A measure of the lack of symmetry in a distribution.

range

A measure of variability defined to be the largest value minus the smallest value. The range can be calculated in Excel using the MAX and MIN functions.

Bayes' Theorem

A method used to compute posterior probabilities. based on prior probabilities estimates for specific events of interest. provides a means for making these probabilities calculations - a method used to calculate posterior calculations.

Probability

A numerical measure of the likelihood that an event will occur. it can be used as a measure of the uncertainty associated with an event.

rule based model

A prescriptive model that is based on a rule or set of rules.

addition law

A probability law used to compute the probability of the union of events. For two events A and B, the addition law is P(AUB) = P(A) + P(B) - P(A∩B). For two mutually exclusive events, P(A∩B) =0, so P(AUB) = P(A)+P(B) . The addition law is helpful when we are interested in knowing the probability that at least one of two events will occur. That is, with events A and B we are interested in knowing the probability that event A or event B occurs or both events occur.

random experiment

A process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined by chance.

data query

A request for information with certain characteristics from a database.

empiricle rule

A rule that can be used to compute the percentage of data values that must be within 1, 2, or 3 standard deviations of the mean for data that exhibit a bell-shaped distribution. Approximately 68% of the data values will be within 1 standard deviation of the mean. Approximately 95% of the data values will be within 2 standard deviations of the mean. Almost all of the data values (99.7%) will be within 3 standard deviations of the mean.

crosstabulation

A tabular summary of data for two variables. The classes of one variable are represented by the rows; the classes for the other variable are represented by the columns.

relative frequency distribution

A tabular summary of data showing the fraction or proportion of observations in each of several nonoverlapping categories or classes.

frequency distribution

A tabular summary of data showing the number (frequency) of data values in each of several nonoverlapping bins.

percent frequency distribution

A tabular summary of data showing the percentage of observations in each of several nonoverlapping bins/classes.

dendrogram

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering.

z-score, also called standardized value

A value computed by dividing the deviation about the mean (xi - x) by the standard deviation s. A z-score is referred to as a standardized value and denotes the number of standard deviations that xi is from the mean. excel function: =STANDARDIZE zi = (xi - x̅)/ s zi - z score for xi x̅ = the sample mean s = the sample std.dev

percentile

A value such that approximately p% of the observations have values less than the pth percentile; hence, approximately (100 2 p)% of the observations have values greater than the pth percentile. The 50th percentile is the median.

market basket analysis

Analysis of items frequently co-occuring in transactions (such as purchases). example: bread and jelly are antecedents and peanut butter is the consequent.

data scientists

Analysts trained in both computer science and statistics who know how to effectively process and analyze massive amounts of data.

Descriptive analytics

Analytical tools that describe what has happened. and why did it happen? Examples are data queries, reports, descriptive statistics, data visualization including data dashboards, some data-mining techniques, and basic what-if spreadsheet models.

big data

Any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software.

volume

Because data are collected electronically, we are able to collect more of it. To be useful, these data must be stored, and this storage has led to vast quantities of data. Many companies now store in excess of 100 terabytes of data

number of bins

Bins are formed by specifying the ranges used to group the data. As a general guideline, we recommend using from 5 to 20 bins. For a small number of data items, as few as five or six bins may be used to summarize the data. For a larger number of data items, more bins are usually required.

complementary colors

Colors located directly opposite one another on the color wheel

categorical data

Data for which categories of like items are identified by labels or names. Arithmetic operations cannot be performed on categorical data.

quantitative data

Data for which numerical values are used to indicate magnitude, such as how many or how much. Arithmetic operations such as addition, subtraction, and multiplication can be performed on quantitative data.

unstructured data (text data)

Data, such as text, audio, or video, that cannot be stored in a traditional structured database. unlike structured data people may have or use sland, typos, all caps, sarcastic words, etc.

The three steps necessary to define the classes for a frequency distribution with quantitative data are as follows:

Determine the number of nonoverlapping bins. Determine the width of each bin. Determine the bin limits.

when we collect data we are:

Gathering the past observed values of a variable. by collecting these past variables our goal is to learn more about variation or particular index companies.

euclidean distance

Geometric measure of dissimilarity between observations based on Pythagorean Theorem. u = u1, u2,... uq and v = v1, v2, ... vq each compromise measurements of q variables. the distance between u and v is: duv = sqrt[(u1 - v1)^2 + (u2 + v2)^2 + .... (uq +vq)^2)] Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values. Euclidean distance is highly influenced by the scale on which variables are measured.

decision making can be defined as the following process:

Identify and define the problem. Determine the criteria that will be used to evaluate alternative solutions. Determine the set of alternative solutions. Evaluate the alternatives. Choose an alternative.

__________ is the most critical step of the decision-making process.

Identifying and defining the problem.

variety

In addition to the sheer volume and speed with which companies now collect data, more complicated types of data are now available and are proving to be of great value to businesses. Text data are collected by monitoring what is being said about a company's products or services on social media platforms such as Twitter. Audio data are collected from service calls (on a service call, you will often hear "this call may be monitored for quality control"). Video data collected by in-store video cameras are used to analyze shopping behavior. Analyzing information generated by these nontraditional sources is more complicated in part because of the processing required to transform the data into a numerical form that can be analyzed.

prior probability

Initial estimate of the probabilities of events.

what falls under predictive analysis and is referred to as risk analysis?

Linear regression, time series analysis, some data-mining techniques, and simulation

location of the pth percentile

Lp = p/100*(n+1)

complete linkage

Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters. defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.

single linkage

Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters. The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar. thus, single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster.

Group average linkage

Measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between two clusters. Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters

Jaccard coefficient

Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries. number of variables with matching nonzero value for observations u and v/ (total number of variables - number of variables with matching non zero value for u and v.

median linkage

Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters. is analogous to group average linkage except that it uses the median of the similarities computed between all pairs of observations between the two clusters.

illegitimately missing data

Missing data that do not occur naturally. These cases can result for a variety of reasons, such as a respondent electing not to answer a question that she or he is expected to answer, a respondent dropping out of a study before its completion, or sensors or other electronic data collection equipment failing during a study.

Two Events are Independent If...

P(A|B)=P(A) or P(B|A) = P(B)

Gesalt Principles

Principles that describe the brain's organization of sensory information into meaningful units and patterns.

hierarchial clustering

Process of agglomerating observations into a series of nested groups based on a measure of similarity.

k-means clustering

Process of organizing observations into; distinct groups based on a meaure of similarity or, one of k groups based on a measure of similarity. typically euclidean distance

MapReduce

Programming model used within Hadoop that performs the two major steps for which it is named: the map step and the reduce step. The map step divides the data into manageable subsets and distributes it to the computers in the cluster for storing and processing. The reduce step collects answers from the nodes and combines them into an answer to the original problem

data security

Protecting stored data from destructive forces or unauthorized users.

posterior probabilities

Revised probabilities of events based on additional information.

As an example of the addition law, consider a study conducted by the human resources manager of a major computer software company. The study showed that 30% of the employees who left the firm within two years did so primarily because they were dissatisfied with their salary, 20% left because they were dissatisfied with their work assignments, and 12% of the former employees indicated dissatisfaction with both their salary and their work assignments. What is the probability that an employee who leaves within two years does so because of dissatisfaction with salary, dissatisfaction with the work assignment, or both?

S = the event that employee leaves because of salary. W = the event that employees leaves because of work assignment. P(S) = .3 P(W) = .2 P(S∩W) = .12 .3+.2-.12 = .38=P(SUW)

When you use excel, you can sort your data on the HOME tab, in the EDITING GROUP or on the DATA TAB.

Sort and Filter in excel. also can right click and use the drop menu > sort option.

imputation

Systematic replacement of missing values with values that seem reasonable.

Prescriptive Analytics

Techniques that analyze input data and yield a best course of action. why wil it happen? provide a forecast or prediction, but do not provide a decision. used to construct optimal portfolios of investments, to allocate assets, and to create optimal capital budgeting plans. For example, GE Asset Management uses optimization models to decide how to invest its own cash received from insurance policies and other financial products, as well as the cash of its clients, such as Genworth Financial. The estimated benefit from the optimization models was $75 million over a five-year period

predictive analytics

Techniques that use models constructed from past data to predict the future or to ascertain the impact of one variable on another. what will happen and when will it happen? For example, past data on product sales may be used to construct a mathematical model to predict future sales. This mode can factor in the product's growth trajectory and seasonality based on past patterns. A packaged-food manufacturer may use point-of-sale scanner data from retail outlets to help in estimating the lift in unit sales due to coupons or sales events. Survey data and past purchase behavior may be used to help predict the market share of a new product. used to forecast financial performance, to assess the risk of investment portfolios and projects, and to construct financial instruments such as derivatives

data cleansing

The data in a data set are often said to be "dirty" and "raw" before they have been put into a form that is best suited for investigation, analysis, and modeling. Common tasks in data preparation include treating missing data, identifying erroneous data and outliers, and defining the appropriate way to represent variables.

Complement of A

The event consisting of all outcomes that are not in "A". denoted as Ac. example between heads and tails; heads would be event A, and tails would be the compliments of A

intersection of A and B

The event containing the outcomes belonging to both A and B. The intersection of A and B is denoted A∩B.

support count

The number of times that a collection of items occurs together in a transaction data set.

Market Segmentation

The partitioning of customers into groups that share common characteristics so that a business may target customers within a group with a tailored marketing strategy.

Joint Probability

The probability of two events both occurring; in other words, the probability of the intersection of two events.

tokenization

The process of dividing text into separate terms, referred to as tokens. First, symbols and punctuations must be removed from the document and all letters should be converted to lowercase. For example, "Awesome!", "awesome," and "#Awesome" should all be converted to "awesome."

text mining

The process of extracting useful information from text data. to be analyzed, requires to be converted to structured data, so that tools of descriptive statistics, data visualization and data mining can be applied.

lift ratio

The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction. confidence/ (support of consequent/ total number of transactions) A lift ratio greater than one suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than having no rule at all.

business analytics

The scientific process of transforming data into insight for making better decisions. Used for data driven or fact based decision making. also called: - data analytics - data intelligence - busincess science

utility theory

The study of the total worth or relative desirability of a particular outcome that reflects the decision maker's attitude toward a collection of factors such as profit, loss, and risk.

marginal probabilities

The values in the margins of a joint probability table that provide the probabilities of each event separately.

independent events

Two events A and B are independent if P(A|B) = P(A) or P(A|B) = P(B); the events do not influence each other. events a is not changed by the existance of event B.

The random numbers generated using Excel's RAND function follows a __________ probability distribution between 0 and 1.

UNIFORM

What makes decision making difficult and challenging?

Uncertainty is probably the number one challenge.

the event containing the putcomes belonging to A or B or both is the ______ (U) of A and B

Union

Centroid linkage

Uses the averaging concept of cluster centroids to define between-cluster similarity. The centroid for cluster k, denoted as Ck , is found by calculating the average value for each variable across all observations in a cluster

Colors, hue

a powerful preattentive attribute, you can control the audiences' attention or alert them to something important. many color models: - saturation (intensity) - value (brightness)

sample

a subset of the population. For example, with the thousands of publicly traded companies in the United States, tracking and analyzing all of these stocks every day would be too time consuming and expensive. The Dow represents a sample of 30 stocks of large public companies based in the United States, and it is often interpreted to represent the larger population of all publicly traded companies. It is very important to collect sample data that are representative of the population data so that generalizations can be made from them.

Predictive and prescriptive analytics are sometimes referred as

advanced analytics

Spreadsheet models are referred to as what-if models because they

allow easy instantaneous recalculation for a change in model inputs.

A normally distributed error term with a mean of zero would

allow more accurate modeling

orientation

another preattentive attribute. similar to shape, it can be useful to show off categorical comparison or as a direction icon. ex: arrow pointing up could mean something has increased. and a down arrow could mean something is decreasing.

data bars

applies a gradient or filled bar in which the width of the bar represents the cell's value with respect to other cells.

icon sets

are symbols or signs that classify data into categories based on the values in a range

top-down k-means clustering

assigning each obersrvation to one of k-clusters in a manner such that the oberservations assigned to the same cluster are as similar as possible. The algorithm repeats this process (calculate cluster centroid, assign each observation to the cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached. specifiy the number of clusters. = k if you know how many cluster you want and have larger data set, more than 500 observations, then choose k means

to minimize visual interferance

avoid having text with backgrounds that are difficult to read. subtle, low contrast background texture with little texture will interfere less.

As the number of degrees of freedom for a t distribution increases, the difference between the t distribution and the standard normal distribution

becomes smaller

Data Visualization

can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. very helpful for identifying data errors and for reducing the size of your data set by highlighting important relationships and trends. important in conveying your analysis to others illuminate the data to gain insights.

25th, 50th, 75th quartile

can be found using the quartile function in excel (=quartile.exc()) or can be found using percentile.

Size

can be used to encode both categorical and quantitative data. makes it easier to understand.

color value or brightness

can be very useful to encode quantitative values. for an example in a sequential or diverging color scheme. value is perceived as ordered.

If arithmetic operations cannot be performed on the data, they are cosidered to be:

categorical data

width of the bins

choose a width for the bins. As a general guideline, we recommend that the width be the same for each bin. Thus the choices of the number of bins and the width of bins are not independent decisions. A larger number of bins means a smaller bin width and vice versa. To determine an approximate bin width, we begin by identifying the largest and smallest data values.

As we increase the cutoff value, _______ error will decrease and _________ error will rise.

class 0; class 1

The ___________ is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.

coefficient of determination

color hints

color is subjective, and color theory is a science. focus on when and how to use color in visualization in unity. pure colors highlight important elements. subdued hues for everything else. limit to 2-3 color choices. focus on applying color efforts to visual targets.

highlight

color is used to highlight one data point or category.

Preattentive attributes

color, size, orientation, and texture. your brain process the information prior to focusing attention on anything

Alerting

colors are used to get the readers attention. Usually done by an alarming/alerting color to tell the reader that something is wrong. In western culture, red is associated with bad.

diverging

colors encode a quantitatie alue but has a mid-point. the mid-point can be zero or the average or a target you would want to set.

categorical

colors encode categories; contrasting colors for individual comparison.

sequential

colors in code of quantity to value from low to high.

analogous colors

colors that are next to each other on the color wheel

A(n) __________ matrix displays a model's correct and incorrect classification.

confusion

corporate-level managers use _______ to summarize sales by region, current inventory levels, and other company-wide metrics all in a single screen.

data dashboards

A retail store owner offers a discount on product A and predicts that the customers would purchase products B and C in addition to product A. Identify the technique used to make such a prediction.

data mining

the extraction of information on the number of shipments, how much was included in each shipment, th date each shipment was sent, and so on from the manufacturing plant's database exemplifies:

data queries.

__________ is a method of extracting data relevant to the business problem under consideration. It is the first step in the data mining process.

data sampling

When a decision maker is faced with several alternatives and an uncertain set of future events, s/he uses __________ to develop an optimal strategy.

decision analysis

event

defined as a collection of outcomes. For example, consider the case of an expansion project being undertaken by California Power & Light Company (CP&L). shows the number of past construction projects that required 8, 9, 10, 11, and 12 months.

In a linear regression model, the variable that is being predicted or explained is known as _____________. It is denoted by y and is often referred to as the response variable.

dependent variable

The mean absolute error, mean squared error, and mean absolute percentage error are all methods to measure the accuracy of a forecast. These methods measure forecast accuracy by

determining how well a particular forecasting method is able to reproduce the time series data that are already available.

A cluster's __________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

durability or strength

highlight cell rules

enables you t apply a highlight to the sales to meet your condition

top/bottom rules

enables you tpo specify the top or bottom number or percentage, or values that are above or below the average value in a specified range.

In the simple linear regression model, the ____________ accounts for the variability in the dependent variable that cannot be explained by the linear relationship between the variables.

error term

Determine a freshman's likely first-year grade point average from the student's Scholastic Aptitude Test (SAT) score, high school grade point average, and number of extra-curricular activities. This is an example of

estimation of a continuous outcome

__________ is the amount by which the predicted value differs from the observed value of the time series variable.

forecast error

Excel's __________ tool allows the user to determine the value of an input cell that will cause the value of a related output cell to equal some specified value.

goal seek

cluster analysis questions

how do we measure simiarity? how do we form cluster? how many groups do we form?

when moving from knowledge to strategy

how do we, how can we, and what can we...

when moving from information to stratgy these are the: how questions

how many, how much what are

veracity

how much uncertainty is in the data. Inconsistencies in units of measure and the lack of reliability of responses in terms of bias also increase the complexity of the data. For example, the data could have many missing values, which makes reliable analysis a challenge.

increasing the color intensity

increasing saturation and brightness draws the eye and means the point is more important.

In a linear regression model, the variable (or variables) used for predicting or explaining values of the response variable are known as the __________. It(they) is(are) denoted by x.

independent variable

An estimate of a population parameter that provides an interval of values believed to contain the value of the parameter is known as the

interval estimate

shape

is a preattentive attribute. can be used for categorical comparisons but not useful for quantitative comparisons.

texture

is another preattentive attribute, it is common to use texture in data visualization when printers only had black and white. useful to encode categorical data, not quantitative data.

knowledge

is awareness and understanding of a set of information, and ways it can be used to support a task. - information and skills acquired through experience or education; th theoritical or practical understanding of a subject.

position, length, or height

is much better for showing preceise quantitative comparisons.

information

is the collextion of data organized in such a way that they have value beyond the facts itself. - organized facts provided or learned about something or someone.

The goal of data mining is to

is the extraction of patterns and knowledge from large amounts of data.

by changing color of important data and the irrelevanr data

it makes the important data easier to see and read.

_________ attempts to classify a categorical outcome as a linear function of explanatory variables.

logistic regression

nonexperimental, or observational, studies

make no attempt to control the variables of interest. 1) Identify research questions and variables. 2) develop survey/interview, then distribute and collect. A survey is perhaps the most common type of observational study. For instance, in a personal interview survey, research questions are first identified. Then a questionnaire is designed and administered to a sample of individuals.

A __________ decision is one in which companies have to decide whether they should manufacture a product or outsource production to another firm.

make versus buy

Cluster analysis is commonly used in marketing to divide consumers into different homogeneous groups, a process known as

market segmentation

legitimately missing data

missing data that occur naturally For example, respondents to a survey may be asked if they belong to a fraternity or a sorority, and then in the next question are asked how long they have belonged to a fraternity or a sorority. If a respondent does not belong to a fraternity or a sorority, she or he should skip the ensuing question about how long.

__________ refers to the degree of correlation among independent variables in a regression model.

multicollinearty

bin limits

must be chosen so that each data item belongs to one and only one class. The lower bin limit identifies the smallest possible data value assigned to the bin. The upper bin limit identifies the largest possible data value assigned to the class.

A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size

n has the same probability of being selected.

in a normal distribution, which is greater, the mean or the median?

neither the mean or the median (and mode) because they would be equal.

probability is the

numerical measure of the likelihood that an event will occur.

we can use charts to visualize our data and

obtain more information about the data set.

With reference to a spreadsheet model, an uncontrollable model input is known as a(n)

parameter

What do nodes in an influence diagram represent?

parts of the model

It is the responsibility of managers to

plan, coordinate, organize, and lead their organizations to better performance. Ultimately, managers' responsibilities require that they make strategic, tactical, or operational decisions.

A simple random sample of 31 observations was taken from a large population. The sample mean equals 5. Five is a

point estimate

The purpose of statistical inference is to make estimates or draw conclusions about a

population based upon information obained from the sample

Bayes' theorem is a method used to compute _______ probabilities

posterior

A forecast that helps direct police officers to areas where crimes are likely to occur based on past data is an example of

predictive analytics.

color wheel

primary colors are red, yellow, and blue. secondary colors are green, orange, and purple. which are created by mixing the primary colors.

One of the most important uses of a histogram is to

provide information about the shape, or form, of a distribution. morderately skewed left. moderately skewed right. Symmetric. high skewed right

probability distribution

represents how likely one can find possible values or random variables. - it is useful when you want to know which outcomes are most likely

A time series plot of a period of time (quarterly) versus quarterly sales (in $1,000s) is shown below. Which of the following data patterns best describes the scenario shown

seasonal pattern and linear trend

The goal of clustering is to

segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration. commonly used in marketing to divide customer into different homogenous groups; known as market segmentation. Cluster analysis can also be used to identify outliers,

three primary ways to use colors in data visulization

sequential, diverging, and categorical.

the triangular distribution is a good model for _____ distributions

skewed

In a simple linear regression analysis the quantity that gives the amount by which the dependent variable changes for a unit change in the independent variable is called the

slope of the regression line

the least squares regression line minimizes the sum of the

squared differences between actual and predicted y values

bottom-up hierarchial clustering

starts with each observation belong to its own cluster then sequentially merges the most similar cluster to create a series of nested clusters. small data set less than 500 observations.

visual perception

the ability to interpret the surrounding environment by processing information that is contained in visible light.

Which of the following statements is correct?

the binomial distribution is a discrete probability distribution and the normal distribution is a continuous probability distribution.

correlation

the correlation coeffiecent is most widely used to determine the strength of the relationship; A standardized measure of linear association between two variables that takes on values between −1 and +1. Values near −1 indicate a strong negative linear relationship, values near +1 indicate a strong positive linear relationship, and values near zero indicate the lack of a linear relationship. POPULATION CORRELATION =PXY = σXY/(σX σY) SAMPLE CORRELATION rxy = Sxy/(Sx*Sy) rxy = sample correlation coefficient sxy sample covariance Sx sample std dev of x Sy sample std.dev of y.

relative frequency

the fraction or percent of the time that an event occurs in an experiment. relative frquency of a bin = frequency of the bin / n

A procedure for using sample data to find the estimated regression equation is

the least squares method

logic

the object of the same group or cluster are more similar to ech other than to those in other groups or clusters for example class; instructor and students. gender: male or female or 3rd option. status: freshmen, sophmore, junior, senior.

Remedial action is considered for illegitimately missing data.

the primary options for addressing such missing data are (1) to discard observations (rows) with any missing values, (2) to discard any variable (column) with missing values, (3) to fill in missing entries with estimated values, or (4) to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values.

Data can be categorized in several ways based on how

they are collected and the type collected.

A time series plot of a period of time (in weeks) versus sales (in 1,000's of gallons) is shown below. Which of the following data patterns best describes the scenario shown?

time series with a horizontal pattern

Simulation optimization helps

to find good decisions in highly complex and highly uncertain settings.

Which of the following states the objective of time series analysis?

to uncover a pattern in a time series and then extrapolate the pattern into the future

Which of the following would be a likely mathematical expression for Total Revenue?

total revenue = production volume * revenue per unit.

The impact of two inputs on the output of interest is summarized by a

two-way data table

A positive forecast error indicates that the forecasting method ________ the dependent variable

underestimated

order is a powerful way to

understand your data. by ordering your data it becomes easier to see patterns.

The goal of __________ is to use the variable values to identify relationships between observations.

unsupervised learning

In which of the following scenarios would it be appropriate to use hierarchical clustering?

when binary or ordinal data needs to be clustered

covariance

yxA descriptive measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship. Sample of size n of x1, y1. Sxy = (Σ(xi - x̅)*(yi - y hat))/n-1 population covariance (Σ(xi - µx) * Σ(yi - µy))/N =σxy

population mean

μ = (Σ Xi)/N

measuring similarity between oberservations:

bottom up -hierarchial clustering; top-down k-means clustering.

frequency term-document matrix

A matrix whose rows represent documents and columns represent tokens (terms), and the entries in the matrix are the frequency of occurrence of each token (term) in each document.

Strategic decision

A decision that involves higher-level issues and that is concerned with the overall direction of the organization, defining the overall goals and aspirations for the organization's future

geometric mean

A measure of central location that is calculated by finding the nth root of the product of n values. x̅g = g^Sqrt(x1*x2....*xn)

coefficient of variation

A measure of relative variability computed by dividing the standard deviation by the mean and multiplying by 100. Std.dev/mean *100 = %

Variance

A measure of variability based on the squared deviations of the data values about the mean. σ² = (Σ(Xi - μ)^2)/ N sample variance s^2 = (Σ(xi - x̅ )^2) / n

standard deviation

A measure of variability computed by taking the positive square root of the variance. s = sqrt(s^2) σ = sqrt(σ² )

approximate bin width

(largest data value- smallest data value)/number of bins

the human brain can remeber approximately

10,000 visuals with an 83% recollection rate.

optimization models

A mathematical model that gives the best decision, subject to the situation's constraints.

variable

A characteristic or quantity of interest that can take on different values.

corpus

A collection of documents to be analyzed.

operational decisions

A decision concerned with how the organization is run from day to day -domain of operations managers. + closest to the customers.

histograms

A graphical presentation of a frequency distribution, relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the bin intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis.

scatter chart

A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other on the vertical axis.

random variable, or uncertain variable

A quantity whose values are not known with certainty

pivot table

An interactive crosstabulation created in Excel.

Hadoop

An open-source programming environment that supports big data processing through distributed storage and distributed processing on clusters of computers.

outliers

An unusually large or unusually small data value.

unsupervised learning

Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process. - a descriptive datamining technique used to identify relationships between observations. there is no outcome variable to predict; instead; qualitatie assessments are used to asess and compare the results. thought of as high-dimensional descriptive analytics because they are designed to describe patterns and relationships in large data sets with many observations of many variables.

Probability of an Event

Equal to the sum of the probabilities of outcomes for the event. denoted as P()

mutually exclusive events

Events that have no outcomes in common. A∩B is empty therefore it = 0 P(AUB) =P(A)+P(B)

data mining example for predictive analysis

For example, a large grocery store chain might be interested in developing a targeted marketing campaign that offers a discount coupon on potato chips. By studying historical point-of-sale data, the store may be able to use data mining to predict which customers are the most likely to respond to an offer on discounted chips by purchasing higher-margin items such as beer or soft drinks in addition to the chips, thus increasing the store's overall revenue.

matching coefficient

Measure of similarity between observations based on the number of matching values of categorical variables.

Computing Probability Using the complement

P(A) = 1 - P(Ac)

three events with addition law: A,B,C

P(AUBUC) = P(A)+P(B)+P(C) - P(A∩B)-P(A∩C)-P(B∩C) +P(A∩B∩C)

What are the two decisions that you can make from performing a hypothesis test?

REJECT The null hypothesis, fail to reject the null hypothesis.

Data

Raw data; The facts and figures collected, analyzed, and summarized for presentation and interpretation.

velocity

Real-time capture and analysis of data present unique challenges both in how data are stored and the speed with which those data can be analyzed for decision making. For example, the New York Stock Exchange collects 1 terabyte of data in a single trading session, and having current data and real-time rules for trades and predictive modeling are important for managing stock portfolios.

harmony effect

Shapes that have similar characteristics are visually read as harmonious.

Picks and Axes Inc. is an Internet-based retail seller of hiking boots and mountaineering gear. The company decides to open retail stores across the major areas of the city to help complement its Internet-based strategy. This activity would be categorized as a(n)

Strategic decision

confidence

The conditional probability that the consequent of an association rule occurs given the antecedent occurs. support of (antecedent and consequent)/ support of antecedent A high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, but a high value of confidence can be misleading

Interquartile Range (IQR)

The difference between the third and first quartiles.

Union of A and B

The event containing the outcomes belonging to A or B or both. The union of A and B is denoted by AUB.

Which of the following approaches is a good way to proceed with the influence diagram building for a problem?

The influence diagram for a portion of the problem is built first and then expanded until the total problem is conceptually modeled.

antecedent

The item set corresponding to the if portion of an if-then association rule.

consequent

The item set corresponding to the then portion of an if-then association rule.

Trend refers to

The long-run shift or movement in the time series observable over several periods of time.

bins

The nonoverlapping groupings of data used to create a frequency distribution. Bins for categorical data are also known as classes.

degrees of freedom

The number of individual scores that can vary without changing the sample mean. Statistically written as 'N-1' where N represents the number of subjects.

Conditional probability

The probability of an event given that another event has already occurred. The conditional probability of A given B is reads "the probability of A given B."/ P(A|B) = P(A∩B)/P(B) OR P(B|A) = P(A∩B)/P(A)

the most common form of distributins is

a frequency distribution, which determines how often a value appears in range.

A one-way data table summarizes

a single input's impact on the output of interest.

why do we need data mining?

because there is huge amount of data we should deal with.

Earthtones and cool colors

can be used for categorical data

data mining

the use of a variety of statistical analysis tools to uncover previously unknown patterns in the data stored in databases or relationships among variables. -extraction of data from data base. For example, by analyzing text on social network platforms like Twitter, data-mining techniques (including cluster analysis and sentiment analysis) are used by companies to better understand their customers. By categorizing certain words as positive or negative and keeping track of how often those words appear in tweets, a company like Apple can better understand how its customers are feeling about a product like the Apple Watch.

If a time series plot exhibits a horizontal pattern, then

there is still not enough evidence to conclude that the time series in stationary


Related study sets

6.6: Hormones, Homeostasis and Reproduction

View Set

Abeka Vocabulary Spelling Poetry V Quiz 6A

View Set

Psych Exam 3 - Ch. 22 (Substance-Related & Addictive Disorders)

View Set

Personal Finance Chapter 11-14 HW

View Set

* HESI A2 Practice - Math (book)

View Set