BANA 5368 Final Exam (CH 1-4, 5-8)
Crosstabulation
-tabular summary of two variables the two variables can be either categorical or quantitative buy quantitative variables must be converted to categorical by creating intervals
4 options for illegitimately missing data
1. discard observations (rows) that have missing data 2. discard the variable (column) that has missing data 3. fill in missing values with estimated values 4. apply a data mining algorithm capable of dealing with missing values
experiments have three characteristics
1. each experimental trial only has two mutually exclusive outcomes 2. each experimental trial is independent of all other trials 3. The probability of success is constant and therefore the probability of failure is also constant if a trial has all three characteristics then the trial is said to be a bernoulli trial
6 Rules of probability
1. probability must be a number no smaller than 0 & no greater than 1 2. probability of an event is equal to the sum of the probabilities of the elementary outcomes in the event set 3. the sum of the probabilities of the elementary outcomes in the sample space must equal 1. Pr(s) = 1 where s represents the sample space 4. multiplicative rule: Pr (A n B) = Pr (A and B) = Pr (A) Pr (BIA) or Pr(B) Pr(BIA) 5. additive rule: Pr (AuB) = Pr(A or B) = Pr (A) + Pr (B) - Pr (AnB) 6. law of complements: Pr (A) = 1 - Pr(Ac)
use tables to display data when
1. reader needs to refer to specific numerical values 2. reader needs to make precise comparisons between values and not just relative comparisons 3. the values being displayed have different units or very different magnitudes
The 4 vs
1. volume 2. velocity 3. variety 4. veracity
quartiles
25th 50th 75th
empirical rule
68% of all observations will be within 1 standard deviation of the mean 95% of all observations will be within 2 standard deviations of the mean 99.7% of all observations will be within 3 standard deviations of the mean
probability density function
A function used to compute probabilities for a continuous random variable. The area under the graph of a probability density function over an interval represents probability.
column charts
A graphical presentation that uses vertical bars to display the magnitude of quantitative data. Each bar typically represents a class of a categorical variable.
Binomial Probability Distribution
A probability distribution showing the probability of x successes in n trials of a binomial experiment.
Interquartile Range (IQR)
Q3-Q1 difference between the first and 3rd quartiles larger IQR the more variablilty
Missing at random (MAR)
The tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data.
normal distribution
a bell-shaped curve, describing the spread of a character throughout a population the area under the bell-shaped curve = 1
big data
a broad term for datasets so large or complex that traditional data processing applications are inadequate.
population parameter
a characteristic of the population (true mean height of all college students in tx)
sample statistic
a characteristic of the sample (eg. the mean height of a random sample of students in tx) remember we use sample statistics to estimate the population parameters
variable
a feature, characteristic, or quantity of the data object that can take on a variety of values (car color, square footage of a house, age of a person)
tree map
a graphical presentation that is useful for visualizing hierarchical data along multiple dimensions. groups data according to the classes of a categorical variable and uses rectangles whose size relates to the magnitude of a quantitative variable
map reduce
a programing model used within hadoop to manage the distribution of the data (the map) to the cluster and then collects the completed answers (reduce) from the cluster
random sample (infinite population)
a random sample of size n from an infinite population is a sample selected such that the following conditions are satisfied 1. each element selected comes from the same population 2. each element is selected independently think of an assembly line
representative sample
a sample that has the same characteristics as the population from which it is drawn. (ex for a college campus consisting of 22,000 students with 52% being female and 48% male, the representative sample would have the same proportions of females and males)
random sample
a sample where the data objects are selected at random from the population. A random sample is considered to be a representative sample
sampling random sample
a simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected to do this: 1. assign a random number to each observation in the dataset 2. sort the data set using the random number as the key index 3. select the n observations with the smallest random numbers
clustered column or clustered bar chart
a special type of column (bar) chart in which multiple bars are clustered in the same class to compare multiple variables AKA side by side column (bar) chart
Linear regression
a statistical method for modeling the relationship between a single dependent variable and multiple independent (explanatory) variables
graphical information systems (GIS)
a system that merges maps and statistics to present data collected over different geographies
imputation
a systematic method of replacing missing values include using the variables median, mean or mode
Missing not at random (MNAR)
a value is missing because some other value is missing
random variable
a variable whose value is not known with certainty beforehand (color you do not know the color of the third vehicle on the first row of the parking lot until you go look)
total ink
all of the ink used to generate the table or chart
Data mining
analytical techniques use to understand patterns and relationships in large data sets (ex. analyzing twitter feeds using textual analysis and cluster analysis)
data mining
analytical techniques used to understand patterns and relationships between variables
classical approach
based on the idea that if you took the probability of each elementary outcome in the sample space and added them together they must add up to 1 n elementary outcomes then the probability of any single outcome would be 1/n
rules of expected value
c represents a constant term and x represents a discrete random variable. mu represents the mean of the probability distribution and is not necessarily the same as arithmetic mean
three ways of assigning probabilities
classical relative frequency subjective
Data dashboard
collections of tables, charts, maps and summary statistics which are updated as new data becomes available (ex. webpage updated hourly that shows sales by product line)
population
complete set of data elements that you are interested in studying (ex. all cars in the state of texas, or all college students in us)
event
condition defined by the researcher ex. flipping a coin and getting at least one head event set is the set of elementary outcomes that satisfy the definition of the event
time series data
data collected about a single data object over several time periods (ex. daily closing stock price for IBM common stock over the last year)
cross-sectional data
data collected about several data objects at the same approximate time (first quarter sales data for all of the companies in the fortune 500)
categorical data
data points represent membership in a category, arithmetic operations cannot be performed directly on the data. categorical data can be summarized by counts
raw data
data that has not been processed in anyway has not been sorted, treated, or checked in anyway
probability
defined as the likelihood something will actually happen. the probability of an event must be a number between 0 and 1 2 types of probability 0 = something won't happen 1 = something will happen never have a negative probability
sample space
defined as the set of all possible elementary outcomes use tree diagram
complement of an event
defined as the set of elementary outcomes in the sample space but not in the event set complement of a consists of all elementary outcomes in the sample space that are not in a if you combine A and Ac you get the sample space s. this relationship is always true, an event set combined with its complement will give you the sample space
union of events
defined as the set of elementary outcomes that are in either set or both ex. Union (AuB) is equal to the set of elementary outcomes that are in a or b or both
variety
different types of data can be analyzed with respect to the impact they have on a decision variable
exponential probability distribution
focuses on the length of time between the occurrence of the event
poisson probability model
focuses on the number of times something happens in a stated interval of time or distance ex. number of people that will enter my shop in the next hour xi = x factorial
pivot charts
graphical presentation created in excel that functions similarly to a pivot table
bar charts
graphical presentation that uses horizontal bars to display the magnitude of quantitative data each bar typically represents a class of a categorical variable
scatter chart matrix
graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables
pie charts vs bar charts
graphical presentation used to compare categorical data because of the difficulty in comparing relative areas of the chart, these charts are not recommended bar or column charts are superior to pie charts
parallel coordinates plot
graphical presentation used to examine more than two variables in which each variable is represented by a different vertical axis. each observation in a data set is plotted in a parallel coordinates plot by drawing a line between the values of each variable for the observation
bubble charts
graphical presentation used to visualize 3 variables in a two dimensional graph the two axes represent two variables, and the magnitude of the third variable is given by the size of the bubble
tree diagram
graphical technique for showing all possible outcomes in a probability experiment
veracity
how much uncertainty is there in the data (missing data, measurement errors, data entry errors) and other inconsistencies in the data
rule based models
if a is true, then do b (ex. If the credit score is less than 600 do not grant the loan)
intersection of events
intersection of two events is defined as the sets at the same time ex the intersection (AnB) is equal to the set of elementary outcomes that are in both A&B
non data ink
is ink used in the table or chart that is not useful to communicating the meaning of the data
pivot tables
microsoft excel calls crosstabulation pivot tables
optimization models
models that give the best decision given the situation (ex. using historical data, determine the combination of stocks that yield the highest return for a specified level of risk)
observational study
neither the variable of interest or any of the related variables are controlled. (Ex. consumer surveys, point of sale data, questionaires)
frequency distribution
number of times a data value occurs in comparison to all other values in the data set. often expressed as a table or graph. if the data is categorical a bar graph is used, quantitative means a histogram will be used
quantiative data
numerical data that can be added, subtracted, multiplied, or divided (ex the amount of money people spend on lunch)
goal of sampling
obtain a representative sample
elementary outcome
one of the possible outcomes to an experiment (flipping a coin the elementary outcomes are heads (H) and tails (T))
Hadoop
open source programing environment that distributes data and processing over a cluster of computers
conditional probability
probability of one event happening after another event has already occurred. Use the notation Pr(AIB) to signify the conditional probability of event A occurring given the fact that B has already occurred Pr(AIB) is read as probability of A given B
Experiment
process to which you dont know the outcome ahead of time (Ex. flipping a coin, making a sales call)
volume
quantity of data collected is huge
data ink ratio
ratio of data ink to total ink data ink/ total ink the higher data ink ratio is better because it minimizes non data ink
Data Query
request for specified types of information from a database
discrete probability distribution table
shows the possible values of x along with the associated probabilities all probabilities between 0 & 1 sum of all probabilities = 1
sparklines
special type of line chart that indicates the trend of the data but not the magnitude. does not include axes or lables
velocity
speed of decision making is increasing
time series analysis
statistical methods used to establish a trend line that can be used to predict what will happen in the short run future
sample
subset of the population (ex. 100 students chosen from a college campus of 22k students)
descriptive analytics
technique that uses historical data to describe what happened in the past
prescriptive analytics
techniques that identify the best or optimal course of action
data
the facts and figures collected, analyzed, and summarized for presentation and interpretation
geometric mean
the mean of n numbers expressed as the n-th root of their product
expected value
the mean of the probability distribution is called the expected value and written as mu = E (x)
data object
the object on which the data is collected (ex. cars, people, states, companies, machines ....) the term population element is alos used as a synonym for the data object
data ink
the quantity of ink used to display the actual data itself
subjective approach
the subjective approach to assigning probabilities relies upon a person's intuition in situations where many things influence the probability of an outcome a subjective probability may be the only way to assign a probability
uniform probability distrbution
the uniform probability distribution is defined for a continuous random variable x which has an equal chance of taking a value anywhere w/n the interval (a <= x <= b)
heat maps
two dimensional graphical presentation of data in which color shadings indicate magnitude
mutually exclusive or incompatible events
two events A & B are mutually exclusive if the occurrence of one event eliminates the possibility of the other event occurring ex. flipping a coin, & getting heads eliminates the possibility of getting tails on the same flip. if 2 events A & B are mutually exclusive, then the Pr(AIB) = 0 and the Pr (BIA) = 0
independence
two events A & B are said to be independent if the occurrence of one event has NO impact on the probability of the other event occurring Pr(AIB) = Pr (A) if A and B are independent Pr(BIA) = Pr(B) if A and B are independent
independence
two events A and B are independent if: Pr (A I B) = Pr (A) and Pr ( B I A) = Pr (B) You can use this idea to test for independence calculate Pr(A I B) and see if it equals Pr (A) if they are not equal then A & B are not independent
Predictive analytics
use historical data to predict what will happen in the future
Simulations
use of statistics to construct computer models to study the impact of uncertainty on a decision
relative frequency
used to estimate probability take number of frequency / numbers of trials
simulation optimization
using simulation along with optimization techniques to identify the best decision in a complex and uncertain environment (ex. a business employs various supply chains and production processes choosing the best combination of resources and processes depends on a variety of probabilities)
experimental study
variable of interest is identified and then related variables are controlled or manipulated to determine how they influence the variable of interest (ex. effectiveness of a new fertilizer is the variable of interest, the type of soil and amount of rainfall are variables being controlled)
discrete probability distribution
variable that can only have certain values ex. number of 3 point shots made in a basketball game, number of cars in the parking lot at 7 am
data dashboards
visualization tool that updates in real time and gives multiple outputs may provide you with the latest sales data number of employees currently working, status of various machines used in production and number of completed orders shipped
Table design principles
•Avoid using vertical lines in a table unless they are necessary for clarity. •Horizontal lines are generally necessary only for separating column titles from data values or when indicating that a calculation has taken place. -columns of numerical data could generally be right aligned -all numerical values in a column should display the same number of digits to the right of the decimal -text should be left aligned