BANA 5368 Final Exam (CH 1-4, 5-8)

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Crosstabulation

-tabular summary of two variables the two variables can be either categorical or quantitative buy quantitative variables must be converted to categorical by creating intervals

4 options for illegitimately missing data

1. discard observations (rows) that have missing data 2. discard the variable (column) that has missing data 3. fill in missing values with estimated values 4. apply a data mining algorithm capable of dealing with missing values

experiments have three characteristics

1. each experimental trial only has two mutually exclusive outcomes 2. each experimental trial is independent of all other trials 3. The probability of success is constant and therefore the probability of failure is also constant if a trial has all three characteristics then the trial is said to be a bernoulli trial

6 Rules of probability

1. probability must be a number no smaller than 0 & no greater than 1 2. probability of an event is equal to the sum of the probabilities of the elementary outcomes in the event set 3. the sum of the probabilities of the elementary outcomes in the sample space must equal 1. Pr(s) = 1 where s represents the sample space 4. multiplicative rule: Pr (A n B) = Pr (A and B) = Pr (A) Pr (BIA) or Pr(B) Pr(BIA) 5. additive rule: Pr (AuB) = Pr(A or B) = Pr (A) + Pr (B) - Pr (AnB) 6. law of complements: Pr (A) = 1 - Pr(Ac)

use tables to display data when

1. reader needs to refer to specific numerical values 2. reader needs to make precise comparisons between values and not just relative comparisons 3. the values being displayed have different units or very different magnitudes

The 4 vs

1. volume 2. velocity 3. variety 4. veracity

quartiles

25th 50th 75th

empirical rule

68% of all observations will be within 1 standard deviation of the mean 95% of all observations will be within 2 standard deviations of the mean 99.7% of all observations will be within 3 standard deviations of the mean

probability density function

A function used to compute probabilities for a continuous random variable. The area under the graph of a probability density function over an interval represents probability.

column charts

A graphical presentation that uses vertical bars to display the magnitude of quantitative data. Each bar typically represents a class of a categorical variable.

Binomial Probability Distribution

A probability distribution showing the probability of x successes in n trials of a binomial experiment.

Interquartile Range (IQR)

Q3-Q1 difference between the first and 3rd quartiles larger IQR the more variablilty

Missing at random (MAR)

The tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data.

normal distribution

a bell-shaped curve, describing the spread of a character throughout a population the area under the bell-shaped curve = 1

big data

a broad term for datasets so large or complex that traditional data processing applications are inadequate.

population parameter

a characteristic of the population (true mean height of all college students in tx)

sample statistic

a characteristic of the sample (eg. the mean height of a random sample of students in tx) remember we use sample statistics to estimate the population parameters

variable

a feature, characteristic, or quantity of the data object that can take on a variety of values (car color, square footage of a house, age of a person)

tree map

a graphical presentation that is useful for visualizing hierarchical data along multiple dimensions. groups data according to the classes of a categorical variable and uses rectangles whose size relates to the magnitude of a quantitative variable

map reduce

a programing model used within hadoop to manage the distribution of the data (the map) to the cluster and then collects the completed answers (reduce) from the cluster

random sample (infinite population)

a random sample of size n from an infinite population is a sample selected such that the following conditions are satisfied 1. each element selected comes from the same population 2. each element is selected independently think of an assembly line

representative sample

a sample that has the same characteristics as the population from which it is drawn. (ex for a college campus consisting of 22,000 students with 52% being female and 48% male, the representative sample would have the same proportions of females and males)

random sample

a sample where the data objects are selected at random from the population. A random sample is considered to be a representative sample

sampling random sample

a simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected to do this: 1. assign a random number to each observation in the dataset 2. sort the data set using the random number as the key index 3. select the n observations with the smallest random numbers

clustered column or clustered bar chart

a special type of column (bar) chart in which multiple bars are clustered in the same class to compare multiple variables AKA side by side column (bar) chart

Linear regression

a statistical method for modeling the relationship between a single dependent variable and multiple independent (explanatory) variables

graphical information systems (GIS)

a system that merges maps and statistics to present data collected over different geographies

imputation

a systematic method of replacing missing values include using the variables median, mean or mode

Missing not at random (MNAR)

a value is missing because some other value is missing

random variable

a variable whose value is not known with certainty beforehand (color you do not know the color of the third vehicle on the first row of the parking lot until you go look)

total ink

all of the ink used to generate the table or chart

Data mining

analytical techniques use to understand patterns and relationships in large data sets (ex. analyzing twitter feeds using textual analysis and cluster analysis)

data mining

analytical techniques used to understand patterns and relationships between variables

classical approach

based on the idea that if you took the probability of each elementary outcome in the sample space and added them together they must add up to 1 n elementary outcomes then the probability of any single outcome would be 1/n

rules of expected value

c represents a constant term and x represents a discrete random variable. mu represents the mean of the probability distribution and is not necessarily the same as arithmetic mean

three ways of assigning probabilities

classical relative frequency subjective

Data dashboard

collections of tables, charts, maps and summary statistics which are updated as new data becomes available (ex. webpage updated hourly that shows sales by product line)

population

complete set of data elements that you are interested in studying (ex. all cars in the state of texas, or all college students in us)

event

condition defined by the researcher ex. flipping a coin and getting at least one head event set is the set of elementary outcomes that satisfy the definition of the event

time series data

data collected about a single data object over several time periods (ex. daily closing stock price for IBM common stock over the last year)

cross-sectional data

data collected about several data objects at the same approximate time (first quarter sales data for all of the companies in the fortune 500)

categorical data

data points represent membership in a category, arithmetic operations cannot be performed directly on the data. categorical data can be summarized by counts

raw data

data that has not been processed in anyway has not been sorted, treated, or checked in anyway

probability

defined as the likelihood something will actually happen. the probability of an event must be a number between 0 and 1 2 types of probability 0 = something won't happen 1 = something will happen never have a negative probability

sample space

defined as the set of all possible elementary outcomes use tree diagram

complement of an event

defined as the set of elementary outcomes in the sample space but not in the event set complement of a consists of all elementary outcomes in the sample space that are not in a if you combine A and Ac you get the sample space s. this relationship is always true, an event set combined with its complement will give you the sample space

union of events

defined as the set of elementary outcomes that are in either set or both ex. Union (AuB) is equal to the set of elementary outcomes that are in a or b or both

variety

different types of data can be analyzed with respect to the impact they have on a decision variable

exponential probability distribution

focuses on the length of time between the occurrence of the event

poisson probability model

focuses on the number of times something happens in a stated interval of time or distance ex. number of people that will enter my shop in the next hour xi = x factorial

pivot charts

graphical presentation created in excel that functions similarly to a pivot table

bar charts

graphical presentation that uses horizontal bars to display the magnitude of quantitative data each bar typically represents a class of a categorical variable

scatter chart matrix

graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables

pie charts vs bar charts

graphical presentation used to compare categorical data because of the difficulty in comparing relative areas of the chart, these charts are not recommended bar or column charts are superior to pie charts

parallel coordinates plot

graphical presentation used to examine more than two variables in which each variable is represented by a different vertical axis. each observation in a data set is plotted in a parallel coordinates plot by drawing a line between the values of each variable for the observation

bubble charts

graphical presentation used to visualize 3 variables in a two dimensional graph the two axes represent two variables, and the magnitude of the third variable is given by the size of the bubble

tree diagram

graphical technique for showing all possible outcomes in a probability experiment

veracity

how much uncertainty is there in the data (missing data, measurement errors, data entry errors) and other inconsistencies in the data

rule based models

if a is true, then do b (ex. If the credit score is less than 600 do not grant the loan)

intersection of events

intersection of two events is defined as the sets at the same time ex the intersection (AnB) is equal to the set of elementary outcomes that are in both A&B

non data ink

is ink used in the table or chart that is not useful to communicating the meaning of the data

pivot tables

microsoft excel calls crosstabulation pivot tables

optimization models

models that give the best decision given the situation (ex. using historical data, determine the combination of stocks that yield the highest return for a specified level of risk)

observational study

neither the variable of interest or any of the related variables are controlled. (Ex. consumer surveys, point of sale data, questionaires)

frequency distribution

number of times a data value occurs in comparison to all other values in the data set. often expressed as a table or graph. if the data is categorical a bar graph is used, quantitative means a histogram will be used

quantiative data

numerical data that can be added, subtracted, multiplied, or divided (ex the amount of money people spend on lunch)

goal of sampling

obtain a representative sample

elementary outcome

one of the possible outcomes to an experiment (flipping a coin the elementary outcomes are heads (H) and tails (T))

Hadoop

open source programing environment that distributes data and processing over a cluster of computers

conditional probability

probability of one event happening after another event has already occurred. Use the notation Pr(AIB) to signify the conditional probability of event A occurring given the fact that B has already occurred Pr(AIB) is read as probability of A given B

Experiment

process to which you dont know the outcome ahead of time (Ex. flipping a coin, making a sales call)

volume

quantity of data collected is huge

data ink ratio

ratio of data ink to total ink data ink/ total ink the higher data ink ratio is better because it minimizes non data ink

Data Query

request for specified types of information from a database

discrete probability distribution table

shows the possible values of x along with the associated probabilities all probabilities between 0 & 1 sum of all probabilities = 1

sparklines

special type of line chart that indicates the trend of the data but not the magnitude. does not include axes or lables

velocity

speed of decision making is increasing

time series analysis

statistical methods used to establish a trend line that can be used to predict what will happen in the short run future

sample

subset of the population (ex. 100 students chosen from a college campus of 22k students)

descriptive analytics

technique that uses historical data to describe what happened in the past

prescriptive analytics

techniques that identify the best or optimal course of action

data

the facts and figures collected, analyzed, and summarized for presentation and interpretation

geometric mean

the mean of n numbers expressed as the n-th root of their product

expected value

the mean of the probability distribution is called the expected value and written as mu = E (x)

data object

the object on which the data is collected (ex. cars, people, states, companies, machines ....) the term population element is alos used as a synonym for the data object

data ink

the quantity of ink used to display the actual data itself

subjective approach

the subjective approach to assigning probabilities relies upon a person's intuition in situations where many things influence the probability of an outcome a subjective probability may be the only way to assign a probability

uniform probability distrbution

the uniform probability distribution is defined for a continuous random variable x which has an equal chance of taking a value anywhere w/n the interval (a <= x <= b)

heat maps

two dimensional graphical presentation of data in which color shadings indicate magnitude

mutually exclusive or incompatible events

two events A & B are mutually exclusive if the occurrence of one event eliminates the possibility of the other event occurring ex. flipping a coin, & getting heads eliminates the possibility of getting tails on the same flip. if 2 events A & B are mutually exclusive, then the Pr(AIB) = 0 and the Pr (BIA) = 0

independence

two events A & B are said to be independent if the occurrence of one event has NO impact on the probability of the other event occurring Pr(AIB) = Pr (A) if A and B are independent Pr(BIA) = Pr(B) if A and B are independent

independence

two events A and B are independent if: Pr (A I B) = Pr (A) and Pr ( B I A) = Pr (B) You can use this idea to test for independence calculate Pr(A I B) and see if it equals Pr (A) if they are not equal then A & B are not independent

Predictive analytics

use historical data to predict what will happen in the future

Simulations

use of statistics to construct computer models to study the impact of uncertainty on a decision

relative frequency

used to estimate probability take number of frequency / numbers of trials

simulation optimization

using simulation along with optimization techniques to identify the best decision in a complex and uncertain environment (ex. a business employs various supply chains and production processes choosing the best combination of resources and processes depends on a variety of probabilities)

experimental study

variable of interest is identified and then related variables are controlled or manipulated to determine how they influence the variable of interest (ex. effectiveness of a new fertilizer is the variable of interest, the type of soil and amount of rainfall are variables being controlled)

discrete probability distribution

variable that can only have certain values ex. number of 3 point shots made in a basketball game, number of cars in the parking lot at 7 am

data dashboards

visualization tool that updates in real time and gives multiple outputs may provide you with the latest sales data number of employees currently working, status of various machines used in production and number of completed orders shipped

Table design principles

•Avoid using vertical lines in a table unless they are necessary for clarity. •Horizontal lines are generally necessary only for separating column titles from data values or when indicating that a calculation has taken place. -columns of numerical data could generally be right aligned -all numerical values in a column should display the same number of digits to the right of the decimal -text should be left aligned


Kaugnay na mga set ng pag-aaral

1. The recommended repair for a defective relay is to Replace it. 2. What components can be changed on a contactor and a starter for rebuilding purposes? The parts that can be replaced on a contactor are the contacts, the holding coil, and the contact ho

View Set

Maternal Child Nursing Care Chapter 16 Nursing Care of the Family During Labor and Birth

View Set

COMP129 - Chapter 7: Internet Blueprint

View Set