IDIS 450 Exam 1
Principal Component Analysis
Find a projection that captures the largest amount of variation in data The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space
Multiple regression
allows a response variable Y to be modled as a linear function of multidimensional feature
Log linear module
approximates discrete multidimensional probability distributions
Compound Events
composition of two or more other events Can be formed in two different ways
Incomplete Data
lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
Sample Point
most basic outcome of an experiment
Random Sample
n elements are selected from a population in such a way that every set of n elements in the population has an equal probability of being selected, the n elements are said to be a random sample.
Completeness
not recorded, unavailable
Derivable data
one attribute may be derived attribute in another table
Interval Data
ordinal data but with constant differences between observations No true 0 point
Probability Formula
p(event)= number of ways event can occur/ (total number of outcomes)
Inconsistent
Containing discrepancies in codes or names
Noisy
Containing errors or outliers
Ratio Data
Continuous values and have a natural zero point Ratios are meaningful
Accuracy
Correct or wrong, accurate or not
Independence
CovA,B = 0 but the converse is not true: Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence
Applications of Business Analytics
Customer Relationship Sports game strategies Pricing Decision HR Planning Supply Chain Management Financial and Marketing
Example of Categorical Data
Customers location employee classification (manager, supervisor)
Conditional Probability
Event Probability given that another event occured Revise original sample space to account for new information
Statistical Independence
Event occurAnce does not affect probability of another event
Mutually Exclusive Events
Events do not occur simultaneously
incorrect attribute values may be due to
Faulty Data Collection Instruments Data Entry Problems Data Transmission Problems Technology Limitation Inconsistency in naming convention
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
Data Cleaning
Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies
Steps for Calculating Probability
1. Define the experiment; describe the process used to make an observation and the type of observation that will be recorded 2. List the sample points 3. Assign probabilities to the sample points 4. Determine the collection of sample points contained in the event of interest 5. Sum the sample points probabilities to get the event probability
Concept Hierarchies
reduce the data by collecting and replacing low level concepts by higher level concepts
Consistency
some modified but some not, dangling
Example of Interval Data
temp reading SAT scores
Chi Squared Test
the larger the value, the more likely the variables are related
Object Identification
the same attribute or object may have diifferent names in different databases
Timeliness:
timey update
Dimensional Outliers
univarite outliers Multivaritive outliers
Multiplicative Rule
used to get compound probabilities for intersection events
Z Score Method
z=(x-mu)/theta Very effective when values in the feature fit a Gaussian distribution Easy to implement Useful for low dimensional feature set Not recommended when data cannot be assumed to be problematic
Other data problems which require data cleaning
Duplicate records incomplete data inconsistent data
Mutually Exclusive
2 outcomes can not occur at the same time
Probability
A measure of how likely an event is to occur
Predictive Analytics
Analyzes past performance extrapolating to future predicts risks
Stratified Sampling
Applied to population divided into subsets and allocates an appropriate proportion of samples to each subset
Four Types of Data based on measurement scale
Categorical (nominal) data Ordinal Data interval Data Ratio Data
Sample Space
Collection of all possible outcomes
Example of Ordinal Data
College football ranking severe responses (Poor or good)
Causes of Outliers
Data Entry Errors Measurement Errors Experimental errors Intentional Data processing Errors Sampling errors Natural
IQR Method
Data does not need to follow any distribution Needs to be ratio data types Calculate Median, Q1, Q3 and IQR If data is beyond 1.5*IQR: Minor Outlier If data is beyond 3*IQR Major outlier
Why Data Preprocessing
Data in the real world is dirty: lots of potentially incorrect data
Missing Data
Data is not always available
Categorical Data (Nominal Data)
Data placed in categories according to a specified characteristic Categories bear no quantitative relationship to one another
Ordinal Data
Data that is ranked or ordered according to some relationship with one another No fixed units of measure
Clustering
Detect and remove outliers
Intentional
Disguised missing data
Cluster Sampling
Divide the population into clusters and sample a set of clusters
Sampling from a continuous process
Fix the time and select n items after that time or select n times at random and select the next item produced after each of these items
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Bayes Rule
Given k mutually exclusive and exhaustive events B1, B1, . . . Bk , such thatP(B1) + P(B2) + ... + P(Bk) = 1,and an observed event A, then
Importance of Selection
How a sample is selected from a population is of vital importance in statistical inference because the probability of an observed sample will be used to infer the characteristics of the sampled population.
Interpretability
How easily the data can be understood
Entity identification problem
Identify real world entities from mutliple data sources
Analyzing the Problem
Identifying and applying appropriate Business analytics techniques typically involves experimentations, statistical analysis or a solution process
Negative Covariance
If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value
Positive Covariance
If CovA,B > 0, then A and B both tend to be larger than their expected values
Data Integration
Integration of multiple databases, or files
Equal Width Partitioning
It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well.
Equal Depth Partitioning
It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky
Subjective Method
Judgement sampling Conveineice sampling
Redundant Attributes
May be able to be detected by correlation analysis or covariance analysis
Example of Ratio Data
Monthly Sales Delivery times
Random Number Generators
Most researchers rely on random number generators to automatically generate the random sample. Random number generators are available in table form, and they are built into most statistical software packages.
Data transformation
Normalization and aggregation
Sampling Plan:
Objectives Target Population Population Frame Operational Procedures for Data Collection Statistical Tools for Data analysis
Data Reduction
Obtains reduced representation in volue but produces the same or similar analytical results
Collectively Exhaustive
One outcome in sample will occur
Intersection
Outcomes in both events A and B 'AND' statement Denoted by symbol (i.e., A B)
union
Outcomes in either events A or B or both 'OR' statement Denoted by symbol (i.e., A B)
Data Discretization
Part of data reduction but particular importance, especially for numerical data
Environmental Outliers
Point Outliers Contextual outliers Collective Outliers
Experiment
Process of observation that leads to a single outcome that cannot be predicted with certainty
Noise
Random error or variance in a measure variable
Discretization
Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval Labels can then be used to replace actual data values
Numerosity reduction
Regression and Log-Linear Models Histograms, clustering, sampling Data cube aggregation
Systematic (periodic sampling)
Selects every nth item from the population
Probability Sampling
Simple Random Sample Involves selecting items from a population so that every subset of a given size has an equal chance of being selected
Regression
Smooth by fitting the data into regression functions
Structuring the Problem
Stating goals and objectives characterizing the possible decisions identifying any constraints and restrictions
Complementary Events
The event that A does not occur All events not in A Denote complement of A by AC
Implementing the Solution
Translate the results of the model back to the real world. Requires providing adequate resources, motivating employees, eliminating resistance to change, modifying organizational policies, and developing trust.
Additive Rule
Used to get compound probabilities for union of events
Descriptive Analytics
Uses data to understand past and present summarizes data into meaningful charts and reports identify patterns and trends in data
Prescriptive Analytics
Uses optimization techniques to identify best alternatives often combined with predictive analytics to account for risk
Dimensionally reduction
Wavelet transforms Principal Components Analysis (PCA) Feature subset selection, feature creation
outcome
a possible result of a probability experiment
Event
a specific result of a probability experiment
Linear Regression
data are modled to fit a straight line
Combined computer and human inspection
detect suspicious values and check by humans
Binning
first sort data and partition into (frequency) bins Then one can smooth by bins, means, smooth by bin median, smooth by bin boundaries
Belivability
how trustable the data is
schema integration
integrate metadata from different sources