IDIS 450 Exam 1

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Principal Component Analysis

Find a projection that captures the largest amount of variation in data The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space

Multiple regression

allows a response variable Y to be modled as a linear function of multidimensional feature

Log linear module

approximates discrete multidimensional probability distributions

Compound Events

composition of two or more other events Can be formed in two different ways

Incomplete Data

lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

Sample Point

most basic outcome of an experiment

Random Sample

n elements are selected from a population in such a way that every set of n elements in the population has an equal probability of being selected, the n elements are said to be a random sample.

Completeness

not recorded, unavailable

Derivable data

one attribute may be derived attribute in another table

Interval Data

ordinal data but with constant differences between observations No true 0 point

Probability Formula

p(event)= number of ways event can occur/ (total number of outcomes)

Inconsistent

Containing discrepancies in codes or names

Noisy

Containing errors or outliers

Ratio Data

Continuous values and have a natural zero point Ratios are meaningful

Accuracy

Correct or wrong, accurate or not

Independence

CovA,B = 0 but the converse is not true: Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence

Applications of Business Analytics

Customer Relationship Sports game strategies Pricing Decision HR Planning Supply Chain Management Financial and Marketing

Example of Categorical Data

Customers location employee classification (manager, supervisor)

Conditional Probability

Event Probability given that another event occured Revise original sample space to account for new information

Statistical Independence

Event occurAnce does not affect probability of another event

Mutually Exclusive Events

Events do not occur simultaneously

incorrect attribute values may be due to

Faulty Data Collection Instruments Data Entry Problems Data Transmission Problems Technology Limitation Inconsistency in naming convention

Data cleaning tasks

Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data

Data Cleaning

Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies

Steps for Calculating Probability

1. Define the experiment; describe the process used to make an observation and the type of observation that will be recorded 2. List the sample points 3. Assign probabilities to the sample points 4. Determine the collection of sample points contained in the event of interest 5. Sum the sample points probabilities to get the event probability

Concept Hierarchies

reduce the data by collecting and replacing low level concepts by higher level concepts

Consistency

some modified but some not, dangling

Example of Interval Data

temp reading SAT scores

Chi Squared Test

the larger the value, the more likely the variables are related

Object Identification

the same attribute or object may have diifferent names in different databases

Timeliness:

timey update

Dimensional Outliers

univarite outliers Multivaritive outliers

Multiplicative Rule

used to get compound probabilities for intersection events

Z Score Method

z=(x-mu)/theta Very effective when values in the feature fit a Gaussian distribution Easy to implement Useful for low dimensional feature set Not recommended when data cannot be assumed to be problematic

Other data problems which require data cleaning

Duplicate records incomplete data inconsistent data

Mutually Exclusive

2 outcomes can not occur at the same time

Probability

A measure of how likely an event is to occur

Predictive Analytics

Analyzes past performance extrapolating to future predicts risks

Stratified Sampling

Applied to population divided into subsets and allocates an appropriate proportion of samples to each subset

Four Types of Data based on measurement scale

Categorical (nominal) data Ordinal Data interval Data Ratio Data

Sample Space

Collection of all possible outcomes

Example of Ordinal Data

College football ranking severe responses (Poor or good)

Causes of Outliers

Data Entry Errors Measurement Errors Experimental errors Intentional Data processing Errors Sampling errors Natural

IQR Method

Data does not need to follow any distribution Needs to be ratio data types Calculate Median, Q1, Q3 and IQR If data is beyond 1.5*IQR: Minor Outlier If data is beyond 3*IQR Major outlier

Why Data Preprocessing

Data in the real world is dirty: lots of potentially incorrect data

Missing Data

Data is not always available

Categorical Data (Nominal Data)

Data placed in categories according to a specified characteristic Categories bear no quantitative relationship to one another

Ordinal Data

Data that is ranked or ordered according to some relationship with one another No fixed units of measure

Clustering

Detect and remove outliers

Intentional

Disguised missing data

Cluster Sampling

Divide the population into clusters and sample a set of clusters

Sampling from a continuous process

Fix the time and select n items after that time or select n times at random and select the next item produced after each of these items

Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources are different

Bayes Rule

Given k mutually exclusive and exhaustive events B1, B1, . . . Bk , such that P(B1) + P(B2) + ... + P(Bk) = 1, and an observed event A, then

Importance of Selection

How a sample is selected from a population is of vital importance in statistical inference because the probability of an observed sample will be used to infer the characteristics of the sampled population.

Interpretability

How easily the data can be understood

Entity identification problem

Identify real world entities from mutliple data sources

Analyzing the Problem

Identifying and applying appropriate Business analytics techniques typically involves experimentations, statistical analysis or a solution process

Negative Covariance

If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value

Positive Covariance

If CovA,B > 0, then A and B both tend to be larger than their expected values

Data Integration

Integration of multiple databases, or files

Equal Width Partitioning

It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well.

Equal Depth Partitioning

It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky

Subjective Method

Judgement sampling Conveineice sampling

Redundant Attributes

May be able to be detected by correlation analysis or covariance analysis

Example of Ratio Data

Monthly Sales Delivery times

Random Number Generators

Most researchers rely on random number generators to automatically generate the random sample. Random number generators are available in table form, and they are built into most statistical software packages.

Data transformation

Normalization and aggregation

Sampling Plan:

Objectives Target Population Population Frame Operational Procedures for Data Collection Statistical Tools for Data analysis

Data Reduction

Obtains reduced representation in volue but produces the same or similar analytical results

Collectively Exhaustive

One outcome in sample will occur

Intersection

Outcomes in both events A and B 'AND' statement Denoted by symbol (i.e., A B)

union

Outcomes in either events A or B or both 'OR' statement Denoted by symbol (i.e., A B)

Data Discretization

Part of data reduction but particular importance, especially for numerical data

Environmental Outliers

Point Outliers Contextual outliers Collective Outliers

Experiment

Process of observation that leads to a single outcome that cannot be predicted with certainty

Noise

Random error or variance in a measure variable

Discretization

Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval Labels can then be used to replace actual data values

Numerosity reduction

Regression and Log-Linear Models Histograms, clustering, sampling Data cube aggregation

Systematic (periodic sampling)

Selects every nth item from the population

Probability Sampling

Simple Random Sample Involves selecting items from a population so that every subset of a given size has an equal chance of being selected

Regression

Smooth by fitting the data into regression functions

Structuring the Problem

Stating goals and objectives characterizing the possible decisions identifying any constraints and restrictions

Complementary Events

The event that A does not occur All events not in A Denote complement of A by AC

Implementing the Solution

Translate the results of the model back to the real world. Requires providing adequate resources, motivating employees, eliminating resistance to change, modifying organizational policies, and developing trust.

Additive Rule

Used to get compound probabilities for union of events

Descriptive Analytics

Uses data to understand past and present summarizes data into meaningful charts and reports identify patterns and trends in data

Prescriptive Analytics

Uses optimization techniques to identify best alternatives often combined with predictive analytics to account for risk

Dimensionally reduction

Wavelet transforms Principal Components Analysis (PCA) Feature subset selection, feature creation

outcome

a possible result of a probability experiment

Event

a specific result of a probability experiment

Linear Regression

data are modled to fit a straight line

Combined computer and human inspection

detect suspicious values and check by humans

Binning

first sort data and partition into (frequency) bins Then one can smooth by bins, means, smooth by bin median, smooth by bin boundaries

Belivability

how trustable the data is

schema integration

integrate metadata from different sources


Kaugnay na mga set ng pag-aaral

Chapter Ten - Race and Ethnicity

View Set

2. Sophia - Information Technology (3) - Unit 2

View Set

Information Security Exam Questions

View Set

PE-Chapter 4: Preparing for Physical Activity.

View Set

Iggy Ch 31-Care of patients with Infectious respiratory problems

View Set