Data Discovery-Exam 1 Review

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Regression

Correlation: No a priori assumption whether one variable is depend on the other Not concerned with the relationship between variables Correlation does not mean causation Regression: Model the dependence of a variable on one or more explanatory variables Hypothesis testing Prediction/ forecasting

CRISP-DM

Cross Industry Standard Process for Data Mining Proposed in 1990s by a European consortium Highly repetitive and experimental

Data mining: Regression

Data spilt: train data Test data Accuracy measures: MSE MAD

SEMMA

Developed by SAS Institute

Inferential statistics

Drawing inferences about the population based on sample data

Online Analytical Processing (OLAP)

Is that enables the user to query the system, provide results, and conduct an analysis Uses data warehouses Goal: decisions support Used for analysis Example: Business reporting system

Online Transaction Processing (OLTP)

Is that is primarily responsible for capturing and storing data related to day-to-day business functions uses operational database Goal: capture and store data Not for analysis purposes Example: ERP, SCMS, CRM

KDD

Knowledge Data Discovery

Time Series Forecasting

Math modeling to predict future values based on previously observed values Methods: Naive forecast Moving average Exponential smoothing ARIMA

Dispersion

Range: Max-Min Standard deviation Mean absolute deviation Quartile/Interquartile range

Scatter Plot

Showing precise, data dense visualizations, correlations, and clusters between two numeric variables

Pie chart

Shows a part-to-whole relationship

Distribution shape

Skewness: measure of asymmetry Kurtosis: Peak/tall/skinny nature of the distribution

Line Chart

To show change over time When one you have one data variable and one numeric variable

data consoldiation

access and collect the data select and filter the data integrate and unify the data

Predictive Analytics

aims to determine what is likely to happen in the future Answering the question of what will happen? and why will it happen? looking at the past data to predict the future Example: Amazon's predictive analytics

The Command Line

also referred to as the shell, bash, or terminal - is the text interface for executing text-based programs. Think of it like interacting with your computer behind the scenes.

Data mining types of pattern

association prediction cluster sequential

Business questions

define the requirements of the metric and determine its usefulness

descriptive statistics

describes the data used for descriptive analytics

Descriptive analytics

descriptive or reporting analytics answering the question of what happened? and what is happening? retrospective analysis of historical data Example: Tableau, PowerBI

Data mining: association

finds interesting relationships between variables Employ unsupervised learning Also known as market basket analysis Input: the simple point-of-scale transaction data Output: Most frequent affinities among items

data cleaning

handle missing values in the data identify and reduce noise in the data find and eliminate erroneous data

Logistic Regression

Can have one or more explanatory variables Used to estimate categorical variable: Binomial variable Multinomial variable

Data Taxonomy

Categorical: represent the labels of multiple classes used to divide a variable into specific groups. Nominal: contain measurements of simple codes assigned to objects as labels, which are not measurements Ordinal: contain codes assigned to objects or events as labels that also represent the rank order among them. Numeric: represent the numeric values of specific variables. Interval: Interval scale has an absolute zero value and difference between values in meaningful. Ratio: measurement variables commonly found in the physical sciences and engineering.

Return on Assets(ROA)

is a financial metric that indicated how profitable a firm is related to its assets.

Current ratio

is a metric that shows a firm's ability to pay short-term liabilities

Inventory Turnover

is a ratio that shows how many times a firm has sold and replaced its inventory during a given period.

Return on Equity (ROE)

is another financial metric that shows how efficient a firm to generate profits.

Basic file commands

make a new directory make a new file rename a file or a directory copy a file or directory download a file from the web

Central tendency

median: the number in the middle mode: the most frequent occurence

data transformation

normalize the data discretize or aggregate the data construct new attributes

data reduction

reduce number of attributes reduce number of records balance skewed data

change directory

relative path absolute path double dot

Things we can do from command line

run a python script install software connect to a remote servers do simple and repetitive tasks faster and more efficiency

Profit margin

shows the degree to which a firm makes money The gross profit margin: represents the percent of total sales revenue that the firm retains after incurring the direct costs associated with producing the goods and services it sold. The value chain profit margin: indicated the percentage of profit after COGS, SGA, and R&D The net profit margin: shows how much of each revenue dollar earned is translated into bottom-line profits.

Data mining: Data split

simple spilt: spilt the data into 2 mutually exclusive sets:

Nature of Data

structured data: targeted for computers to process Numeric versus nominal Unstructured/textual data: Targeted for humans to process/digest Semi-structured data: XML, HTML, Log Files

Analytics

the process of developing actionable decisions or recommendations for action based on insights generated from historical data

Cash-to-cash Cycle

the time between when you pay your supplier and when your customer pays you.

Data mining: Clustering

used for automatic identification of natural groupings of things Employ unsupervised learning In marketing: segmentation

The four pillars of visualization

1. Has clear purpose 2. Includes only the relevant content 3. Uses appropriate structure 4. Has useful formatting

Data Analysts

1. Understand the business how it operates, the product/ services, the industry, the supply chain, etc. 2. understand the data data collection, data management, data preprocessing 3. understand methods to analyze the data statistics

Perform and present analysis

1. calculate the metrics 2. create graphs from the metrics 3. read and interpret the graph 4. research the firm

Data preprocessing

1. data consolidation 2. data cleaning 3. data transformation 4. data reduction

Supervised learning

A learning approach with a priori knowledge Labeled data Training data includes both the input and the outcome The construction of proper training, validation, and test is crucial

Unsupervised learning

A learning approach without guidance No labeled data Model is not provided with the correct results Exploratory model

Data Mining Process

A systematic way to conduct data mining projects Most common standard processes: CRISP-DM KDD SEMMA

presecriptive analytics

Aims to determine the best possible decision Answering the question of what should i do? and why should I do it? Uses both descriptive and predictive to create the alternatives, and then determines the best one Example: UPS driver's routing system

Data Mining: Classificatiton

Supervised, learn from past data, classify new data The output variable is categorical Assessment methods: Predictive accuracy: hit rate Speed: model building versus predicting/usage speed Robustness: ability to make prediction given noisy data Scalability: ability to construct prediction given size of data Interpretability: transparency, explinability

Bar chart

The most simple bar charts Difficult to read


Kaugnay na mga set ng pag-aaral

Chapter 1 "Introduction to Professional Ethics"

View Set

NURS12154 Pharmacology for Nursing Practice

View Set

Chapter 2: Learning About Death (Socialization)

View Set

PrepU Chapter 65: Management of Patients with Oncologic or Degenerative Neurologic Disorders

View Set

Module 8: Upper GI EAQs (MEDSURG)

View Set

Fundamentals of Physics I Final Exam Review Fall 2017

View Set

Inquizitive: Incorporating Quotations

View Set

AP Psychology Commonly Missed Terms

View Set

SPPC: Musculoskeletal Ultrasound

View Set