Data Discovery-Exam 1 Review
Regression
Correlation: No a priori assumption whether one variable is depend on the other Not concerned with the relationship between variables Correlation does not mean causation Regression: Model the dependence of a variable on one or more explanatory variables Hypothesis testing Prediction/ forecasting
CRISP-DM
Cross Industry Standard Process for Data Mining Proposed in 1990s by a European consortium Highly repetitive and experimental
Data mining: Regression
Data spilt: train data Test data Accuracy measures: MSE MAD
SEMMA
Developed by SAS Institute
Inferential statistics
Drawing inferences about the population based on sample data
Online Analytical Processing (OLAP)
Is that enables the user to query the system, provide results, and conduct an analysis Uses data warehouses Goal: decisions support Used for analysis Example: Business reporting system
Online Transaction Processing (OLTP)
Is that is primarily responsible for capturing and storing data related to day-to-day business functions uses operational database Goal: capture and store data Not for analysis purposes Example: ERP, SCMS, CRM
KDD
Knowledge Data Discovery
Time Series Forecasting
Math modeling to predict future values based on previously observed values Methods: Naive forecast Moving average Exponential smoothing ARIMA
Dispersion
Range: Max-Min Standard deviation Mean absolute deviation Quartile/Interquartile range
Scatter Plot
Showing precise, data dense visualizations, correlations, and clusters between two numeric variables
Pie chart
Shows a part-to-whole relationship
Distribution shape
Skewness: measure of asymmetry Kurtosis: Peak/tall/skinny nature of the distribution
Line Chart
To show change over time When one you have one data variable and one numeric variable
data consoldiation
access and collect the data select and filter the data integrate and unify the data
Predictive Analytics
aims to determine what is likely to happen in the future Answering the question of what will happen? and why will it happen? looking at the past data to predict the future Example: Amazon's predictive analytics
The Command Line
also referred to as the shell, bash, or terminal - is the text interface for executing text-based programs. Think of it like interacting with your computer behind the scenes.
Data mining types of pattern
association prediction cluster sequential
Business questions
define the requirements of the metric and determine its usefulness
descriptive statistics
describes the data used for descriptive analytics
Descriptive analytics
descriptive or reporting analytics answering the question of what happened? and what is happening? retrospective analysis of historical data Example: Tableau, PowerBI
Data mining: association
finds interesting relationships between variables Employ unsupervised learning Also known as market basket analysis Input: the simple point-of-scale transaction data Output: Most frequent affinities among items
data cleaning
handle missing values in the data identify and reduce noise in the data find and eliminate erroneous data
Logistic Regression
Can have one or more explanatory variables Used to estimate categorical variable: Binomial variable Multinomial variable
Data Taxonomy
Categorical: represent the labels of multiple classes used to divide a variable into specific groups. Nominal: contain measurements of simple codes assigned to objects as labels, which are not measurements Ordinal: contain codes assigned to objects or events as labels that also represent the rank order among them. Numeric: represent the numeric values of specific variables. Interval: Interval scale has an absolute zero value and difference between values in meaningful. Ratio: measurement variables commonly found in the physical sciences and engineering.
Return on Assets(ROA)
is a financial metric that indicated how profitable a firm is related to its assets.
Current ratio
is a metric that shows a firm's ability to pay short-term liabilities
Inventory Turnover
is a ratio that shows how many times a firm has sold and replaced its inventory during a given period.
Return on Equity (ROE)
is another financial metric that shows how efficient a firm to generate profits.
Basic file commands
make a new directory make a new file rename a file or a directory copy a file or directory download a file from the web
Central tendency
median: the number in the middle mode: the most frequent occurence
data transformation
normalize the data discretize or aggregate the data construct new attributes
data reduction
reduce number of attributes reduce number of records balance skewed data
change directory
relative path absolute path double dot
Things we can do from command line
run a python script install software connect to a remote servers do simple and repetitive tasks faster and more efficiency
Profit margin
shows the degree to which a firm makes money The gross profit margin: represents the percent of total sales revenue that the firm retains after incurring the direct costs associated with producing the goods and services it sold. The value chain profit margin: indicated the percentage of profit after COGS, SGA, and R&D The net profit margin: shows how much of each revenue dollar earned is translated into bottom-line profits.
Data mining: Data split
simple spilt: spilt the data into 2 mutually exclusive sets:
Nature of Data
structured data: targeted for computers to process Numeric versus nominal Unstructured/textual data: Targeted for humans to process/digest Semi-structured data: XML, HTML, Log Files
Analytics
the process of developing actionable decisions or recommendations for action based on insights generated from historical data
Cash-to-cash Cycle
the time between when you pay your supplier and when your customer pays you.
Data mining: Clustering
used for automatic identification of natural groupings of things Employ unsupervised learning In marketing: segmentation
The four pillars of visualization
1. Has clear purpose 2. Includes only the relevant content 3. Uses appropriate structure 4. Has useful formatting
Data Analysts
1. Understand the business how it operates, the product/ services, the industry, the supply chain, etc. 2. understand the data data collection, data management, data preprocessing 3. understand methods to analyze the data statistics
Perform and present analysis
1. calculate the metrics 2. create graphs from the metrics 3. read and interpret the graph 4. research the firm
Data preprocessing
1. data consolidation 2. data cleaning 3. data transformation 4. data reduction
Supervised learning
A learning approach with a priori knowledge Labeled data Training data includes both the input and the outcome The construction of proper training, validation, and test is crucial
Unsupervised learning
A learning approach without guidance No labeled data Model is not provided with the correct results Exploratory model
Data Mining Process
A systematic way to conduct data mining projects Most common standard processes: CRISP-DM KDD SEMMA
presecriptive analytics
Aims to determine the best possible decision Answering the question of what should i do? and why should I do it? Uses both descriptive and predictive to create the alternatives, and then determines the best one Example: UPS driver's routing system
Data Mining: Classificatiton
Supervised, learn from past data, classify new data The output variable is categorical Assessment methods: Predictive accuracy: hit rate Speed: model building versus predicting/usage speed Robustness: ability to make prediction given noisy data Scalability: ability to construct prediction given size of data Interpretability: transparency, explinability
Bar chart
The most simple bar charts Difficult to read