Intro to Data Analytics Study Guide

¡Supera tus tareas y exámenes ahora con Quizwiz!

Text mining

discovery of patterns and relationships from large sets of unstructured data

Gnuplot

is a command-line and GUI program that can generate two- and three-dimensional plots of functions, data, and data fits

ARIMA model (AutoRegressive Integrated Moving Average)

is a popular statistical model used for time series forecasting

Imputation

is the process of replacing missing data with substituted values

Swift

powerful and intuitive programming language optimized when running on iOS, macOS, and other Apple platforms

Operationalization

the process of assigning a precise method for measuring a term being examined for use in a particular study

Cohort analysis

the process of breaking a data set into groups of similar data, often into a customer demographic. This allows data analysts and other users of data analytics to further dive into the numbers relating to a specific subset of data.

Line charts

a chart that plots data points which are continuously distributed data to compare trends over time. Place time measurements on the X axis.

standard deviation

a computed measure of how much scores vary around the mean score

Data Visualization

describes technologies that allow users to see or visualize data to transform information into a business perspective

Result analysis

detailed description of the results obtained through experimentation.

Data preprocessing

A tedious process of converting raw data into an analytic ready state.

Random forest

An algorithm used for regression or classification that uses a collection of tree data structures trees "vote" on the best model

Data administrator

An individual responsible for defining and implementing consistent principles for a variety of data issues.

R

Free open-source statistical programming language Built and maintained for statisticians by statisticians Capable of both data analysis and data graphics Can write your own functions and packages to make graphics the way you want

Data Preparation

Gather and organize the data in the correct formats and structures for analysis

Ruby

Has a dynamic type system and automatic memory management.

Data Wrangling

The process of cleaning, unifying, and preparing unorganized and scattered data sets for easy access and analysis. (converting unstructured data to structured data)

Predictive analytics

This moves to what is likely going to happen in the near term. What happened to sales the last time we had a hot summer? How many weather models predict a hot summer this year?

Prescriptive analytics

This suggests a course of action. We should add an evening shift to the brewery and rent an additional tank to increase output if the likelihood of a hot summer is measured as an average of these five weather models and the average is above 58%,

Database administrator

Which stakeholder has access to essential tables or storage systems and guarantees the highest levels of security in the data repository?

Data scientist

extracts knowledge from data by performing statistical analysis, data mining, and advanced analytics on big data to identify trends, market changes, and other relevant information

Prospects

in the target market, but are not yet customers

Business intelligence analyst

make sense out of an organization's data and information, and present their findings to senior staff for the purpose of making tactical and strategic decisions

Monte Carlo simulations

model the probability of different outcomes happening. They're often used for risk mitigation and loss prevention. These simulations incorporate multiple values and variables and often have greater forecasting capabilities than other data analytics approaches.

Type 2 Errors (or false negatives)

occur when we fail to reject a hypothesis when it is actually false.

Type 1 Errors (or false positives)

occur when we reject a hypothesis when it is actually true.

Data post-processing

simple way of applying mathematical expressions, logic arithmetic and conditional functions to data

Association rules analysis

specify a relation between attributes that appears more frequently than expected if the attributes were independent.

Business user

subject matter expert, who benefits from the results

Support vector machines (SVM)

supervised learning models with associated learning algorithms that analyze data for classification and regression analysis

Machine learning

the extraction of knowledge from data based on algorithms created from training data

data integration

the integration of data from multiple sources, which provides a unified view of all data

Database administrator

the person responsible for coordinating, controlling, and managing the database

data profiling

the process of collecting statistics and information about data in an existing source

Data Modeling

the process of determining the users' information needs and identifying relationships among the data

Model Deployment

the process of putting machine learning models into production

Feature selection

the process of selecting attributes which are most predictive of the class we are predicting

Established Customers

those new customers who return, for whom the relationship is hopefully broadening or deepening

MATLAB

used for a variety of mathematical calculations and tasks

Regression analysis

used to analyze the relationship between one or more independent variables and a dependent variable.

Project managers

work with project sponsors, project team, and other people involved in a project to meet project goals

Alpine Miner

allowed for non-data scientists to create predictive analytics data models without using code

Tableau

an American interactive data visualization software company focused on business intelligence

K-means

an algorithm in which "k" indicates the number of clusters and "means" represents the clusters' centroids

Responders

are prospects who have exhibited some interest (filling out application, registering on a website, etc.)

New Customers

are responders who have made a commitment, usually an agreement to pay (first purchase, signed a contract, registered on a site with some personal information)

Former Customers

are those who have left, either as a result of voluntary attrition (because they have defected to a competitor or no longer see value in the product), forced attrition (because they have not paid their bills), or expected attrition (because they are no longer in the target market

SPSS Modeler

data mining used to build predictive models and conduct other analytic tasks

Project sponsor

The person who provides the direction and funding for a project

Project managers

Which stakeholder is primarily responsible for ensuring the desired quality of the project?

SAS/ACCESS

Which tool is used to connect users to relational databases and data warehouse appliances in the model planning phase?

Project sponsor

Which role is responsible for project initiation and providing the requirements for a project?

Customer Life Cycle

1. Prospects 2. Responders 3. New Customers 4. Established Customers 5. Former Customers

SAS Enterprise Miner

A comprehensive, and commercial data mining software tool

Time series analysis

A forecasting method that uses historical sales data to discover patterns in the firm's sales over time and generally involves trend, cycle, seasonal, and random factor analyses

Oversampling

A form of probability sampling; a variation of stratified random sampling in which the researcher intentionally overrepresents one or more groups.

Scatter plots

A graph with points plotted to show a possible relationship between two sets of data.

Text analysis

A process for extracting value from large quantities of unstructured text data.

Data engineer

A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure

data deduplication

A specialized data compression technique for eliminating duplicate copies of repeating data

Microsoft Excel

A spreadsheet application tool that analyzes data in a table format using formulas

Logistic regression

A statistical analysis which determines an individual's risk of the outcome as a function of a risk factor. The outcome of interest has two categories.

Minitab

A statistical package to perform statistical analysis Designed to perform analysis as accurately as possible

Python

An easy-to-learn, general-purpose, high-level scripting language.

Naive Bayes

Classification predictive Training data classify new data points ie red, round, Apple, yellow, oblong,banana If new obj red then more likely Apple

SPSS modeler

It is used for applying the trained model to new data for predictions model execution phase

Box plots

Make sure the numbers are in order from least to greatest. Find the median, the Q1 and Q3 by finding the median again of the upper and lower sections.

Data interpretation

Making sense of and analyzing data to find patterns and trends.

Hadoop

Open-source software framework that enables distributed parallel processing of huge amounts of data across many inexpensive computers.

Data analyst

Someone who collects, transforms, and organizes data in order to draw conclusions, make predictions, and drive informed decision-making

p-value

The probability level which forms basis for deciding if results are statistically significant (not due to chance). The probability of observing a test statistic as extreme as, or more extreme than, the statistic obtained from a sample, under the assumption that the null hypothesis is true.

Descriptive analytics

This describes what has happened over a given period of time. Have the number of views gone up? Are sales stronger this month than last?

Diagnostic analytics

This focuses more on why something happened. It involves more diverse data inputs and a bit of hypothesizing. Did the weather affect beer sales? Did that latest marketing campaign impact sales?

A/B testing

This is the process of comparing two variations of a single variable to determine which performs best in order to help improve marketing efforts

Nonlinear regression

Used if a hypothesis exists that suggests a curvilinear relationship between the predictor variables and the criterion variable.

Histograms

Used when data is continuous The bars touch each other

Cross-validation

Verifying the results obtained from a validation study by administering a test or test battery to a different sample (drawn from the same population)

Maintaining Databases

What is a skill required of a data engineer?

D3.js

Which data visualization tool in the communicate results phase is used to create web-based visualization?

Business users

Which group of stakeholders comprises the professionals, such as line managers?

Data engineer

Which job position is primarily responsible for designing and constructing data pipelines within the field of data analytics?

Data engineer

Which role in a data analytics project helps data scientists shape data for analysis?

Data scientist

Which role in a data analytics project provides expertise for analytical techniques?

Decision tree

a graph of decisions and their possible consequences; it is used to create a plan to reach a goal

OpenLayers

a high-performance, feature-packed library for creating interactive maps on the web

coefficient of determination

a measure of the amount of variation in the dependent variable about its mean that is explained by the regression equation

Correlation coefficient

a statistical measure that quantifies the strength and direction of a linear relationship between two variables. It is a number between -1 and 1 that tells you how similar the measurements of two or more variables are across a dataset

Principal component analysis

a statistical method to simplify the description of a set of interrelated variables. Its general objectives are data reduction and interpretation; there is no separation into dependent and independent variables; the original set of correlated variables is transformed into a smaller set of uncorrelated variables called the principal components. Often used as the first step in a factor analysis.

Linear regression

a statistical method used to fit a linear model to a given data set

Multiple regression

a statistical technique that computes the relationship between a predictor variable and a criterion variable, controlling for other predictor variables

Cluster analysis

a technique used to divide an information set into mutually exclusive groups such that the members of each group are as close together as possible to one another and the different groups are as far apart as possible

Factor analysis

correlations among many variables are analyzed to identify closely related clusters of variables


Conjuntos de estudio relacionados

Malcolm X, Nation of Islam and Black Power

View Set

LS23L Lab C Week 3: b-Galactosidase Assay

View Set

ch. 37 - impact of cognitive or sensory impairment on the child and family

View Set

Abdomen & Inguinal Region Clinical Questions

View Set

anatomy - representative synovial joints

View Set

2018 Virtualization Midterm (w/multiple choice)

View Set