D204 The Data Analytics Journey

Ace your homework & exams now with Quizwiz!

In which phase of the data analytics life cycle does an analyst build a histogram?

Data Exploration

Applying Phase

1 Present Model 2 Deploy Model 3. Revisit the Model 4. Archive Assets

Components of EDARP

1. Define the questions 2. Method to answer the questions 3. Wrangling 4. Budget 5. Stakeholders 6. Deployment 7. Maintenance and Delivery

3 ways to list a project

1. Hold a meeting 2. Categorize tasks (WBS) Work Breakdown Structure 3. Ask Others

Mel's project will take an average of 24 weeks to complete, and worst case is 36 weeks. To be 90 percent safe, he should add some contingency. What contingency would you recommend that Mel take?

36 - 24 = 12W 12W/2 = 6 Weeks

How does one define research questions with an organization?

A solid understanding of the root question must be achieved. Working backwards from the desire output can help frame the right questions to ask

Power and Interest Grid

A tool used to group stakeholders based on their level of authority (power) and their level of concern (interest) for project outcomes

What does Bayes' Theorem describe?

A way to compute the probability of an event given prior knowledge related to the event

Prescriptive

Analysis with an end goal of making a recommendation Keyword: Causation Quasi-experiments A/B testing best action to bring about your goals focus on cause and effect relationships in your data

Turbine, Inc. is implementing a wind energy project. The key driver for the project is quality. What should the PM do with the key driver?

Add a safety margin to the key driver.

Infographics

Adobe illustrator

Example of Machine Centric

Alexa, smart watch wearables

Example of quantitative data

Alice has 8 dogs which is discrete and all the dogs weigh between 10 -25 pounds continuous data

What is Interpretability?

Allows humans to understand the process by which algorithms process data so they can apply those principles to new situations.

SVG

An XML-based vector image format for two-dimensional graphics with support for interactivity and animation. scalable vector graphics is an XML based vector

What is the most critical of the critical path?

Any delay in the critical path delays the project finish date

SPSS

App for data analysis. It uses point and click graphical user interface.

What does API stand for?

Application Programming Interface

What is "A" in IRAC?

Application-How does a company/government apply the rules to the situation?

A data analytics project team is preparing to develop a predictive model that will be included within a business intelligence tool for upper management. Which step should be considered for inclusion when creating the project schedule?

Business intelligence tool interface training

Who are the end users?

Business managers, executives and even customers who are using the outputs of the research

What formats can be exported?

CSV, JSON

What is the key focus for prescriptive and the main question?

Causal/manipulate and How can we make it happen? How can we change it?

What happens in the data explorations phase?

Central Tendency/ Measures of center (e.g., mean, median, mode), Variability (e.g., standard deviations and quartiles) and distributions (e.g., normal, skewed, etc) Identify basic correlations between variables Pattern discovery

What are the challenges of passive collection?

Check for shared meanings, need to check for limit cases, ensure adequate representation

Salience Model

Classifies stakeholders based on their power, urgency, and legitimacy

You must measure how positive the reception of the new lunch menu has been at your client's restaurant. What would be the best approach to perform this task?

Collect social media reviews of the restaurant and then apply opinion mining techniques to quantify the satisfaction expressed for the menu.

CSV file

Comma separated values file; a text file with one record per line, and the field of each record separated by commas

You possess data from an online retailer. The data includes customer ages, whether they are foreign or domestic customers, and the dollar value of the items purchased. How would you determine if older customers spend more money on each purchase?

Compute the correlation coefficient between customer, age and purchase amount for all customers

What is "C" in IRAC?

Conclusion-What comes out of the rule and did it infringe on the situation?

After you develop a problem that you want to solve and perform cleaning on the data, what are you now ready to do?

Conduct exploratory data analysis.

A specific drug is manufactured for the treatment of depression. The company decides to ignore research results on an alternative, less expensive, drug treatment in order to make higher profile. Which ASA ethical standard has the company violated?

Conflict of Interest

Actionable insights

Controllable and Practical

What is the key focus for predictive and the main question?

Correlation and What will happen in the future?

Iron Triangle

Cost, Quality, Time

Another name for Predictive Modeling

Data Modeling Correlation based models Regression models Time series

Modeling Phase

Create model, Validate model, Evaluate model, Refine model

What happens in the Data Mining Phase?

Creating training and testing datasets to build models from Identify/detect patterns Determine if groups (clusters) exist in data Classify data into groups Create models that "learn" and improve (e.g., machine/deep learning, AI, etc) Test Hypotheses Refine

What is critical path?

Critical is the longest/slowest path to completing of a project

Active Listening

Cue into the verbal and nonverbal messages and the intent of what is being said

Another name for Reporting and Visualization

Dashboards

What are the 3 phases in Wrangling?

Data Acquisition, Data Cleaning and Data Exploration

Andre is part of a data science team. He is good at creating visualizations and reports. Which role on the team is best for him?

Data analyst

Data Science team

Data analysts, research lead and project manager. Their job is to create an interesting data model. Show trends in the model. Show a correlation between certain topics and the likelihood to be shared.

ETL

Data integration that refers to three steps Extract, Transform, Load. Used to blend data from multiple sources

An analyst realizes that the data set has been reduced significantly, resulting in sample sizes that are too small. In which phase of the data analytics life cycle did this likely occur?

Data mining

An analyst realizes the the data set has been reduced significantly, resulting in sample sizes that are too small. In which phase of the data analytics life cycle did this likely occur?

Data mining

What happens in the discovery phase?

Define goals, organize resources, coordinate people, schedule the project

What is business understanding in the discovery phase?

Defines the major questions of interest that need to be answered, understand the needs of the stakeholder and access the resource constraints of the project

What do open-source software tools and widely available analysis tools, such as spreadsheets, help accomplish?

Demo democratization

What is the name for a chart that shows "branches" or cases splitting from one, giant cluster, to individual clusters?

Dendrogram

What analytic method ask the question what happens in the past?

Descriptive analytics

What are the methods for modeling?

Descriptive, predictive, prescriptive

Why it is to productive to aggregate models?

Different models tend to overestimate and underestimate their predictions, so the differences frequently cancel out

Which type of stakeholder mapping technique would best be used for managing change and communication?

Direction of influence. This technique groups stakeholders by their direction of influence, and can be particularly helpful for managing change and communications

Types of quantitative data

Discrete data-falls on count(1,2,3) Continuous data-things that can be measure

Which characteristics are used to group together a cluster analysis?

Distance and Similarity

Potential problem with Report/Visualization

Due to potential large audience consumption, mistakes can cause bad business decisions and loss of revenue Improper scales used in graphs could push for interpretations of the story that is inaccurate

What is GDPR?

EU General Data Protection Regulation

What happens in the Predictive Modeling Phase?

Estimate/project future values or likelihood of an event Extend correlations found in EDA to mathematical models Predict/determine output values based on input values Cross-validation of predictive models to ensure accuracy.

Critical Listening

Evaluate and judge with the intent of looking at the logic of what is being heard

Universal data tool

Excel spreadsheets, google sheets

What strategy will contribute to effective data representation and reporting?

Excluding unrelated data

A data analytics project manager has been asked to complete a project on a very short timeline. What action is likely to yield positive results?

Expand the team with experienced staff

What is the key focus for diagnostic and the main question?

Explained reason and Why did it happen?

Another name for data exploration

Exploratory Data Analysis (EDA) Descriptive Statistics

What is EDARP mean?

Exploring Data Analysis Research Plan

Another name for Data Acquisition

Extraction Data, gathering Data, query Data collection, ETL (extract, transform, load), Web scrapping

In-house Data

Fastest way to start. Restrictions may not apply, talk with creator

Which concept should be considered when choosing variables for inclusion in a linear regression model?

Feasibility of controlling variables

What happens in the Data Cleaning phase?

Fixing improperly formatted values Dealing with duplicates, missing data, and outliers Data reduction

Which technique can a project manager use to foster the identification of quality data analytics questions?

Frequent collaboration with the team

What happens in the Data Acquisition phase?

Gather/collect data from a variety of sources Provide structure to data accessible via relational databases (SQL Build data pipeline (ETL) Use of API to download data from an external source

A U.S company collects and sells information on consumers. Which law prevents the company from collecting information on European Union consumers without their permission?

General Data Protection Regulation

Python

General Purpose Programming Language

Wrangling Phase

Get data, clean data, explore data, refine data

Organize resources

Get the right tools. Software, hardware, staff.

Clustering

Grouping data, can be geographical K dimensional space measure distances relies on the similarities between them data mining phase unsupervised learning

What is the Venn Diagram for Data Science?

Hacking Skills, Math and Statistics, and Domain Expertise

Descriptive Analytics

Historical data to better understand the changes that have occurred in a business. Uncovers error in data. Helps understand the distribution of variables. Uses mean, median, mode for counting things Histograms and Bell Curve univariate descriptive

Example of Human Accessible

Home Loan Approval, Credit Card Approval, etc.

Which statistical technique should be used to draw conclusions about an entire population based on a representative sample?

Hypothesis testing

Which task would an analyst consider first during the discovery phase of the data analytics lifecycle?

Identify the project goals

Example of Predictive

In what month did we sell the most cars?

What can be identified using a box plot?

Interquartile range

What is "I" in IRAC

Issue- is the find the legal problem that is inherent in the situation

Gantt chart

It comes after doing the post-it notes. Input into Excel it helps to see the project, time, cost.

Machine Learning as a Service(MLASS)

It is a cloud hosting machine learning where the software often with a drag and drop interface is hosted on the same servers that store the data and house the processors. It democratization the data science.

What does API mean?

It is a way of sharing data. It can take data from one application to another or from a server to your computer.

Reinforcement Learning

It is an algorithm that is designed to reach a particular income like for instance running through the levels of a game

What does "D3" mean?

It stands for Data Driven Documents. It helps visualize the data in a web browser.

An example of API

JSON

TwentySomething, Inc. is implementing a data mart server environment. James, the PM, realizes the key driver for the project is quality. What should James refrain from doing with the key driver?

James can ignore time and cost, and place all focus on quality as the key driver

JSON

Javascript Object Notation

Collaboration tools

Jira, Slack, Teams and PivotalTracker

Classifying Methods

K-means, k nearest neighbors. BInary Classification. Many Categories, Distance Measures.

Semi-structured data

Key Characteristic: Loosely organized into categories using meta tags. Typical File Types: JSON, XML, Email, Web pages

Predictive

Keyword: Correlation analyze past trends and data to provide future insights Models like regression and decision trees What might happen in the future? Churn Analysis counting, mapping, visualization forecasting

Classificiation

Labeling discrete data. supervised learning locate,compare,assign Data mining phase

Potential Problems with Business Understanding

Lack of clear focus on stakeholders, timeline, limitations and budget could potentially derail an analysis

What is a common method used for feature selection?

Lasso Regression

Prescriptive Examples

Launch a new product in January, will it be well received as if we launch in July or November?

What does GDPR mean?

Law about privacy

Statutory Law

Law passed by the U.S. Congress or state legislatures.. For example: Congress passed The Genetic Information Nondiscrimination Act, otherwise known as GINA, to regulate how employers and insurance companies can, if at all, access our genetic information

What is statutory law?

Laws created by representative bodies by US Congress

Why do data and information systems come before laws?

Laws need to be driven by specific products, processes or events that already took place. Laws are created after something takes place to control parameters

Machine Learning Specialists

People who use math and computer science to advance AI

Your boss needs a quick way to view the project you are managing. You decide to use Excel and create a Gantt chart. What is the first step in creating a Gantt chart?

List the critical path task first

Self-generated data

Loop backing where the computer can engage itself to create data training machine.

Another name for Data Mining

Machine Learning Deep Learning AI (artificial intelligence) Supervised/ Unsupervised Models

What is the common duty of a data administrator?

Maintain data on the IT infrastructure

What are the downside of in-house data?

May not be well-documented, well maintained not exist

How do you know if a distribution is negative skew or left skew?

Meaning that the left tail is longer.

Schedule the project

Meet with the team and the customer to work out the last details on time and how long the project will take

What does the Critical Path represent in the project?

Minimum time to complete dependent tasks

Entrepreneurs

Need all the skills for business acumen

Scope Creep

New requirements added that increase the time/resources needed to complete the project.

What is unsupervised learning?

No human interventions. Enter data without labeling.

What are the two categorical data?

Nominal(Name)--red, green, blue Ordinal(with an order)-small,medium,large

What is the key focus for descriptive and the main question?

Observation and what happened?

Open data

Open library of data. It is free.

Which type of data analysis is appropriate if the goal is to minimize the cost of a diet, using a data set consisting of the following variables: protein content, fat content, and cost per unit?

Optimization

What is a soft skill?

Persuasion Communication Emotional intelligence Active listening Logic and reasoning Interpersonal skills Negotiation

Data Acquisition Phase

Phase of collecting data. Data retrieved from data bases like SQL, surveys, web scrapping Techniques: ETL ,API

Pathways to Data Analytics

Planning, Wrangling, Modeling, Applying

What is the mapping techniques for stakeholders?

Power and Interest Grid Salience Model Direction of Influence

What is the expected sales forecast for 2021?

Predictive Analytics

What will happen in the future? Churn Analysis

Predictive Modeling

During which phase in the data analytics life cycle would be a churn analysis?

Predictive analysis

What analytic method ask the question what might happen in the future?

Predictive analytics

How can we increase our sales by 20% in 2021?

Prescriptive

What analytics method ask the question what should we do going forward?

Prescriptive analytics

Present the model

Presentation of the findings

Data Scraping

Process of extracting data from formats that were not specifically design for data sharing

Who clears organizational hurdles?

Project manager

Which party has the primary visions for a data analytics project and brings resources to complete it?

Project sponsors

What are the two purposes of the reporting phase of the data analytics life cycle?

Provide the conclusions from the analysis in an engaging manner and provide a actionable insight that can inform decision making

Meaning of POWER

Purpose Outcome What in it for them Engagement Roles and responsibilities

Which tool has libraries that expand its visualization capabilities?

Python

What are the tools used?

Python R SQL Tableau

Potential Problems with Data Acquisition

Quality and type of data may make access more difficult

An analyst has been asked to analyze the open-ended responses from customers on a satisfaction survey. Which type of data is the analyst working with on this project?

Quantitative

What are some practical methods for identifying cause and effect relationships in your data?

Quasi-experiments

Measure of Variation

Range, quartiles, variance, or standard deviation

What do tools like Slack and Teams include for users?

Real Time Instant Messaging

What is an example of an external stakeholder for a data analytics project?

Regulatory body

Outside people

Regulatory groups

What is "R" in IRAC?

Rule-What, is any, rules are relevant to the situation

Potential problems with Data Mining

Running an entire data is problematic; need to subset data into training and testing datasets to build models

What happens in the Business Understanding phase?

Scope project Identify stakeholders and research questions/KPIs Identify timeline, budget, and participants

Which of the following models the relationship between a dependent variable and a single independent variable?

Simple Linear regression

Potential problems with Data Exploration

Skipping these steps could enable faulty perceptions of the data which hurt advanced analytics

Potential Problems with Data Cleaning

Some cleaning techniques could dramatically change data/outcomes Outliers not dealt with can cause problems with statistical models due to excessive variability.

SQL

Standard Query Language

What types of analyzing methods are used for descriptive analysis?

Summary Statistics, Clustering, and Regression

Data Engineers

System Architects. They help get/give permissions for the data

What software specializes in visualization, graphs, and charts for user interaction?

Tableau and PowerBI

What happens in the Reporting/Visualization phase?

Tell a story with data Provide a summary of analytic analysis Provide insights to stakeholders Create insightful graphs that showcase trends and forecasts

What will be a consequence of poor attention to detail during the data exploration phase?

The analyst will lack insight into the structure of the data set

Research lead

This is someone from the business side who pushes the team to ask interesting questions. Identifying key problems. They know the most about the business. Know topic categories. Drive questions

Source of EDARP

The goal is to create some form of value

What do vertical lines on a Gantt chart represent?

The limits of the allowable movement of floating tasks

What mistake is commonly made during the predictive analytics phase?

The model is developed before the research question is known

Overfitting

The process of fitting a model too closely to the training data for the model to be effective on other data.

The PM and the accountant are reviewing the financial information of a project. Which assumption should the accountant refrain from making?

The project is linear.

You are creating a list of people who are stakeholders for your project. Who would be the key stakeholder?

The project sponsor

For a project to have success, who must engage in the project work and outcomes?

The project stakeholders

How do you know if a distribution is positive skew or right skew?

The right tail of the distribution is longer than the left

Numerical measurements of the amount of a toxic chemical substance are recorded in a large database. Which hypothesis can the data analyst answer through exploratory data analytics methods?

The statistical distribution of the chemical measurements is normal.

Mode

The value that occurs most frequently in a given data set.

Data scientists

They are data analysts, create software, work with mathematics, know the business, ask the interesting questions and hacking.

Managers

They oversee entire process of data analysts, data engineers, machine learning specialists and even researchers

Researchers

They use data to help them further studies

Unicorn

This is a full-stack data scientist who can do it all, and do it at absolute peak performance.

How would you use Bayes' theorem for spam detection?

To calculate the probability a message is spam if it contains certain flagged words

You are at the end of a very long project, and the last task to complete is a review. What is the benefit of performing a project review with your team?

To learn for next time

During Data Mining, why might an analyst resample a data set with replacement data?

Too little data for training and testing data sets

Potential problems with Predictive Modeling

Too many input variables (predictors) can cause problems Correlation does not imply causation Time series models often need sufficient time data to offer precise trending Predictive model accuracy should be assessed using cross-validation.

Jerome has organized his Post-it notes and is now estimating durations. What is a typical practice for estimating durations at this stage?

Using average times in weeks

What are effective interpersonal communications?

Verbal and Listening

Which aspect of data exploration occurs when an analyst writes code to compile a bar graph of dog food sales per month?

Verification through visualization

What elements make up "Big Data"?

Volume, Velocity, Variety

You have a data set containing four columns of data that describe an index number, a model number, weight class, and weight. Which column contains quantitative data?

Weight

Example of Prescriptive

What color and sizes of shirts should we sell to maximize profits?

Example of Descriptive Analytics

What has happened over the last 5 years?

Prescriptive modeling question

What types of donuts should we sell to maximize profits?

Enforce learning

When the project manager turns insight into something actionable

What happens in the data acquisition phase?

Wrangling: Get data, Clean data

Is Tableau easy to use?

Yes, it allows a novice user the ability to interact with the data and spot trends and patterns

C

You are running a cluster of servers taking in streaming data from cell towers. There's a bottleneck in the extract, transform, load (ETL) pipeline. What is the most likely issue? A Users are overloading the servers with data queries. B There are more servers than needed in the cluster. C Load balancing is not evenly splitting the load among servers

Geoplotlib

a Python toolbox for geographic visualizations

Time Series Analysis

a data metric over time. Ex. Area chart

Calculus

a function that describes the relationship between price and sales. Ex. to find the best price for maximizing revenue, you must first have a formula that says how sales are related to price

AI(Artificial Intelligence)

a replica of the human brain that can solve cognitive task. Programs that learn from data and machine learning.

Conflict of Interest

a situation in which an action by a company or individual results in an unfair benefit.

Bayesian analysis

a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that the average male height is between 70 and 80 inches or that the average female height is between 60 and 70 inches?

Hierarchical clustering

algorithm that groups objects into groups called clusters

Histograms

allows a way to graph numerical data in "groups" or bins that allow bars to represent frequencies. __________ are convenient graphs to show outliers in data and skewness (i.e., asymmetry of the data). For example, this is a __________ on "Years of Experience" which shows a right-sided (i.e., positive) skewness

Reporting and Visualization Phase

an analyst tells the story of the data and uses graphs interactive dashboards to inform others of the findings

Data Flow

analyzes the flow of data through a systemic and the users experience

JASP

app for data analysis. is uses free and open source. It is good for democratizing data

JAMOVI

app for data analysis. is uses free and open source. Statistical spreadsheet software that aims to simplify two aspects of using R

Expert Systems

are algorithms that mimic decision making process of a human domain expert. Can spell out every step in a decision tree like a flow chart

Packages

are collections of codes that give additional functionality to programming language and simplify many common tasks.

C, and C++, and Java

are general-purpose languages that are used for the back end, the foundational elementsterm-26 of data science, and they provide maximum speed

Cross-validation

builds models to use test data and split into groups

Bokeh

can create interactive charts and graphs. It allows you to create tools for the user so they can access and explore your visualizations.

Principal Component Analysis

combine multiple, correlated variables into a single component

Create the model

comes after refining the data. Analyst will figure out what type of model to used

Linear regression

common and powerful technique from combining many variables into a equation to predict a single outcome

Nominal comparsion

comparing values of categories or sub-categories

Generative Adversarial Networks

computer creates one neural network that creates data and the other tests the data

Evaluate the model

consider this as double checking your data

Pygal

create charts that are interactive SVG's with minimal line of codes

GGplot2

create professional looking plots with minimum user codes

What is quantitative data?

data can be quantified, verified and measured. All values are numerical.

What is qualitative data?

data obtained by the researcher from first-hand observation, interviews, surveys, focus groups. All non-numerical sources

What is categorical data?

data that can be group into a category.

Tidy Data

data that is easily imported. One sheet per file, one level of observation, column=variable

Human accessible

decision making - Many algorithmic decisions are made automatically, and even implemented automatically. But they're designed such that humans can at least understand what happened in them. Such as, for instance, with an online mortgage application.

Human in the Loop

decision making is where advanced algorithms can make and even implement their own decisions, as with self-driving cars.

Data definitions

defines product, customer and process data

infographics (information graphics)

displays information graphically so it can be easily understood

Data relationships

drives the architecture which consists of the boundaries of how data relates to each other

Joyce is part of a data science team. She is very interested in running experiments. At what point would Joyce do this?

during the process of asking questions

Predictive Modeling Phase

enable an analyst to move beyond describing the data to creating models that enable predicting outcomes of interest cross-validation of predictive models to ensure accuracy estimate/project values or likelihood of an event

Looping back

engage themselves to create the data training machine

Correlation

examining the relationship of 2 or more numerical values

Deviation

explores how data points relate to each other. Easy to use graphs to examine

Factor Analysis

find the underlying common factor that gives rise to multiple indicators

Optimization Analysis

finding the best value for one or more target variables given certain constraints. It shows what value a variable should have, given certain constraints or restraints

Explicit Rules

flow charts to show what method

Democratization Data

for everyone to have excess to the information and platform

GIGO

garbage in, garbage out

Creating data

get your own data thru natural observations, informal discussions, interviews, surveys

What is Constitutional law?

gives the government authority to act restricts authority to avoid infringement

Optimization analysts

goal seeking analysts

Implicit rules

help the algorithm function. They are the rules that are develop while analyzing the test data. They cannot be easily described to humans

Key Project Driver

helps to figure out what is the most important to the stakeholder

PCA vs. FA

in principal component analysis, the variables come first and the component results from it. In factor analysis, this hidden factor comes first and gives rise to the individual variables.

Data Ethics

institutional review board, informed consent, confidentiality

Tableau

interactive data exploration and visualization sources Brings in data from a lot of different sources and formats. Facilitate data integration. Can help anyone see and understand their data via visual dashboard

Qlik

interactive data exploration. Same as Tableau

Heat Maps

is a colorful graph that can visually show frequency or interaction using a range of colors. Example, a heatmap is applied to a webpage to show where customers typically hover their mouse or click/interact with the website.

Decision tree

is a decision support tool. Sequence of binary decisions based on your data, that can combine to predict an outcome. It branches from one decision to the next

Scatterplots

is a two dimensional graph which is great to visualize correlation or relationships.

What is quasi-experiments?

is a useful way to approximate cause and effect, as what-if simulations and optimization models

Feature

is a variable or dimension in the data

IRAC

is our legal analysis tool to understand how to move from identifying a legal issue to reaching a conclusion and a decision about how to take action

Data privacy

is responsibly collecting, using and storing data about people, in line with the expectations of those people, your customers, regulations and laws

Data analysts

is the person who obtains and scrubs the data. They will display the data in graphs and reports

Factor analysis

is to find the underlying common factor that gives rise to multiple indicators.

Causation

is when there is a real-world explanation for WHY this is logically happening; it implies a cause and effect

Pandas

is wrapper which is great for data structure and as a data analysts tool

What is common law?

laws created by the courts, rules negligence

Dimensionality reduction

learning to read a language. Errors tend to cancel out.

JSON means

lightweight format for storing and transporting data. Open standard file format and data interchange format that uses human-readable text to store and transmit data objects. It is text only.

K-Dimensional Space

locate each data point, each observation, in a multidimensional space

Univariate Descriptive

look for one number to be able to represent the entire collection(the mean)

Refine the model

make sure you are answering the question that has been provided

Tensorflow

makes it easy to do deep learning neutral networks

Autocorrelation

means that each point in time is influenced by the points that came before it.

The standard deviation

measures how close to the mean is the overall sample

Unstructured data

most abundant type of data and covers everything from text and audio to images and videos. Cannot be stored in a structured database

Python is

most popular language for data sciences and machine learning. It is very clean and easy to learn. Manages memory and large data set better

Passive Collection

one trail learning.

Functional managers

ones who oversee a couple data analysts teams

Project team

people who are doing the day to day on the project. Mangers, analysts, techs. People outside of the company

Project sponsor

people who help with the funding and the project to be seen through to the end

Subject Matter Experts(SME)

people who know everything. They have been in a field for a long time and knows the "ins" and "outs"

Regulators

people who set rules for business

Bayes' Theorm

posterior probability as a function of the likelihood, the probability of getting the data you found

Structured data

pre-define data model and is simple to analyze

Anomaly Detection

process of identifying rate or unexpected items or events in a data set that do not conform to other items in the data set.

What does the BAs need to understand and analyze the linkages between?

process, rules, and data.

R

programming language that was developed especially for work in data analysis. Statistical uses. Works with vectorized operations and non-standard evaluation

Privacy Torts

protects individuals rights to keep certain things out of public view even if they are true

Boxplot

provides a concise summary of the quartiles of numerical data (i.e., cutpoints that divide the data into 25% percentile segments). This graph is also convenient for detecting outliers and skewness.

Get Data

pull the data from the database or data warehouse.

Which are examples of models used in predictive analytics?

regression and decision trees

What are the techniques used?

regression, decision trees, classification, clustering, neural networks, time series, PCA

BI(Business Intelligence)

rely on structured dashboards. emphasize speed, insight, access

Levels of granularity

right level of details. Cannot be estimated, overlap, ongoing

Kendall and her team are preparing a list of risks and entering them in a risk assessment chart. How do they calculate the weighted factor?

risk factor x impact factor

XML

self-describing or self defining of data embedded with the data. When the data arrives no need to prebuild structure to store the data

Neural networks

series of algorithms that mimic the operations of a human brain to recognize relationships between vast amounts of data. Characteristics that are not readily visible to humans and that would not necessarily make sense to a human

Data Cleaning Phase

several terms like data cleansing, data wrangling, data munging, feature engineering. fixing improperly formatted values data reduction dealing with duplicates, missing data and outliers Cannot skip this phase, the results may become irrelevant

Part to whole

simply how a smaller subset compares with the larger subset

Refine the Data

sometime you need to reshape the data. Ex. might not have enough data and have to go back to your resources to gather more information

Community groups

specific type of group of people looking for information from the data team

HTML

tagging text files to achieve font, color, graphic and hyperlink

Decomposition

take trend overtime and break it down into several elements

Implicit Rules

that help algorithms function. They cannot be easily described to humans. But very effective

Principal Component Analysis (PCA)

that takes your multiple correlated variables and then combine them into a single component score

Data Exploration Phase

the analyst begins to understand the basic nature of data and the relationships within it. pattern discovery descriptive analytics Vertification of visualization central tendacy relies on use of data visualization tools distribution-normal skews, left, right

Business architecture

the business strategy, business capabilities, business knowledge, value streams and organizational views

Central Limit Theorem

the distribution of sample averages tends to be normal regardless of the shape of the process distribution

Median

the middle score in a distribution; half the scores are above it and half are below it

What is posterior probability?

the probability of the cause, such as disease, given the effect, such as a positive medical test for the disease

Ranking

the showing of two variables. Used to show changes over time

Analyst

they are the worker bees of data analytics. They deal with data for businesses using the data life cycle

Hold Testing Data

think outside the box. Take your model that you built go thru the cross validation and apply only once for the hold out data to see if stands up to the unusual

Clean Data

this is where you will spend 80% of your time. This is where you format the data

Validate the model

to make sure the data is sound and can be justify is very important

Seaborn

to make things "sexy" . It has built in schemes and color palettes

Crashing

to speed up a plan in a project

Define goals

trying to understand what the "customer" wants and what they really need. Kick-off meetings

Content Listening

understand and retain the information provided and identify the main points of the message

Purpose of EDARP

understand the objective and detail our path. Drives buy in from the org in support of the project. It helps drives the success of the project

Regression analysis

understandable questions to predict a single outcome based on multiple predictor variables

Business acumen

understanding business concepts, being business savvy and understanding how business works.

Explore the Data

understanding of the data. Using this data to answer the question

Serendipity

unexpected insights, untapped potential or values

Automated Code Review

use of analytics method to inspect and review the source codes to detect bugs or security issues

Algebra

use to scale up solution

What is supervised learning?

uses algorithms that "learn" data entered by a person

Distribution

visualization that shows data often surrounding a central value

Matplotlib

website for trying to make easy things easy and hard things possible. Makes data easy to read and visualize

Linear Algebra

works with matrices and vectors


Related study sets

Lesson 3: Chapters 4&5 : Force and Newton's Laws of Motion

View Set

Social Studies: Chapter 3 Lesson 2

View Set

Accounting Midterm study guide 1

View Set

Chapter 11 - Prioritization, Delegation, and Assignment

View Set