D204 The Data Analytics Journey
In which phase of the data analytics life cycle does an analyst build a histogram?
Data Exploration
Applying Phase
1 Present Model 2 Deploy Model 3. Revisit the Model 4. Archive Assets
Components of EDARP
1. Define the questions 2. Method to answer the questions 3. Wrangling 4. Budget 5. Stakeholders 6. Deployment 7. Maintenance and Delivery
3 ways to list a project
1. Hold a meeting 2. Categorize tasks (WBS) Work Breakdown Structure 3. Ask Others
Mel's project will take an average of 24 weeks to complete, and worst case is 36 weeks. To be 90 percent safe, he should add some contingency. What contingency would you recommend that Mel take?
36 - 24 = 12W 12W/2 = 6 Weeks
How does one define research questions with an organization?
A solid understanding of the root question must be achieved. Working backwards from the desire output can help frame the right questions to ask
Power and Interest Grid
A tool used to group stakeholders based on their level of authority (power) and their level of concern (interest) for project outcomes
What does Bayes' Theorem describe?
A way to compute the probability of an event given prior knowledge related to the event
Prescriptive
Analysis with an end goal of making a recommendation Keyword: Causation Quasi-experiments A/B testing best action to bring about your goals focus on cause and effect relationships in your data
Turbine, Inc. is implementing a wind energy project. The key driver for the project is quality. What should the PM do with the key driver?
Add a safety margin to the key driver.
Infographics
Adobe illustrator
Example of Machine Centric
Alexa, smart watch wearables
Example of quantitative data
Alice has 8 dogs which is discrete and all the dogs weigh between 10 -25 pounds continuous data
What is Interpretability?
Allows humans to understand the process by which algorithms process data so they can apply those principles to new situations.
SVG
An XML-based vector image format for two-dimensional graphics with support for interactivity and animation. scalable vector graphics is an XML based vector
What is the most critical of the critical path?
Any delay in the critical path delays the project finish date
SPSS
App for data analysis. It uses point and click graphical user interface.
What does API stand for?
Application Programming Interface
What is "A" in IRAC?
Application-How does a company/government apply the rules to the situation?
A data analytics project team is preparing to develop a predictive model that will be included within a business intelligence tool for upper management. Which step should be considered for inclusion when creating the project schedule?
Business intelligence tool interface training
Who are the end users?
Business managers, executives and even customers who are using the outputs of the research
What formats can be exported?
CSV, JSON
What is the key focus for prescriptive and the main question?
Causal/manipulate and How can we make it happen? How can we change it?
What happens in the data explorations phase?
Central Tendency/ Measures of center (e.g., mean, median, mode), Variability (e.g., standard deviations and quartiles) and distributions (e.g., normal, skewed, etc) Identify basic correlations between variables Pattern discovery
What are the challenges of passive collection?
Check for shared meanings, need to check for limit cases, ensure adequate representation
Salience Model
Classifies stakeholders based on their power, urgency, and legitimacy
You must measure how positive the reception of the new lunch menu has been at your client's restaurant. What would be the best approach to perform this task?
Collect social media reviews of the restaurant and then apply opinion mining techniques to quantify the satisfaction expressed for the menu.
CSV file
Comma separated values file; a text file with one record per line, and the field of each record separated by commas
You possess data from an online retailer. The data includes customer ages, whether they are foreign or domestic customers, and the dollar value of the items purchased. How would you determine if older customers spend more money on each purchase?
Compute the correlation coefficient between customer, age and purchase amount for all customers
What is "C" in IRAC?
Conclusion-What comes out of the rule and did it infringe on the situation?
After you develop a problem that you want to solve and perform cleaning on the data, what are you now ready to do?
Conduct exploratory data analysis.
A specific drug is manufactured for the treatment of depression. The company decides to ignore research results on an alternative, less expensive, drug treatment in order to make higher profile. Which ASA ethical standard has the company violated?
Conflict of Interest
Actionable insights
Controllable and Practical
What is the key focus for predictive and the main question?
Correlation and What will happen in the future?
Iron Triangle
Cost, Quality, Time
Another name for Predictive Modeling
Data Modeling Correlation based models Regression models Time series
Modeling Phase
Create model, Validate model, Evaluate model, Refine model
What happens in the Data Mining Phase?
Creating training and testing datasets to build models from Identify/detect patterns Determine if groups (clusters) exist in data Classify data into groups Create models that "learn" and improve (e.g., machine/deep learning, AI, etc) Test Hypotheses Refine
What is critical path?
Critical is the longest/slowest path to completing of a project
Active Listening
Cue into the verbal and nonverbal messages and the intent of what is being said
Another name for Reporting and Visualization
Dashboards
What are the 3 phases in Wrangling?
Data Acquisition, Data Cleaning and Data Exploration
Andre is part of a data science team. He is good at creating visualizations and reports. Which role on the team is best for him?
Data analyst
Data Science team
Data analysts, research lead and project manager. Their job is to create an interesting data model. Show trends in the model. Show a correlation between certain topics and the likelihood to be shared.
ETL
Data integration that refers to three steps Extract, Transform, Load. Used to blend data from multiple sources
An analyst realizes that the data set has been reduced significantly, resulting in sample sizes that are too small. In which phase of the data analytics life cycle did this likely occur?
Data mining
An analyst realizes the the data set has been reduced significantly, resulting in sample sizes that are too small. In which phase of the data analytics life cycle did this likely occur?
Data mining
What happens in the discovery phase?
Define goals, organize resources, coordinate people, schedule the project
What is business understanding in the discovery phase?
Defines the major questions of interest that need to be answered, understand the needs of the stakeholder and access the resource constraints of the project
What do open-source software tools and widely available analysis tools, such as spreadsheets, help accomplish?
Demo democratization
What is the name for a chart that shows "branches" or cases splitting from one, giant cluster, to individual clusters?
Dendrogram
What analytic method ask the question what happens in the past?
Descriptive analytics
What are the methods for modeling?
Descriptive, predictive, prescriptive
Why it is to productive to aggregate models?
Different models tend to overestimate and underestimate their predictions, so the differences frequently cancel out
Which type of stakeholder mapping technique would best be used for managing change and communication?
Direction of influence. This technique groups stakeholders by their direction of influence, and can be particularly helpful for managing change and communications
Types of quantitative data
Discrete data-falls on count(1,2,3) Continuous data-things that can be measure
Which characteristics are used to group together a cluster analysis?
Distance and Similarity
Potential problem with Report/Visualization
Due to potential large audience consumption, mistakes can cause bad business decisions and loss of revenue Improper scales used in graphs could push for interpretations of the story that is inaccurate
What is GDPR?
EU General Data Protection Regulation
What happens in the Predictive Modeling Phase?
Estimate/project future values or likelihood of an event Extend correlations found in EDA to mathematical models Predict/determine output values based on input values Cross-validation of predictive models to ensure accuracy.
Critical Listening
Evaluate and judge with the intent of looking at the logic of what is being heard
Universal data tool
Excel spreadsheets, google sheets
What strategy will contribute to effective data representation and reporting?
Excluding unrelated data
A data analytics project manager has been asked to complete a project on a very short timeline. What action is likely to yield positive results?
Expand the team with experienced staff
What is the key focus for diagnostic and the main question?
Explained reason and Why did it happen?
Another name for data exploration
Exploratory Data Analysis (EDA) Descriptive Statistics
What is EDARP mean?
Exploring Data Analysis Research Plan
Another name for Data Acquisition
Extraction Data, gathering Data, query Data collection, ETL (extract, transform, load), Web scrapping
In-house Data
Fastest way to start. Restrictions may not apply, talk with creator
Which concept should be considered when choosing variables for inclusion in a linear regression model?
Feasibility of controlling variables
What happens in the Data Cleaning phase?
Fixing improperly formatted values Dealing with duplicates, missing data, and outliers Data reduction
Which technique can a project manager use to foster the identification of quality data analytics questions?
Frequent collaboration with the team
What happens in the Data Acquisition phase?
Gather/collect data from a variety of sources Provide structure to data accessible via relational databases (SQL Build data pipeline (ETL) Use of API to download data from an external source
A U.S company collects and sells information on consumers. Which law prevents the company from collecting information on European Union consumers without their permission?
General Data Protection Regulation
Python
General Purpose Programming Language
Wrangling Phase
Get data, clean data, explore data, refine data
Organize resources
Get the right tools. Software, hardware, staff.
Clustering
Grouping data, can be geographical K dimensional space measure distances relies on the similarities between them data mining phase unsupervised learning
What is the Venn Diagram for Data Science?
Hacking Skills, Math and Statistics, and Domain Expertise
Descriptive Analytics
Historical data to better understand the changes that have occurred in a business. Uncovers error in data. Helps understand the distribution of variables. Uses mean, median, mode for counting things Histograms and Bell Curve univariate descriptive
Example of Human Accessible
Home Loan Approval, Credit Card Approval, etc.
Which statistical technique should be used to draw conclusions about an entire population based on a representative sample?
Hypothesis testing
Which task would an analyst consider first during the discovery phase of the data analytics lifecycle?
Identify the project goals
Example of Predictive
In what month did we sell the most cars?
What can be identified using a box plot?
Interquartile range
What is "I" in IRAC
Issue- is the find the legal problem that is inherent in the situation
Gantt chart
It comes after doing the post-it notes. Input into Excel it helps to see the project, time, cost.
Machine Learning as a Service(MLASS)
It is a cloud hosting machine learning where the software often with a drag and drop interface is hosted on the same servers that store the data and house the processors. It democratization the data science.
What does API mean?
It is a way of sharing data. It can take data from one application to another or from a server to your computer.
Reinforcement Learning
It is an algorithm that is designed to reach a particular income like for instance running through the levels of a game
What does "D3" mean?
It stands for Data Driven Documents. It helps visualize the data in a web browser.
An example of API
JSON
TwentySomething, Inc. is implementing a data mart server environment. James, the PM, realizes the key driver for the project is quality. What should James refrain from doing with the key driver?
James can ignore time and cost, and place all focus on quality as the key driver
JSON
Javascript Object Notation
Collaboration tools
Jira, Slack, Teams and PivotalTracker
Classifying Methods
K-means, k nearest neighbors. BInary Classification. Many Categories, Distance Measures.
Semi-structured data
Key Characteristic: Loosely organized into categories using meta tags. Typical File Types: JSON, XML, Email, Web pages
Predictive
Keyword: Correlation analyze past trends and data to provide future insights Models like regression and decision trees What might happen in the future? Churn Analysis counting, mapping, visualization forecasting
Classificiation
Labeling discrete data. supervised learning locate,compare,assign Data mining phase
Potential Problems with Business Understanding
Lack of clear focus on stakeholders, timeline, limitations and budget could potentially derail an analysis
What is a common method used for feature selection?
Lasso Regression
Prescriptive Examples
Launch a new product in January, will it be well received as if we launch in July or November?
What does GDPR mean?
Law about privacy
Statutory Law
Law passed by the U.S. Congress or state legislatures.. For example: Congress passed The Genetic Information Nondiscrimination Act, otherwise known as GINA, to regulate how employers and insurance companies can, if at all, access our genetic information
What is statutory law?
Laws created by representative bodies by US Congress
Why do data and information systems come before laws?
Laws need to be driven by specific products, processes or events that already took place. Laws are created after something takes place to control parameters
Machine Learning Specialists
People who use math and computer science to advance AI
Your boss needs a quick way to view the project you are managing. You decide to use Excel and create a Gantt chart. What is the first step in creating a Gantt chart?
List the critical path task first
Self-generated data
Loop backing where the computer can engage itself to create data training machine.
Another name for Data Mining
Machine Learning Deep Learning AI (artificial intelligence) Supervised/ Unsupervised Models
What is the common duty of a data administrator?
Maintain data on the IT infrastructure
What are the downside of in-house data?
May not be well-documented, well maintained not exist
How do you know if a distribution is negative skew or left skew?
Meaning that the left tail is longer.
Schedule the project
Meet with the team and the customer to work out the last details on time and how long the project will take
What does the Critical Path represent in the project?
Minimum time to complete dependent tasks
Entrepreneurs
Need all the skills for business acumen
Scope Creep
New requirements added that increase the time/resources needed to complete the project.
What is unsupervised learning?
No human interventions. Enter data without labeling.
What are the two categorical data?
Nominal(Name)--red, green, blue Ordinal(with an order)-small,medium,large
What is the key focus for descriptive and the main question?
Observation and what happened?
Open data
Open library of data. It is free.
Which type of data analysis is appropriate if the goal is to minimize the cost of a diet, using a data set consisting of the following variables: protein content, fat content, and cost per unit?
Optimization
What is a soft skill?
Persuasion Communication Emotional intelligence Active listening Logic and reasoning Interpersonal skills Negotiation
Data Acquisition Phase
Phase of collecting data. Data retrieved from data bases like SQL, surveys, web scrapping Techniques: ETL ,API
Pathways to Data Analytics
Planning, Wrangling, Modeling, Applying
What is the mapping techniques for stakeholders?
Power and Interest Grid Salience Model Direction of Influence
What is the expected sales forecast for 2021?
Predictive Analytics
What will happen in the future? Churn Analysis
Predictive Modeling
During which phase in the data analytics life cycle would be a churn analysis?
Predictive analysis
What analytic method ask the question what might happen in the future?
Predictive analytics
How can we increase our sales by 20% in 2021?
Prescriptive
What analytics method ask the question what should we do going forward?
Prescriptive analytics
Present the model
Presentation of the findings
Data Scraping
Process of extracting data from formats that were not specifically design for data sharing
Who clears organizational hurdles?
Project manager
Which party has the primary visions for a data analytics project and brings resources to complete it?
Project sponsors
What are the two purposes of the reporting phase of the data analytics life cycle?
Provide the conclusions from the analysis in an engaging manner and provide a actionable insight that can inform decision making
Meaning of POWER
Purpose Outcome What in it for them Engagement Roles and responsibilities
Which tool has libraries that expand its visualization capabilities?
Python
What are the tools used?
Python R SQL Tableau
Potential Problems with Data Acquisition
Quality and type of data may make access more difficult
An analyst has been asked to analyze the open-ended responses from customers on a satisfaction survey. Which type of data is the analyst working with on this project?
Quantitative
What are some practical methods for identifying cause and effect relationships in your data?
Quasi-experiments
Measure of Variation
Range, quartiles, variance, or standard deviation
What do tools like Slack and Teams include for users?
Real Time Instant Messaging
What is an example of an external stakeholder for a data analytics project?
Regulatory body
Outside people
Regulatory groups
What is "R" in IRAC?
Rule-What, is any, rules are relevant to the situation
Potential problems with Data Mining
Running an entire data is problematic; need to subset data into training and testing datasets to build models
What happens in the Business Understanding phase?
Scope project Identify stakeholders and research questions/KPIs Identify timeline, budget, and participants
Which of the following models the relationship between a dependent variable and a single independent variable?
Simple Linear regression
Potential problems with Data Exploration
Skipping these steps could enable faulty perceptions of the data which hurt advanced analytics
Potential Problems with Data Cleaning
Some cleaning techniques could dramatically change data/outcomes Outliers not dealt with can cause problems with statistical models due to excessive variability.
SQL
Standard Query Language
What types of analyzing methods are used for descriptive analysis?
Summary Statistics, Clustering, and Regression
Data Engineers
System Architects. They help get/give permissions for the data
What software specializes in visualization, graphs, and charts for user interaction?
Tableau and PowerBI
What happens in the Reporting/Visualization phase?
Tell a story with data Provide a summary of analytic analysis Provide insights to stakeholders Create insightful graphs that showcase trends and forecasts
What will be a consequence of poor attention to detail during the data exploration phase?
The analyst will lack insight into the structure of the data set
Research lead
This is someone from the business side who pushes the team to ask interesting questions. Identifying key problems. They know the most about the business. Know topic categories. Drive questions
Source of EDARP
The goal is to create some form of value
What do vertical lines on a Gantt chart represent?
The limits of the allowable movement of floating tasks
What mistake is commonly made during the predictive analytics phase?
The model is developed before the research question is known
Overfitting
The process of fitting a model too closely to the training data for the model to be effective on other data.
The PM and the accountant are reviewing the financial information of a project. Which assumption should the accountant refrain from making?
The project is linear.
You are creating a list of people who are stakeholders for your project. Who would be the key stakeholder?
The project sponsor
For a project to have success, who must engage in the project work and outcomes?
The project stakeholders
How do you know if a distribution is positive skew or right skew?
The right tail of the distribution is longer than the left
Numerical measurements of the amount of a toxic chemical substance are recorded in a large database. Which hypothesis can the data analyst answer through exploratory data analytics methods?
The statistical distribution of the chemical measurements is normal.
Mode
The value that occurs most frequently in a given data set.
Data scientists
They are data analysts, create software, work with mathematics, know the business, ask the interesting questions and hacking.
Managers
They oversee entire process of data analysts, data engineers, machine learning specialists and even researchers
Researchers
They use data to help them further studies
Unicorn
This is a full-stack data scientist who can do it all, and do it at absolute peak performance.
How would you use Bayes' theorem for spam detection?
To calculate the probability a message is spam if it contains certain flagged words
You are at the end of a very long project, and the last task to complete is a review. What is the benefit of performing a project review with your team?
To learn for next time
During Data Mining, why might an analyst resample a data set with replacement data?
Too little data for training and testing data sets
Potential problems with Predictive Modeling
Too many input variables (predictors) can cause problems Correlation does not imply causation Time series models often need sufficient time data to offer precise trending Predictive model accuracy should be assessed using cross-validation.
Jerome has organized his Post-it notes and is now estimating durations. What is a typical practice for estimating durations at this stage?
Using average times in weeks
What are effective interpersonal communications?
Verbal and Listening
Which aspect of data exploration occurs when an analyst writes code to compile a bar graph of dog food sales per month?
Verification through visualization
What elements make up "Big Data"?
Volume, Velocity, Variety
You have a data set containing four columns of data that describe an index number, a model number, weight class, and weight. Which column contains quantitative data?
Weight
Example of Prescriptive
What color and sizes of shirts should we sell to maximize profits?
Example of Descriptive Analytics
What has happened over the last 5 years?
Prescriptive modeling question
What types of donuts should we sell to maximize profits?
Enforce learning
When the project manager turns insight into something actionable
What happens in the data acquisition phase?
Wrangling: Get data, Clean data
Is Tableau easy to use?
Yes, it allows a novice user the ability to interact with the data and spot trends and patterns
C
You are running a cluster of servers taking in streaming data from cell towers. There's a bottleneck in the extract, transform, load (ETL) pipeline. What is the most likely issue? A Users are overloading the servers with data queries. B There are more servers than needed in the cluster. C Load balancing is not evenly splitting the load among servers
Geoplotlib
a Python toolbox for geographic visualizations
Time Series Analysis
a data metric over time. Ex. Area chart
Calculus
a function that describes the relationship between price and sales. Ex. to find the best price for maximizing revenue, you must first have a formula that says how sales are related to price
AI(Artificial Intelligence)
a replica of the human brain that can solve cognitive task. Programs that learn from data and machine learning.
Conflict of Interest
a situation in which an action by a company or individual results in an unfair benefit.
Bayesian analysis
a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that the average male height is between 70 and 80 inches or that the average female height is between 60 and 70 inches?
Hierarchical clustering
algorithm that groups objects into groups called clusters
Histograms
allows a way to graph numerical data in "groups" or bins that allow bars to represent frequencies. __________ are convenient graphs to show outliers in data and skewness (i.e., asymmetry of the data). For example, this is a __________ on "Years of Experience" which shows a right-sided (i.e., positive) skewness
Reporting and Visualization Phase
an analyst tells the story of the data and uses graphs interactive dashboards to inform others of the findings
Data Flow
analyzes the flow of data through a systemic and the users experience
JASP
app for data analysis. is uses free and open source. It is good for democratizing data
JAMOVI
app for data analysis. is uses free and open source. Statistical spreadsheet software that aims to simplify two aspects of using R
Expert Systems
are algorithms that mimic decision making process of a human domain expert. Can spell out every step in a decision tree like a flow chart
Packages
are collections of codes that give additional functionality to programming language and simplify many common tasks.
C, and C++, and Java
are general-purpose languages that are used for the back end, the foundational elementsterm-26 of data science, and they provide maximum speed
Cross-validation
builds models to use test data and split into groups
Bokeh
can create interactive charts and graphs. It allows you to create tools for the user so they can access and explore your visualizations.
Principal Component Analysis
combine multiple, correlated variables into a single component
Create the model
comes after refining the data. Analyst will figure out what type of model to used
Linear regression
common and powerful technique from combining many variables into a equation to predict a single outcome
Nominal comparsion
comparing values of categories or sub-categories
Generative Adversarial Networks
computer creates one neural network that creates data and the other tests the data
Evaluate the model
consider this as double checking your data
Pygal
create charts that are interactive SVG's with minimal line of codes
GGplot2
create professional looking plots with minimum user codes
What is quantitative data?
data can be quantified, verified and measured. All values are numerical.
What is qualitative data?
data obtained by the researcher from first-hand observation, interviews, surveys, focus groups. All non-numerical sources
What is categorical data?
data that can be group into a category.
Tidy Data
data that is easily imported. One sheet per file, one level of observation, column=variable
Human accessible
decision making - Many algorithmic decisions are made automatically, and even implemented automatically. But they're designed such that humans can at least understand what happened in them. Such as, for instance, with an online mortgage application.
Human in the Loop
decision making is where advanced algorithms can make and even implement their own decisions, as with self-driving cars.
Data definitions
defines product, customer and process data
infographics (information graphics)
displays information graphically so it can be easily understood
Data relationships
drives the architecture which consists of the boundaries of how data relates to each other
Joyce is part of a data science team. She is very interested in running experiments. At what point would Joyce do this?
during the process of asking questions
Predictive Modeling Phase
enable an analyst to move beyond describing the data to creating models that enable predicting outcomes of interest cross-validation of predictive models to ensure accuracy estimate/project values or likelihood of an event
Looping back
engage themselves to create the data training machine
Correlation
examining the relationship of 2 or more numerical values
Deviation
explores how data points relate to each other. Easy to use graphs to examine
Factor Analysis
find the underlying common factor that gives rise to multiple indicators
Optimization Analysis
finding the best value for one or more target variables given certain constraints. It shows what value a variable should have, given certain constraints or restraints
Explicit Rules
flow charts to show what method
Democratization Data
for everyone to have excess to the information and platform
GIGO
garbage in, garbage out
Creating data
get your own data thru natural observations, informal discussions, interviews, surveys
What is Constitutional law?
gives the government authority to act restricts authority to avoid infringement
Optimization analysts
goal seeking analysts
Implicit rules
help the algorithm function. They are the rules that are develop while analyzing the test data. They cannot be easily described to humans
Key Project Driver
helps to figure out what is the most important to the stakeholder
PCA vs. FA
in principal component analysis, the variables come first and the component results from it. In factor analysis, this hidden factor comes first and gives rise to the individual variables.
Data Ethics
institutional review board, informed consent, confidentiality
Tableau
interactive data exploration and visualization sources Brings in data from a lot of different sources and formats. Facilitate data integration. Can help anyone see and understand their data via visual dashboard
Qlik
interactive data exploration. Same as Tableau
Heat Maps
is a colorful graph that can visually show frequency or interaction using a range of colors. Example, a heatmap is applied to a webpage to show where customers typically hover their mouse or click/interact with the website.
Decision tree
is a decision support tool. Sequence of binary decisions based on your data, that can combine to predict an outcome. It branches from one decision to the next
Scatterplots
is a two dimensional graph which is great to visualize correlation or relationships.
What is quasi-experiments?
is a useful way to approximate cause and effect, as what-if simulations and optimization models
Feature
is a variable or dimension in the data
IRAC
is our legal analysis tool to understand how to move from identifying a legal issue to reaching a conclusion and a decision about how to take action
Data privacy
is responsibly collecting, using and storing data about people, in line with the expectations of those people, your customers, regulations and laws
Data analysts
is the person who obtains and scrubs the data. They will display the data in graphs and reports
Factor analysis
is to find the underlying common factor that gives rise to multiple indicators.
Causation
is when there is a real-world explanation for WHY this is logically happening; it implies a cause and effect
Pandas
is wrapper which is great for data structure and as a data analysts tool
What is common law?
laws created by the courts, rules negligence
Dimensionality reduction
learning to read a language. Errors tend to cancel out.
JSON means
lightweight format for storing and transporting data. Open standard file format and data interchange format that uses human-readable text to store and transmit data objects. It is text only.
K-Dimensional Space
locate each data point, each observation, in a multidimensional space
Univariate Descriptive
look for one number to be able to represent the entire collection(the mean)
Refine the model
make sure you are answering the question that has been provided
Tensorflow
makes it easy to do deep learning neutral networks
Autocorrelation
means that each point in time is influenced by the points that came before it.
The standard deviation
measures how close to the mean is the overall sample
Unstructured data
most abundant type of data and covers everything from text and audio to images and videos. Cannot be stored in a structured database
Python is
most popular language for data sciences and machine learning. It is very clean and easy to learn. Manages memory and large data set better
Passive Collection
one trail learning.
Functional managers
ones who oversee a couple data analysts teams
Project team
people who are doing the day to day on the project. Mangers, analysts, techs. People outside of the company
Project sponsor
people who help with the funding and the project to be seen through to the end
Subject Matter Experts(SME)
people who know everything. They have been in a field for a long time and knows the "ins" and "outs"
Regulators
people who set rules for business
Bayes' Theorm
posterior probability as a function of the likelihood, the probability of getting the data you found
Structured data
pre-define data model and is simple to analyze
Anomaly Detection
process of identifying rate or unexpected items or events in a data set that do not conform to other items in the data set.
What does the BAs need to understand and analyze the linkages between?
process, rules, and data.
R
programming language that was developed especially for work in data analysis. Statistical uses. Works with vectorized operations and non-standard evaluation
Privacy Torts
protects individuals rights to keep certain things out of public view even if they are true
Boxplot
provides a concise summary of the quartiles of numerical data (i.e., cutpoints that divide the data into 25% percentile segments). This graph is also convenient for detecting outliers and skewness.
Get Data
pull the data from the database or data warehouse.
Which are examples of models used in predictive analytics?
regression and decision trees
What are the techniques used?
regression, decision trees, classification, clustering, neural networks, time series, PCA
BI(Business Intelligence)
rely on structured dashboards. emphasize speed, insight, access
Levels of granularity
right level of details. Cannot be estimated, overlap, ongoing
Kendall and her team are preparing a list of risks and entering them in a risk assessment chart. How do they calculate the weighted factor?
risk factor x impact factor
XML
self-describing or self defining of data embedded with the data. When the data arrives no need to prebuild structure to store the data
Neural networks
series of algorithms that mimic the operations of a human brain to recognize relationships between vast amounts of data. Characteristics that are not readily visible to humans and that would not necessarily make sense to a human
Data Cleaning Phase
several terms like data cleansing, data wrangling, data munging, feature engineering. fixing improperly formatted values data reduction dealing with duplicates, missing data and outliers Cannot skip this phase, the results may become irrelevant
Part to whole
simply how a smaller subset compares with the larger subset
Refine the Data
sometime you need to reshape the data. Ex. might not have enough data and have to go back to your resources to gather more information
Community groups
specific type of group of people looking for information from the data team
HTML
tagging text files to achieve font, color, graphic and hyperlink
Decomposition
take trend overtime and break it down into several elements
Implicit Rules
that help algorithms function. They cannot be easily described to humans. But very effective
Principal Component Analysis (PCA)
that takes your multiple correlated variables and then combine them into a single component score
Data Exploration Phase
the analyst begins to understand the basic nature of data and the relationships within it. pattern discovery descriptive analytics Vertification of visualization central tendacy relies on use of data visualization tools distribution-normal skews, left, right
Business architecture
the business strategy, business capabilities, business knowledge, value streams and organizational views
Central Limit Theorem
the distribution of sample averages tends to be normal regardless of the shape of the process distribution
Median
the middle score in a distribution; half the scores are above it and half are below it
What is posterior probability?
the probability of the cause, such as disease, given the effect, such as a positive medical test for the disease
Ranking
the showing of two variables. Used to show changes over time
Analyst
they are the worker bees of data analytics. They deal with data for businesses using the data life cycle
Hold Testing Data
think outside the box. Take your model that you built go thru the cross validation and apply only once for the hold out data to see if stands up to the unusual
Clean Data
this is where you will spend 80% of your time. This is where you format the data
Validate the model
to make sure the data is sound and can be justify is very important
Seaborn
to make things "sexy" . It has built in schemes and color palettes
Crashing
to speed up a plan in a project
Define goals
trying to understand what the "customer" wants and what they really need. Kick-off meetings
Content Listening
understand and retain the information provided and identify the main points of the message
Purpose of EDARP
understand the objective and detail our path. Drives buy in from the org in support of the project. It helps drives the success of the project
Regression analysis
understandable questions to predict a single outcome based on multiple predictor variables
Business acumen
understanding business concepts, being business savvy and understanding how business works.
Explore the Data
understanding of the data. Using this data to answer the question
Serendipity
unexpected insights, untapped potential or values
Automated Code Review
use of analytics method to inspect and review the source codes to detect bugs or security issues
Algebra
use to scale up solution
What is supervised learning?
uses algorithms that "learn" data entered by a person
Distribution
visualization that shows data often surrounding a central value
Matplotlib
website for trying to make easy things easy and hard things possible. Makes data easy to read and visualize
Linear Algebra
works with matrices and vectors