Data Analytics Journey
A software intermediary that allows two applications to talk to each other. IN other words, it is the messenger that delivers your request to the provider that you are requesting it from and then delivers the response back to you (e.g., pay with PayPal)
API
It involves being able to listen to others with understanding and empathy.
Active Listening
Is the identification of rare items, events or observations in a dataset which differ from the norm or raise suspicions. It can be used to detect fraud, intrusion, outliers, technical glitch., etc. in a dataset.
Anomaly Detection
API stands for
Application Programming Interface
The development of smart machines capable of performing tasks that typically require human intelligence.
Artificial Intelligence
Is the probability of observing various data, given the hypothesis, and the observed data. It gives you the after-the-data probability of a hypothesis as a function of the likelihood of the data; the probability of getting the data you found.
Baye's Theorum
Provides a concise summary of quartiles of numerical data (i.e., cut points that divide the data into 25% percentile segments). This graph is also convenient for detecting outliers and skewness.
Boxplot
An analyst defines the major questions of interest that need to be answered, determines the needs of the stakeholders, and assess the resource constraints of the project. Define project outcomes. Which phase of the data analytics life cycle is this?
Business Understanding/Discovery Phase
Scope Statement, Stakeholder Register, Gannt Chart, and Network Diagram are all tools used in which phase of the data analytics life cycle?
Business Understanding/Discovery Phase
A technique in which the analyst wants to assign an item to a specific category based on various conditions.
Classification
Lack of ______________ on stakeholders, timeline, limitations, and budget could potentially derail an analysis.
Clear focus
Groups are unknown and the analyst wishes to determine if the object belongs to any group. An example is when data on search queries are analyzed to determine if they group a particular way and how many groups exist. Examples include: genome patterns, google news, and point cloud processing.
Clustering
Means creating meaningful dialog together that focuses son the problem, opportunity, and solution. They can use diagrams, chart, and visuals. Its strategy aims at bringing together different groups of people and third parties to assist with a project or product development (example tools could be Google Docs, Slack, Microsoft Teams, etc).
Co-creation
___________ is the context of data framework referring to not being ethical or compromising analysis to allow it to lean towards favorable results.
Conflict of Interest
Delay on the ______ activities could delay the project.
Critical Path
The longest path of activities on a project or the minimum time necessary to complete all project works.
Critical Path
Is a JavaScript library for manipulating documents based on data. Helps bring data to life using HTML, SVG and CSS.
D3.js Data Driven Document
Collecting data phase. Data is collected and stored, for easy retrieval from a database, perhaps a component of a data warehouse, by using a language like SQL. Can use webs scraping and surveys to acquire data. Which phase of the data analytics life cycle is this?
Data Acquisition
SQL, Web Scrapping Software, Surveys, input Data (Self-Generated Data), NoSQL (used to collect unstructured data) are all tools used in which phase of the data analytics life cycle?
Data Acquisition Phase
The role in the workplace in a data analytic project that obtains and cleans data, displays data in reports, and searches for trends and outliers.
Data Analyst
Also known as data cleansing, data wrangling, data urging, and feature engineering. Analyst will use SQL, Python, and R, or Excel to perform data modifications and transformations. Which phase of the data analytics life cycle is this?
Data Cleaning
Python, R, SQL, Excel are all tools used in which phase of the data analytics life cycle?
Data Cleaning Phase
Analyst begins to understand the basic nature of data, the relationships within it (between data variables), and the structure of the dataset, the presence of outliers, and the distribution of data values. This phase uses data visualization tools and numerical summaries such as measures of central tendency and variability. Which phase of the data analytics life cycle is this?
Data Exploration
Distributions (normal or skewed curve), Visualization Tools (tableau, R, Python, RStudio, and Histogram) and statistical tools (such as mean, median, and mode) are all tools used in which phase of the data analytics life cycle?
Data Exploration
Looks for patterns in large sets of data. Tools are Python and R. Also called machine learning. A specialized segment of data mining techniques that continually update to improve modeling over time. Which phase of the data analytics life cycle is this?
Data Mining Phase
Simply reducing the amount or volume of data in each storage or database. One f the goals is to optimize storage capacity.
Data Reduction
Dashboards, Tableaux, Story Telling (a feature of Tableau), graphs, charts, imagines, histogram, etc. are all tools used in which phase of the data analytics life cycle?
Data Reporting
Analyst tells the story of the data and uses graphs or interactive dashboards to inform others of findings from analyses. Tools such as Tableau is used to spot trends and patterns. Goal is to give actionable insight to stakeholders. Which phase of the data analytics life cycle is this?
Data Reporting Phase
Tree like model of alternative decisions and their consequences. It is a whole series, a sequence of binary decisions based on your data., that can combine to predict an outcome. It branches out from one decision to the next.
Decision trees.
Breaking trend over time into components; its procedures are used in time series to describe the reasons for variations in trend.
Decomposition
A type of neural network capable of performing text classification. Also, a type of recurrent neural network (RNN) that works best on sequential data.
Deep Learning
The ability for information in digital format to be accessible to the average end-user. One of the goals is to allow non-specialists to be able to access data without technical requirement. It means that everyone should have access to the data and there isn't a gatekeeper that can create a bottleneck to the data.
Democratization
The interpretation of historical data to better explain market developments. Which type of analytics is this?
Descriptive
Which type of analytics asks the question, "What happened?"
Descriptive
It enables the extraction of value from data by posing the right questions and conducting in-depth investigations into the problems. Which type of analytics is this?
Diagnostic
Which type of analytics asks the question, "Why did it happen?"
Diagnostic
Reduces the number of variables and the amount of data. You will deal with a single score and not multiple scores or a lot of data. It uses techniques such s Principle Component Analysis (PCA), Factor Analysis, and Feature selection.
Dimensionality reduction
Is a type of data integration that is used to blend data from several sources. It's often used to build a data warehouse.
ETL
Another version ETL; tends to load anything and everything into a warehouse or a data lake from where it can be analyzed at a later point of time.
ETLTL
Persuasion, verbal communication, non-verbal communication, active listening, problem-solving, and decision-making are all examples of:
Effective interpersonal communication skills.
XML stands for
Extensible Markup Language
ETLTL stands for
Extract, Transform, Load, Transform, and Load.
ETL stands for
Extract, Transform, and Load.
How does one define research questions within an organization?
Formulate questions that align with the organizational needs.
A colorful graph that can visually show frequency or interaction using a range of colors (red used for most frequency, blue is used for least frequency)
Heatmap
Algorithm that groups similar objects into groups that are called clusters.
Hierarchal Clustering.
A simple and commonly used plot to quickly check the distribution of a sample set. The data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained. It helps show outliers in data and skewness.
Histogram
The ___________ shows in a graphical form the project constraints of Time, Cost, and Scope. Quality is a central theme, which is at the midpoint. If you make a change to one constraint, the other two need to be adjusted accordingly otherwise quality will suffer.
Iron Triangle
A lightweight format for storing and transporting data on networks. Also an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and array data types.
JSON
JSON stands for
JavaScript Object Notation
Is an array of services that provide machine learning tools as part of cloud computing services. Helps clients benefit from machine learning without the cognate cost, time, and risk of establishing an in-house internal machine learning team.
MLaaS (Machine Learning as a Service)
Involves using algorithms and statistical models to analyze and draw inferences from patterns in data. Focuses on the development of computer programs that can access data and use it to learn from themselves.
Machine Learning
MLaaS stands for
Machine Learning as a Service
Founded by Thomas Bayes is an algorithm that applies Baye's theorem to estimate the conditional probability of an outcome.
Naive Bayes
Algorithm that mimic the operations of the human brain to recognize relationships between vast amounts of data. It is modeled roughly after the neurons that are inside the biological brain. They are on and off switches that relate to each. other. Taking very basic pieces of information and connecting it with many other nodes and it is very high-level cognitive decisions and classifications.
Neural Networks
A symmetrical curve centered around the mean. Its data falls to the empirical rule that indicates that a percentage of the data set falls within (plus or minus) 1, 2, and 3 standard deviations of the mean.
Normal Distribution (bell-shaped)
Finding the best value for one or more target variables given certain constraints. It shows what value a variable should have, given certain conditions or restraints.
Optimization Analysis
Organizations responsible for carrying out specific project activities in a manner and scope indicated in an application form.
Partners
It uses data, statistical algorithms, and machine learning techniques to determine the JS of potential outcomes. The aim is to have the best assessment of what will happen in the future, rather than simply understanding what has happened. Which type of analytics is this?
Predictive
Which type of analytics asks the question, "What will happen?"
Predictive
Python and R are solely tools used in which 2 phases of the data analytics life cycle?
Predictive Modeling and Data Mining
It helps organizations make decisions. Which type of analytics is this?
Prescriptive
Predictive analytics uses collected data to come up with the future outcomes, and _______________ analytics takes that data and makes decisions that cause future outcomes.
Prescriptive
Predictive and ______________ analytics are two forward-looking tools used by business leaders.
Prescriptive
Which type of analytics asks the question, "How can we make it happen?"
Prescriptive
The role in the workplace in a data analytic project that coordinates and manages the triple constraints, and gets the data/reports out to the organization.
Project Manager
The role in the workplace in a data analytic project that provides direction
Project Manager
The role in the workplace in a data analytic project that provides funds:
Project Sponsor
What are the implications of undefined outcomes of potential data analytics projects?
Project will not. be aligned with organization needs.
A production-ready language with capacity to be a single tool that integrates with every part of your workflow.
Python
Any piece of functionality is always written the same way with:
Python
Coding. and debugging is easy because of the simple syntax.
Python
Easier for people with software engineering background.
Python
Open-source, general-purpose programming language. It provides a more general approach and has several libraries that are useful to data science. Used by engineers and programmers.
Python
The indentation of code affects its meaning.
Python
Used by programmers that want to delve into the data analysis or apply. statistical techniques, and by developers and programmers that turn to data science.
Python
End-to-end platform which include data integration.
Qlik
Known as nominal or ordinal. Describes the basic features of the data in a study.
Qualitative
Known as numerical, parametric, or interval data.
Quantitative
Easier for people with no coding experience.
R
Open-source programming languages with new libraries or tools added continuously. Is mainly used for statistical analysis. Used by statisticians, educational researchers, etc.
R
Statistical models can be written with only a few lines.
R
The indentation of code does not affect its meaning.
R
The same piece of functionality can be written in several ways.
R
Used by statisticians, engineers, and scientists without computer programming skills. It's popular in academia, finance, pharmaceuticals, media, and marketing.
R
Used primarily in academics and research and is great for exploratory data analysis. In recent years, enterprise usage has rapidly expanded.
R
Is a technique that allows us to predict an outcome based on a set of predictor variables. It is like providing output given a set of inputs.
Regression
A collection of data items with predefined relationships between them (e.g., collection of tables).
Relational database
The role in the workplace in a data analytic project that pushes the team to ask interesting questions and identifies key problems.
Researcher
A domain specific language used in programming and designed for managing data in relational database management systems. Helps pull data from databases.
SQL
A two dimensional graph which is great to visualize correlation or relationships. Each dot on the graph represents an outcome for two numerical variables of interest.
Scatterplot.
Program that searches for and identifies items in a database that correspond to keywords or character specified by the user.
Search Engine
Loosely organized data in categories using tags (emails, CSV, XML, JSON doc, etc).
Semi-structured
A graph that shows frequencies related to the auto covariance time domain.
Spectral density
People who have an interest/power in any decision or activity of the project/organization. They could be involved in project plan development, change control boards, requirements gathering, risk management, and/or advocacy.
Stakeholders
By skipping the data exploration phase, the analyst will lack insight into the:
Structure of the data set
Type of data that is numbered and labeled; stored in an organized framework with columns and rows (e.g. in relational databases)?
Structured
Machine Learning algorithm that. learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data (e.g., classification and regression).
Supervised Model
Is visual analytics engine that makes it easier to create interactive visual analytics in the form of dashboards.
Tableau
_______ data set is used to validate the model built.
Test (or validation)
Allows the analyst to move beyond describing the data to creating models that enable predictions of outcomes of interest. Python and R are used in automating the training and use of models. Which phase of the data analytics life cycle is this?
The Predictive Modeling Phase
The role in the workplace in a data analytic project that is the ninja that knows everything.
The Unicorn
A statistical tool that deals with a sequence of data in chronological order. A technique that looks for trends in data over time. It also involves separating data into an overall trend.
Time Series
Data set is implemented to build up a model. Data points in the __________ set are excluded from the test (validation).
Training
A regression analysis and a function of time in a value. Understanding how and why things have changed over time. (e.g., stock prices). Involves figuring out the path your data is on.
Trend Analysis
The five attributes used to determine quality in data are:
Uniqueness, relevance, reliability, validity, and accuracy.
Text heavy information that isn't organized in clearly defined framework (texts, videos, audios, etc).
Unstructured
Provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own. Example: clustering, anomaly detection, neural network.
Unsupervised Model
Outliers not dealt with can cause problems with statistical models due to excessive ___________.
Variability
What is the most effective way of virtual communication?
Video Conferencing
A set of codes, or tags, that describes the text in a digital document.
XML
Co-creation is
collaboration.
Programming languages are
compiled
Programming uses a __________ to convert the language to machine language.
compiler
Scripting languages are:
interpreted
Scripting uses an ______________ (like PowerShell) to cover the language to machine language.
interpreter
What decisions are necessary to initiate a data analytics project?
knowing the goals of the organization, resource availability, stakeholders, and the outcome of the project.
The general approach that the classification model uses is to
locate, compare, assign.
The _____ is the the portion of the bell curve distribution having many occurrences far from the central part of the distribution.
long tail