D204
Predictive Modeling
Allows the analyst to move beyond describing the data to creating models that enable predictions of outcomes of interest. Python and R are used in automating the training and use of models.
Data Cleaning
Also known as data cleansing, data wrangling, data munging, and feature engineering. Analyst will use SQL, Python, R, or Excel to perform data modifications and transformations
Business Understanding/ Discovery phase
An analyst defines the major questions of interest that need to be answered, determines the needs of the stakeholders, and assesses the resource constraints of the project. Define project outcomes.
Data Mining
An in- depth step to discover patters using automated methods like machine learning.
Data Exploration
An initial step to uncover initial patterns and using both manual and automated methods.
Data Exploration
Analyst begins to understand the basic nature of data, the relationships within it (btw data variables), the structure of the dataset, the presence of outliers, and the distribution of data values. This phase uses data visualization tools and numerical summaries such as measures of central tendency and variability.
Data Reporting
Analyst tells the story of the data and uses graphs or interactive dashboards to inform others of the findings from the analyses. Tools such as Tableau is used to spot trends and patterns. Goal is to give actionable insight to stakeholders.
Data Acquisition
Collecting data phase. Data is collected and stored, for easy retrieval from a database, perhaps a component of a data warehouse, by using a language like SQL. Web scraping and surveys to acquire data.
Prescriptive Analytics
How can we make it happen? - Keywords: Change/Action/Solution/Causality/Manipulation/Decision Making. It helps organizations make decisions.
Formulate questions that align with the organizational needs.
How does one define research questions within an organization?
Data Sources
In house, open Data, web server, data lake, data warehouse, self-generated are types of what?
Data Mining
Looks for patterns in large sets of data. Tools are Python and R. Also called Machine learning. A specialized segment of data mining techniques that continually update to improve modeling over time.
Video conferencing
The most effective way of virtual communication
Unicorn
The person that knows everything.
Project Managers
This person coordinates and manages the triple constraints, and gets the data/reports out to the organization
Data analyst
This person obtains and cleans data, displays data in reports, and searches for trends and outliers.
Program Managers
This person provides direction
Project sponsor
This person provides funds, resources.
Researcher
This person pushes the team to ask interesting questions and identifies key problems
Iron Triangle
This shows in graphical form the project constraints of Time, Cost, Scope/Quality
Linear Regression
Used to predict the value of variable based on the value of another variable. The variable to be predicted is called dependent variable and the variable used to predict the target variable is independent variable.
Legal frameworks for data governance
What are the following: Data Privacy laws covering the collection and sharing of personally identifiable information (PII) (Example: GDPR in the EU, IRAC, HIPPA. IRAC is an acronym that generally stands for: Issue, Rule, Application, and Conclusion. It functions as a methodology for legal analysis. The IRAC format is mostly used in hypothetical questions in law school.
Project will not be aligned with organization needs.
What are the implications of undefined outcomes of potential data analytics projects?
Predictive and Prescriptive analytics
What are two forward-looking tools used by business leaders? ___________ analytics uses collected data to come up with future outcomes, and ____________ analytics takes that data and make decisions that cause future outcomes.
Knowing the goals of an organization, resource availability, stake holders, and the outcome(s) of the project.
What decisions are necessary to initiate a data analytics project?
Descriptive Analytics
What happened? - Observation/Describe event. It is the interpretation of historical data to better explain market developments.
Active Listening
What involves being able to listen to others with understanding and empathy.
Machine Learning
What involves using algorithms and statistical models to analyze and draw inferences from patterns in data; and focuses on the development of computer programs that can access data and use it to learn for themselves.
Quality
What is a central theme which is at the midpoint. If you break Iron Triangle by making a change to one constraint, other two need to be adjusted accordingly otherwise this will suffer.
Relational Database
What is a collection of data items with predefined relationships between them e.g. collection of tables.
Supervised Model
What is a machine learning algorithm that learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data. E.g. classification and regression.
Decision trees
What is a model of alternative decisions and their consequences. It is a whole series, a sequence of binary decisions based on your data, that can combine to predict an outcome. It branches out from one decision to the next.
Classification
What is a technique in which the analyst wants to assign an item to a specific category based on various conditions. The general approach that the model uses is to find the location of the item among measurements of interest, compare this item to items close by, then assign them to a group. Also used for object detection, spam detection, cancer detection, etc.
Regression
What is a technique that allows us to predict an outcome based on a set of predictor variables. It is like providing output given a set of inputs.
Deep Learning
What is a type of neural network capable of performing text classification. Also, a type of recurrent neural network (RNN) that works best on sequential data.
Hierarchal Clustering
What is an algorithm that groups similar objects into groups that are called clusters.
Neural Networks
What is an algorithm that mimic the operations of human brain to recognize relationships between vast amounts of data. It is modeled roughly after the neurons that are inside a biological brain. They are on and off switches that relate to each other. Taking very basic pieces of information and connecting it with many other nodes and it is very high-level cognitive decisions and classifications. Example: NN and NLP techniques can be used to analyze product reviews submitted by the customers and identify positive and negative sentiments from those reviews.
Decomposition
What is breaking trends over time into components. Its procedures are used in time series to describe the reasons for variations in trend.
Trend analysis
What is defined as a function of time in a value. Understanding how and why things have changed over time. Ex. Stock prices. In data analytics, this involves figuring out the path your data is on. It starts by plotting a line, making a graph of changes over time, then connect the points. Trying to find a function for a particular line like the number of people that visit a site. movement, etc.
Time Series
What is defined as a statistical tool that deals with a sequence of data in chronological order. A technique that looks for trends in data over time. It also involves separating data into an overall trend. Examples include the daily log returns on a stock, monthly values of the consumer price index, or CPI, which is a measure of the national inflation rate.
Optimization Analysis
What is defined as finding the best value for one or more target variables given certain constraints. It shows what value a variable should have, given certain conditions or restraints.
Clustering
What is defined when groupings are unknown, and the analyst wishes to determine if the objects belong to any group. An example is when data on search queries are analyzed to determine if they group in a particular way and how many groups exist. Examples include genome patterns, google news, point cloud processing.
Data Reduction
What is simply reducing the amount or volume of data in each storage or database. One of the goals is to optimize storage capacity.
Democratization
What is the ability for information in digital format to be accessible to the average end-user. One of the goals is to allow non-specialists to be able to access data without technical requirement. It means that everyone should have access to the data and there isn't a gatekeeper that can create a bottleneck to the data.
Artificial Intelligence
What is the development of smart machines capable of performing tasks that typically require human intelligence. EX. Visual perceptions, speech recognition, online cheque processing, decision-making, natural language processing (NLP), etc.
Anomaly Detection
What is the identification of rare items, events or observations in a dataset which differ from the norm or raise suspicions. It can be used to detect fraud, intrusion, outliers, technical glitch, etc. in a dataset. Tools include R, RStudio, Tableau, MS Excel, Editor, etc. Techniques include local outlier factor (LOF), alfa function, etc.
Critical path
What is the longest path of activities on a project or the minimum of time necessary to complete all project works. Delay these activities could delay the project.
Baye's theorem
What is the probability of observing various data, given the hypotheses, and the observed data. It gives you the after-the-data probability of a hypothesis as a function of the likelihood of the data, the probability of getting the data you found.
Co-creation/collaboration
What means creating meaningful dialog together that focus on the problem, opportunity, and solution and can use diagrams, charts, and visuals? Its strategy aims at bringing together different groups of people and third parties to assist with a project or product development. Examples of tools teams used are Microsoft Teams, Google docs, Slack
Unsupervised Model
What provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own. Example Clustering, anomaly detection, neural network.
Dimensionality Reduction
What reduces the number of variables and the amount of data. You will deal with a single score and not multiple scores or a lot of data. It uses techniques such as Principal Component Analysis (PCA), Factor Analysis, & Feature Selection.
Conflict of Interest
What refers to not being ethical or compromising analysis to allow it to lean towards favorable results.
Effective interpersonal communication
What skills include persuasion, verbal communication, non-verbal communication, active listening, problem-solving, and decision-making.
Predictive Analytics
What will happen? - Correlation. Predicts what will happen in the future. It uses data, statistical algorithms, and machine learning techniques to determine the JS of potential outcomes. The aim is to have the best assessment of what will happen in the future, rather than simply understanding what has happened.
Bell curve with a long tail end
Which data analytics application/process has a portion of the distribution that has many occurrences far from the central part of the distribution. In sales, it may mean more people buying individualize niche products.
D3.js (Data driven document)
Which data analytics application/process is a JavaScript library for manipulating documents based on data. It helps bring data to life using HTML, SVG and CSS.
Heatmap
Which data analytics application/process is a colorful graph that can visually show frequency or interaction using a range of colors. Red is used mostly for most frequency while blue is used for least frequency.
SQL
Which data analytics application/process is a domain specific language used in programming and designed for managing data in relational database management systems. Helps pull data from databases.
JSON (JavaScript Object Notation)
Which data analytics application/process is a lightweight format for storing and transporting data on networks. Also an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute-value pairs and array data types
Search Engine
Which data analytics application/process is a program that searches for and identifies items in a database that correspond to keywords or character specified by the user.
Histogram
Which data analytics application/process is a simple and commonly used plot to quickly check the distribution of a sample of data. In the histogram, the data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained. It helps show outliers in data and skewness.
Application Programming Interface (API)
Which data analytics application/process is a software intermediary that allows two applications to talk to each other. In other words, it is the messenger that delivers your request to the provider that you are requesting it from and then delivers the response back to you e.g. pay with PayPal, SQL.
Normal Distribution (Bell-Shaped)
Which data analytics application/process is a symmetrical curve centered around the mean. It's data falls to the empirical rule that indicates the percentage of the data set that falls within (plus or minus) 1, 2 and 3 standard deviations of the mean.
Scatterplot
Which data analytics application/process is a two-dimensional graph which is great to visualize correlation or relationships. Each dot represents an outcome for two numerical variables of interest.
Extract, Transform, and Load (ETL)
Which data analytics application/process is a type of data integration that is used to blend data from several sources. It's often used to build a data warehouse.
Tableau
Which data analytics application/process is a visual analytics engine that makes it easier to create interactive visual analytics in the form of dashboards.
Machine Learning as a Service
Which data analytics application/process is an array of services that provide learning tools as part of cloud computing services. It helps clients benefit from machine learning without the cognate cost, time and risk of establishing an inhouse internal team.
Qlik
Which data analytics application/process is an end-to-end platform which includes data integration.
R
Which data analytics application/process is an open-source programming languages with new libraries or tools added continuously.
Training Data Set
Which data analytics application/process is implemented to build up a model. Data points in the this set are excluded from the test.
Python
Which data analytics application/process is open-source general-purpose programming language. Python provides a more general approach and has several libraries that are useful to data science. Used by engineers and programmers.
Test (validation)
Which data analytics application/process is used to validate the model built.
Boxplot
Which data analytics application/process provides a concise summary of the quartiles of numerical data (i.e., cut points that divide the data into 25% percentile segments). This graph is also convenient for detecting outliers and skewness.
ETLTL (extract, transform, load, transform, and load)
Which data analytics application/process tends to load anything and everything into a warehouse or a data lake from where it can be analyzed at a later point of time.
XML
Which data analytics language is a set of codes, or tags, that describes the text in a digital document.
Qualitative
Which data type is known as nominal or ordinal. Describes the basic features of the data in a study.
Quantitative
Which data type is known as numerical, parametric, or interval data.
Semi-Structured
Which data type is loosely organized in categories using tags. e.g. Emails, CSV, XML, JSON doc., etc.
Structured
Which data type is numbered and labeled stored in an organized framework with columns and rows, e.g. Sql, databases, Excel, etc.
Unstructured
Which data type is text heavy, information not organized in clearly defined framework. e.g. text, videos, audios, etc.
Python
Which programming language can be described as the indentation of code affects its meaning. Any piece of functionality is always written the same way.
R
Which programming language can be described as the indentation of code does not affect its meaning. The same piece of functionality can be written in several ways.
Python
Which programming language is a coding and debugging is easy because of the simple syntax.
Python
Which programming language is a production-ready language with capacity to be a single tool that integrates with every part of your workflow!
R
Which programming language is easier for people with no coding experience. Statistical models can be written with only a few lines.
Python
Which programming language is that is easier for people with software engineering background.
Python
Which programming language is used by programmers that want to delve into data analysis or apply statistical techniques, and by developers and programmers that turn to data science.
R
Which programming language is used by statisticians, engineers, and scientists without computer programming skills. It's popular in academia, finance, pharmaceuticals, media, and marketing.
R
Which programming language is used primarily in academics and research and is great for exploratory data analysis. In recent years, enterprise usage has rapidly expanded.
Stakeholders
Who are people who have an interest/power in any decision or activity of the project/organization. They could be involved project plan development, change control board, requirements gathering, risk management, and advocacy.
Partners
Who are the organizations responsible for carrying out specific project activities in the manner and scope indicated in an application form.
Third Parties
Who may include regulatory agencies/customers.
Diagnostic Analytics
Why did it happen? - Explains the reason for the event. It enables the extraction of value from data by posing the right questions and conducting in-depth investigations into the problems.
Programming
____________ languages are compiled and use a compiler to convert the language to machine language
Scripting
____________ languages are interpreted and use an interpreter (like PowerShell) to convert the language to machine language.