The Data Analytics Journey D204
Meaning of E-T-L
Extraction, Trasnformation, Loading. **Look up when is it used and for what**
Which IS typically a good approach to the requirements process?
Facilitate activities that inspire discovery of what is not known. Expect requirements to change as stakeholders evolve their thinking about the future state. Understand and facilitate a learning process to get to the future state.
What is a key role the BA plays?
Facilitate the decision making process.
Facilitating meetings
Facilitating meetings is about making sure the meeting makes good use of everyone's time. It meets the objective, and an objective is something that everyone in the meeting cares about.
Predictive models
Find and use relevant past data. It doesn't have to be really old. It can be data from yesterday. But you always have to use data in the past because that's the only data you can get. And then you model the outcome using any of many possible choices.
verbal communications and articulating means
First, how we articulate the questions we ask. Second, how we articulate summarizing what we're hearing. And last, how we articulate general information like action items in meetings, facilitating instructions for activities, and our interpersonal articulation
process to resolve conflict
First, separate the people from the problem. Next, make sure your relationship integrity remains. Then, listen. Listen with the intent to learn. And next, let everyone be heard. Agree on the problem and find the common goals and intentions. And last, explore options
Statutory law
For example: Congress passed The Genetic Information Nondiscrimination Act, otherwise known as GINA, to regulate how employers and insurance companies can, if at all, access our genetic information
Human-Accessible decisions
Many algorithmic decisions are made automatically, and even implemented automatically. But they're designed such that humans can at least understand what happened in them. Such as, for instance, with an online mortgage application.
EDARP in BUSINESS
**Driven by the needs of the organization** May act as sort of project charter between teams, deliverables may be pushed ot the analyst form the organization, must balance between interpretability and robustness of the outputs. Practical application to the business, generally a user of methods
Meaning of IRAC
**Issue, Rule, Application and Conclusion**:traditional method for analyzing legal problems that arise in any context. It's also just a time-tested, straightforward problem-solving tool that can be applied to other disciplines to help us work logically from issue identification to action.
skews
**positively-skewed distributions ie Think of the valuations at companies, the cost of houses. negative skew, where most of the people are at the high end and the trailing ones are at the low end. If you think of something like birth weight. U-shaped distribution:polarizing movie and the reviews that it gets**
What is the right question? [a] How can we increase our market share in the US? - [b] Which geographic market that we operate in has the most potential for us to increase our market share?
Although both a and b are focused on market share, b's question is framed more specifically so we can identify the data needed. For example, we'll need both competitor market shares and our market share for every location where we have a store.
APIs
An API or Application Programming Interface isn't a source of data but rather it's a way of sharing data, it can take data from one application to another. Uses JSON files
expert system
An expert system is an approach to machine decision-making in which algorithms are designed that mimic the decision-making process of a human domain expert.
What is the most effective techniques for stimulating powerful conversation?
Ask high impact questions and giving everyone space to reflect.Yes, asking high impact questions is a powerful technique for deep and meaningful conversation, especially when everyone is given some time to think before anyone chimes in to start the dialog.
Open data
Basically it's data that is free because it has no cost and it's free to use that you can integrate in your projects. Sources: Number one is government data, number two is scientific data and the third one is data from social media and tech companies
Change management
Change management is about how we support and guide those impacted by changes.
What is the right question? [Andreas] Do we have enough employees? - [Katie] Given the projects that we have scheduled for next quarter, do we have enough skilled personnel to fulfill our commitments?
Clearly a hasn't quite got the hang of asking focused data questions. A's question's still too vague and not actionable enough for a data analysis. B's question is well structured so it's clear we'll need to know the size of the projects, anticipated deadlines, required skill sets, number of available employees, and current capacity.
Scraping data
Data scraping is, in a sense, the found art of data science. It's when you take the data that's around you, tables on pages and graphs in newspapers, and integrate that information into your data science work. Unlike the data that's available with API's or Application Programming Interfaces, which is specifically designed for sharing, Data scraping is for data that isn't necessarily created with that integration in mind.
Feature selection and creation
Dimension reduction's often used as a part of getting the data ready so you can then start looking at which features to include in the models you're creating.
Which is typically NOT a good approach to the requirements process?
Directly ask stakeholders what they want, or have them send it in writing to make sure you are not getting it wrong.Correct, this is not the best approach. Stakeholders often struggle to articulate what they need so more advanced techniques are better suited for this.
EDARP in RESEARCH
Driven by the researcher and the field of study, final output does not need to fit within business constraints, concerned with advancing a field of study, may not require discussion of budget constraints.
What is NOT the most effective techniques for stimulating powerful conversation?
Get an email chain going to get everyone talking.Get an email chain going to get everyone talking.Ask a leader to weigh in first.
COMPONENTS of an EDARP
Good foundation of what components to use, what methods to use, what data do we have access to
LAW is territorial
It's confined to the territory or the place that created it.NOT TECHNOLOGY**
Upside to In-house data
It's the fastest way to start., you may actually be able to talk with the people who gathered the data in the first place.
Classifying Methods
K-means, k nearest neighbors. BInary Classificaiton. Many Categories, Distance Measures.
Why do data and information systems come before laws?
Laws are created after something takes place, to control parameters.
Classifying
Locate the case in a k-dimensional space where k is the number of variables or different kinds of information that you have. And there's probably going to be more than three. It might be hundreds or thousands. But once you get it located in that space, compare the labels on nearby data, that of course assuming that other data already has labels that it says whether it's a photo of a cat, or a dog, or a building. And then once you've done that, assign the new case to the same category. LOCATE,COMPARE, ASSIGN
Dimensionality reduction Methods
Number one is principal component analysis, often just called principal components or PCA. And the idea here is that you take your multiple correlated variables and you combine them into a single component score. Another very common approach is factor analysis. And functionally, it works exactly the same way. People use it for the same thing, although the philosophy behind factor analysis is very different. Here your goal is to find the underlying common factor that gives rise to multiple indicators.
Algebra
Number one is that it allows you to scale up. The solution you create to a problem should deal efficiently with many instances at once. Basically create it once, run it many times. And the other one closely related to that is the ability to generalize. Your solution should not apply to just a few specific cases with what's called Magic Numbers, but to cases that vary in a wide range of arbitrary ways, so you want to prepare for as many contingencies as possible
predictive analytics
One of them is trying to predict future events, and that's using presently available data to predict something that will happen later in the future, or use past medical records to predict future health. The other possibly more common use is using prediction to refer to alternative events, that is, approximating how a human would perform the same task. So you're going to have a machine do something like classifying photos and you want to see whether this is a person
What is the meeting POWER approach?
POWER is an acronym that stands for purpose, outcomes, what's in it for them, engagement, and roles and responsibilities
Part to whole charts
Pie charts/bar charts
EDARP
Purpose of an Exploratory Data Analysis Research Plan
People processes: Mentoring
Set expectations. As the mentor, what do you expect from your mentee? And as the mentee, what do you hope to get out of the relationship?Have clear goals
EDARP Similarities for business and research
Share similar structure of the research plan. Goal is to create some form of value, guide the analyst through a complex and iterative process
measures that give you a numerical description of association
correlation coefficient or regression analysis
Validating models methods
Take your data and split it up into two groups. Training data and testing data.**cross-validation**. Now you can say it's testing data but it's actually using the training data. What you do here is you take the training data and you split it up into several pieces, maybe six different groups and then you use five at a time to build a model and then you use the sixth group to test it and then you rotate through a different set of five and you verify them against a different one sixth of the data and so on an so forth.
Interpretability
The point of all this, is it in your analysis, interpretability is critical. You're telling a story and you need to be able to make sense of your findings, so you can make reasonable and justifiable recommendations.
Data preparation Time
data preparation 80%, and everything else falls into about 20%
Why have software Standards?
These standards serve as a baseline for decision-making, help reduce uncertainty, and ultimately save time, as expectations are clearly understood by all
decision tree
This is a whole series, a sequence of binary decisions, based on your data, that can combine to predict an outcome. It's called a tree because it branches out from one decision to the next
simple Recommendations
This is where the algorithm processes your data and makes a recommendation, or suggestion to you and you can either take it or leave it. A few places where this approach shows up are things like, for instance, online shopping, where you have a recommendation engine that says "Based on your past purchase history, "you might want to look at this."
Bayes' theorem
What Bayes' Theorem does is it gives you the posterior or after-the-data probability of a hypothesis as a function of the likelihood of the data given the hypothesis, the prior probability of the hypothesis and the probability of getting the data you found.
Demo day
demo day is where you get to share the accomplishments from the sprint with your team
Clustering
You can look at things like a K-dimensional space. So you locate each data point, each observation, in a multidimensional space with K-dimensions for K variables. So if you have five dimensions, K is five. If you have 500, then you have 500 dimensions. What you need to do then, is you need to find a way to measure the distance between each point, and you're going to do one point, every other point, and you're looking for clumps and gaps.
Optimization and the combinatorial explosion
You're trying to find an optimum solution, but randomly going through every possibility doesn't work. This is called the combinatorial explosion because the growth is explosive as the number of units and the number of possibilities rises and so you need to find another way that can save you some time and still help you find an optimum solution.
What is the right question? [a] Should we expand internationally?[b] Could we capture 1% of the retail market in Toronto if we open a store there?
[a] is predictably asking broad questions while b is focusing the analysis on a specific goal and city. To answer b's question, we'll need competitor data in Toronto, our success in comparable markets, and customer surveys in Toronto asking about brand awareness.
What is the right question? a) How can we cut costs across the retail company? - b) Can we reduce cotton materials cost by 20% and linen materials costs by 10% from Supplier A?
a) is too broad to design a study. b) the other hand, is detailed and specific. To answer the question, we'll need cost data for supply chain and suppliers, and our historical relationship with suppliers.
Where branch of law is privacy housed
all three, Privacy for example is sourced in common law in what we call privacy torts. It's sourced in statutory law such as GINA and other legislation that protects financial, health and student privacy. And it is sourced in the U.S. Constitution's amendments
Purpose of an EDARP
an effective EDARP will convince he or of hte potential value of your work. in doing so, the support of the organiztion will help drive he success of the project
People processes: Onboarding
break down your onboarding task into time buckets
decomposition
breaking things down from the whole into their elements, to try to see what's happening with your data over time. This is decomposition. Think of it like disassembling a clock or some other item. You're going to take the trend over time and break it down into several separate elements. You're going to look at the overall trend, you're going to look at seasonal or a cyclical trend, and you're going to have some leftover random noise
The enumeration of explicit rules
business strategies, flowcharts, Or criteria for medical diagnoses.
What are the two values involved in being an ethical person?
caring for one's own well-being and the well-being of others
Validating models
check your work.The basic principle is pretty easy, even if people outside of data science don't do it very often.
Descriptive analyses
descriptive analyses are one way of doing this. It's a little like cleaning up the mess in your data to find clarity in the meaning of what you have. And I like to think that there are three very general steps to descriptive statistics. Number one, visualize your data, make a graph and look at it. Number two, compute univariate descriptive statistics. There's things like the mean. It's an easy way of looking at one variable at a time. And then go on to measures of association, or the connection between the variables in your data.
descriptive methods
descriptive stats, plotting data, outlier detection
Self-generated data
external reinforcement learning.generative adversarial networks. internal
So what can you do to make sure your stakeholders feel connected with the project from beginning to end
first step is often a project kickoff meeting. This helps you to establish rapport with your stakeholders and will make it much easier for them to feel connected to the project and get you the answers you need to be successful. Next, after committing to a given set of project communications, find out how it's working. You can take the pulse of your stakeholders by means of a short survey
effective post-mortem communication
focus on the facts, don't assign blame,you should aim to have actionable takeaway
distribution charts
frequency tables
GIGO
garbage in, garbage out. That's a truism from computer science. The information you're going to get from your analysis is only as good as the information that you put into it
Passive collection of training data
gathering enormous amounts of data doesn't always involve enormous amounts of work. In certain respects, you can just sit there and wait for it to come to you. Photo Classificaiton. issue with this:One, and this is actually a huge issue, is that you need to ensure that you have adequate representation; things like categorizing photos/ limit cases
Cluster Analysis Methods
hierarchical clustering,K-means, or a group centroid model. You can use density models or distribution models, or a linkage clustering model
Downside to In-house data
if it was an ad-hoc project, it may not be well documented. And the biggest one is the data simply may not exist. Maybe what you need really isn't there in your organization.
where is negligence law housed
in common law
PCA vs FA
in principal component analysis, the variables come first and the component results from it. In factor analysis, this hidden factor comes first and gives rise to the individual variables.
Research Ethics when gathering data
informed consent,Also sometimes confidentiality, or anonymity
if you don't have access to all the data you want
it's often an effective way to figure out a better or more complete set of data to collect in the future.
Agile team meetings: IPM's
iteration planning meetings.At the start of each iteration, you would all get together and plan which portion of the product backlog will get attention in the upcoming iteration. This is known as your IPM
Feature selection and creation Methods
just basic correlation.stepwise regression.lasso regression,ridge regression
linear regression
linear regression, which is a common and powerful technique for combining many variables in an equation to predict a single outcome. decision tree
Neural networks
look at things in a different way than humans do and in certain situations they're able to develop rules for classification, even when humans can't see anything more than static.
Steps for Descriptive Analyses
looking at your data through charts, i.e Historgram.
descriptive stats
looking historical data or trends. observations tha happened in the past
MLaaS
machine learning as a service.Amazon Machine Learning, and Google AutoML, and IBM Watson Analytics,
Often, more data will come in as you're analyzing, and your question
must change to reflect this addition. Sometimes data isn't clean, which may put a time crunch on your project if you don't take this into account.
Creating data/Get your own Data
natural observation, informal discussions with, for instance, potential clients. You can do this in person in a one on one, or a focus group setting. You can do it online through email, or through chat, and this time you're asking specific questions to get the information you need to focus your own projects.Surveys. Words > Numbers. Let ppl express themselves. Start general
implicit rules
o the implicit rules are rules that help the algorithms function. They are the rules that they develop by analyzing the test data. And they're implicit because they cannot be easily described to humans.
why big data projects fail?
poor organization as the biggest factor
Dimensionality reduction
reduce the number of variables and the amount of data that you're dealing with. ach variable, each factor or feature has error associated with it. It doesn't measure exactly what you want. It brings in some other stuff. But when you have many variables or features together that you combine, the errors tend to cancel out.
predictive methods
regression, classification, clustering
What makes a meeting ineffective
scheduling a meeting last minute, not having a clear plan, and having too many people present.
predictive stats
seeking some sort of prediction. forecasting
Microsoft Excel and its many versions. Google Sheets
spreadsheets the universal data tool. It's my untested theory that there are more datasets in spreadsheets than in any other format in the world. The rows and columns are very familiar to a very large number of people and they know how to explore the data and access it using those tools. The most common by far
Data Researcher shares insights to
stake holders
Anomaly detection
tese are cases that are distant from the others in a multidimensional space. They also can be cases that don't follow an expected pattern or a trend over time, or in the case of fraud, they may be cases that match known anomalies or other fraudulent cases.
Constitutional law
the government takes actions with respect to data it is likely the Constitution will have a role restricting what the government can and can't do
Scraping Data and Ethics
there's still legal and ethical constraints that you need to be aware of. For instance, you need to respect people's privacy. If the data is private, you still need to maintain that privacy. You need to respect copyright. Just because something's on the web doesn't mean that you can use it for whatever you want. The idea here is Visible Doesn't Mean Open just like in an open market just because it's there in front of you and doesn't have a price tag doesn't mean it's free. There are still these important elements of laws, policies, social practices that need to be maintained to not get yourself in some very serious trouble. And so keep that in mind when you're doing Data scraping.
Machine-Centric processing and action
this is when machines are talking to other machines. And the best example of this is the internet of things. And that can include things like Wearables. My smart watch talks to my phone, which talks to the internet, which talks to my car in sharing and processing data at each point. Also Smart Homes. You can say hello to your Smart Speaker which turns on the lights
Human-in-the-Loop decision making
this is where advanced algorithms can make and even implement their own decisions, as with self-driving cars. And, I remember the first time my car turned it's steering wheel on it's own. But humans are usually at the ready to take over, if needed
Calculus
to do a maximization and a minimization, when you're trying to find the balance between these disparate demands.
Why would we enable software automation
to let machines do as much of the dirty work as possible. items in the Low reward-High conflict area is where automation is going to help us out.*** you can focus on more high reward items.
Predictive Model Critical Step
validate your model by testing it against new data, often against data that's been set aside for this very purpose. This is the step that's often neglected in a lot of scientific research, but it's nearly universal in predictive analytics and it's a critical part of making sure that your model works well outside of the constraints of the data that you had available
prescriptive stats
we make some action for the business
Predictive analytics
which future events are the most likely.
what's a univariate descriptive
you can look for one number that might be able to represent the entire collection.