Data Science Foundation: Fundamentals
Data preparation
80% of project time is typically spend on data prep. Column = variable Row = case/observation one sheet per file Each file has one level of observations (vendor address file, order file) Tidy Data: each column represents a variable.
Feature Selection and Creation
A feature is a variable or dimension in the data. You can get data from the features of the data. You can combine these features to create a new feature. (Dimension reduction's often used as a part of getting the data ready so you can then start looking at which features to include in the models you're creating.) Methods: correlation; stepwise regression (all potential variables and looks at correlations); lasso and ridge regression. You should be able to control the variable; look at the ROI; is it sensible (doe it make sense to select that variable)
Infographics
Adobe Illustrator now has a chart rendering tool.
Artificial Intelligence
Algorithms that learn from data; broadly: machine learning. Strong or General AI: a replica of the human brain that can solve any cognitive task. Weak or Narrow AI: algorithms that focus on specific well-defined tasks. You can't do AI without data science
Aggregating Models
Any one guess maybe high maybe low. When you combine (central limit theorem) several different models the errors tend to cancel out and you end up with a composite estimate that's generally closer to the true value. Takes extra time and effort but gives you multiple perspectives compensating on weakness and improving strength. You can find the signal amid the noise. More stable. Many eyes on the same problem
Tools for Data Science: Applications
Apps: more common, more accessible; good for exploring, good for sharing. Most common: spreadsheet (universal, excel, google sheets; good for browsing and exporting) SQL: Structured Query Language: access data stored in data bases. Visualization: Tableau; PowerBI, Qlik: interactive data exploration. Apps for Data Analysis (point an click - makes analysis easier for non specialist to conduct): SBSS, JASP, jamovi. Good for democratizing data. Lt the tools and techniques follow the question.
The first step in the data science pathway is "define goals." Why is the best place to start a data science project?
Clarifying your project's goals up front will help you at every step of the project pathway, from framing questions, to choosing data and algorithms, and interpreting and applying your results. Feedback Goals influence every step of the data science pathway, from planning to wrangling to modeling to applying.
Math for Data Science: Optimization and the combinatorial explosion
Combinatorial explosion: as the number of units and and number of possibilities rise growth is explosive - gets out of hand. You can use excel or calculus or optimization (linear programming). Using Solver for excel you can do optimization. A way to optimize various combinations to determine how you can get everything done with the best possible revenue outcomes.
Tableau
Convert your data into something you can see. Easy to learn and use. Creates dynamic dashboards.
Descriptive work
Counting the frequency of topics on social media. Cluster analysis of a customer database.
D3
D3.js (Data Driven Documents) is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It makes use of Scalable Vector Graphics, HTML5, and Cascading Style Sheets standardsYou can create charts and graphs for browsers. Many libraries available. You can use other peoples' work to use for your own. NOT so easy to use and learn!.
Entrepreneur
Data Based startups Often need all skills, including business Creativity in planning and execution.
Actionable Insights:
Data and data science is for doing. Need to focus on things that are controllable (specific); be practical (ROI?) - impact is large enough to justify the efforts. You want to build up: have sequential steps .
Data science methods can contribute to business intelligence by which tasks?
Data cleaning, Data modeling (outcomes), finding trends and anomalies in the data
Deviation
Data points relate to each other while seeing if the data point differs in the mean. We are seeing if is normal or unusual. Easiest to see line in the graphs.
Analyst
Day to day data tasks Web analytics, SQL, visualizations. Good for business decision-making.
Big Data:
Def: unusual volume; velocity and variety. You can do big data without the full toolkit of data science.
Data-Driven Decision Making (Future)
Democratization of data: won't have to collect anymore - it will be available to anyone in the company to run hypothesis and experiments.
Data engineer
Developers, architects Focus on hardware and software
Substantive Expertise
Each domain has its own goals, methods, constraints. What constitutes value How to implement insights.
Social Issues:
Engage with respect. don't fool, deceive.
The enumeration of explicit rules
Expert system are algorithms that mimic the decision making process of a human domain expert. Data analysis methods (flow charts) Medical Diagnoses (DSM Psychiatric) Business Strategies (what to do if x happens)
Machine Learning Specialits
Extensive work in computer science and mathematics. Deep Learning Artificial Intelligence
Trend analysis
Figure out the path your data is on, so you can inform decisions about whether to stay on the current path, or whether changes need to be made. Autocorrelation: today's value is associated with yesterday's. You are looking for consistency in change. You can have linear growth; exponential growth, logarithmic growth (rate diminishes); sigmoid, sinusoidal. Change points are changes in the resting state of the data and you may look at historical events that can explain those changes You use R for this. Decomposition: breaking the trend over time and break it down into several separate elements. All start with plotting the dots and connecting them
Validating Models
Principle: check your work. Will it work with anything else? How? Training data and testing data. In a dataset a training set is implemented to build up a model, while testing data (or validation) set is to validate the model built. There are two types of testing data: Cross Validation uses the training data splits and use the first set to create the model and the last to test it; and Holdout testing data (or holdout validations). You take 20% of data that you set aside (never looked at or touched), you use it to apply to the model just once against the model and see how it functions.
Legal issues:
Privacy Laws: GDPR HIPPA: health insurance portability and accountability act FERPA: family educational rights and privacy act.
Classifying
The process of grouping together items that are alike in some way. 1) Locate the case in K-dimensional space. 2) Compare the labels on nearby data 3) Assign the new case to same category Use K-Means: assign case to closes of k-centroids Or use K-Nearest Neighbor: use most common category of K cases closest to the new case. It can be Binary: yes/no; Many categories; Distance Measures; Confidence level. Bayes Theorem: allows you to combine data about sensitivity, specificity and base rates. Which two methods are common algorithms for classifying new cases into existing categories? k-means and k-nearest neighbors. While many methods can be used for matching cases to existing categories, k-means and k-nearest neighbors are two of the most common and useful.
Data Science
The skills and techniques for dealing with challenging data. Not mutually exclusive from AI You can do Data Science without AI, machine learning or big data, or predictive analytics or prescriptive.
Time Series
Tracking a data metric over time. Independent variable on an axis
What is a major advantage of understanding the algebra behind data science procedures?
You will better understand how to diagnose problem and respond when things don't work as expected. Feedback Data doesn't always match the assumptions and requirements of algorithms, so things can go wrong. Understanding the algebra behind the algorithms can help you respond to problems intelligently.
Neural networks can consist of millions of interconnected nodes processing information in complicated ways. What is an important consequence of this fact?
developers sometimes have to rely on methods similar to those of psychological researchers to infer what information is being processed and how. Feedback Psychological methods of inference can be useful in understanding the sometimes opaque processes of neural networks.
What is a "posterior probability" in Bayes' Theorem?
A posterior probability is the probability of the cause, such as a disease, given the effect, such as a positive medical test for the disease. Feedback Baye's Theorem combines the probability of a hypothesis (the "prior") with the likelihood of the data given the hypothesis and the base rate of the cause to get the posterior probability, or probability of the hypothesis given the data.
Research ethics on gathering your own data
1) Informed consent: When you're gathering data from people, they need to know what you want from them, and they also need to know what you're going to do with it so they can make an informed decision about whether they want to participate. 2) Privacy: You need to keep identifiers to a minimum. Don't gather information that you don't need, and keep the data confidential, and protected.
Passive Collection of Training Data:
1) Photo classification (photos are tagged online) 2) Autonomous Cars (get data constantly to improve the car function) 3) Health Data: my watch gathers data. You can get enormous amounts of data that is either general or specific. Challenge: adequate representation; need to check for shared meaning; need to check for limit cases.
Data science pathway
1) Planning: define goals; organize resources (right computers, software), coordinate people; schedule the project. 2) Wrangling: get data; clean data (fits into the program); explore (visualizations); Refine data 3) Modeling: create the statistical model; validate it; evaluate the model; refine the model. 4) Applying: presenting the model; deploy the model; revisit the model (how well is it performing); archive the assets.
Heat Map
A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heatmaps are used in various forms of analytics but are most commonly used to show user behaviour on specific webpages or webpage templates.
Bubble Charts
A type of scatter plot with circular symbols used to rank your data with bubbles. Can overlaid on a map.
Math for Data Science: Algebra
Allows to scale up. Your solution should deal efficiently with many instances at once. Generalize: your solution should apply to not just a few specific cases but cases that vary in arbitrary ways. Elementary Algebra (linear regression); Linear algebra ( works with vectors and matrixes). Choose Procedures: know which algorithms will work best with your data to answer your questions. Resolve Problems: know what to do when things don't go as expected so you respond thoughtfully.
Prescriptive Analytics
Cause and effect relationships. Observed Correlation (effect is likely when the cause is present). Correlation coefficient Temporal Precedence. (cause comes BEFORE the effect). No other explanation: connection can't be accounted for by anything else. Gold Standard: RCT: randomized controlled sample. Difficult to do. A/B testing: web applications: you have one offer on a website and another one at another website - which gets more clicks. What-If simulations. if this is true, then what will be expect? Optimization Models. if we spend time and money it will maximize outcome. Another name for this: mathematical programming. Cross-Lag Correlations: Quasi-Experiments: are a useful way to approximate cause-and-effect relationships, as are what-if simulations and optimization models. Iteration is critical. Test it over and over again. You can have Prescriptive without data science. Causality may be impossible; but prescriptive can get you "close enough."
Difference between classification and clustering:
Classification and clustering are two methods of pattern identification used in machine learning. Although both techniques have certain similarities, the difference lies in the fact that classification uses predefined classes in which objects are assigned, while clustering identifies similarities between objects, which it groups according to those characteristics in common and which differentiate them from other groups of objects. These groups are known as "clusters". Say you have ducks, goats, chickens, turkeys, sheep and rabbits. Classification would put all the ducks together; all the turkeys together, all the sheep and so forth. Clustering would group: all mammals together (rabbits, sheet and goats) and all foul together (ducks, turkeys, chickens)
Nominal Comparison
Compares values of categories or sub-categories. Food at a BBQ: # of hot dogs, hamburgers, buns
What is the difference between "general AI" and "narrow AI"?
General AI attempts to build general purpose thinking machine, while narrow AI focuses on algorithms for specific tasks like translating language. Feedback General AI has historically focused on creating machines that can solve any problem, but narrow AI, where most of the recent technical growth has occurred, focuses on well-defined, specific problems.
C, C++, Java
General purpose languages for back end and maximum speed. (JSON)
Business Intelligence
Getting insights to do something better in your business. Emphasized speed, accessibility, insight. Often rely on structured dashboards. Data science helps set up BI, makes it possible. Business intelligence gives purpose to data science. Collect and clean data; build model outcomes; find trends and anomalies.
Predictive Models:
Find and use relevant past data. Model the outcome Apply to new data Validate the model against new data. Useful: predicting someone will develop an illness, or recover; pay off a loan. Two meanings of prediction: One of them is trying to predict future events, and that's using presently available data to predict something that will happen later in the future, or use past medical records to predict future health. The other possibly more common use is using prediction to refer to alternative events, that is, approximating how a human would perform the same task. Methods: Classification methods: k, nearest neighbors, nearest centroid classifications. Decision trees: a way of tracking the most influential data in determining where a particular case is going to end up Neural networks: a form of machine learning that has proven to be immensely adaptive and powerful. Regression analysis: which gives you an understandable equation to predict a single outcome based on multiple predictor variables. This can be the amount of time a person spends on your website to predict their purchase volume. They are very flexible with data; they can be flexible models and easy to interpret.
Interpretability:
First: identify who is going to use the data. If a machine: an algorithm works - they don't' need to understand the principles. If a Human: they can take the information to apply to new information. You are telling a story - makes sense of you findings to make recommendations.
Researcher
Focus on domain-specific research. Physics and genetics r common More statistical expertise.
Ethical Issues:
Forms of fairness: equity, equality and need Forms of justice: distributive, procedural, interactional Authenticity: who/what you are dealing with.
Scrapping Data:
Found art of data science: find data around you that you can use for your needs. Legal/Ethical constraints: need to respect privacy; copyright; visible doesn't mean open. Data scraping refers to the process of extracting data from formats that were not specifically designed for data sharing. Feedback Data scraping is the creative work in getting data from formats that were not designed for data sharing, such as heat maps or image PDFs.
Clustering
Grouping data - can be geographical. K-Dimensional space: locate each data point, each observation, in a multidimensional space with K-dimensions for K variables. What you need to do then, is you need to find a way to measure the distance between each point, and you're going to do one point, every other point, and you're looking for clumps and gaps. You can measure distance in a lot of ways. You can use K-Means or group centroid model. What is the name for a chart that shows "branches" or cases splitting from one, giant cluster, to individual clusters? a dendrogram. Coming from the Greek word for "branch," a dendrogram shows the hierarchical structure of clusters.
Neural Networks
Growth in machine learning. Takes information for processing - it approximates how the human brain works. Computing power and raw data has exploded. Tiny steps with data leads to amazing analytics results. Inference: need to infer how it is functioning. Legal issues: GDPR: privacy in EU (right to explanation - if you've been harmed by a decision - you have right to know why). EU residents have the "right to explanation," or the right to appeal any machine-made decision that harms them. Because neural networks are so complex, this can be very difficult or impossible.
Part to whole
How a smaller subset compares within that larger subset. (Pie chart, stacked bar chart)
The generation of implicit rules
Implicit rules are rules that help the algorithms function. They are the rules that they develop by analyzing the test data. And they're implicit because they cannot be easily described to humans.
Data Availability
In House: fastest way to start; restrictions may not apply; may be able to talk to the people who created the data. Issues: not well documented and maintained; may not exist. Open Data: data that is free available to the public. Can be government, scientific and social media
Math for Data Science: Calculus
Involved anytime you are trying to do maximization and minimization. Revenue maximization.
API: Application Programming Interface
Isn't a source of data but rather it's a way of sharing data, it can take data from one application to another or from a server to your computer. It's the thing that routes the data, translates it, and gets it ready for use. It allows you to access data and include it in your data science programing. JSON is used here: JavaScript Object Notation (can include in Python and Java). Social API (twitter, facebook) Utilities (drop box, Google) Commerce (stripe, mailchimp, slack) It can become a process or an App. What kind of data can be accessed with APIs? Both proprietary and open data.
SQL
Language for relational database queries and manipulations
The Derivation of rules from Data Analysis.
Linear regressions: combine many variables to predict a single outcome. Decision Tree: sequence of binary decisions that can combine to predict and outcome.
Self Generated Data
Looping back: computers engage themselves to create data. Needed for training machine learning algorithms. Benefits: millions of variation and trials. Machines can create scenarios that humans wouldn't. Needed for creating rules. programming computers to engage with themselves to create their own training data Feedback Self-generated data is a method for creating training data for machine learning models where computers engage themselves to generate data. Method: generative adversarial networks.
Project Managers
Manage the project Big Picture: frame business relevant questions Must "speak data" - may not be able to do. A data science manager oversees the entire project and helps place it in a business context.
Predictive Analytics
May involve predictions involving large datasets and sophisticated algorithms like neural networks. Correct answer Predictive analytics also refers to models that estimate what a human judge would do if given the same task, such as categorizing photos.
Forms of mathematics
Probability, linear algebra, calculus and regression. You can choose the procedures to judge the fit between your questions and your data and your procedure. Diagnose Problems: know what to do when it fails or gives you impossible results.
Creating Data
Natural Observation. Informal Discussions Formal Interviews Surveys (closed ended questions) Words > numbers. Be as open ended as possible. Start with the big picture and then narrow it down. Start with general and move to more specific. Experiments: A/B testing: 2 versions of a website which one is more effective.
TensorFlow
Open source library used for deep learning. Deep learning neural networks
Bar chart:
Over time for different items. showing distribution.
Types of Data
Part to Whole Distribution Nominal Comparison Time-Series Correlation Ranking Deviation
Bayes' Theorem
Posterior probability as a function of the likelihood, the prior probability and the probability of getting the data you found. Used for medical diagnosis. If the person tests positive, on a a test that is 90% effective, what is the probability the person has the disease.
Data Analysis (past)
Predecessor to Data Analytics: used in science and statisticians (insurance and finance). Collecting data was difficult, expensive and slow
Predictive Analytics
Predict Outcomes. Restorative Justice Clicks & Purchases Risk of Disease (Baysean) Classification of Photos. Correlation Work done by Data Science Researchers Predictions that involve difficult data (unstructured); sophisticated model (neural networks). Can do prediction without data science: clean quantitative data sets; common models.
Common predictive analytics tasks include which of these?
Predicting a patient's risk for a disease Predicting the classification of phots Predicting whether a customer will purchase a product online Although predictive may include historical data - the focus is on bringing about changes in the future.
Python & R
Programming languages for data manipulation and modeling
Languages for Data Science
Python: most popular for data science and machine learning. General purpose, easy to learn. Works great with large data. R: programming language specifically for data analysis popular among scientist and researchers.. Works natively. SQL Java Julia Scala Matlab Expand functionality with packages.
Different Types of Datas
Quantitative Data: quantified, verified and measured. All values are numerical. 2 Categories: 1) Discrete Data: based on counts 2) Continuous Data: things that can be measured Time weight, height and width. Discrete would be "I have 8 dogs"; continuous: I have 8 dogs that weigh between 30-35 Lbs. Categorical Data: grouped into a category. It can be nominal (name: trucks, boats); ordinal (in order: small medium large)
Pie Charts
a graph in which a circle is divided into sectors that each represent a portion of the whole. Not good for comparison.
Agency of Algorithms and Decision Makers
Recommendations: algorithm process you can accept/reject. Based on your shopping patterns may w suggest XYZ; based on what you've read you may like this. Your own past behavior to give you recommendations. Human in the loop make and implement decisions. Self driving cars. You are there is needed to intervene or make the final decision. Human Accessible: algorithm makes the design, but you need to be able to understand how it reached the decision. Online mortgage applications. Machine-Centric: machine talks to other machines. Smart watch talks to phone. It's the Internet of Things.
Correlation
Relationship of two of more numerical variables. Show positive (grouped to the higher end - right side), negative (grouped towards the lower end or left side); null (grouped around the same value on the Y axis), linear (positive and negative), exponential
MLaaS: Machine Learning as a Service
SaaS: software as a service: making software accessible through the internet instead of using it from your desktop. MLaaS: it's a way of making the entire process of data science, machine learning, and artificial intelligence easier to access, easier to setup, and easier to get going. Azure ML, Amazon Machine Learning, IBM Watson. They put the analysis where the data is stored. Give you very flexible computing requirements: you can rent hardware as needed. It's a way to democratize the process.
Heat Maps
Show the rate of the data with high and low or high density and/or low density. It can range in color or multiple densities.
Area Chart:
Similar to line charts, except the areas under the lines are filled in.
Data Analytics (present)
Spread from science to the business world. Inexpensive and readily available. Better tools (excel, tableau, R). Easy, inexpensive and fast.
Machine Learning
The ability of algorithms to learn from data and improve their function in the future. Memorization is easy; spotting patterns is hard; new situations are challenging. Machine learning really can't be done without data science. Sub discipline of data science.
Dimensionality Reduction
The idea of dimension reduction is to actually reduce the number of variables and the amount of data that you're dealing with. Reasons: 1) each variable has error associated with it. When you bring in many features the errors tend to cancel out. 2) reduced colinearity : the overlap between predictor variables in the model which can create problems. 3) fewer features; you can process faster. 4) improves generalizability - more stable. General ways to do it: principal component analysis (combine multiple variables into a single component - variables come first); factor analysis (find the underlying common factor that gives rise to multiple indicators: factors come first) Methods: exploratory analysis; confirmatory analysis; levels of measurement; multiple algorithms.
Scatterplot
a graphed cluster of dots, each of which represents the values of two variables. You can see a lot of information to the mix
Ranking
Two or more variables that show a greater than, less than or equal to.
Distribution
Type of visualization shows data distribution often surrounding a central value.
Quantitative Analysts (Quants)
Use data science to scientifically investigate investment hypotheses and build models to predict investment outcomes. Changed the stock market. They have been replaced by high frequency automated intelligent trading algorithms which account most of the trading done in wall street now.
Python
Use for data science is the best quality. Easiest to learn out there, read, clean and fix. you can use libraries to visualize (matplotlib). It is dull with is aesthetics. You can use Pandas (data structures and as a data analysis tool); and Seaborn (makes it sexy). GGPlot2: is a plotting system requiring minimal code. Very bland. Bokeh: creates interactive charts and graphs. Pygal: interactive SVG's with minimal lines of code. Geoplot
Descriptive Analyses
Used to simplify data into manageable levels. It's like cleaning up the mess in your data to find clarity in the meaning of what you have. Three general steps: 1) Visualize the data - make a graph, bell curve, histogram. Positive Skew: tail is at the end (most of the numbers are at the low end); Negative Skew: tail is in the beginning (most of the numbers are at the high end). It could be U shaped: most of the data is at the far left or right. 2) Compute univariate descriptive statistics (mean - average or balance point, mode - most common, median-splits the data in two equal halves.). 3) Go to Measures of association, or the connection between variables in an association. Measure of variability can be: Range - distance between lowest to highest number, Quartiles or IQR; splits the data into 25% groups; the variance and the standard deviation both used in statistics. Associations can be scatterplots. Numerical can be correlation coefficient; regression analysis. Regardless: the data must be representative of the larger group. Must be attentive to outliers; open ended scored can dramatically affect the data.
Line Charts
a chart that plots data points which are continuously distributed data to compare trends over time. Place time measurements on the X axis.
According to the example calculation in the video, what information do you have to have in order to use calculus?
a function that describes the relationship between price and sales Feedback In order to use calculus to find the best price for maximizing revenue, you must first have a formula that says how sales are related to price.
How do expert systems mimic the decision-making of experts?
by explicitly listing decisions and outcomes in a logical chain like a flow chart Feedback An expert system spells out every step in a decision tree like a flow chart.
Data visualization can be considered an example of what?
data science without big data Feedback Creative data visualization often requires substantial computer programming and mathematical skills, and so can be considered data science, even if it doesn't require all three Vs of big data.
Predictive analytics focuses on predicting what's likely to happen in the future. On the other hand, prescriptive analytics focuses on which of these?
identifying cause-and-effect relationships in your data Feedback Prescriptive analytics focuses on cause-and-effect relationships so you can determine the best actions to bring about your goals.
Computers frequently work with data in matrices that are arranged in rows and columns. What is the name for the version of algebra that works best with matrices?
linear algebra Feedback Linear algebra is the form of algebra that deals with matrices. It is used in the algorithms that computers typically use for analyzing data.
What is one of the rare qualities that creates such a high demand for data scientists?
the ability to find order, meaning, and value in unstructured data Feedback Data scientists are valuable because they are able to find value in unstructured data, but they're also able to predict outcomes and automate processes.
Anomaly detection
the process of identifying rare or unexpected items or events in a data set that do not conform to other items in the data set. This can be serendipity - unexpected insights untapped potential/values. Finding anomalies: it can be fraud, process failure, potential value. All have in common: they are outliers. they don't follow expected patterns. Regression Bayesian Analysis Hierarchical Clustering Neural Networks Dealing with rare events - leads to unbalanced models. Difficult data (biometrics, multimedia)