data science: foundation
Supply and demand for data science
Now I know that we're all supposed to be happy just being who we are, but sometimes it's nice to have some external validation, and it's also nice to get paid. And that brings up a curious statement from 2012 that appeared in Harvard Business Review. In a groundbreaking article, Thomas Davenport and D. J. Patil made the extraordinary claim that data science, of all things, was the sexiest job of the 21st century. Now, it's a shocking thing to say, but they had some good reasons for saying this. data scientists had a valuable combination of rare qualities that, two, put them in a high demand. So let's take a look at each of those things. So the RARE QUALITIES, what is special about data scientists that would make it such a amazing job? -Well, they find order, meaning, and value in unstructured data. If you think about the text from social media, that's UNSTRUCTURED, doesn't go into rows and columns. If you think about just this nightmare deluge of data coming from so many different sources, data scientists are able to FIND some MEANING in that. -Also, they specialize in PREDICTING OUTCOMES, BUILDING PREDICTIVE models. -They're also able to AUTOMATE really cumbersome processes to make a business run more efficiently. So those are some very RARE and VALUABLE QUALITIES. And then in terms of HIGH DEMAND, what's going on here is, because data science can PROVIDE HIDDEN insight, it's able to provide a competitive advantage and every business wants that. And while traditionally it's the tech industry, the thing to know is it's not just the tech industry, it's healthcare, telecommunications, energy, banking, financial services and insurance, retail, media and entertainment, construction, education, manufacturing, cybersecurity, transportation and government, and nonprofit organizations, among others. Really what's happening is there is a spread of the value and the demand for data science, which originated in the tech industry, think people hiring for Google searches or Facebook recommendation engines, but the value is being seen everywhere. And as the process becomes broader, more and more companies in more and more sectors are able to see and pursue the value in data science. GROWTHS IN JOB ADDS: -So to give you a little bit of data on data science, a January 2019 report from Indeed found growth in job ads. They said that they had a 29% increase in job ads for data science over one year, and 344%, that's three and a half times as many over six years. Then there's also a growth in job searches. So this is people looking for jobs in data science, and it does grow, but the same article by Indeed found only 14% growth over one year. So just put those two things together. There is a gap in supply and demand. A 29% increase in demand over one year, but only a 14% growth in supply. That gap in supply and demand is what makes it a very good time to go into data science. Part of this shows up in the salary for data science. So for instance, the average salary, base salary, by the way, for data science, according to Glassdoor in 2021 is $117,000 a year. By the way, compare that to the national personal income median of 51,000, that's over twice as much. And what that lets you know is, yeah, this is a good job to have. Glassdoor has listed data science as the number one job in the US for four consecutive years from 2016 to 2019, and still in the top three, even as it spreads over, it's an amazingly rewarding and productive field. And that is what led Harvard Business Review to call it the sexiest job of the 21st century.
Data preparation
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Anybody who's cooked knows how time consuming food prep can be. And that doesn't say anything about actually going to the market, finding the ingredients, putting things together in bowls and sorting them, let alone cooking the food. And it turns out there's a similar kind of thing that happens in data science, and that's the data preparation part. The rule of thumb is that 80% of the time on any data science project is typically spent just getting the data ready. So data preparation, 80%, and everything else falls into about 20%. And you know, that can seem massively inefficient, and you may wonder, what is your motivation to go through something that's so time consuming and really this drudgery? Well, if you want, in one phrase, it's GIGO, that is garbage in, garbage out. That's a truism from computer science. The information you're going to get from your analysis is only as good as the information that you put into it. And if you want to put in really starker terms, there's a wonderful phrase from Twitter. And it says most people who think that they want machine learning or AI really just need linear regression on cleaned-up data. Linear regression is a very basic, simple and useful procedure. And it lets you know, just as a rule of thumb, if your data is properly prepared, then the analysis can be something that is quick and clean and easy and easy to interpret. Now, when it comes to data preparation and data science, one of the most common phrases you'll hear is tidy data, which seems a little bit silly, but the concept comes from data scientist Hadley Wickham, and it refers to a way of getting your data set up so it can be easily imported into a program and easily organized and manipulated. And it revolves around some of these very basic principles. Number one, each column in your file is equivalent to a variable, and each row in your file is the same thing as a case or observation. Also, you should have one sheet per file. If you have an Excel sheet, you know you can have lots of different sheets in it, but a CSV file has only one sheet. And also that each file should have just one level of observation. So you might have a sheet on orders, another one on the SKUs, another one on individual clients, another one on companies, and so on and so forth. If you do this, then it makes it very easy to import the data and to get the program up and running. Now, this stuff may seem really obvious, and you say, why do we even have to explain that? It's because data in spreadsheets frequently is not tidy. You have things like titles and you have images and figures and graphs, and you have merged cells, and you have color to indicate some data value, or you have-sub tables within the sheet, or you have summary values, or you have comments and notes that may actually contain important data. All of that can be useful if you're never going beyond that particular spreadsheet. But if you're trying to take it into another program, all of that gets in the way. And then there are other problems that show up in any kind of data, things like for instance, do you actually know what the variable and value labels are? Do you know what the name of this variable is, 'cause sometimes they're cryptic. Or what does a three on employment status mean? Do you have missing values where you should have data? Do you have misspelled texts? If people are writing down the name of the town that they live in or the company they work for, they could write that really an infinite number of ways. Or in a spreadsheet, it's not uncommon for numbers to accidentally be represented in the spreadsheet as text, and then you can't do numerical manipulations with it. And then there's a question of what to do with outliers? And then there's metadata, things like where did the data come from, who's the sample, how was it processed? All of this is information you need to have in order to have a clean data set that you know the context and the circumstances around it, that you can analyze it. And that's to say nothing about trying to get data out of things like scanned PDFs or print tables or print graphs, all of which require either a lot of manual transcription or a lot of very fancy coding. I mean, even take something as simple as emojis, which are now a significant and meaningful piece of communication, especially in social media. This is the rolling on the floor laughing emoji. There are at least 17 different ways of coding this digitally. Here's a few of 'em. And if you're going to be using this as information, you need to prepare your data to code all of these in one single way so that you can then look at these summaries altogether and try to get some meaning out of it. I know it's a lot of work, but just like food prep is a necessary step to get something beautiful and delicious, data prep is a necessary, vital step to get something meaningful and actionable out of your data. So give it the time and the intention it deserves. You'll be richly rewarded.
Optimization and the combinatorial explosion
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] If you want to have a winning team, you need to have both excellent players and excellent teams or combinations of players. Now you might be tempted to simply take the players that you have and try them in every possible situation. See where they work together in the different positions. If you're in a sport that only has a few people like say for instance, beach volleyball where there's two people on each team, this is something you can do. Let's say you have four players total to choose from, and you want to try them out in two where you put each person into a position and you try all the possible permutations. Well, the formula for that is this one right here, where we take the N players, that's how many we're choosing from, that's four and R is how many we're taking at a time, that's two. And that gives us 12 possible permutations. That's something you could do in a day or two and feel confident about the decision that you've made. But what if you have more people? Let's say you're actually dealing with basketball. Well, now you're going to have five people on the court at a time, and you're going to have 15 people on an NBA roster to choose from. So now you have a permutations of 15, that's your N, players taken five at a time. And if you're randomly shuffling them around in positions to see who does better at what and how well they work together, this is your formula. And unfortunately, now you have 360,360 possible permutations. That's going to keep you busy for a really, really long time. And you know what, it's not even as bad as it gets. Let's go to baseball. And let's say, you want to try out the 25 players on your roster where you put nine of them on the field at a time. And this is actually the sport where I first heard people talk about this. Well, the math gets out of hand very quickly. You're doing permutations where N is 25 players taken, R is nine at a time and that gets you over 741 billion possible permutations, which is possibly longer than the entire universe has been in existence and so that's just not going to work. You're trying to find an optimum solution, but randomly going through every possibility doesn't work. This is called the combinatorial explosion because the growth is explosive as the number of units and the number of possibilities rises. And so you need to find a another way that can save you some time and still help you find an optimum solution. There are a few simple ways of approaching this. Number one is just to go into Excel. If you can estimate a basic function, you can use trial and error to model in the function and look for a local maximum. Excel also has what if functions and scenarios that help you do that. You can also use calculus. For simple functions, you can use a calculus based approach that I've demonstrated elsewhere. On the other hand, you need to be able to estimate the functions and get a derivative. And then there's also optimization also known as mathematical optimization or mathematical programming and certain versions of which are known as linear programming. And I want to show you very quickly how I can use optimization in Excel to answer a basic question. Now, my goal here is not to show you every step involved in Excel. What I'm trying to do is show you that it is possible and what it looks like. If you decide it's something you want to pursue, you're going to need to go back and spend a little more time trying the ins and outs of how this works. I'm going to be using a little piece of software that is an included add in called Solver for Excel. And what I've done is I've set up a hypothetical situation where a person owns a yoga studio and they're trying to maximize their revenue for the amount of time they spent. And the first thing I do is over here, I put down a list of possible things a person could spend their time on from responding to social media and writing a newsletter to teaching classes to recording videos. And I say about how long it takes to do each one and how much each of those impacts the bottom line. And then one of the important things about optimization is putting in constraints. And so here I say we need to do at least this many on the left. We need to do no more than this many on the right. You have to respond to your email, but you can't teach classes all day long. You'd get burned out. And so you have these minima and these maxima. And then we're going to use the solver to try to find an optimal number of units. How much of your time you should spend on each of these activities? We're going to maximize the impact, which is basically revenue, and then we're going to make it so it's no more than 40 hours because you still have a life you got to live. Now, when you call up the Solver dialog box, it looks like this. This is where you say what you're trying to maximize, what you're going to constrain, what it's allowed to change. And there's a number of options here, and it can get pretty complex. Again, my goal here is not to show you how to use it, but simply that it exists and it can solve this kind of problem. When I hit solve what it does is it adjusts the column in F and it says, this is your optimal distribution. It says spend this much time on each of these things. And what that's going to do, it's going to maximize the impact, which is going to be associated with revenue. And it also says how much time you're going to spend on each one. I also put a little dot plot here on the end just to give a visualization that you're going to spend most of your time teaching group classes in the studio, but that you'll do at least a little bit of everything else as a way of maximizing the impact. This is a way of reaching a conclusion about the optimal way of allocating your time. Maybe even recruiting a team without having to go through billions and billions of possibilities. That's one of the beauties of data analysis and data science to help you cut through the mess and find the most direct way to the goals that are important to you.
Business intelligence
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] It's an article of faith for me that any organization will do better by using data to help with their strategy and with their day-to-day decisions, but it reminds me of one of my favorite quotes from over 100 years ago. William James was one of the founders of American psychology and philosophy, and he's best known for functionalism in psychology and pragmatism in philosophy, and he had this to say. He said, "My thinking is first and last and always "for the sake of my doing." That was summarized by another prominent American psychologist, Susan Fisk, as, "Thinking is for doing." The point is when we think, the way that our brain works, it's not just there because it's there. It's there to serve a particular purpose. And I think the same thing is true about data and data science in general. In fact, I like to say data is for doing. The whole point of gathering data, the whole point of doing the analysis is to get some insight that's going to allow us to do something better. And truthfully, business intelligence is the one field that epitomizes this goal. Business intelligence, or BI, is all about getting the insight to do something better in your business. And business intelligence methods, or BI methods, are pretty simple. They are designed to emphasize speed, and accessibility, and insight right there. You can do them on your tablet. You can do them on your phone. And they often rely on structured dashboards, like these graphs that you see. Maybe you do a social media campaign and you can go and see the analytics dashboard. Or you have videos on YouTube or Vimeo or someplace. You can get the analytics and see: How well is this performing? Who's watching it? When? That's a business intelligence dashboard of a form. So if this is all about the goal of data, that data is for doing, and BI does that so well, where does data science come in to all of this? Well, it actually comes in sort of before the picture. Data science helps set things up for business intelligence, and I'll give you a few examples. Number one, data science can help tremendously in collecting, and cleaning, and preparing, and manipulating the data. In fact, some of the most important developments in business intelligence, say for instance companies like Domo, their major property is about the way that they ingest and process the information to make it easily accessible to other people. Next, data science can be used to build the models that predict the particular outcomes. So you will have a structure there in your data that will be doing, for instance, a regression, or decision tree, or some other model to make sense of the data. And while the person doesn't have to specifically manipulate that, it's available to them, and that's what produces the outcomes that they're seeing. And then finally, two of the most important things you can do in business intelligence are find trends to predict what's likely to happen next, and to flag anomalies. This one's an outlier, something may be wrong here, or we may have a new case with potential hidden value. Any one of those is going to require some very strong data science to do it well. Even if the user-facing element is a very simple set of graphs on a tablet, the data science goes into the preparation and the offering of the information. And so really, I like to think of it this way. Data science is what makes business intelligence possible. You need data science to get the information together from so many different sources and sometimes doing complex modeling. And also, I like to think that business intelligence gives purpose to data science. It's one of the things that helps fulfill the goal-driven, application-oriented element of data science. And so, data science makes BI possible, but BI really shows to the best extent how data science can be used to make practical decisions that make organizations function more effectively and more efficiently.
The enumeration of explicit rules
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Let's say you meet someone at a party, and after talking for a while, you start to wonder if that person might be interested in you. This is apparently a question that is on a lot of people's minds. If Google's auto-complete is to be trusted, assessing attraction is a major research question of the top 10 statements to start how to tell. The first two are on this topic, shortly followed by how to tell an egg is bad. And in fact, men appear to be sufficiently difficult to read that they get to appear twice on the top 10 list. And so that lets you know, we need an answer to this question. How can you tell if somebody is interested in you? Well, maybe we can propose some rules. Maybe they're interested in you because they said so, or maybe they smiled and made eye contact or maybe they just swiped right. Then again, there are the insecure doubts that pop up and undermine your belief in these things. Maybe it's wonderful to meet you isn't diagnostic and they say that to everybody. Maybe they smile when they're bored or maybe they slipped on the ice and fell and accidentally swiped the wrong way. All these things that can undermine your faith in the rules that you have. But lest you think I'm just being silly in these examples, I want to point out that this is a legitimate data science problem. Dating apps are a multi-billion dollar business. And if you can help people find someone they love, then you've truly accomplished something worthwhile too. So, if we want to write a program to help people figure out if someone likes them, maybe we just need to include a little more detail and create a flow chart with explicit rules and qualifications before concluding with a definitive yes or no. This is an example of what's called an expert system. An expert system is an approach to machine decision-making in which algorithms are designed that mimic the decision-making process of a human domain expert. In this case, maybe Jane Austen, she wasn't incisive ane strategic decision maker when it came to matters of the heart. For example, here's a set of observations from Pride and Prejudice where we're pointed out that being wholly engrossed by one person and inattentive to others, offending people by not asking them to dance, having people speak to you and not answering. She says, "Could there be finer symptoms Could there be more diagnostic criteria for amorous inclinations?" She says, "Is not general incivility the very essence of love". And so maybe we could create an expert system based on these criteria. Now, I can imagine a lot of situations where this might not work well, but there are many other situations in which expert systems which model the decision-making of an expert have worked remarkably well. Those include things like flow charts to help you determine which data analysis method is most appropriate for answering the question you have given the data that's available to you. Or criteria for medical diagnoses, I'm in psychology, and so we use the DSM, which has the diagnostic and statistical manual of the American Psychiatric Association to say, if a person has these symptoms, but not these, then this is the likely diagnosis. Or even business strategies where you can give general checklists of approaches on what to do first, what to do next and how to reach the goals that you have. But it's important to remember that like any system, logic and the flow of expert based decisions has its limits. You're eventually going to hit the wall. You're going to meet situations that you just can't accurately predict or there'll be things that you never anticipated. And so it turns out that there are significant limits to enumerating explicit rules for decision-making. And in fact, we need more flexible and more powerful methods which is what I'm going to turn to next.
Data ethics
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Let's start by stating the obvious. If your data project ends with everybody screaming at you, then something has gone wrong. And when it comes to data ethics, things can go very wrong very quickly. Now, let me give you just a different perspective for a moment, my background is in academic research. I have a PhD in social psychology, which is a research field, and I'm a tenured full-time professor at a university. Whenever an academic does research with people, there were a few general principles that apply. For instance, it is absolutely mandatory to have voluntary participation, that is, people need to choose freely, without coercion, to participate in your study. Second, absolutely mandatory, there needs to be informed consent. People must know what they're getting into so they can make a free, voluntary choice to participate. Now, two other things that are not necessarily mandatory, but you got to have a good reason if you're not going to use them are anonymity, that is, there are no easy identifiers of people in the data, you don't ask for their names or their social security numbers or something like that. You try to make it so that people feel that their data is going to be private, and then confidentiality, again, not necessarily absolutely required, but you got to have a good reason if you're not going to use it. Confidentiality means regardless of whether the data is anonymous or not, that people know you're not going to freely share the data, that it's going to stay in certain tight bounds. I have a locked filing cabinet in my locked office, which is where I'm required to keep the data that I've gathered, it is very tight. And so this is one way of thinking about data ethics, voluntary participation, informed consent, and whenever possible, anonymity and confidentiality. That's the academic research world. On the other hand, in the data science world, there's an unfortunate history of anything goes. There was a time, not that long ago, when it seemed that the governing rule was, do whatever you want, that included scraping private or copyrighted data from the web or social media, being sneaky about using cookies on people's web browsers, selling private information, and a host of other potential dumpster fires in the tech world. Now, I'm just talking about gathering data, not how it might be used in machine learning algorithms, which is a different set of significant issues. On the other hand, I have mentioned elsewhere more than once about the changes in the regulatory environment, the single most important of which is the GDPR, the European Union's General Data Protection Regulation, a very serious piece of legislation that completely alters the way that data is gathered. That's why you now have cookie notices. That's why companies now have to specifically ask you for permission to gather data and to share it with others. And here in the United States, there's the California Consumer Privacy Act, which has spawned a whole host of changes, not just in California, but around the country. And so these laws and ones similar to them are changing the environment in which people gather data. And they're giving some very specific guidelines for data ethics. Now, you can think of the general principles, you want to be able to play by the rules and keep things in bound, and so here are the general principles for data ethics and data science, again, not an exhaustive list. But number one is privacy by default, people's data is private, and they have to specifically tell you whether you can use it or do something with it. That's why even the check boxes that say you can gather the data must be unchecked by default. They have to go in and check it. Also, you require active consent to gather, to use, and to sell data. And that people, for instance, have the right to view the data you've gathered about them, or to have it removed from your datasets. And then there's the development of privacy-first hardware like cell phones and their operating systems that give people much tighter control on privacy, or software like the browser or the search engine that you use may be a privacy-first piece of software. And so all of these are creating a very different environment in which the data science world operates, and while it puts a lot of restrictions on, it does it in a way that I think actually makes our work much more productive. The idea here is that if you respect people's individuality, their autonomy, and their privacy when gathering data for your projects, you build up trust and good will, which can be scarce commodities, and you get better data both now and in the future because people are willing to participate and work with you again. And then you can go from screaming people to happy people, which is what all of us are working for.
Supervised learning with predictive models
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Marriage is a beautiful thing where people come together and set out on a new path full of hope and possibilities. Then again, it's been suggested that half of the marriages in the U.S. end up in divorce, which is a huge challenge for everyone involved. But 50% is just a flip of a coin. If you were trying to predict whether a particular marriage would last or whether it would end in divorce, you could just predict that everybody would stay married, or maybe everybody would get divorced and you'd be right 50% of the time without even trying. In a lot of fields, being right 50% of the time would be an astounding success. For example, maybe only 5% of companies that receive venture capital funding end up performing as projected. And there's billions of dollars at stake. If you could be right 50% of the time in your venture capital investments, you'd be on fire. And that brings up the obvious question. How can you tell which companies will succeed and which will fail? Not surprisingly many methods have been proposed. Apparently being too wordy in your emails is a sign of imminent business failure. But I think that anecdotal data and not a proper data science predictive analysis. But here's the general approach for trying to use data to predict what's going to happen. Find and use relevant past data. It doesn't have to be really old. It can be data from yesterday. But you always have to use data in the past because that's the only data you can get. And then you model the outcome using any of many possible choices. And then you take that model and you apply it to new data to see what's going on in there. There's actually a fourth critical step and it's separate from applying and that's to validate your model by testing it against new data. Often against data that's been set aside for this very purpose. This is a step that's often neglected in a lot of scientific research, but it's nearly universal in predictive analytics and it's a critical part of making sure that your model works well outside of the constraints of the data that you had available. Now there's a number of areas where predictive analytics as a field has been especially useful. Things like predicting whether a particular person will develop an illness or whether they'll recover from an illness. Whether a particular person is likely to pay off a mortgage or a loan or whether an investment will pay off for you. And then even the more mundane things like building a recommendation engine to suggest other things that people can buy when they're shopping online. All of these are hugely influential areas and major consumers of predictive analytics methods. Now, I do want to mention there are two different meanings of the word prediction when we talk about predictive analytics. One of them is trying to predict future events and that's using presently available data to predict something that will happen later in the future, or use past medical records to predict future health. And again, this is what we think of when we hear the word prediction. We think about trying to look into the future. On the other hand, it's not even necessarily the most common use of that word in predictive analytics. The other possibly more common use is using prediction to refer to alternative events. That is approximating how a human would perform the same task. So you can have a machine do something like classifying photos and you want to say whether this is a person, whether this is a dog, whether this is a house, and you're not trying to look into the future, but you're trying to say, if a person were to do this, what would they do? And we're trying to accurately estimate what would happen in that case. And so you also might try inferring what additional information might reveal. So we know 20 pieces of information about this medical case. Well, from that we might infer that they have this particular disease, but we wouldn't know for sure until we do a direct test. So we're trying to estimate what's happening there. Now, when you go about your analysis, there's a few general categories of methods for doing a predictive analytics project. Number one is classification methods. That includes things like K nearest neighbors and nearest centroid classification, and also is connected to clustering methods, such as K means. You can also use decision trees and random forest, which is several decision trees put together as a way of tracking the most influential data and determining where a particular case is going to end up. And then also extremely powerful in data science are neural networks, a form of machine learning that has proven to be immensely adaptive and powerful, although very hard sometimes to follow exactly what's going on in there. But all of these methods have been very useful within trying to predict what's going to happen with a particular case. But I do want to mention one other approach that's been enormously useful and dates back a lot further than a lot of these and that's regression analysis, which gives you an understandable equation to predict a single outcome based on multiple predictor variables. And it can be a very simple thing. Like this is an equation that uses the amount of time a person spends on your website to predict their purchase volume. Now this is fictional data, but you get to see, we have a scatter plot, we draw a regression lines through it, and we even have an equation there at the top of the chart. And this is a regression equation written entirely symbolically. I showed this to you before. It's where you're trying to predict an outcome, Y for individual I, and you're using several predictors, X1, X2, X3, and their regression coefficients to predict their score. So for instance, the example I used was predicting salary, and you can write it out this way too. Where the salary for individual I is going to be $50,000, that's the intercept, plus $2,000 for each year of experience, plus $5,000 in each step of their negotiating ability on a one to five scale, plus $30,000 if they're the founder or owner of the company. And so that's a regression equation and it'd be very useful for predicting something like salary. And it's a very easy, conceptually easy way to analyze the data and make sense of the results. And there are a few nice things about regression models. Number one is they're very flexible in the kind of data they can work with. Different versions of regression can work with predictors or outcomes that are quantitative or continuous, ordinal, dichotomous, or polychotomous as categorical variables. They also can create flexible models. They're usually linear. They can also be curvilinear. They can be quantile based. It can be multi-level. You have a lot of choices. And generally they're easy to interpret. That's compared to many other data science procedures. The results of regression analysis are easy to read, interpret, present, and even to put into action. But this is simply one choice among many for predictive analytics, where you're trying to use your data to estimate what's going to happen in the future. We have a lot of resources here where you can find other courses that will give you specific instruction on predictive analytics methods and help you find the actionable next steps in your data.
Machine learning as a service
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] One of the things that's most predictable about technology is that things get faster, smaller, easier, better over time. That's the essence of Moore's law, which originally talked about just the density of transistors on circuits doubling about every two years. But think about the women working here on ENIAC. That's the Electronic Numerical Integrator And Computer, which was the first electronic general-purpose computer back in 1945. It was huge. It filled up a room, and it took a whole team of people to run it. Then things evolved to brightly colored reel-to-reel computers. Then, later, you get to your desktop Macintosh. I still have my Mac Classic II. And before you know it, you are running your billion dollar tech company from your cell phone. Now, one of the most important developments in the internet era has been SAS, or software as a service. Just think of anytime you've used an online application, like Excel Online, instead of an application installed locally on your own computer, like the regular version of Excel. And now, a similar revolution has been happening in data science with machine learning as a service, or MLaaS. All of the major cloud providers have announced machine learning as a service offerings. They include Microsoft Azure ML and Amazon SageMaker and Google Cloud AutoML and IBM Watson Machine Learning. Each of these companies provides a host of specialized offerings, as well, which cover things like text to speech and back again, in the case of transcription, especially for people who dictate medical records; chatbots with enough intelligence to do natural language processing and hopefully answer people's questions and take care of matters quickly and efficiently, and then various forms of content recognition, such as identifying objects, people, scenes, text and activities in images and videos; detect inappropriate content posted online, and maybe do facial analysis and facial search capabilities. And all of these can be done using these specialized, highly developed and refined offerings from each of these major machine learning as a service providers. Now, there are several advantages to these. Number one is, these approaches put analysis where the data is stored, and that's because each of these companies also offers cloud-based data storage that's enormously flexible. It also allows you to do the computing right there with flexible computing requirements. You pay for what you need. If you only have a little bit of data, you don't pay for as much if you're doing something huge. Also, they can provide convenient drag-and-drop interface. And so what they all do is they make the process more immediate, 'cause it's where your data is, and more organized. And because they do the heavy lifting, it makes it faster for you. It makes it easier for you to meet deadlines and easier to update things as needed and get very high-quality results in a short amount of time using these specialized online services for machine learning.
Supervised vs. unsupervised learning
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] People can be very enthusiastic about data science, and sometimes a little too enthusiastic. And they sometimes seem to confuse data science with magic. But as it turns out, data science doesn't involve genies in a bottle, it doesn't involve fairy dust. Instead, it involves algorithms and machine learning. Algorithms are just a series of decisions, not too different from this flowchart. In fact, it's those algorithms that make machine learning possible, or rather, machine learning is the ability of a computer program to find patterns in a data all on its own without you specifically saying, if it's red, do this, if it's green, do this, but it sees what the pattern is without being explicitly programmed. And this is where some of the most important developments in data science have happened. And for our purposes, I want to briefly introduce three versions of machine learning, three general categories. The first is supervised learning. This is where you are classifying new cases into pre-existing labeled categories. It's like taking a piece of mail and deciding which mailbox it goes into. And this is one of the biggest and most productive areas of data science. It's simply the prediction of defined outcomes as well. Like if you know a person's blood test score, you might try to get other variables about their behavior that can predict their score on this outcome. So you have a defined category or outcome. And supervised learning can involve methods as simple as regression, logistic regression or linear regression, or something as complicated as a deep learning neural network. Next is unsupervised learning. And this is where you will simply find similarities without established labels or scores. So you don't have a single outcome. You're just putting like with like. Like sorting fruit, put the same kind with each other. So good examples of this are the various forms of clustering methods or dimensionality reduction, where you have a lot of different variables in your dataset and you're trying to reduce it to a smaller composite amount to make your life a little easier. Also anomaly detection, finding things that stick out from the pattern, possibly as something going wrong with the system, or maybe as an indicator of a niche where there might be more value that you haven't explored. And then the third general category is reinforcement learning. And these are algorithms that help computers to play Go, they help robots run parkour, and cars that drive themselves. Now, understand, it's not that we have this one set outcome, not putting things into a category, but simply, the algorithm gets rewarded, it gets points, it gets scored for a better performance than it did before. And it learns how to do more and more of what it is is getting rewarded for it. And so, between those three general categories, between supervised learning, where you're classifying into known categories or scores, unsupervised learning, where you're simply trying to find the similarities within the data, and reinforcement learning, where you're giving it rewards for performing a given task better, that summarizes some of the most important and magic-like elements of data science.
Bayes' theorem
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Performing is hard. You put it in a lifetime of training, weeks of preparation for an event, and even when you've done the very best, you can never be completely certain that everything is going to work out exactly the way you wanted it to. There's a certain element of luck or probability associated with it. It's the same thing in data science, no matter how big your data set, no matter how sophisticated your analysis and the resources available to your organization, there's still an inescapable element of probability. It's just how it works. And one of the best things you can do is to explicitly incorporate that uncertainty into your data science work to give you more meaningful and more reliable insights. This is Bayes' theorem. And this is one of the keys to incorporating that uncertainty. What Bayes' theory does is it gives you the posterior or after the data probability of a hypothesis as a function of the likelihood of the data, given the hypothesis, the prior probability of the hypothesis, and the probability of getting the data you found. Now, you can also write it out as a formula like this, but it's going to be a lot easier if I give you an example and do the work graphically. So let's take a look at medical diagnosis as a way of applying Bayesian analysis or Bayes' theorem to interpreting the results of our study. There are three things you need to know. First off, you need to know the base rate of a disease. How common is it overall? Let's assume that we have a disease that affects 5% of the people who are tested. Then we need to know the true positive rate. So 90% of the people with the disease will test positive. It means that 10% won't, those will be false negatives. There's also a false positive rate. That's 10% of the people who do not have the disease will also test positive. It depends on how the test is set up, but that's not completely unlikely. And so we have the base rate, the true positive rate and the false positive rate. We can use those to answer this question. If a person tests positive and the test is advertised as 90% accurate, then what is the probability that they actually have the disease? Well, I'm going to give you a hint. The answer is not 90% and it has to do because of the way we incorporate the base rates and the false positives. So let's go and look at this. This square represents 100% of the people tested for the disease. Up here at the top, we have 5% of the total. That's the people who actually have the disease. Below that are the people without the disease. That's 95% of the total. Then if we take the people with the disease and give them a test, the people who have the disease, who test positive, that's in blue, 90% of them, that's 90% of 5% is 4 1/2% of the total number of people. And these are the true positives. Now, next to that, we add the people without the disease who tests positive. That's 10% of the 95%. So that's 9.5% of the total, and those are the false positives. And so you can see everybody in blue got a positive result, but what's the probability that you actually have the disease? Well, to do that, we're going to calculate the posterior probability of the disease. We take the true positives and divide it by all of the positives, the true and the false positives. Now, in this particular case, that's 4 1/2 divided by the sum of 4 1/2 and 9 1/2%, which is 14%. 4 1/2 divided by 14% is 32.1%. And what this means is that even though the test is advertised as 90% accurate, depending on the base rate of the disease and the false positive, a positive test result may still mean that you have less than 1/3 chance of having the disease. That's a huge difference between what people expect the result to be and how it's actually going to play out in practice. And that's one of the most important things you can do for accurately interpreting and putting into action, the results of your data science analysis, to get meaningful and accurate insight.
Reinforcement learning
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Perhaps you've seen the video of the Atlas robot by Boston Dynamics, not as anthropomorphic as this one, but still bipedal, running through a parkour course and doing backflips. It's amazing. It's a moment that is incredible, both for its technical accomplishments, as well as for its ability to make me feel even more discoordinated than usual. But reinforcement learning is the key to a whole realm of extraordinary accomplishments in data science. So for instance, when a computer algorithm is able to play chess and go, games that people thought were way too complicated for computers, and not only do they learn to play it, but they beat the best people in the world; or self-driving cars that can keep themselves between the lanes and get you across the country, or stock market trading, where algorithms are designed to find optimal times to buy and sell certain stocks at certain amounts to maximize the overall profit. Or, one of the more interesting applications, because it's life and death, is healthcare, where reinforcement learning can be used to determine dosages for medication, dynamic treatment regimes, or DTRs, for chronic illness; planning and conducting clinical trials, a whole host of other outcomes. The basic principles of reinforcement learning are pretty simple. You start by setting up the situation. You need to give the algorithm the rules and constraints. It needs to know what's allowed and when it's offside. and then you need to have discrete steps in time where you can measure both what it has done, and you can measure the reward function, or Q. Q is the expected value of the total reward from the current starting point, which is why one of the most common approaches to reinforcement learning is called Q learning. So you have to be able to measure, or put numbers on, the progress that you're doing. And if you can have that reward function, then you can then put the algorithm into play. And what you're trying to do is create what's called a policy map, where that's the probability of taking a particular action when in a given state or condition. And then you have these twin ideas of exploration, which means the algorithm is going to try lots and lots of different possible behaviors, make lots of different choices at each given point, go in every direction, and see how well it works. And then, exploitation. And in reinforcement learning, this is a good thing. It means to exploit the things that it has already learned, to continue its progress. So that's how you can assemble the policy map. And that's the general process, in very broad strokes, of how reinforcement learning works. Or, if you want something slightly more specific, you'll come across the acronym SARSA, which stands for state, action, reward, then the subsequent state and the subsequent action. This is an iterative algorithm that is used in a kind of policy learning called Markov decision process, which is very, very common within reinforcement learning. On the other hand, one of the major challenges in reinforcement learning is what's called the credit assignment problem, and that is exactly who you're going to give that gold star to. This has to do with the gap in time between the action and the feedback. You want that to be as short as possible, to make the reinforcement unambiguous. But if you're working on a self-driving car, for instance, you want it to measure mile by mile, or maybe even foot by foot, so you can give it immediate feedback, instead of just a yes or no on whether it successfully got somebody across the country. Now, that's an absurd example, but for instance, with some medical interventions, like changing a person's medications or trying a new procedure, there can be a difficult period of adaptation or a recovery period before the eventual improvement has manifested. And by the way, this is similar in concept to what is called the attribution problem in online marketing, where you're trying to determine exactly which factors led, for example, to an eventual purchase. And that's a major challenge. It's knowing where exactly, again, where are you going to give the credit, where are you going to put that gold star? Now, another interesting challenge in reinforcement learning is initializing where to start. Because you can provide the algorithm with specific examples to learn from. You can have it watch thousands or millions of chess games. You can have it get all this data for medical records. There's a lot that you can do to get it set up. On the other hand, you can also have these quote unquote naive situations, where all you do is give it the rules. You simply tell it what the rules of chess are, and it learns on its own how to progress. There are fascinating examples of watching virtual robots in games, learning how to move themselves with no instructions at all, except for the basic laws of physics. And sometimes you end up with peculiar solutions, but they're effective. And so this is one of the choices you get to make in terms of setting up your reinforcement learning approach. Now, if you'd like to learn more about this topic, because there is a ton more that you could learn, I highly recommend the excellent course Reinforcement Learning Foundations by Khaulat Abdulhakeem. What this shows you is one method for doing some of the extraordinary magic that is associated with data science in terms of teaching programs, machines, what the goal is, and then letting them figure it out on their own, have their own developmental sequence in terms of getting to it. It's an extraordinarily flexible and creative approach to a lot of very challenging problems, and one of the things that makes data science so useful.
Classifying
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] So maybe you've got a dog, and maybe your dog does cute things like sticking its nose in your camera. Now, my dog's too short for that, but you take a picture or a video to save the moment. But one interesting consequence of that process is that now your phone's photo program is going to start analyzing the photo to determine what it's a photo of. That way, you can search for it later by typing dog, without ever having had to tell the program that's what it is. And that's the result of a machine learning algorithm taking the data to analyze the photos and classify it as a dog, a cat, a child, and add those labels to the data. In fact, classifying is one of the most important tasks that data science algorithms perform, and they do it on all kinds of data. The general idea of automated classification is pretty simple to describe. Locate the case in a K-dimensional space, where K is the number of variables or different kinds of information that you have. And it's probably going to be more than three. It might be hundreds or thousands. But once you get it located in that space, compare the labels on nearby data. That, of course, assuming that other data already has labels, that it says whether it's a photo of a cat or a dog or a building. And then once you've done that, assign the new case to the same category as the nearby data. So in principle, it's a pretty simple process. Now, in terms of what data you're going to assign it to, you can do that using one of two different methods, among other choices. A very common one is called K-means. And this is where you choose the number of categories that you want. You can actually say, "I only want two, or I want five, or I want 100." And then what the algorithm does is it creates centroids. That's like a mean in multidimensional space. And it will create as many centroids as you want groups. And so when you put your new data in, it will assign that new case to the closest of those K-centroids. Again, might be two, might be five, might be 100. Another approach is called K-nearest neighbors. And what it does in this case is it finds where your data is in the multidimensional space, it looks at the closest cases next to it, and you can pick how many you want. It might be the five closest, the 20 closest, the 100 closest, and look at the categories of those cases and assign your new data to the most common category among them. Now, as you might guess, classification is a huge topic in data science, machine learning, and artificial intelligence, and so there are many, many options on how to do this process. And you're going to have to spend a little time talking with your team to decide which approach is going to best meet your individual goals. Now, some of the things you're going to have to consider are things like whether you want to make a binary classification, that's just a yes, no, like whether a credit card transaction is or is not fraudulent, or whether you have many possible categories, like what's in a photo, or what kind of movie to recommend to someone. You also have a lot of choices for how you measure the distance. How close is it to something else? You can use Euclidean distance, Manhattan distance, edit distance, and so on. And you also need to decide whether you're going to compare it to one central point, like a centroid, or several nearby points. And then you also need to make a decision about confidence level, especially when you have a significant classification. How certain do you have to be that that's the right one? Some cases fit beautifully. Others are much harder to classify. Now, once you've done the classification, you want to evaluate your performance, and there's a few different ways to do that. You can look at the total accuracy. So if you have, like, a binary classification is a legitimate transaction, is a fraudulent transaction, what percentage of the total cases got put into the right category? This is simple to calculate, and it's intuitive to understand. But it's problematic because if one category is much more common than the others, then you can get high overall accuracy without even having a functional model. So you want to start looking at things a little more particularly, like, for instance, sensitivity. This is the true positive rate. Or if a person is supposed to be in a particular category, what's the likelihood that that will actually happen? So if a person has a disease, what's the probability they will be correctly diagnosed with that disease? And there's also specificity, which is like the true negative rate. And what this means is that the case should only be categorized when it is supposed to go in there. You don't want these other cases accidentally going in. And that's one of the purposes of Bayes' Theorem, which I have talked about elsewhere. Bayes' Theorem allows you to combine data about sensitivity, specificity, and the base rates, how common the thing is overall. And again, my goal here is not to show you the step-by-step details, but to give you an overall map. For more information on classification methods, we have a wide variety of courses using languages like Python and R that can walk you through the entire process. But the goal is the same, using automated methods to help you identify what you have, and by placing into relevant categories, helping you get more context and more value out of the data so you can provide better products and services.
Explainable AI
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] So when are the robot overlords arriving? Well, many years ago, I was listening to a talk show on the radio and a person called in to ask the person being interviewed if they ever thought that computers were going to get too powerful and take over. Well, his response surprised me. Essentially he said, "That happened a long time ago." And the evidence he gave for his claim was, "Do you know how many computer chips "there are in your house? "Do you know what they do? "Do they wait for you before they do anything?" And the answer to these, of course, is no, I have no idea how many computer chips there are in my house, and I imagine that's true for most people. And so in one sense, when the chips are everywhere and we don't even know where they are and what they're doing, maybe they have taken over. Now the same general principle applies to artificial intelligence, or AI. It's everywhere. It's ubiquitous. AI is involved when you search the internet, shop online, get recommendations for TV shows or a song, use a smart thermostat, ride in a car that can steer itself, call your bank, and so on. It's happening all the time and its use is expanding exponentially in more and more domains. Now, there's a trick about this, and that is AI, artificial intelligence, is often opaque. Modern AI relies heavily on versions of neural networks, especially deep learning neural networks. They do amazing things, but it can be difficult, or even impossible, to trace the data flow through the network. And so they're called black box procedures. You've got this input that goes into it, something happens there in between, and then you've got an output that comes out. It's quick, it's effective, it does amazing things, but basically nobody knows exactly what's happening in there with the data. And that gives us some really serious challenges, especially when you look at things, laws like the CCPA, that's the California Consumer Privacy Act, and the GDPR, that's the European Union's General Data Protection Regulation. Those are very significant laws that require at least some level of explainability from algorithmic decisions. Also, if businesses are going to use or expand their use of AI procedures, they need to have some confidence in the algorithms and have some idea how they will function in new situations. And finally, the people who benefit from all of this, the people like us who use their computers to make decisions, we need to have a certain level of trust if we're actually going to use the computers. Now, I'm just going to mention for a moment that my academic training is in research psychology, and trying to understand what's going on in an artificial brain so you can explain it sounds an awful lot like the job we have, trying to figure out what's going on in a biological brain. The short version is, it's really complicated. But the good news is that data scientists created the goal of XAI, or explainable artificial intelligence, an approach that specifically focuses on machine learning and how you can get from the input to the output and understand what's going on between, and it hopes to unite the astounding success of algorithms like neural networks with a legal and commercial and social need for understanding. Now, there are three main criteria for XAI algorithms. The first one is transparency. Essentially this means that you can see the process happening, like going from a black box to a glass box, so you can see how the model parameters are extracted from training data and how labels are generated for the testing data. Next is interpretability. This is the ability to identify the underlying basis for decision-making in a way that's understandable to humans. It's also been described as the ability to determine cause and effect from machine learning models. And then the third one is explainability, which refers to how the features in a model contribute to a decision, or more specifically, what a given node in a neural network, the little circles here that are connected, what a given node in a neural network represents and how it contributes to the overall model, the output there at the bottom of this one. Now something that I really want to point out is this is aspirational. Explainable AI is a goal. It doesn't refer to any particular algorithm. There's not like a deep learning neural network, and then an explainable AI network. It's a way of describing things, and the trick here is that the goal of explanation can conflict with the goals of speed and accuracy. So it's the attempt to strike the balance. But there are several useful methods for approaching this. One, for instance, is the use of algorithms that are more transparent. So things like linear regression, that's usually easily interpretable, or a decision tree like we see here, or maybe collections of decision trees. Those are things that people can understand better, and they are still often very effective. There's also research in developing what you can call parallel algorithms, in that you have one algorithm that is doing the main work, but next to it, and at the same time, you have another algorithm that is working to explain and follow the process of what's going on in there. Think of it as an explainable version of generative adversarial networks, where you have two networks working simultaneously to produce a particular outcome. And then there's also the general approach of what's called variable importance, and a lot of measures use that. They may not be able to say specifically that this data point triggered this, but they say the model, in general, relies heavily on this data, less heavily on this data. Which gets to the idea that there are existing solutions, but they are generally proprietary versions of XAI. So for example, IBM has Watson OpenScale, and it can do a lot of this work. It has visualization and a short computer-generated text description of the variable importance in a model. It also has something called contrastive explanation, and that is the minimum change required in the data to change the predicted outcome. It's a nice way of thinking about it, and not surprisingly, there are similar offerings from other cloud computing providers. This is a very hot market and a lot of people are working in it. Then again, sometimes you just want to simplify things a little bit. So for instance, there's an algorithm called COMPAS, C-O-M-P-A-S, and that's a proprietary black box algorithm that is sometimes used for recidivism, or reoffending, risk predictions in, for instance, deciding whether somebody gets to go on parole or not, and the trick is you don't know what is in there because it's a black box of some kind, also because it's proprietary. The people who made it, they're not telling anybody what's in there, and the problem is that there are identified, well-known biases and really some issues with the COMPAS program. And so a collection of researchers from Berkeley, Harvard, and Duke developed something called CORELS, which is a portmanteau for certifiably optimal rules lists, and what this is is like a decision tree, but they were able to develop a very short list of nested if/then statements that happen to match the performance of two-year recidivism risk predictions from the proprietary black box COMPAS algorithm, and theirs is only four lines. It says, first, if the offender's age is between 18 and 20 and they are male, then predict that they're going to reoffend. Well, that's easy. Else, if they are a little older, 21 to 23, and if they have two to three prior offenses, but sex is no longer in the equation, then predict, yes, they will reoffend. Or, a little simpler, if their number of prior offenses is greater than three, regardless of their age or sex, then predict yes. And anybody who's not covered by any of those, predict no. Four lines. Extremely effective and very, very interpretable. Now, the other problem is, it's also not really satisfactory, 'cause you think, man, there's a lot more things that should go into a sentencing thing. But what this does is to highlight that, if these four lines can match the predictive accuracy of some proprietary black box algorithm, then maybe we don't want to be using that particular algorithm. Maybe you want to try something that does get a little more of the nuance to the things that are important in our lives. Now, what all of this means: It's a long road. We're not there yet, but the need for explainable AI is abundantly clear, and there's some promising progress towards the goal. Hopefully in the not-too-distant future, we'll see some advances that unite the extraordinary abilities of AI algorithms with the transparency, interpretability, and explainability that all of us need.
Bias
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Sometimes, despite your best efforts and intentions, your data science projects just go haywire and you get garbage out of it. Sometimes it's things that are obvious, like the glitchy screen, but sometimes it's less obvious such as when an algorithm is biased in a ways that may not be immediately visible. I can give you a few classic examples. One is several years ago, Microsoft created a Tay Twitter bot that took only 12 hours to become a misogynistic, anti-Semitic conspiracy theorist and had to get taken down. Or there's the COMPAS Sentencing software. That stands for correctional offender management profiling for alternative sanctions, that gave inaccurate, racially biased, recidivism predictions for black and white defendants. A report by ProPublica found that black defendants, who in reality did not re-offend in a two-year timeframe, were nearly twice as likely to be misclassified as higher risk compared to similar white defendants. It was at 45% to 23%. There's also PredPol Crime Predictions, that stands for predictive policing, that predicted crimes at much higher than actual rates in primarily minority neighborhoods leading to increased policing. And then another familiar one is that Google's online jobs system showed ads for a much high-end payer jobs to men than it did to women. Now these are, by this point in time, well-known errors and many of them have been responsibly addressed, but I want to point out there's a couple of different sources for the error and the ones I want to talk about right now are, some of them are just technical glitches. So for instance, you have a training dataset, it's got limited variability. And so it can't go outside of that very well. You can't extrapolate very well. Also sometimes you have a statistical artifacts from small samples. If you're doing an algorithm that uses confidence intervals, but one of your groups is much smaller than the other, then their confidence interval's going to be much larger. And if you have a lower criteria in, like for instance, you must have a predicted probability of repaying your loan at least this high, a smaller group, just isn't going to make it as often as a larger group, just because of the way confidence intervals are calculated. Also, maybe you're focusing on overall accuracy of your classification model and ignoring what's happening with subgroups. If you have a very rare disease, you can simply ignore its existence and be highly accurate overall, but everybody recognizes that would be a very serious problem. But each of these things can happen just as a matter of implementing the algorithm, or the math that goes behind it, without necessarily having some of the bigger bias problems. On the other hand, you may also commit some failures. Some things that you should have known better when conducting the research. So for instance, maybe you had a failure to gather diverse training data sets. It's incumbent on somebody who's creating a system to try to do something, to get a wider range of data, to use with their system. Second, maybe there was a failure to capture a diversity of data labels. You know, not everybody interprets the same thing the same way. I mean, I don't know if you see these robots as cute or as scary, and ask any, differ from one person to another. So when you were labeling your data, deciding is this cute? Is this scary? Is it big? Is it small? Is it good? Is it bad? You need to get a very wide range of people to provide that data. And then also the failure to use more flexible algorithms, something that can capture the nuance that's in the data, the exceptions, the outliers that can matter, especially when you're looking at relatively small groups and relatively rare outcomes. Also, there's the risk of what are called self-fulfilling prophecies. So let's take the job ad as an example here. If a woman is shown ads for lower paying jobs, well, she'll probably apply for one of those lower paying jobs. And, then by extension, get one of those lower paying jobs. And what that does then is that the fact that she has that lower paying job now becomes a data point that goes in to the next iteration of the algorithm, says, "Aha, one more woman with a lower paying job. We will show more of them." And so what happens is that the algorithms actually have the possibility of creating the reality that they're trying to predict. Now, mind you, this is not blaming the victim, but it does let you know that when you are creating the algorithm, you have to find ways to get past some of these self-fulfilling prophecies or even vicious cycles. There are a few things that you can do. Number one, you can deliberately check for biased output. Compare this to some gold standard. Are you in fact showing men and women jobs that pay the same amount of money? Are you in fact getting output from more than just, say, for instance, English speaking, upper-middle-class people in the United States, but a much broader group. And you can check that with your data. Also when you're developing something that can have implications for lots of different groups, consult with all the parties. Do some focus work, talk to people, and see how they see the results of your algorithm. And then finally include diversity. Again, this means make a deliberate effort to include a broad range Of people, of circumstances, in your training data and in the labels that you have, and in the way that you develop the algorithms. The diversity can make such a difference. Demographic diversity, worldview diversities, technical diversity, any number of these and all of this can make your algorithm more robust and applicable to what is really a very broad world. Finally, if you'd like more information on this, you can consult the course, "AI Accountability Essential Training," which addresses issues of bias in algorithms specifically.
Artificial intelligence
The HUMAN MIND seems to work in MYSTERIOUS ways and sometimes CONCEPTUALLY and EMPIRICALLY DISTINCT phenomena seem to occupy the same cognitive space and as a result can get muddled up in the process. That seems to be the case for DATA SCIENCE and for ARTIFICIAL intelligence, which are sometimes TREATED as SYNONYMS but before I compare and contrast the two fields, 'cause there are DIFFERENCES, I want to mention a few things about the NATURE of categories and definitions. The first thing is that CATEGORIES are CONSTRUCTS, they're not things that exist out there in the world, but they are ways of thinking about things. So they're CONSTRUCTED mental cognitive phenomena. You put them together, which means they can be put together in different ways. Second, categories and DEFINITIONS SERVE FUNCTIONAL PURPOSES. They don't exist for their own personal satisfaction, somebody created them because they allowed them to accomplish a particular task. And the final thing is that the USE of CONSTRUCTS varies BY NEED. The idea here is that maybe your CONSTRUCTS NEED to change be REFRAMED depending on what you're doing at the moment, 'cause there's not an INHERENT INESCAPABLE TRUTH to them, but again, they are conveniences, they are manners of speaking. And this whole thing about contracts and definitions it makes me think about the question of whether TOMATOES are FRUITS or VEGETABLES. Now everybody knows that tomatoes are supposed to be fruit, but everybody also knows that you would never put tomatoes in a fruit salad, instead they go on a vegetable plate with the carrots and celery. Now the answer to this PARADOX is actually simple. Fruit is a botanical term, vegetable as a culinary term. They're not parallel or even very well coordinated systems of categorization, which is why confusion like this can arise. Also anybody who's ever tried to organize their music or their movies knows that categories are shifty things. There are dozens, hundreds of categories of hip hop music, as well as opera, heavy metal or what have you. Long ago, I decided that instead of trying to identify some sort of intrinsic essence, the true category of the music, it was best to simply GIVE CATEGORIES for THINGS that I wanted to HEAR TOGETHER, regardless of how other people thought of them or even what the artist thought, it WAS a FUNCTIONAL CATEGORY for me. And that gets us back to the question of data science and artificial intelligence. These are FUNCTIONAL CATEGORIES. And so let's go back to what do we even mean by artificial intelligence? Well, there's a joke that it simply means whatever thing a computer can't do, that's intelligence. Well, obviously that's a joke because COMPUTERS are always LEARNING how to do new things. People SET a STANDARD, the computer achieves it then they say, well that's not really intelligence it's something else. Another way to think about it is artificial intelligence is when COMPUTERS are ABLE to ACCOMPLISH TASKS that normally REQUIRE HUMANS to do them. Now, what's interesting about that is these TWO ELEMENTS, whatever a computer can't do and tests that require humans, those definitions really kind of go back to the 50s at the first major boom of artificial intelligence when researchers were trying many different approaches to have computers do the work of humans many of those approaches were based on EXTENSIVE CODING of EXPERT KNOWLEDGE and DECISION PATHS. Think of enormous decision trees that more recently became known as good old fashioned artificial intelligence or G-O-F-A-I or GOFAI. The approach was promising for a little while but it ultimately faded when the magnitude of the task became apparent and also realizing that the work that they had done didn't have the flexibility needed for what the researchers were hoping for which is some sort of true general intelligence. And so more recently, artificial intelligence has come to refer to programs or algorithms or sequences of equations or computer code that CAN LEARN FROM the DATA. Now, some of these are very simple approaches and some of them are extraordinarily sophisticated, but they allow the COMPUTERS to do things that again, normally humans would've done and the computers can get better and better at it. Some examples of this include classifying photos without human assistance, translating text, or even spoken language from one language to another, or mastering games like Go or Chess or other games that people thought a machine would never be able to do. And so this last one that is a PROGRAM that can LEARN FROM DATA is probably the best working definition of artificial intelligence. And while it can include very simple models, a REGRESSION MODEL for example, it usually refers to TWO APPROACHES in particular, MACHINE LEARNING ALGORITHMS as a general category and DEEP LEARNING NEURAL NETWORKS as a particular instance. I'm going to say more about each of those elsewhere, but I did want to bring up one more important distinction when talking about AI. And that is the difference between what is called Strong or General AI, where you want to have a replica of the human brain that can solve any task and that's the thing that we normally think of in science fiction, that computer that can talk to you and intuit all sorts of things. That was the original goal of artificial intelligence research back in the 50s but it ended up being really unworkable and instead when researchers refocused, instead of trying to create a general purpose mechanical brain to what is sometimes called WEAK or NARROW AI, that is ALGORITHMS that FOCUS on a specific WELL-DEFINED TASK, there was enormous growth. It turned out that this focus, the specificity, is what made the explosive growth of AI possible. Now let's get back to our original question. How does artificial intelligence compare or contrast to data science? Well, it's a little like the fruit versus vegetable conundrum. Now in terms of artificial intelligence again, think that means algorithms that learn from data, broadly, machine learning. Now they're very wide takes on this, lots of very smart people INTERPRET these things DIFFERENTLY and they will say, no, no they're absolutely differently we know one is subsumed in the other and that lets you know, again, there's not an intrinsic essence here these are functional constructs, many different ways of thinking about them but MACHINE LEARNING or a COMPUTER PROGRAM that can LEARN FROM the DATA and LEARN TO to do TASKS on IT'S OWN like classifying photos. And then Data Science generally REFERS to SKILLS and TECHNIQUES for DEALING with CHALLENGING DATA. Now it happens that a lot of CHALLENGING data is INVOLVED in AI and so SOMETIMES PEOPLE SAY that DATA science is a SUBSET of AI OR AI as a SUBSET of DATA science but I like to think of it being a little more like this, DATA SCIENCE is a very BIG broad TERM. Machine Learning overlaps a lot with Data Science but you can have machine learning that doesn't incorporate what we normally think of as data science. Neural networks kind of overlap with both of 'em, but they also kind of do their own thing. And then in my mind, my personal take on this is that AI is this fuzzy little category that overlaps is off to the side. Again, the idea is these are constructs, these are ways of thinking about things that serve particular purposes, but in all of them, what we're trying to do is use computer programs to help organize and analyze and get insight out of data to solve novel problems.
The data science pathway
The insights you get from data science can feel like a gift to your business, but you don't get to just open your hands and get it delivered to you with a bow on it. Really, there are a lot of MOVING PARTS and THINGS that have to be planned and coordinated for all of this to work properly. I like to think of data science projects like walking down a pathway where each step gets you closer to the goal that you have in mind. And with that, I want to introduce you to a way of thinking about the DATA SCIENCE PATHWAY. 1 It BEGINS with planning your PROJECT. You first need to DEFINE your GOALS. What is it that you're actually trying to find out or accomplish? That way, you can know when you're on target or when you need to redirect a little bit. You need to ORGANIZE your RESOURCES, that can include things as simple as getting the right computers and the software, accessing the data, getting people and their time available. You need to COORDINATE the work of those PEOPLE because data science is a TEAM EFFORT, not everybody's going to be doing the same thing, and some things have to happen first and some happen later. You also need to SCHEDULE the PROJECT so it doesn't expand to fill up an enormous amount of time. Time boxing or saying we will ACCOMPLISH this TASK in this amount of TIME can be especially useful when you're working on a tight timeframe or you have a budget and you're working with a client. 2 After planning, the next step is going to be WRANGLING or PREPARING the DATA. That means you need to first get the data. You may be gathering new data, you may be using open data sources, you may be using public APIs, but you have to actually get the raw materials together. The next one, step six, is CLEANING the DATA, which actually is an ENORMOUS TASK within data science. It's about getting the data ready so it fits into the paradigm, for instance, the program and the applications that you're using, that you can process it to get the insight that you need. Once the data is prepared and it's in your computer, you need to EXPLORE the DATA, maybe making VISUALIZATIONS, maybe doing some numerical summaries, a way of getting feel of what's going on in there. And then based on your EXPLORATION, you may need to REFINE the DATA. You may need to recategorize cases. You may need to combine variables into new scores, any of the things that can help you get it prepared for the insight. 3 The third category in your data pathway is MODELING. This is where you actually create the STATISTICAL MODEL. You do the LINEAR REGRESSION, you do the DECISION TREE, you do the deep LEARNING NEURAL NETWORK, but then you need to VALIDATE the MODEL. How well do you know this is going to generalize from the current dataset to other data sets? In a lot of research, that step is left out and you often end up with conclusions that fall apart when you go to new places. So, VALIDATION'S a very important part of this. The next step is EVALUATING the model. How well does it fit the data? What's the return on investment for it? How usable is it going to be? And then based on those, you may need to refine the model. You may need to try processing in a different way, adjust the parameters in your neural network, get additional variables to include in your linear regression. Any one of those can help you build a better model to achieve the goals that you had in mind in the first place. And then finally, the LAST PART of the DATA PATHWAY is applying the MODEL. And that includes PRESENTING the MODEL, showing what you learned to other people, to the decision-makers, to the invested parties, to your client, so they know what it is that you've found. Then you deploy the model, say for instance, you created a recommendation engine. You actually need to put it online so that it can start providing these recommendations to clients, or you put it into a dashboard so it can start providing recommendations to your decision-makers. You will eventually need to REVISIT the MODEL, see how well it's performing, especially when you have new data and maybe a new context in which it's operating. And then you may need to revise it and try the process over again. And then finally, once you've done all of this, there's the matter of ARCHIVING the ASSETS, really cleaning up after yourself is very important in data science. This includes DOCUMENTING where the data came from, how you process it, includes commenting the code that you used to analyze it, it includes making things FUTURE-PROOF. All of these together can make the project more successful, easier to manage, easier to get the return on investment calculations for it. And those together will make the project more successful by following each of these steps. Taken together, those steps on the pathway get you to your goal. It could be an amazing view at the end of your hike, or it could be an amazing insight into your business model, which was your purpose all along.
Roles and teams in data science
Data science is fundamentally a team sport. There's so many different skills and so many different elements involved in a data science project that you're going to need people from all sorts of different backgrounds and with different techniques to contribute to the overall success of the project. I want to talk about a few of these IMPORTANT ROLES. The first one is the DATA ENGINEERS. These are the developers and the system architects, the people who FOCUS on the HARDWARE and the SOFTWARE that make data science possible. They provide the foundation for all of the other analyses. They focus on the speed and the reliability and the availability of the work that you do. Next are machine LEARNING SPECIALISTS, these are people who have extensive work in computer science, and in mathematics, they work in deep learning, they work in artificial intelligence, and they're the ones who have the intimate understanding of the algorithms and understand exactly how they're working with the data to produce the results that you're looking for. In an entirely different vein are people who are researchers, and by that, I mean TOPICAL RESEARCHERS, they focus on DOMAIN-SPECIFIC research, like for instance, physics and genetics are common, so is astrophysics, so is medicine, so is psychology, and these kinds of researchers, while they CONNECT with DATA SCIENCE, they are usually better versed in the design of research within their particular FIELD and doing common statistical analysis. That's where their expertise lies, but they connect with data science in that they're trying to find the answers to some of these big picture questions that data scientists can also contribute to. Also any business doing its job has ANALYSTS. These are people who do the day-to-day DATA TASKS that are necessary for any business to RUN EFFICIENTLY. Those include things like web analytics and SQL, that's SQL, or structured query language, data visualizations, and the reports that go into business intelligence. These ALLOW people to MAKE DECISIONS, it's for good business decision-making that lets you see how you're performing, where you need to reorient, and how you can better reach your goals. Then there are the MANAGERS, these are the people who MANAGE the ENTIRE data science project. And they're in charge of doing a couple of very important things. One is they need to frame the business-relevant questions and solutions. So they're the ONES WHO have the BIG PICTURE. They know what they're trying to accomplish with that. And then they need to keep people on track and moving towards it. And to do that, they don't necessarily need to know how to do a neural network, they don't need to make the data visualization, but they need to speak data so they can understand how the data relates to the question they're trying to answer, and they can help take the information that the other people are getting and putting it together into a cohesive hole. Now, there are people who are ENTREPRENEURS, and in this case, you might have a data-based startup. The trick here is you often need all of the skills, including the BUSINESS ACUMEN, to make the business run well. You also need some great CREATIVITY in PLANNING your PROJECTS and the EXECUTION that get you towards your entrepreneurial goals. And then there's the UNICORN, also known as the rockstar or the ninja. This is a full stack DATA SCIENTIST who can do it all and do it at absolute peak performance. Well, it's a nice thing to have, on the other hand, that thing is very RARE, which is why we call them the unicorn. Also, you don't want to rely on one person for everything, aside from the fact that they're hard to find and sometimes hard to keep, you're only getting a SINGLE PERSPECTIVE or APPROACH to your business questions, and you usually need something more diverse than that. And what that suggests is the common approach to getting all the skills you need for a data project. And that is by TEAM. And you can get a unicorn by team, or you can get the people who have all the necessary skills, from the foundational data engineer, to the machine learning specialist, to the analyst, to the managers, all working together, to get the insight from your data and help your project reach its greatest potential in moving your organization towards its own goals.
Interpretable methods
I have a sister who went to culinary school and taught me the important difference between the terms cooking and baking. Now cooking is a general term, but it usually refers to heating food on a stove like when you make stew. The thing about cooking, this definition is that you can easily change what the ingredients are and the quantities, you can just throw stuff in to taste. As long as you're not doing oat cuisine, it's pretty straight forward, and it's not hard to tell how things are going to end up. Baking on the other hand, is a little more complicated. If you want to bake a cake, you have to be very precise with your ingredients. How they're combined, their temperature and the time. You can't just improvise your way to a chocolate souffle, even when you have a pretty good understanding of kitchen chemistry. Baking is inherently a more opaque process than most kinds of stove top cooking. And as you might guess, there is an analogy here to data science. Specifically, when it comes to choosing algorithms to use in your projects, some are easier to interpret. That includes methods like contingency tables with the rows and columns that simply give the frequencies or the means for different combinations, or T-tests that are used in the A/B testing that's really common in web design. You're simply comparing the means of two different groups, or correlation coefficients, or linear regression, where you say for each point on this variable, add X number of points, subtract this number for this and so on and so forth. And then for decision trees. All of these are easy, at least in theory, to understand and to interpret. And you can usually have a pretty good idea of how it's going to work out. It makes me think of this single tree out on the hill. It's easy to see what it is. A tree is a conceptual thing. It's not complicated in theory. Now that said, linear regression decision trees can get enormous and very complex, but the theory behind them is pretty straight forward. On the other hand, there are other algorithms in data science that are a little more like jungles of rainforests, and they are harder to interpret. That includes for instance, regression with interactions, where you no longer know immediately how it's going to affect things, or random forest, that is large collections of decision trees, where you could visualize an individual tree, but knowing how they're all going to act is a little more complicated, or machine learning algorithms like support vector machines. And then, of course, neural networks is our really best possible example of a complex and opaque, difficult to understand procedure, but one that works very, very well. Now, what I'm saying here in that some are easier to interpret and some are harder to interpret, it's not exactly the same thing as Occam's razor, where the simplest explanation is taken as the one that's most likely to be true. And that is a theory that I strongly support and I encourage people to use the simplest method wherever possible. Also, this easier and harder to interpret isn't quite like the less is more minimalism of architect Mies van der Rohe that I love so much. Instead, it's a little more like going from a black box. That's where it's hard to know exactly how the data is processed. Neural networks use a lot of very non-linear transformations of the data, and it's hard to predict how the input data will combine for the output. Well, a black box while it can be extremely effective, can be hard to interpret and hard to apply in a new situation, and it's moving to a glass box, a clear box. These are methods where you can see how the data is processed, even if there's a lot going on in there, and theoretically, it's easy to predict the outcomes from the input. And so when a clear method and easy to interpret one works, there should be a preference for that. On the other hand, there's also the idea that there is necessary complexity. Circuit boards are complicated, but they need to be complicated in order to accomplish their purpose. But ultimately, your analysis is like a map that helps your organization get from point A, that's where you currently are, to point B, that's your goal, your intended outcome. And when possible, choose a method that helps you best understand how to get there so you can reach your goal, so you can deal with any potential setbacks, and so you can know when you're there. It helps you work more efficiently and more effectively when you are able to interpret your algorithms.
Time-series data
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When you go out hiking in a new place, it's nice to have a path to help you get where you're going. It's also nice to know if that path is actually going to take you someplace you want to be. As a data scientist, your job is to figure out what path the data is on. So you can inform decisions about whether to stay on the current path or whether changes need to be made. The most basic way to do this is to take your data over time and make a line chart. It's a graph of those changes over time, connect the points to make a clear line, and maybe you've got something that's interpretable right off the bat. On the other hand, when in you're looking at changes over time, one of the things you need to be aware of is auto correlation. And this is a situation where every value at each point in time is influenced by the previous values, more or less. So, it doesn't suddenly go from being 80 degrees one day to 20 degrees the next day to 112 the next day, the temperature of each day is associated with what the temperature was the day before. And in most time series, what you're looking for is consistency of changes. Say for instance, on a percentage basis. You may also be trying to figure out what the actual function is, the shape of the curve. So you're trying to get a mathematical formula, a function that describes the curve. It might be cyclical, something that has a seasonal variation. It might actually be several functions at once that you have to combine, but I can show you some of the simplest ones. The easiest one by far is linear growth, it's the same quantity is edited every time period like dollars per hour or per year for an employee, or you can have exponential growth where instead of going up the same number of units at each step, it goes up the same percentage at each step. If you're dealing with, say, for instance, the number of followers on social media, at a certain point is going to look like this. If you're looking at the growth of a stock, it usually looks something like this. You can also have logarithmic trends that rise rapidly at first, but slow down as it approaches a ceiling like operating at a hundred percent capacity if the initial growth is fast or a sigmoid, a logistics function, which is using a number of algorithms, it starts slowly, it accelerates then tapers off as limits are reached such as market's reaching saturation for a new product. And then there's sinusoidal or a cyclic sine wave. This is something that goes up and down over time. Again, like seasonal variation, the temperature, or this bending on trips to go to the beach over time. And there are several options for looking at these variations in trend analysis, they include methods like time series decomposition, ARIMA models, and even neural networks where you're looking for substantial and perhaps even qualitative changes over time, like a flock of birds that's moving, perching, moving and so on. But let's look at this first method where we're going to try to break things down into their elements and see what's happening with your data over time. The first method is decomposition. This is where you take a trend over time and try to break it down into its several constituent elements. Now that includes both an overall trend. Say for instance, think of a stock price, is it generally uphill, but it can also include a seasonal or cyclical trend as well as some random noise. And so here's a graph that's looking at a stock market index over time from 1990 up to about 2017. And you can see it's generally uphill and it's got a lot of bounciness going on in there. Well, we can do a decomposition. And what that does is it takes our dataset. This is a compressed version of what we saw just a second ago and it smooths it out to give us an overall trend. You can see it's basically going from the bottom left to the top right. It's uphill, it's got a few bumps in there, but it's generally uphill. This third section is the seasonal trend that's looking at every year it seems to go up a certain amount back down then up and then back down. Now there are certain situations where this is going to make a lot of sense. It depends on what the stock market is actually looking at and how things react in terms of how interpretable this method is. But you can see that there is a seasonal element to it. And then the last one is the random element that says, once you take this smooth general trend and you take the seasonal trend, this is what's leftover, and you can see it goes from about a thousand points down to about a thousand points up. And so it's something that matters, but if you go back to the top one, you can see that those are just the little squiggly variations on it. This is a way of decomposing the trend, breaking it down into constituent elements, to see the overall trend, the cyclical part, as well as the noise that's left over. A more advanced approach is something called a ARIMA. That stands for autoregressive integrated moving average model. It's a mouthful, which is why people normally just say ARIMA. The autoregressive or AR part, means that later values in the time series are predicted by earlier or lagged values. The I in ARIMA is the absolute values are replaced by different values. This is the integrated part. And then the MA at the end, that means that regression errors are linear combinations of current and previous values. That's the moving average part. Now there are a lot of variations on this. There's ARMA which doesn't have the I for the different scene. There's SARIMA and SERAMAX, which includes seasonal components among others. But ARIMA is a great way of breaking down the data by looking back, looking over your shoulder, to see what happened in the past, to try to predict what's going to happen in the near future. So here's a dataset pretty well known. It's about monthly international air passengers going from about 1948 up to about 1963 or four. And so this is when air travel was still pretty uncommon, but you can see that is generally uphill and it also looks like there's a big seasonal trend. Now, the thing that you don't want to do is just take a regression line through the data because you're missing a huge amount. You can get the overall trend, if you have to describe it in three words, but you're missing a lot. And so what I did instead is, separated the last three years from the earlier years, used on ARIMA model, actually a SARIMA because it had this seasonal component to estimate the trend over time and then projected into the future. And you have both the confidence intervals there in the gray as well as a very close match between the predicted values and the observed values, which are shown in the red and the blue. And so it's able to capture both the general upward trend as well as the seasonal variation. So when you're working with time series data, some variation of an ARIMA model is almost guaranteed to be part of your approach to understanding what's happening with that data. On the other hand, you can also use neural networks, a tool of choice in the data science world these days. When you're working with time series data, two of the most common choices are a recurrent neural network, an RNN or a multilayer perceptron an MLP. And because there are neural networks, you have an input layer, you have one or more hidden layers, and you have an output layer that says what the predicted values are going to be for that point in time. Now, the thing about this is neural networks use non-linear activation functions, which again, makes it possible for them to capture some unusual variation and trends in the data. So here's the same dataset with airline passengers over time and in the black up to about 1960, that's the training data. And then I actually tried three different versions of neural networks. This one is in fact an MLP, a multilayer perceptron, but you can see the very close alignment of the predicted and the observed values in the red and the blue, where a neural network is able to get something that's even more precise than what we had with the ARIMA model. And again, because you so often want to have multiple perspectives, it would be worth your while to do ID composition and an ARIMA model, and maybe some kind of neural network to try to get multiple views on the same phenomenon to give you more confidence in what you're doing. But what all of these approaches have in common is that they allow you to take what's behind you and they give you confidence to move forward.
Algebra
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When you're working with data, it's not too hard to come up with a solution when you have only one problem at a time, and it's basically stationary. But it's a whole different story when you have thousands of problems stampeding at you at the same time, in that case, you need a more flexible approach, and this is where algebra can come to the rescue. There are two reasons that it's important to understand algebra and data science, number one is that it allows you to scale up. The solution you create to a problem should deal efficiently with many instances at once, basically create it once, run it many times. And the other one closely related to that is the ability to generalize. Your solutions should not apply to just a few specific cases with what's called magic numbers, but to cases that vary in a wide range of arbitrary ways, so you want to prepare for as many contingencies as possible. And so we'll start with the basic building blocks of data science, which is elementary algebra. An algebraic equation looks like this, this in fact is a linear regression equation, but what we're doing is we're using letters to stand in for numbers. That actually is an interesting thing because a lot of people think that people who work in mathematics and work in data, work with numbers all the time, we actually work with variables, and so this is the association between variables. So let's start right here. Here on the far left, y is the outcome, and the subscript i means is for the case i, person or observation i, that could be one or two or a thousand or a million. Right next to that is a Greek letter that's a lowercase beta like a B, and it has a zero because it is the Y intercept. It's sort of the starting value before we add anything else. Next to that is another beta, but this time with a sub one, this is a regression coefficient and it's the slope that we use for this variable X1, so that's going to be our first predictor variable, and we're going to multiply it for the value of case i. Then we do similar things with the second regression coefficient and the second predictor variable, then the third regression coefficient and the third predictor variable. And here at the end, we have an Epsilon, and this stands for error. And it means how far off is our prediction from the actual values for each person, one at a time. And as wonderful as a regression equation like that is, the power of computing comes in when we go past a single dimension to the rows and columns of a matrix, that's how your computer likes to see math and how it processes it. This is what's known as linear algebra. It works with matrices and vectors. Over here on the far left is a vector that has all of the outcomes scores for each case, in this situation, there's only two, there's y sub one for the first person and y sub two for the second. If there were a hundred people, we would have a hundred numbers all arranged vertically. If there were a million, we'd have a million numbers arranged all vertically. Those are the outcomes and they're in a vector . Right next to this is all the scores. This is a matrix because it has both rows and columns and it contains the data for each individual person. Next to that is a another vector, which has the regression coefficients written again with the beta, and we have the intercept at the top and then each of the three slopes. And then we finish with a vector of error terms for each individual person. There's only two in this case, but there could be a thousand or a million. Let me fill this in with some numbers so you can see how it works in practice. If we're going to estimate the salary for a person working in data science, and this is actually based loosely on real data from a few years ago. Let's say we have two people, and the first one has a known salary of 137,000, the other one has a salary of 80,000. And what we're going to do is see how we can run that through this matrix equation, this linear algebra to estimate those numbers. The first thing here is we have their data in the matrix on the left and the regression coefficients on the right. This one is the intercept, everybody gets a one because everybody's number is multiplied times 50,000, that's the starting value that everybody gets. Next to that is a number that indicates years of experience. So, this person has nine years, and for each year, we estimate an additional $2,000 in salary. Next to that is a number that indicates negotiating ability on a one to five scale, where one is very low, very poor negotiator, five is a very strong negotiator. And for each step up, we predict a $5,000 increase in annual salary. This person has a three in their midlands, So we would add $15,000 onto their expected salary. This last one is an indicator variable, zero, one, zero is no, one is yes, to say whether this person is a founder or an owner of the company. If they are, then we would expect them to make about $30,000 more per year, that's reasonable. And when you put all of these things together, we predict 113,000 for this first person. Now they actually had 137,000, so we're off by a little bit and that's the error equation. And that doesn't necessarily mean we messed up, we only have three variables in the equation. We didn't put down where they live, we didn't put down what sector they're working in, we didn't put down what kinds of client projects they have. Although those things would influence it as well, but this shows you how you can use these vectors and matrices to do the algebra for an important question like estimating salary. Now, one of the neat things about matrix notation is that it's like packing the suitcase, you can go from lots of different individual things into a much more compact and maneuverable and agile package. This is matrix notation, and what it is here is this is the same information that I showed you just a moment ago, except now, you see how it's in bold. Each of these symbols now stands for the entire vector or matrix. This Y on the side stands for every single outcome score for every person in our data. This X is the matrix of all of the predictor data, and this bolded beta is the vector of all of the regression coefficients, and this over here, the Epsilon is the vector of all error terms. So it's very compact. And this is the way the computers like to deal with the information, and it makes it much easier for them to manipulate it and get the things that you're looking for. Now, even though you're going to be using the computer to do this, there's a couple of reasons you want to be aware of how the algebra functions, number one is it allows you to choose procedures well. You can note which algorithms will work best with the data that you currently have to answer the questions that motivated your project in the first place. And the second one is it can help you resolve problems. Things don't always go as planned and you know what to do when things don't go as expected, so you can respond thoughtfully and get the insight and the actionable steps you need out of your data.
AutoML
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Working with data can be challenging under the best of circumstances and there's a lot of thankless work that goes into it. For example, there's the common saying that 80% of the time of any data project is spent getting the data prepared and that certainly matches with my experience. And the data preparation tasks involve things like converting categorical features or variables to a numerical format, or dealing with missing data or rescaling the data, or the complicated procedures of feature engineering, feature extraction, and feature selection which are central to building a machine learning model. Also, when you're doing machine learning, there is the difficult matter of hyperparameters. These are the settings that are used for the various algorithms. So they're like the knobs and the switches you have to do before you can actually have the data analyzed. Now, sometimes these are pretty simple. For linear aggression, maybe it's just the alpha rate or the false positive rate in hypothesis testing. For canorous neighbors, usually just a number of neighbors to consider, but for algorithms like deep learning, there can be many more like the number of hidden layers, the number of units per layers, the learning rate, the dropout rate, the number of epics and so on. There's also the matter of how you set up validation and all of these make a difference. They can affect the performance of the algorithm, they can also affect the reproducibility of the algorithm. And that's particularly important because when people publish the results of these models or when they share them in some way, they usually don't tell you what all the hyperparameters are. And so it becomes very difficult to get the same results with the same data. Also, even just knowing which ones to use. And so there are several different methods for optimizing the selection of hyperparameters. That can include methods like grid search, random search, basin optimization, gradient based optimization, evolutionary optimization among others. It just lets you know it's an important and complicated task. And so people have spent a lot of time trying to figure out how to respond to this. And the nice thing is there are some helpful approaches that automate elements of setting up the analysis. Now there are two very general categories and there's overlap between them. And the first is open source solutions, which is what I want to mention right here. So these are applications and packages of code that work with Python, R, and other languages that are frequently used in data science. There are also proprietary solutions. These are specialized commercial applications that fall in the general rubric of machine learning as a service and I'll talk about those in another video, but I want to mention some of the common players in auto ML and the open source world. Probably the most important is auto SK learn. This is an open source tool implemented in Python built around the scikit-learn library, probably the most common approach to machine learning around. There's AutoKeras, which is of course built on Keras, also used for deep learning. AutoGluon is an open source toolkit developed by Amazon for use with their auto ML package and then Neural Network Intelligence or NNI. This is an open source toolkit developed by Microsoft that performs things like efficient neural architecture search and hyperparameter tuning and there are others. There's H2O's auto ML. There's the auto ML package for R, there's Transmogrify and a selection of others. But what all of them do is get you through some of the more tedious and time consuming parts of machine learning so that you can develop models that are more accurate, more robust, more repeatable, and more useful, getting you the insight you need from your data. That's the whole point that you conducted the project in the first place.
Calculus
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] You may have the best product or service in the world, but if you want to get paid, you've got to make the sale, and you've got to do it in a way that's profitable for you. Surprisingly, calculus may be one of the things to help you do just that. The idea here is that calculus is involved anytime you're trying to do a maximization and minimization, when you're trying to find the balance between these disparate demands. Let me give you an example of how this might work. Let's say that you sell a corporate coaching package online, and that you currently sell it for $500, and that you have 300 sales per week. That's $150,000 revenue per week. But let's say that based on your experience with adjusting prices, you've determined that for every $10 off of the price, you can add 15 sales per week. And let's also assume, just for purposes of this analysis, that there's no increase in overhead. So the idea here is you can change the sales by adjusting the price. But where are you going to have the maximum revenue? Well, let's start with a simple thing, the formula. Revenue is equal to the price times the number of sales. That's easy. Well, just a second ago, I said that the price is $500, but for every $10 of discount, you can change sales. So we have $10 times d, which is the unit of discount. And then next to that is sales. And currently you're having 300 sales per week, but for each unit of discount, you can add 15 more sales. Okay, we've got an equation here. And if you multiply this through, go back to high school, then what you get is -150d squared plus 4,500d plus 150,000. And this is the thing that we can use to maximize the revenue. This is where calculus comes in. What we're going to do is we're going to take the derivative of this formula. Now, this one actually wouldn't be too hard to do by hand. You can also just take it down to a calculator online, it'll do it for you. But the derivative is what's going to help us find the best discount for maximizing revenue. So if we get the derivative, it's -300 times d minus 15. All right, we want to find out when this is equal to zero, because that lets us know where the maximum of the distribution is. So we set it equal to zero. We divide both sides by -300, that just cancels out. And then we add 15 to both sides. And we get d is equal to 15. Now, let me show you what that actually is representing. This is a graph of the equation that I showed you earlier, and it has the units of discount across the bottom, and it has the weekly revenue up the side. And you can see that it goes up and then it curves back down. We want to find where that curve is the highest. Now, one way to do that is put a vertical line across the top, and the highest point actually is this one right here. It's 15 units of discount. Which is the same thing we got from the calculus. Now, let's go back and determine what that means for our price. The price is $500 minus $10 per unit of discount. We decided that 15 was the optimal solution. $10 times 15 is 150, $500 minus 150 is $350. So that's the price that's going to get us the optimal revenue. Well, let's see how that affects sales. We go back to sales, we originally have 300 per week, and we had determined that for every unit of discount, we could get 15 more sales per week. Well, we decided that 15 was the ideal units of discount. 15 times 15 is 225. Add that to 300, and you get 525 sales per week once we make the change. So our current revenue is $500 times 300 sales per week. That's $150,000 in revenue per week. But if we were to change the price and drop it down to 350, we would increase the sales to 525, and that would give us an estimated total revenue of $183,750. And that's a lot more money. In fact, the ratio is 1.225, which means it's a 22.5% improvement in revenue. In fact, let's look at the revenue this way. If we lower the price by 30%, going from 500 to $350 is 30%. We are able to increase the sales by 75%. And that taken together increases the revenue by 22.5%. That's an increase of almost $2 million annually, simply by making things more affordable and reaching a wider audience and helping them reach their own professional dreams. And that is the way that calculus can help you get paid.
Security
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Your data has a lot of value to it all on its own, and so does your work on your data. The challenge, of course, is that value sometimes attracts people who don't have your own best interests at heart. So, for instance, sometimes people will come and steal your data sets. There's an enormous amount of value in the raw data that you have, or they might come and steal your algorithm, the process that takes the raw data and converts it into this finished product, or maybe even they just avoid all of that and steal your output directly. Let you do all the data gathering, let you do other calculating, and they'll just get the final product. Any one of these is a major security risk to a data science project. And it's frustrating enough when it's things like trying to predict what the weather is going to be, but think about it when it's a high stakes outcome, think about something like complex machine learning that can be used to make life and death decisions, like the diagnosis of brain injuries. The idea there is what could go wrong if somebody is either trying to take what you have or deliberately mess with you. Well, it turns out that the general phrase for this in the machine learning world is adversarial attacks. That's when an adversary, somebody who's working against you, attacks your system, and they are attempts to either bypass your system or corrupt your system, your machine learning algorithm. Now, there are several people who will do this and not all of them are bad guys. So, for one thing, there are privacy activists. These are people who work with machine learning the same way you do when you maybe turn on the privacy protections on your phone or on your web browser. There are also people who are trying to conceal illegal or unethical behavior, or maybe frame somebody else, or maybe they're trying to interfere with the competitor's performance or gain an advantage, or maybe they're trying to get past security settings and access users like spam messages or turning your computer into a spam bot, or a blockchain zombie, or maybe they're trying to hack into the system and steal, or possibly worse, manipulate data, which in and of itself can be an act of cyber terrorism. But let me discuss two different kinds of adversarial attacks, regardless of the motivation. The first one is what's called an evasion attack. Think of it as somebody putting a mask on. This is where a model already exists. Say, for instance, a facial recognition system, and it's an attack on that implemented model by trying to get around it. Sometimes the people doing this will create new examples that fool the algorithm, and for instance, make it think that it's somebody else. I know this is a silly example right here with this mask, but there are ways to do it very subtly to print glass frames that kind of look silly hodgepodge that make the system think that you are a specific other person. Also evasion attacks can be done with images, they can be done with video or audio, or even something like a text classifying system. Now, the good news is evasion attacks can usually be overcome and defeated by updating the algorithm. You've got this new information, you know what's happening. You can adapt to it. On the other hand, there are what are called poisoning attacks. Think of it as something that's completely corroding and corrupting your algorithm. This is an attack on a model that is in development before it's been implemented. Now, that's harder to do, but what it does is it embeds it bakes in the mistakes into the algorithm, and the problem here is that algorithms that are frequently updated and which, therefore, are less susceptible to evasion attacks are, in fact, more susceptible to poisoning attacks, and so security becomes a major issue. Especially when you're dealing with significant outcomes, things like financial systems, things like health systems, things like military systems. It's going to make a very big difference there. And so, you know, what can you do about it? Well, there are at least a few options. One, of course, is to show a little less trust and have a little more security effort. I'm a particularly trusting person, but I understand the need for this in many situations. Another one is what's called a white hat evaluation. That's where you hire a hacker and try to get them to break into your system or try to corrupt it and see what it takes. So, they share that with you, and then you can make the changes necessary. Also regular testing of algorithms. Are they in fact working the way you expect them to? Just think of it as a periodic checkup on the performance of the model. And then, of course, there are laws and regulations that both govern the security of algorithms, their performance, and attacks against them. All of these things together can make for a more secure environment where you can conduct your machine learning and your data science. Get your outcomes without having to get so worried about getting tripped up in the process. You want to bring the value without bringing the drama.
Validating models
Selecting transcript lines in this section will navigate to timestamp in the video- [Narrator] Several years ago, my wife and I adopted our second child. No, this isn't her. It's some other lovely child. Now, we have loved and raised her the best we could, but we made one critical error that has caused her unanticipated grief and could take years to recover from. We used a non-standard character in her name. Those two dots over the E are a diuresis, which indicates that the second vowel should be pronounced as its own syllable and by the way, that's not to be confused with the identical looking, but functionally distinct umlaut, which softens the sound of a vowel or indicates that you are a heavy metal band. Now, it turns out that there are still a lot of web forms out there that don't like non-ASCII characters, and it will tell you that you've entered invalid data. Well, there are other people whose names have caused problems. It's not just us, and so aside from having apostrophes or hyphens in your name. So for instance, there are mononyms, people who have just one name, like pop singers and people in Indonesia and Myanmar, or a single letter. I filled out a form yesterday, which said you had to have at least two letters in your name, but O is a last name in Korea. E occurs occasionally in China, or you get names that are longer than are allowed by the forms. So for instance, this is a Hawaiian name, and then there's the very unfortunate people whose last name is null, which just causes the entire system to crash. Now, what's happening here, in many cases, is that a well-meaning programmer has created a name validation system and has tested it against common names. Think of it as an algorithm that is designed to classify text as either a valid name or not a valid name, but it turns out the world is a big place and apparently, programmers don't name their kids, Zoë. So problems come up and the systems break down. So it's important to check your work or as it applies to data science and machine learning, it's important to see how well your model works with data that you didn't use to build the model. Donald Knuth, emeritus professor of computer science from Stanford and the author of "The Art of Computer Programming" put it this way. He once wrote, "Beware of bugs in the above code. I have only proved it correct, mot tried it" and so, even though you feel like your logic might be airtight, you got to check it out with a lot of different variations. Now, one very common way to do this is a validation process, where you take a dataset, a large dataset, and you split it up, and you have training data, which is where you build the model. That's where you try to understand the relationship between the factors in your data to allow you, for instance, to predict a score or classify a case at the end. You can also have what's called cross-validation, where you take your dataset that you're training with, and then you split that up into five or six different parts, and then you take four of the parts to model what might be happening with the fifth one, and then you rotate through, and so every section of the training data has a chance to beat the outcome that the rest of them are trying to predict. This is considered a very good strategy. It allows you to deal with some of the variation that's inherent in your data, and then you can have holdout validation. When you first split your data into training data, which, maybe, you then split into cross-validation data, you also created another partition of your data that never got used in the developing of the models. That's called the holdout. That's the testing data, and the idea is you develop your model with the training data. Hopefully, you went through cross-validation as well, and when you think you've gotten something that hits the mark, you test it on the holdout data, and that gives you the best impression of how well your model will generalize to new situations. You can also think of in-the-bag and out-of-the-bag data, which is something that applies in certain algorithms. So in-the-bag data is the data that is used to build a model. It's what you are working with at this exact moment and in bagging algorithms or bootstrapped aggregates, these are randomly-selected cases from your overall data that you use to build the model, and it's often used with random forest models or collections of decision trees. So in-the-bag data is the data that is used to build the current model and separate from that, you have out-of-bag data that you can use to validate. These are data points that were not randomly selected when getting the data to build the model, and then what you can do is that for each data point that was out of the bag, that wasn't included when you built that particular decision tree, you can check the predictions of all of the trees that don't use that particular point and then from those various predictions, you can see how well they match up with the true state, the actual classification of the point, and you can calculate the out-of-bag error, the OOB error, and this is a great way of validating some approaches, again, particularly with random forest, a very, very common approach within data science and once you do that, you can be certain that your work does what you think it should do, and you can be confident that it's ready to go out there into the real world.
Legal
Selecting transcript lines in this section will navigate to timestamp in the video- [instructor] Data science can give you a lot of knowledge, and by extension a lot of power, but as anyone who has seen Spider man knows, with great power comes great responsibility, and the same thing applies in data science. Now, earlier on in the tech world and including data science, there was this little line from Mark Zuckerberg of Facebook, "Move fast and break things." And people did amazing things in data science, they're grabbing data from all over the place and they're putting in their algorithms and they're making these amazing predictions. On the other hand, that wild west phase, well, it was exciting, brought up some very serious issues for instance, about privacy, about copyright, about how to treat people. And so current data science is a lot more instead of moving fast and breaking things as just a moment, let's do this a little more deliberately. Now, in terms of laws and regulations, when you're working with data science, some of the important ones you have to deal with are for instance HIPAA. It's in the US, that's the Health Insurance Portability and Accountability Act, which places very serious restrictions on the confidentiality and privacy of medical records. There's also FERPA, the Family Educational Rights and Privacy Act, again, a US law that has to do with information from schools and education. More recently, there's the CCPR, the California Consumer Privacy Act, which went into effect in 2020, as well as the amended and expanded California Privacy Rights Act, which have important implications for how you do your work. I'll talk more about that in just a second. And then really the biggest one for the largest number of people is GDPR. That's the European Union's General Data Protection Regulation. I'm going to say things about these last two here. So if we talk about the California Consumer Privacy Act, given the number of data companies and tech companies that are in California, and really the number of people who interact with people in California, this one's really significant. And here's some of the general provisions of the CCPA. Number one, that people have a right to know what personal data is being collected about them. You have to tell them what you're gathering. Number two, they have a right to know whether their personal data is sold or disclosed and to whom. Selling data has been a very big part of the data science world, but it turns out now that people have to know, also people have the right to say no to the sale of their personal data, that they have the right to access the data that you have gathered about them, that they can request a business to delete any personal information about a consumer collected from that consumer. So you can say I'm out of your system. And also they have the right to not be discriminated against for exercising their privacy rights. These all sound like basic human rights, but a lot of them are there because some of the practices in the wild west days of the data science world ran very contrary to this, especially gathering data without people's knowledge and selling their data without their knowledge. And so this makes a very big change. And then the other larger one, mostly across the ocean is the European Union's GDPR or General Data Protection Regulation. But again, it's not just if you're in the European Union, it's if you're interacting with customers in the European Union and given that everybody online interacts with everybody everywhere, it's relevant. And this one talks about things like transparency and modalities. So the person who gathers the data and provides information to you the data subject in a concise, transparent, intelligible, and easily accessible form using clear and plain language. there is also a right to information and access. This gives people the right to access their personal data and information and see how that information is being processed. The data controller who gathers the information must provide upon request an overview of the categories of the data that are being processed, the purposes of that processing, with whom the data is shared, and how it acquired the data. And now of course, people can access an actual copy of the data that's been gathered. Then there is rectification and erasure. So a right to be forgotten, that is to be completely deleted, has been replaced by a more limited right to eraser to remove your information. The data subject has the right to request erasure of personal data related to them on any one of a number of grounds in 30 days. And then there's the right to object and automated decisions. So this allows an individual to object to the processing of personal information for marketing or non-service related purposes, and the data controller, somebody gathers the data about them, they have to inform of the individual of their right to object from the first communication that the controller has with them. That's why you have all these check boxes now, and that by default, they are not checked. And the last thing I want to say about this is there are fines. The GDPR can impose fines of up to €20 million at the moment that's $23 million or up to 4% of the annual worldwide turnover of the proceeding financial year for company. So this is a regulation that has teeth and taking both of these as examples, the California Law, the European Union Law, and I'll also point out that once the United Kingdom left the European union, they developed their own UK GDPR, which is basically identical and other countries, other states, other organizations are developing similar regulations. And so the very short version of all of this is, as you do your data science, as you gather data, as you process the data, as you do whatever wonderful things it is that you do with the data, remember that the guard rails are important both for your safety and for the safety of others. So be mindful and be respectful.
The role of questions in data science
You as a DATA SCIENTIST are in the businesses of answering QUESTIONS. A person, say the marketing director, comes to you with some kind of question, so you gather and you process and analyze the data. You BUILD a model or several models and you come up with an answer to the question that motivated the project. But chances are, if you've been at this for a while, it's a question that you've been asked before and the answer you've given is closely related to the answers that you have also given before. But one of the STRANGE things about life, including data science, is that the REAL VALUE ISN'T always in the ANSWERS. For example, here's something from Esther Dyson, a forward-looking investor and philanthropist. She says, "The definition of the problem, rather than its solution, will be the SCARCE resource in the future." Think about that. She is saying that ANSWERS or solutions WILL be LESS VALUABLE THAN the actual QUESTIONS that motivate them and I think that's true in data science. So you need to change your perspective and find a new way to ask questions that can be answered with your approaches. Look at the problem differently and see what extra value you can bring in. So it might be helpful, for instance, to think of the tools that you use. If you think of some of the most common TOOLS used when WORKING with DATA, things that you've probably used hundreds or even thousands of times, things like T-tests, regression, classification, principle component analysis, probability values, among lots of others. The big question here is what exactly do these procedures do, compared to how are they interpreted? So for example, a T-test is used to compare the means of two groups and that's the bread and butter of AB testing in the user design world. Well, T-tests make some assumptions, first off. They only work properly when you have normal distributions, that's bell curves, and that you have similar sample sizes in your two groups and you have approximately equal variance. You also have independent observations. They're kind of sensitive to these things and what they tell you is whether the means of those two groups are different. What's important there is they ignore other differences. If the median were different, it doesn't care. If there's outliers on one or the other, doesn't care. If there's a spread in variance, it doesn't look at that. And so people often use the T-test to indicate whether TWO GROUPS are the SAME or DIFFERENT, but remember, it only looks at one point, the mean of each groups. And so if you're trying to think a little more creatively, you might want to look beyond just that one element of that one test as a way of trying to bring some extra value into your work. In fact, when you start thinking a little more creatively about the questions that you could answer and the methods that you could use to explore them, then you find that your data science can get a lot more compelling. So think about it. The reports are going to be more interesting. You're not telling people the same stuff they've had before. I understand the value of PREDICTABILITY in dashboards, which are very useful for getting a idea of a constant process. But again, data science, you're usually being ASKED to do SOMETHING NOVEL and so you can get MORE INTERESTING INSIGHTS. Also, by using these different approaches and thinking maybe this common approach isn't what I need, you can FIND some HIDDEN VALUE. And again, that's one of the things that makes data science so important is the ability to get to that hidden value. And obviously. by extension, that also tells you that you can get some competitive advantage for the people who've commissioned this project. Again, by finding something UNEXPECTED in there, by using a creative approach, you can present it to them in a way that both gets their attention and helps them further their business and that is the great advantage of data science and thinking about new questions. So you don't want to just find answers. That's maybe what they pay you for, what they are expecting, but you can do more than that by learning to ask better, more interesting, and more informative questions and getting the9 answers to them. You can FIND VALUE and that is where the promise of data science is realized.
Self-generated data
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When I was growing up, I remember an ad for toys that said, "Wind it up and watch it go". But now you can do a similar kind of thing with data science. You can do this looping back process. This is where computers, the algorithms in them, can engage themselves to create the data they need for machine learning algorithms. It's a little bit like the mythical self consuming snake that comes all the way back around. And the reason this is important is because you need data for training your machine learning algorithms so they can determine how to categorize something or the best way to proceed. And having the machines generated by engaging with themselves is an amazingly efficient and useful way of doing that. There are at least three different versions of this, and I'm giving a little bit of my own terminology here. The first one is what I'm calling external reinforcement learning. Now, reinforcement learning as a very common term, it means an algorithm that is designed to reach a particular outcome. Like for instance, running through the levels of a game, I'm calling it external because it's focusing on some outside contingency and it's this method, for example, that allowed Google's DeepMind algorithms teach an AI to actually learn on its own, how to move through the levels of a video game. This was a major accomplishment to do it with no instruction except for move forward, get to the end. There's also generative adversarial networks, and they are used a lot in things like generating audio or video or images that seem photorealistic. It's both exciting and scary at the same time. The idea here is that one neural network generates an image and that a second neural network tries to determine whether that image is legitimate or whether it's been modified in some way, and that if it gets caught with a fake, then the other one has to learn how to do it better. Again, this has gotten to the point where you can have photorealistic face switching in videos, again, both exciting and scary but done with what's called a generative adversarial network. And then there's another kind of reinforcement learning, which I'm calling internal. And this is the kind where the algorithm works with itself. And the best example of this is DeepMind again, learning to play chess and Go, and other games by playing millions and millions of games against itself in just a few hours and mastering the game completely. Now, there are a few important benefits to this. One of course, is that you can get millions of variations and millions of trials, a gargantuan amounts of data very, very quickly. And that sometimes the algorithms can create scenarios that humans wouldn't. I mean, something they wouldn't even think of or deem possible. And this kind of data is needed for creating the rules that go into the algorithms of the machine learning that I'll talk about later. And the reason this is so important is because the kind of data that you can create so quickly using this method is exactly the kind of data, both the variability and the quantity that you need for creating effective rules in machine learning algorithms, which is what we're going to turn to next.
Data vendors
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When a dentist buys or sells their practice, the office location and the technical equipment is often included, but the real value comes in getting the accumulated list of customers, the patients, and hopefully their continued patronage. After all, everybody knows that the hardest part of any business is building your customer base. And that is where data vendors come in for data science. Now, a data vendor is a third party organization that sells data. They can give you very detailed data. And there are several advantages to working with data vendors. Number one is the volume of data they have and the variety. You could potentially get data on millions of people and the variety, you could get tens of thousands of different indicators. There is so much data that's out there. And many of these organizations, they put it together and are packaging it in a way, they make it very quick and easy, relatively speaking, to get started with it. Also, maybe you have a very specific group that you want to focus on. Maybe you're looking for a demographic group, or maybe you're looking for people who are at a very particular point of the buying cycle in a niche industry. If you can specify it, you might be able to find a data vendor that already has those people identified by where they are in the sales funnel, in the buying cycle, already. And then finally they can create indicators. You can choose what it is you want to know. They can give you a score for a person's inclination to quit their current cell phone provider and get a new one, or they can get you an indicator for a person's responsiveness to a particular kind of marketing campaign. There's a lot of stuff that they can do. And so all of these are strong advantages for what data vendors can do. So for example, think about if you're selling a product or service, then you could probably do your business a lot faster and better if you knew things about the online behaviors of potential customers. And those could include things like who they are, what are their names, what are their email addresses, where do they live? Things like the technology used, are they accessing this on a laptop, on a mobile device? Are they doing it through a web search, or through a social media, or through something else? Search keywords, how do they find things? What are they specifically using as their method, so you could try to get yourself in there to get in front of their eyes. And you can even get things like their web history, all the things they clicked on to get to a particular place. Very detailed information about what happened. And the data vendors can get their data from a lot of different sources. So for instance, the cookies that go onto your internet browser can be used for that information, or who searched for job postings, or who downloaded particular files, or who read news articles or clicked on the links, who said what on social media, or liked it or forwarded it, who left a product review and what did they say? There are so many different sources of information. I mean, these are just a few, there's hundreds. On the other hand, there are some disadvantages to working with data vendors. And first and foremost is cost. This is a service where they're potentially saving you a lot of time, but you're going to pay for it. Data can be extremely expensive, especially if you're looking for a very large amount of data on a very specific group and you want detailed information, it comes at a cost. Second, there is the question of accuracy. Now, data vendors are going to go out of their way to make sure that their data is useful, but it is something that you are going to have to verify before you go ahead with the rest of your project. You need to know how they calculated things. You need to have some indication of the validity of their scores and the validity of their sampling, and it behooves you to check that yourself. And then finally, a very, very big one. There's the issue of privacy. Data vendors usually get their data without people necessarily knowing that they have it. And a lot of people object to the gathering and selling of personal data, and not surprisingly, over the last few years, many laws have strictly regulated the practice. As I've mentioned elsewhere, there's the European Union's General Data Protection Regulation. And there's also the California Consumer Privacy Act, along with privacy policies of cell phone companies, browser companies, and so on. So there are some significant advantages, but there are some significant costs, both literal financial costs, as well as things potentially like social good will. And so these are things you have to think about. The overall point of this is that working with data is a challenging thing in the best of circumstances, and the option of working with data vendors can be enormously powerful, but it does have its trade-offs, just as purchasing a professional practice has a cost, paying the money up front and then getting the people to continue working with you, but it also allows you to get up and running with your clientele as fast, and hopefully as effectively as possible.
Predictive analytics
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When a person is convicted of a crime, a judge has to decide what the appropriate response is and how that might help bring about positive outcomes. One interesting thing that can contribute to that is what's called restorative justice. This is a form of justice that focuses on repair to the harm done as opposed to punishment. And it often involves, at the judge's discretion and the victim's desire, mediation between the victim and the offender. Now, one of the interesting things about this is it's a pretty easy procedure, and it has some very significant outcomes. Participating in restorative justice predicts improved outcomes on all of the following. People feel that they were able to tell their story and that their opinion was considered. They feel that the process or outcome was fair. They feel that the judge or mediator was fair. They feel that the offender was held accountable. An apology or forgiveness was offered. There's a better perception of the other party at the end of all of this. The victim is less upset about the crime. The victim is less afraid of revictimization. Those are absolutely critical. And then one more is that there's a lower recidivism rate. Offenders who go through restorative justice are less likely to commit crimes again in the future. All of these are very significant outcomes and can be predicted with this one relatively simple intervention of restorative justice. And so when a judge is trying to make a decision, this is one thing they can keep in mind in trying to predict a particular outcome. Now in the world of predictive analytics, where you're using data to try to predict outcomes. The restorative justice is a very simple one based on simple analyses. Within data science and predictive analytics, you'll see more complicated things, like for instance, whether a person is more likely to click on a particular button or make a purchase based on a particular offer. You're going to see medical researchers looking at things that can predict the risk of a disease, as well as the responsiveness of particular treatments. You'll also look at things like the classification of photos, and what's being predicted there is whether a machine can accurately predict what a human would do if they did the same particular task. These are all major topics within the field of predictive analytics. Now, the relationship between data science and predictive analytics is very vaguely like this. Data science is there, predictive analytics is there, and there's a lot of overlap. An enormous amount of the work in predictive analytics is done by data science researchers. There are a few important meeting points at that intersection between the two. So predictions that involve difficult data, if you're using unstructured data like social media posts or a video, that doesn't fit into nice rows and columns of a spreadsheet. You're probably going to need data science to do that. Similarly, predictions that involve sophisticated models like the neural network we have here, those require some really high-end programming to make them happen. And so data science is going to be important to those particular kinds of predictive analytics projects. On the other hand, it's entirely possible to do predictions without the full data science toolkit. If you have clean quantitative data sets, nice rows and columns of numbers, then you're in good shape. And if you're using a common model, like a linear regression or a decision tree, both of which are extremely effective, but they're also pretty easy to do and pretty easy to interpret. So in these situations, you can do useful and accurate predictions without having to have the entire background of data science. Also, it's possible to do data science without necessarily being involved in the business of predictions. If you're doing things like clustering cases or counting how often something happens or mapping, like what we see here, or a data visualization, these can be significant areas of data science, depending both on the data that you're bringing in and the methods that you're using, but they don't involve predictions per se. And so what this lets you know, is that while data science can contribute significantly to the practice of predictive analytics, they are still distinguishable fields. And depending on your purposes, you may or may not need the full range of data science skills, the full toolkit, to get to your predictive purposes. But either way, you're going to be able to get more insight into how people are likely to react and how you can best adapt to those situations.
Applications for data analysis
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When people think about data science, machine learning and artificial intelligence, the talk turns almost immediately to tools, things like programming languages and sophisticated computer setups. But remember the tools are simply a means to an end. And even then only part of it, the most important part of any data science project by far is the question itself and the creativity that comes in exploring that question and working to find possible answers using the tools that best match your questions. And sometimes those tools are simple ones. It's good to remember, even in data science, that we should start with the simple and not move on to the complicated until it's necessary. And for that reason, I suggest that we start with data science applications. And so you may wonder why apps? Well, number one, they're more common. They're generally more accessible, more people are able to use them. They're often very good for exploring the data, browsing the data, and they can be very good for sharing. Again, because so many people have them and know how to use them. By far, the most common application for data work is going to be the humble spreadsheet. And there are a few reasons why this should be the case. Number one, I consider spreadsheets the universal data tool. It's my untested theory that there are more datasets in spreadsheets than in any other format in the world. The rows and columns are very familiar to a very large number of people, and they know how to explore the data and access it using those tools. The most common by far is Microsoft Excel. And it's in many versions. Google Sheets is also extremely common and there are others. The great thing about spreadsheets is they're good for browsing. You sort through the data, you filter the data. It makes it really easy to get a hands-on look at what's going on in there. They're also great for exporting and sharing the data. Any program in the world can read a CSV file, a comma separated values, which is the generic version of a spreadsheet. Your client will probably give you the data in a spreadsheet. They'll probably want the results back in a spreadsheet. You can do what you want in between, but that spreadsheet is going to serve as the common ground. Another very common data tool, even though it's not really an application, but a language is SQL, or SQL, which stands for structured query language. This is a way of accessing data stored in databases, usually relational databases, where you select the data, you specify the criteria you want. You can combine it and reformat it in ways that best work. You only need maybe a dozen or so commands in SQL to accomplish the majority of tasks that you need. So a little bit of familiarity with SQL is going to go a very long way. And then there are the dedicated apps for visualization. That includes things like Tableau, both the desktop and the public and the server version and Qlik. What these do is they facilitate data integration. That's one of their great things. They bring in data from lots of different sources and formats and put it together in a pretty seamless way. And they allow you, their purpose is interactive data exploration to click on sub groups, to drill down, to expand what you have. And they're very, very good at that. And then there are apps for data analysis. So these are applications that are specifically designed for point-and-click data analysis. And I know a lot of data scientists think that coding is always better at everything, but the point-and-click graphical user interface makes things accessible to a very large number of people. And so this includes common programs like SPSS, or JASP, or my personal favorite jamovi. JASP and jamovi are both free and open source. And what they do is they make the analysis friendly. Again, the more people you can get working with data the better, and these applications are very good at democratizing data. But whatever you do, just remember to stay focused on your question and let the tools and the techniques follow your question. Start simple with the basic applications and move on only as the question requires it. That way you can be sure to find the meaning and the value as you uncover it in your data.
The generation of implicit rules
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When the aliens land, we'll have a better chance of understanding them because of our experience with machine learning and artificial intelligence. AI doesn't see things the way that people do, and it doesn't reason the way that people do. For example, DeepMind's AlphaGo, AlphaGo Zero, and AlphaZero are three generations of AI's that have come to completely master the game go, which is massively more complex than chess. But these AIs beat all of their opponents, not by mimicking the moves of human go masters, but by coming up with their own unorthodox, but extremely effective strategies. One computer science researcher described these AI's gameplay in these terms, it's like an alien civilization inventing its own mathematics, which allows it to do things like time travel. So, there's something very significant going on here, and part of the issue is this. Neural networks, a particularly powerful kind of machine learning technique, can do amazing things with games, and can also quickly spot differences between, for example, two closely related breeds of dogs, like the English Mastiff and the French Mastiff. Humans will tell you that the English Mastiff is generally larger in size, about 200 pounds with a large muzzle and often with a distinctive black mask. Whereas, the French Mastiff is not quite so huge. Tends to have a shorter muzzle and slightly different coloration Neural networks, on the other hand, can distinguish these two breeds reliably, but because of the peculiar things they focus on, may have trouble distinguishing either of them, perhaps from an image of a toaster that has certain pixels strategically doctored. And you can see the course AI Accountability Essential Training, if you want to learn more about the peculiarities of AI vision. Neural networks look at things in a different way than humans do, and in certain situations, they're able to develop rules for classification, even when humans can't see anything more than static. So, for instance, there are implicit rules. The algorithm knows what the decision rules are and can apply them to new cases. Even if the rules tend to be extraordinarily complex, but computers are able to keep track of these things better than humans can. So, the implicit rules are rules that help the algorithms function. They are the rules that they develop by analyzing the test data. And they're implicit because it cannot be easily described to humans. They may relay on features that humans can't even detect, or they may be nonsensical to humans, but the trick is those implicit rules can still be very effective. And that's one of the tricks or that's one of the trade-offs that goes on with machine learning. You can think of it this way. It leaves you as a human decision maker in an interesting position. You can trust the machine to do its work very quickly and often very accurately, but in a slightly mysterious way, relying on implicit rules that it, the algorithm, inferred from the data, or you can use other more explicit processes like regression and decision trees that are easier to understand and monitor, but may not be as effective overall. It's a difficult decision, but one that should definitely be left in the hands of the humans.
Agency of algorithms and decision-makers
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When we think about artificial intelligence and how it works and how it might make decisions and act on its own, we tend to think of things like this. You've got the robot holding a computer right next to you. But the fact is most of the time when we're dealing with artificial intelligence, it's something a lot closer to this. Nevertheless, I want to suggest at least four ways that work in data science can contribute to the interplay of human and artificial intelligence of personal and machine agency. The first is what I call simple recommendations, and then there's human in the loop decision-making, then human accessible decisions, and then machine centric processing and action. And I want to talk a little more about each of these, let's start with recommendations. This is where the algorithm processes your data and makes a recommendation or a suggestion to you. And you can either take it or leave it. A few places where this approach shows up are things like, for instance, online shopping, where you have a recommendation engine that says based on your past purchase history, you might want to look at this or the same thing with online movies or music. It looks at what you did, and it looks at what you like, and it suggests other things. And you can decide whether you want to pick up on that or not. Another one is an online news feed. This says, based on what you've clicked in the past and the things that you've selected, you might like this. It's a little bit different because this time it's just a yes or no decision, but it's still up to you what you click on. Another one is maps, where you enter your location and it suggests a route to you based on traffic, based on time, and you can follow it or you can do something else if you want. But in all of these, data science is being used to take truly a huge amount of information about your own past behavior, about what other people have done under similar circumstances, and how that can be combined to give the most likely recommendations to you. But the agency still rests in the human. They get to decide what to do. Next is human in the loop decision-making. And this is where advanced algorithms can make an even implement their own decisions as with self-driving cars. And I remember the first time my car turned it's steering wheel on its own. But humans are usually at the ready to take over if needed. Another example might be something as simple as spam filters. You go in every now and then, and you check up on how well it's performing. So it can do it on its own, but you need to be there to take over just in case. A third kind of decision-making and the interplay between the algorithm and the human is what I call human accessible decision-making. Many algorithmic decisions are made automatically and even implemented automatically, but they're designed such that humans can at least understand what happened in them, such as for instance, with an online mortgage application, you put the information in and they can tell you immediately whether you're accepted or rejected, but because of recent laws, such as the European Union GDPR, that's the General Data Protection Regulation, the organizations who run these algorithms need to be able to interpret how it reached its decision. Even if they're not usually involving humans in making these decisions, it still has to be open to humans. And then finally there's machine centric. And this is when machines are talking to other machines. And the best example of this is the internet of things. And that can include things like wearables. My smartwatch talks to my phone, which talks to the internet, which talks to my car and sharing and processing data at each point. Also smart homes. You can say hello to your smart speaker, which turns on the lights, adjust the temperature, starts the coffee, plays the news, and so on. And there are smart grids which allows, for example, for two way communication between maybe a power utility and the houses or businesses they serve. It lets them have more efficient routing of power recovery from blackouts, integration with consumer generated power, and so on. The important thing about this one is this last category, the machine centric decisions or the internet of things is starting to constitute an enormous amount of the data that's available for data science work. But any one of these approaches from the symbol recommendations up to the machine centric, all of them show the different kinds of relationships between data, human decision-makers, and machine algorithms and the conclusions that they reach. Any one of these is going to work in different circumstances. And so it's your job as somebody who may be working in data science to find the best balance of the speed and efficiency of machine decision-making and respect for the autonomy and individuality of the humans that you're interacting with.
Getting started
-data that surrounds us and infuses everything we do from making toast to driving cars across the country to inventing new paradigms for social interaction. explore some of the ways that data science allows us to ask and answer new questions that we previously didn't even dream of. To do that, we'll see how data science connects to other data-rich fields like artificial intelligence, machine learning, prescriptive analytics. -the fundamental practices for gathering and analyzing data, formulating rules for classification and decision-making, and implementing those insights. We'll touch on some of the tools that you can use in data science, but we'll focus primarily on the meaning and the promise of data in our lives. Because this discussion focuses on ideas as opposed to specific techniques, if you want to know how you can thrive in the new world of data, regardless of your technical background, you can get a better understanding of how to draw on data to do the things that are important to you and to do them more effectively and more efficiently.
Deep learning neural networks
A light switch is a simple everyday object. It turns things on, it turns them off. The NEURONS IN YOUR BRAIN are ALSO ON/OFF SWITCHES. They get turned on. And when that happens an electrical impulse travels down the axon and possibly turns on other neurons. And when you have a whole bunch of switches, say a hundred billion of them, like the neurons in your head and YOU GET THEM ALL CONNECTED, THEN AMAZING THINGS can happen like LOVE and CONSCIOUSNESS. And IN data science, AN ARTIFICIAL NEURAL NETWORK, where the FUNCTIONING is loosely INSPIRED BY OUR BIOLOGICAL NEURAL NETWORKS can also give rise to surprising things. Examples of this include autonomous cars that can drive themselves safely or medical diagnostics, where it can help diagnose diseases in MRIs and EKGs or the ability to find photos without you ever tagging them. In fact, NEURAL NETWORKS and DEEP LEARNING NEURAL NETWORKS in particular have been so SUCCESSFUL and used in so many ways over the last few years that we're going through a period of development and expansion right now, that's been called the NEW CAMBRIAN EXPLOSION IN AI because there are so many new variations coming out. But to get back to our original question, WHAT exactly is a DEEP LEARNING NEURAL NETWORK? Well, here's a very simple version that CONSISTS OF AN INPUT LAYER, it CONSISTS OF SEVERAL HIDDEN LAYERS that DO THE ACTUAL PROCESSING AND an OUTPUT LAYER THAT GIVES YOU YOUR RESPONSE. So imagine for example, that you want to create a neural network that can categorize what's in a photo, you take a digital photo and you're going to get the X and Y coordinates of each pixel in the photo, along with its red, green, and blue component. So those are five numbers for each pixel and you probably have a lot of pixels in your photo, but that information goes into the first or input layer of the neural network. And then maybe what happens is it takes that information and it starts combining it in ways that, and if we want to think about it the way a human might do it and the neural network doesn't necessarily, but you could think of it as the first step is it finds lines in the image. And then from there it does some more processing and it finds the edges or outlines of objects. And then from there, it identifies the shapes that are in the photo. And then maybe at that point, it can give you an output and say what it is a picture of. And so it's sequential processing. Now, it's actually very, very complicated because these are non linear transformations that go on. And that's part of what makes the magic of a deep learning neural network. Also, they're a lot more complicated. The networks can potentially have millions of neurons and even more connections in between them. In fact, work is progressing on what is called GPT4 right now, this is a proposed neural network that would have 100 trillion parameters in it. That's just absolutely staggering and would be far and away the closest thing we have to general purpose AI compared to anything else that currently exists, but there's something important to remember, just like a human brain, it gets to be a little complicated or potentially massively complicated, and it can be a little hard to know exactly what's going on in there. And so when it comes to trying to understand what's happening inside a neural network, you have to rely a fair amount on INFERENCE. Sometimes you have to infer how the neural network is functioning, because it turns out to BE NEARLY IMPOSSIBLE TO TRACE THE DATA THROUGH THE TRANSFORMATIONS in a neural network. That's why people call them black box models. You know the data that goes in, you know what it gives you when it comes out. And it's very hard to identify exactly what's going on in the middle. Now, what's funny about that is sometimes you end up using the same methods as psychological researchers trying to understand what's going on in a person's mind. It also explains why there's a lot of work right now going on in what's called EXPLAINABLE ARTIFICIAL INTELLIGENCE or XAI as a critical element in the implementation and the trustworthiness of AI. Also, there are important legal issues involved in deep learning neural networks. So for instance, the European Union has their general data protection regulation or GDPR. There is also the California Consumer Privacy Act or a CCPA. And both of these are very significant laws that have requirements, for example, for transparency, meaning when an algorithm makes a decision that affects somebody, they have to know how it got there, and explainability, so they know how it reached the process. And it turns out both of those can be very complicated. And so we're still in a period of both trying to develop models that can meet these requirements, as well as trying to best understand how to interpret the laws so that we don't completely shut down the developments of deep learning neural networks, but find a way to make them more open and more trustworthy to the people who rely on them. Finally, I just want to mention very briefly some of the really interesting DEVELOPMENTS going on in DEEP LEARNING NEURAL NETWORKS and AI in general, one of them is what's called SHALLOW NEURAL NETWORKS that allow you to use neural networks when you have a much smaller dataset because normally you require massive, massive datasets, but what can you do say for instance, when you have a relatively small number of MRIs? And so it requires a revised algorithm. There are also efficient neural networks. And what that refers to is the energy requirements for training and running neural networks can sometimes be overwhelming. I mean, they use a huge amount of power. And so methods are being developed to try to reduce the energy demands and the computing demands of neural networks. Part of that is showing up at the use of neural networks or variations on them, machine learning in embedded devices, little sensors, you know, things that are like in your smoke detector, where it can be smart enough to tell that the house is on fire from smoke from your toaster or something else going on. But embedding machine learning into very low powered, very limited devices is another exciting area of development. There's also hybrid AI, which is in part going back to the idea of what's called good old fashioned AI, which is back in the '50s when people try to develop systems that encoded expert knowledge of a domain, but there's an attempt to use neural networks that draw at least in part on expert knowledge to make them more functional within a particular domain. And then that might be related to what is also called liquid neural networks. And these are flexible algorithms that keep developing in response to new data, not just the training data, where they first learned how to do things because most deep learning neural networks are kind of set in stone. Once they've been trained, they just kind of operate after that. But the liquid neural networks are more flexible. And one interesting consequence of that is the cause and effect relationships that can arise from the responsiveness can also make them more interpretable and more explainable. And so there's a lot of overlap between data science, between artificial intelligence and also the power of deep learning neural networks and the massive explosive growth going on that field and how it can be applied within data science.
Machine learning
Back in the day, a machine was just a machine. It did whatever machine things it did, like stamping metal or turning a propeller, or maybe washing your clothes with a fair amount of help on your part. But nowadays, MACHINES HAVE TO DO MORE THAN JUST THEIR GIVEN MECHANICAL FUNCTION. Now a washing machine's supposed to be smart. It's supposed to learn about you and how you like your clothes, and it's supposed to adjust its functions according to its sensors, and it's supposed to send you a gentle message on your phone when it's all done taking care of everything. This is a big change, not just for washing machines, but for so many other machines and for data science processes as well. This gets to the issue of machine learning, and a very simple definition of that is the ability of algorithms to learn from data and to learn in such a way that they can improve their function in the future. Now, LEARNING IS a pretty UNIVERSAL THING. Here's how humans learn. Humans, memorization is hard. I know this, I teach, and MEMORIZATION is something my students STRUGGLE with every semester. On the other hand, SPOTTING PATTERNS is often pretty EASY FOR HUMANS, as is reacting well and adaptively to new situations that resemble the old ones in many but not all ways. On the other hand, the way that machines learn is a little bit different. Unlike humans, MEMORIZATION is really EASY FOR MACHINES. You can give them a million digits, it'll remember it perfectly and give it right back to you. But FOR A MACHINE, for an algorithm, SPOTTING PATTERNS in terms of here's a visual pattern, here's a pattern over time, those are much HARDER FOR ALGORITHMS. And new situations can be very challenging for algorithms to take what they learned previously and adapt it to something that may differ in a few significant ways. But the general idea is that ONCE YOU FIGURE OUT HOW MACHINES LEARN and the ways that you can work with that, you CAN DO SOME USEFUL THINGS. So for instance, there's the SPAM EMAIL, and you get a new email and the algorithm can tell whether it's a spam. I use a couple of different email providers, and I will tell you some of them are much better at this than others. There's also IMAGE IDENTIFICATION, for instance telling whether this is a human face or whose face it is. Or there's the translation of languages where you enter text, either written or spoken, and it translates it back, a very complicated task for humans, but something machines have learned how to do much better than they used to. It's still not 100%, but getting closer all the time. Now, the important thing here is that you're not specifying all the criteria in each of these examples, and you're not laying out a giant collection of if this, then that statements in a flow chart. That would be something called AN EXPERT SYSTEM. Those were created several decades ago and have been found to have limited utility, and they're certainly not responsible for the modern developments of machine learning. Instead, a more common approach really is to just TEACH YOUR MACHINE. You TRAIN IT. And the way you do that is you show the algorithm millions of labeled examples. If you're trying to teach it to identify photos of cats versus other animals, you give it millions of photos and you say, this is a cat, this is cat, this is not, this is not, this is. And then the algorithm find its own distinctive features that are consistent across at least many of the examples of cats. Now what's important here is that the features, the things in the pictures that the algorithm latches onto, may not be relevant to humans. We look at things like the eyes and the whiskers and the nose and the ears. It might be looking at the curve on the outside of the cheek relevant to the height of one ear to another. It might be looking just at a small patch of lines around the nose. Those may not be the things that humans latch onto, and sometimes they're not even visible to humans. It turns out that algorithms can find things that are very subtle, pixel by pixel changes in images, or very faint sounds in audio patches, or individual letters in text. And it can respond to those. That's both a blessing and a curse. It means that it can find things that humans don't, but it ALSO CAN REACT IN STRANGE WAYS OCCASIONALLY. But once you take all this training, you give your algorithm millions of labeled examples and it starts classifying things, well, THEN YOU WANT TO USE SOMETHING like a NEURAL NETWORK, which has been responsible for THE MAJOR GROWTH in MACHINE LEARNING and DATA SCIENCE in the past five years or so. These diagrams here are different layouts of possible neural networks that go from the left to the right. Some of them circle around, or they return back to where they were, but all of these are different ways of taking the information and processing it. Now, the theory of neural networks or artificial neural networks has existed for years. The theory is not new. What's different, however, is that computing power has recently caught up to the demands that the theory places, and in addition the AVAILABILITY OF LABELED DATA, primarily thanks to social media, has recently caught up too. And so now we have this perfect combination. The theory has existed, but the computing power and the raw data that it needs have both arrived to make it possible to do these computations that in many ways resemble what goes on in the human brain and that allow it to think creatively about the data, find its own patterns, and label things. Now, I do want to say something about THE RELATIONSHIP BETWEEN DATA SCIENCE and MACHINE LEARNING. DATA SCIENCE can definitely BE DONE WITHOUT MACHINE LEARNING. Any traditional classification task, logistical regression, decision tree, that's not usually machine learning and it's very effective data science. Most predictive models or even something like a sentiment analysis of social media text. On the other hand, MACHINE LEARNING WITHOUT DATA SCIENCE, well, you know, NOT SO MUCH. It's possible to do machine learning without extensive domain expertise, so that's one element of data science. On the other hand, you would nearly always want to do this in collaboration with some sort of topical expert. Mostly I like to think of MACHINE LEARNING AS A SUBDISCIPLINE OF DATA SCIENCE. And that just brings up one more thing I want to say. The neural networks, and the DEEP LEARNING NEURAL NETWORKS IN PARTICULAR, that have been RESPONSIBLE for NEARLY ALL OF THESE AMAZING DEVELOPMENTS in machine learning are a little bit of a black box, which means it's hard to know exactly what the algorithm is looking at or how it's processing the data. And one result of that is IT KIND OF LIMITS YOUR ABILITY TO INTERPRET WHAT'S GOING ON, even though the predictions and classifications can be amazingly accurate. I'll say more about NEURAL NETWORKS and THESE ISSUES elsewhere, but they highlight the trade-offs, the potential, and the compromises that are inherent in some of these really exciting developments that have been taking place in one extraordinary influential part of the data science world.
Actionable insights
If you're working in a startup or, really, any entrepreneurial or organizational setting, then you know that your work is all about getting results. And that brings up something that I mentioned earlier in this course I want to mention again, from one of my heroes, the American psychologist and philosopher William James, who said, "My thinking is first and last and always "for the sake of my doing." His point is that human cognition is designed to fulfill goals, to help us reach a particular end. And I like to summarize that and apply it towards data science with this thought, that data and data science is for doing. It exists to help us do things. And the reason we do these projects is to help us accomplish things that are important to us. Remember, when you did the project, there was a goal, there was some motivation. What motivated the project? What sorts of things did you stick up on the wall? These are questions you wanted answered. Why was the project conducted? The goal is usually to direct some kind of particular action. Should we open a new store over here? Should we reduce the price over here? Should we partner up with this other organization over here? And your analysis should be able to guide those actions. And so, can you remember what those clear questions were? And can you give a clear, well-articulated, and justifiable response to those questions based on your data science project? When you do, there's a few things you want to keep in mind. -You need to focus on things that are CONTROLLABLE. The analysis might say that companies founded in the 80s have greater success, but if you're founded in 2015, there's not much you can do about it. -So focus on something that is under your control and try to make it something that's specific. Also, be PRACTICAL. Think about the return on investment, or ROI. You need to work on things and give actions where the impact will be large enough to justify the efforts. -Also, if you're giving recommendations to a client, make sure it's something that they are actually capable of doing. And then, also, you want to BUILD UP, you want to have sequential steps. You want to make a small recommendation, carry through on it, and then build on it as you see the results of each earlier step. The data science project is designed to fulfill all of these requirements in a way that the benefits will be visible to the client as you help them find an answer to their question. And if you can get that done, then you've done exactly what you meant to do when you started in data science. And that is worthy of an office celebration, so congratulations.
Creating data
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Sometimes, you need something special, something that's not already there. In the data science world, there's a lot of data that you can get from in-house data, open data APIs and even data scraping. But if you still can't get the data you need to answer the questions you care about, then you can go the DIY route and get your own data. There are several different ways to go about this. The first one I would recommend is just natural observation. See what people are doing. Go outside. See what's happening in the real world. Or observe online. See what people are saying about the topics that you're interested in. Just watching is going to be the first and most helpful way of gathering new data. Once you've watched a little bit about what's happening, you can try having informal discussions with, for instance, potential clients. You can do this in-person in a one-on-one or a focus group setting. You can do it online through email or through chat. And this time, you're asking specific questions to get the information you need to focus your own projects. If you've gone through that, you might consider doing formal interviews. This is where you have a list of questions, things you are specifically trying to focus on and getting direct feedback from potential customers and clients. And if you want to go one step beyond that, you can do surveys or questionnaires. And you can start asking close-ended questions. Ask people to check off excellent, good. Or you can ask them to say yes or no, recommend something in particular. You usually don't want to do that however until you've done the other things ahead of time, 'cause this makes a lot of assumptions. And the idea here is that you already know what the range of responses are. And so make sure you don't get ahead of yourself with this one. Throughout all of this, one general principle is important, especially in preliminary research. And that is that words are greater than numbers. When you're trying to understand people's experience, be as open-ended as possible. Let them describe it in their own terms. You can narrow things down and quantify them later by counting for instance, how many people put this general response or that general response, but start by letting people express themselves as freely as possible. Also, a couple of things to keep in mind when you're trying to gather your own data. Number one is don't get ahead of yourself. Again, start with the very big picture and then slowly narrow it down. Don't do a survey until you've done interviews. Don't do interviews until you've watched how people are behaving. Start with the general information and then move to more specific things. As you focus on the things that you can actually have some influence on the actionable steps in your own projects. Also, be respectful of people's time and information. They're giving you something of value. So make sure that you are respectful. Don't take more time than you need to. Don't gather information you don't need to. Just what is necessary to help you get the insights you need in your own particular project. Now, another method of gathering data that can be extremely useful is an experiment. Now I come from a background in experimental social psychology. The experiments are very time-consuming, very labor intensive. But that's not what we're talking about in the e-commerce and the tech world. Usually here, we're referring to what's called AB testing. Where for instance, you prepare two different versions of a website or a landing page and you see how many people click on things as a result of those two different versions. That's extremely quick and easy and effective. And people should constantly be doing that to try to refine and focus on the most effective elements of their website. But through all of this, there is one overarching principle. When you're gathering data, when you're engaging with people, seriously, keep it simple. Simpler is better, more focused, is more efficient, and you're going to be better able to interpret it, to get useful next steps out of the whole thing. I also want to mention two elements of research ethics that are especially important when you are gathering new data. Number one is informed consent. When you're gathering data from people, they need to know what you want from them. And they also need to know what you're going to do with it so they can make an informed decision about whether they want to participate. Remember, they're giving you something of value. This is something that needs to be their own free choice. The second one which can be expressed in many different ways is privacy. Also sometimes confidentiality or anonymity. You need to keep identifiers to a minimum. Don't gather information that you don't need and keep the data confidential and protected. Now that's actually a challenge because when you got this kind of data, it's often a value and it's easy for it to get compromised. So maintaining privacy really is incumbent upon the researcher and it's part of building the trust and the good faith for people to work with you and others again in the future. And then finally, to repeat something I said before, people are sharing something a value with you. So it's always a good idea to show gratitude and response both by saying thank you in the short-term, and also by providing them with better, more useful services that are going to make things better in their own lives.
Aggregating models
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] There's a saying that life imitates art. Well, for some time, as I've been preparing for this presentation, I planned to talk about how people estimate the amount of money in a jar of coins. And then it turned out that our family inherited a giant jar of coins from my mother-in-law. And we've asked everybody in our extended family for their guesses as to how much money is in the jar. I guessed $165, but my wife, who was much more optimistic, guessed $642.50. Well, we eventually took the jar to the money counter and it turned out that it had $476 in it, which as it happens is almost exactly halfway between the guesses that I made and my wife made. Any one guess it turns out is like an arrow shot at a target. It may be high or low. It may be more or less accurate, but as it happens, if you take many guesses and average them, like averaging the guesses that my wife and I each made, the errors tend to cancel out and you end up with a composite estimate that's generally closer to the true value than anyone guesses. And that was true in our situation. This is sometimes called the wisdom of the crowds. Although in statistics, it's a function of something called the central limit theorem and the way sampling distributions behave. But the idea is that combining information is nearly always better than individual bits and pieces of information. In data science, the wisdom of the crowd comes in the form of combining the results of several different models of the same dataset, trying to predict the same outcome, like maybe a linear regression, and a lasso regression, and a decision tree, and a neural network. Again, all using the same data and all predicting the same outcome. Now, one reason why it can be helpful to combine estimates is because just as some people will underestimate the number of coins in a jar like I did and some people overestimate like my wife did, machine learning algorithms can also miss the mark in a few ways. There can be under fitting or bias. And that's where your model is too simple and it loses key detail in the data. It doesn't work to simply say, well, if you have this many years of education your expected income is this level. That's just a straight ahead regression line. And it turns out that it misses a huge amount of the variation in the data and it doesn't generalize well. You may develop the model on your training data, but it just kind of falls flat when you try to apply it to a larger data set. The flip side of underfitting or bias in your machine learning algorithms is overfitting where you have the problem of variance. And this is where your model is too complicated and it follows every twist and turn and the data matches it too closely. And it turns out that that also doesn't generalize well. You get a model that has a thousand different parameters when you're trying to predict something that only should involve four things. It doesn't serve the purpose of simplifying, which is the general goal of analysis. And again, it doesn't go well, it doesn't translate well from your training data where you built the model to the real world. You can think of it this way. Here we have a little scatterplot that has a predictor variable across the bottom and output variable on the side. And you can see that it's generally uphill, but it curves down at the end on the right side. Well, the underfit, the bad model here that is underfit and has too much of bias, it just draws a straight line. And yeah, there is a straight trend in there somewhere, but it's missing the curve in there. On the far right we have an model that has been overfit, has got too much variance, where we got a line that's looping through every single point. And so, yeah, the line connects all the points, but there's no way it's going to work that exact same way with more data from the same general population. Instead you want the one that's in the middle. That is an optimal level that balances between the bias of the underfit model and the variance of the overfit model that generally comes through the middle. And it shows you that there's that descending curve at the end. Another way to look at it is this model, which shows the model complexity across the bottom. Think of just a straight linear regression on the left and some super complicated neural network on the right. And then the prediction error is going up the side where more error is bad. And what you have is the very, very simple model, which has high bias. It's got a lot of error at the bottom 'cause it's too simple, but it drops down. That's the blue line dropping down. On the other hand, a model with high variance, it gets more and more error as the model gets more complex. And so, ideally what you're going to do is you're going to take a function of those two curves, the blue descending one for bias, the red ascending one for variance, combine them and you get this kind of smooth U-shaped curve in yellow. And you want to pick the part where that is the lowest. So, you're going to pick a level of model complexity that gives you the lowest possible value for prediction error based on the methods that you applied to your data. Now, this trade-off between bias and variance gives rise to the saying, no free lunch, which in the machine learning world, means that no one algorithm is always going to be the best. No one choice will always minimize the error and give you the best solution to the bias variance trade off. On the other hand, there are several things that you can do to try to get a good solution to this by combining the estimates. So for instance, bagging or bootstrap aggregating, where you pull multiple random samples from your training dataset, you build several iterations of the same algorithm, like a decision tree, to predict the outcome on different samples from the training data. Then combine the estimates. Perhaps by using the most common outcome category or average and estimates where it's like voting. You can also have boosting. This is a sequential process where in the first step a model is built. For example, to classify cases. In the second step, a model is built for the cases that the first step got wrong, the misclassifications, and it goes at each step, building new models for the cases that the previous model missed. And then finally, there's stacking. This is a way of using several different algorithms like logistic regression, and K-nearest neighbors, a decision tree, and a neural network to classify cases. Then using a higher level algorithm to build a weighted model that uses the predictions of those first level models to come up with an overall prediction. All three of these approaches, bagging, boosting, and stacking have proven to be very effective under different circumstances for combining different predictions to get the wisdom of the crowd in data science. Now, there are some tremendous benefits to combining models in this way. The first is that you get multiple perspectives, several different ways of looking at the data. Again, no one system is going to be ideal for everything. And so, it behooves you to try several different models to see how well they work with the particular task at hand. And by doing so, they allow you to find the signal amid the noise. And the reason it does that is because generally each system will bring its own noise, but it will cancel out with the next one and what you are left with is something that is usually more stable and more generalizable. The idea is that by combining the estimates you get from each of these models, you are able to find an optimal solution to the bias variance trade-off. And so, you can think of it as a kind of cooperation between the models. The idea that many eyes on the same problem can lead to the best possible solution which is why you got involved in data science in the first place.
Generative adversarial networks (GANs)
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] This painting is called Edmond de Belamy, and it's notable for a few reasons aside from being pretty, smudgy and impressionistic. First, it's sold in 2018 at auction for over $400,000. And that's a lot of money by most people's standards. Second, if you take a closer look at the bottom right of the picture where the artist typically signs their name, you'll see not a name but a mathematical formula. In fact, this formula is part of the computer algorithm that produced the painting based on its experience, viewing over 15,000 portraits from WikiArt. This was an algorithmically produced work of art. It was created with what is called a generative adversarial network. And that actually consists of two pieces. First, there is a generative network or a generator. This is an algorithm that produces novel output based on experience with training data. So you show it a whole bunch of information, like a bunch of pictures, and it produces something that is inspired by those other ones, maybe with some kind of random or evolutionary variation on it and it outputs that. Now the most common approach here is what's called a deconvolutional neural network. But the output from this network is then sent to a separate network. This is a discriminative network or a discriminator, and this is an algorithm that attempts to distinguish the generated output from nongenerated output. So it's also seen a whole lot of real paintings and it tries to tell the ones that a computer created versus the ones that the humans created. And the most common approach here is what's called a convolutional neural network, which is frequently used for analyzing visual data. And so you have these two networks, the generator and the discriminator that are in competition with each other. And it's a little bit like when you're playing soccer and you're going for that penalty kick, and you've got the offense that is trying to get the ball into that and the defense is trying to stop the ball and keep it out. It is a competition. And one of the interesting things about competition is we all know, it makes things advance. So there's a progression. The generator gets better and better at creating data that fools the discriminator. At the same time, the discriminator gets better and better at spotting the generated data. So it's kind of like an arms race, they each improve and what you get is data that's more realistic. Like for instance, every one of these photos here off to the side, these are not real people. These are portraits that were generated by a generative adversarial network. And you can see, it's really effective. Now there are a number of use cases and I'll start with what you might call the legitimate use cases. And that is, you can use these generative adversarial networks or GANs to upscale images and videos, say you got a low resolution photo or a low resolution digital video. It can fill in the details enough to make it so everything is high resolutioned 4k, and it can be used to create images for fashion and for design visualizations like these shoes. I don't even know if these are real shoes. A GAN could create products that look real and images that look real. They're also useful for creating realistic animated movies. Several very large budget Hollywood movies have included extremely realistic animation that was created using GANs. And then they can also be used in various kinds of scientific simulations and even in molecular research. So this creative process that generates novel data and then sends it to a discriminator that tries to distinguish between the two. So the generator has to get its game together, very productive in a lot of ways. On the other hand, everybody knows that there are some potential problems with this. GANs make it possible to create fake images, fake videos, fake audio, fake texts, and even fake datasets. And then you can also use it to create fake social media profiles and posts and to place fake phone calls, send fake text messages and so on. It actually becomes really difficult sometimes to know what's real and what's not, it's easy to take an image of one person and then create a video of them saying whatever you want them to say in their own voice. And so there is a very serious problematic element to this. That doesn't mean we shouldn't have GANs. GANs are able to do so much of the magic of data science. And I like to think that with great promise, the ability to have computers do things that we just didn't even think were possible, comes great risk and a great need to be mindful of how this is used. And so there are laws that exist in certain locations that ban the use of GANs to generate, for instance, fake images of actual people in a kind of defamation, but there needs to be a broader societal discussion about what is acceptable, what is not, and what the guard rails are to keep this particular technology in line while allowing us to benefit from its amazing abilities.
The derivation of rules from data analysis
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] A Dancer has to spend years working on their craft to deliver a masterful performance. One of the paradoxes of this training is that sometimes, you have to think a little bit less in order to move better. Your conscious processes can interfere with fluid and meaningful movement. Sometimes, you just have to calm down all your ideas about expert decision-making systems and the rules that they bring along and let the data have a say in how you should go about your work. We'll start by looking at linear regression, which is a common and powerful technique for combining many variables in an equation to predict a single outcome. The same way that many different streams can all combine into a single river. We'll do this by looking at an example based on a data science salary survey. And so this is based on real data and the coefficients are based on the actual analysis. Although I'm only showing a few of the variables that went into the equation. What the researchers found is that you could predict a person's salary and data science by first starting with a value of $30,500 per year. That's the intercept. And then for each year above 18, so take their age and subtract 18, you add $1,400. And then to that, you add $5,900 for each point on a five-point bargaining scale from one to five. So people who were better at bargaining made more money each year. And to that, you can add $380 for each hour of the week that they work through the year. Taken together, you can combine that information, age and bargaining ability and time spent working to make a single prediction about their salary working in data science. It's an easy way to take these multiple sources and combine them using this rule, which comes from the data that you fed it to know how to best predict the one outcome. Another method that's frequently used in data science is what's called a decision tree. This is a whole series of sequence of binary decisions based on your data that can combine to predict an outcome. It's called a tree because it branches out from one decision to the next. And in this example, we'll look at a simple analysis of the classic dataset on classifying Iris flowers as one of three different species, depending on the length and the width of the petal and the sepal. And the decision tree looks like this. It's extremely simple because there's only four variables in the dataset. But what you start with is based on the analysis, is the single most important thing is to first know what is the length of the petal. And if the petal is less than or equal to 1.9 centimeters, then we would predict that 100% of them are going to be Iris Setosa, that's a particular species. On the other hand, if the length of the petal is more than 1.9 centimeters, we have to make another decision. That's, we have to look at the width of the petal. And if that width is less than or equal to 1.7 centimeters, then we need to look at the length of the petal again. And if it's less than or equal to 4.8 centimeters, then there's a very high probability that it's a versicolor. And the algorithm used to calculate this does this by finding out which variable splits produced the best predictions, and then do you need to split them anymore? So it's not like somebody sat down and said, "I think length is most important. Let's do that first." Instead, the data drove the entire decision. And you'll notice by the way, there's only two variables in this. There's petal length, which appears twice and petal width. The two other variables in the dataset, the sepal length and the sepal width with don't work into it at all, not in this simple model. And what this lets you do is it lets you have a data-driven method of classifying observations into one case or another. You can then feed that into your algorithm and use it to process new information. Again, either approach linear regression decision trees or so many others can give you the information you need for database decision-making either in-person or as a part of your data science algorithms. And that will get you on your way.
Sampling and probability
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] AI and data science are amazing, but things don't always go as expected. Algorithms crash or can produce severely biased results that lead to a very embarrassing PR problem for your organization. And part of this problem, especially with bias, can be traced to sampling issues. Now, if you want to have a valid and generalizable result, one of the things that you need to have, not the only one, but one of them is a representative sample. And the easiest way, at least in theory, to get a representative sample is through a simple, random sampling, which is not to be confused with haphazard or convenient sampling. A simple, random sample draws from a large population, say for example, a complete list of customer transactions or social media followers, and draws from that list with an equal probability for every case. This is done repeatedly until you have the number of observations that you need for whatever procedure. If you do this, there is a high probability, it's not a hundred percent, but often close enough that whatever you find in your sample data will generalize well to the population or the entire group that you drew your data from. A related procedure is used whenever people do cross validation or hold out validation or sampling is used to separate the data into, for example, training and testing sets for analysis and verification. The idea there is that the model developed with one randomly selected data set will also function well with the others and so sampling is important for that. On the other hand, something you need to know about sampling and probability, you know that if you flip a coin 10 times and you do that repeatedly and count the number of heads you get each time, you won't always get the same number, even though the coin is the same. It hasn't changed and also, the probability of getting heads hasn't changed. It's going to be 50% each time. Anyone who has done cross validation or holdout validation knows that you never get exactly the same model and the same accuracy from one random subset to another, but hopefully, they're close. And if you use enough data, if you have a large enough sample to begin with, they should be very close. But this only applies if you have the variability in your data to begin with, which brings up another important point about new populations. The other thing about variability is stuff that never got captured and put into your data set in the first place. When a company branches out to a new product line or to a new market, they're going to a kind of terra incognita. You can't automatically assume that the models you've made for one will transfer to the other. You may have a great understanding of how social influence works on maybe Twitter, but that may not transfer well to TikTok. Or you may have an excellent model for predicting the purchasing behavior of 30-somethings on the Pacific coast, but it may not work well with teens in the Gulf coast, let alone with grandparents in Central America. The best algorithm in the world can't predict something that just doesn't appear in your data, which means it is incumbent on you to spread your net wider, to make sure you actually have data that captures the variability that you care about. Also, you need to qualify your results. All of this should serve as a reminder that the results of any analysis or model are inherently probabilistic. They are not guaranteed to be exactly the same, except maybe in the case of a highly controlled simulation, but that's not what you were looking for in the first place. Every model is an approximation of reality. It's a model that responds to that, but there's always some slippage. So there's a discrepancy and this slippage is something important to keep in mind as we talk about some of the mathematics behind data science. So we don't automatically fall into the idea that if there's an equation and we're obviously dealing with something absolute and indisputable, keep in mind that you need to still be circumspect. You need to be humble and sampling, which introduces the limits to our analysis, just as through the process of partitioning data sets, but it lets us know that everything has an built in, inescapable variability that we're going to have to acknowledge and include in our conclusions so that we can have a sample model that generalizes and is more applicable to a larger population without getting us in trouble by going too far, Keep your limits in mind and you will be able to deliver better value in your data science.
Dimensionality reduction
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Back in 1957 in his legendary song, "Rock and Roll Music," Chuck Berry's single lament about musicians who make things too complicated and to quote, "Change the beauty of the melody until it sounds just like a symphony." And that's why he loves rock and roll music. The same idea explains why most bands have four people like The Beatles right here, or possibly three or five. That's enough people, enough instruments to fill all the sonic regions without overwhelming with too much information and resulting in cacophony. Jumping way ahead in time, there's a similar problem in data science. We think of data coming down in a matrix-like stream here is really cool, but it's hard to get meaning out of it and it's hard to know what to do as a result of it. We need a way to get through the confusion and the haze and pull things into focus. Fortunately, there's a way to do that in data science. The idea of dimension reduction is to actually reduce the number of variables and the amount of data that you're dealing with. So instead of dealing with dozens or hundreds, or maybe even thousands of variables, you're dealing with a single score. Like how likely a person is to behave in a particular way. It sounds counterintuitive, but there are actually some very good reasons for doing this. First off, each variable, each factor or feature has error associated with it. It doesn't measure exactly what you want to bring in some other stuff. But when you have many variables or features together that you combine, the errors tend to cancel out. So if they're all pointing in slightly different directions, you end up centered on what it is you want. Also, by going from many individual measurements to a single conglomerate measurement, that reduces the effect of something called collinearity, which is the association, the overlap between predictive variables in the model, which creates some significant problems. And so if you have fewer variables, there's less problems for collinearity. Also, not surprisingly when you have a few features you're dealing with instead of hundreds, you are able to do things faster. Your computer is able to process the information with greater speed. And another really nice consequence to this is it improves generalizability. Again, because you're getting rid of, or averaging out the idiosyncratic variation with each observation, with each variable, and you're going to get something much more stable that you're able to apply to new situations better. Now, there are two general ways to do this. There are a lot more options, but the two most common are these. Number one is called principal component analysis, often just called principal component or PCA. And the idea here is that you take your multiple correlated variables. And you combine them into a single component score. So let's say you give a personality questionnaire and it's got 50 questions on it, but you have 10 questions for each element of personality. Then you can combine those into 10 components if the analysis supports that combination. And then you only have five things to deal with as opposed to 50. That's much easier to deal with. Another very common approach is factor analysis and functionally, it works exactly the same way people use it for the same thing. Although the philosophy behind factor analysis is very different. Here, your goal is to find the underlying common factor that gives rise to multiple indicators. So in principle component analysis, the variables come first and the component results from it. In factor analysis, this hidden factor comes first and it gives rise to the individual variables. That said, even though they are conceptually very different that way, people tend to use the too often interchangeably. What they let you do is group variables in ways that make sense. Now, there are a lot of variations on methods for dimension reduction. You, for instance, might be engaged in an exploratory analysis where you're just trying to find out what's there in the data in front of you or a confirmatory analysis, where you have a known instructor and you're trying to see how well your current data fit. You have different methods for different levels of measurement. If you have a quantitative thing where you're looking at how long it takes somebody to do something or the value of their purchases, that's one approach. But if you're counting yes or no are there in this particular category, you're going to need to do something else. Also, there are multiple algorithms, many different ways of measuring the similarity between variables and measuring ways of overlap and the degree that they share. And so there are some very important details on these, but we can save that for a more detailed analysis in another video. Right now, I want you to know that there's the possibility of doing this and that is worth looking into. I mean, think about it. Dimension reduction in data is like learning to read a language. At first, you just see random shapes. Then you see individual characters, then words, then sentences. And then finally, ideas. You can go from a hundred pieces of information, a line here, a circle there, but just a handful. And that's what you need to get meaning out of your data and to do something useful with it.
Clustering
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Everybody in a crowd is their own person. Each person is a unique individual. And perhaps in an ideal world, your organization would acknowledge that and interact with each person in a tailored and unique way. But for right now, we face a lot of limitations and there're plenty of times where it's helpful to create groups of clusters of people that might be similar in important ways. These can include marketing segments where you might give the same ads or the same offers to a group of people. Or developing curricula for exceptional students like gifted and talented or artistic students. Or maybe developing treatments for similar medical groups. Now, when you look at clusters in the United States, it's easy to start with each state represented separately but it's really common practice to group these states into say four large regions of geographically adjacent states. Like the South, the West, the Northeast and the Midwest. And that makes a lot of sense if you're actually having to travel around from one to another. But you don't have to group by just what's physically next to each other. For example, a soccer team has matching jerseys and they coordinate their movement 'cause they're a team. These could serve as the basis of maybe a behavioral cluster as opposed to a geographic one. And you can use a lot of different measures for assessing similarity, not just physical location. You can look at things like a K-dimensional space. So you locate each data point, each observation in a multidimensional space with k dimensions for k variables. So if you have five dimensions, k is five. If you have 500, then you have 500 dimensions. What you need to do then is you need to find a way to measure the distance between each point. And you're going to do one point, every other point and you're looking for clumps and gaps. You can measure distance in a lot of ways. You can use Euclidean distance. That's the standard straight line between points in a multidimensional space. You can use things like Manhattan distance and Jaccard distance, cosine distance, edit distance. There's a lot of choices in how you measure the distance or the similarity between points in your dataset. Let me give you an example though of cluster analysis in real life. So for instance, one of my favorite studies is based on what's called the Big 5. I have a background in social and personality psychology. The Big 5 is a group of five very common personality factors that show up in a lot of different places under a lot of different situations. The actual factors are extroversion versus introversion, agreeableness, conscientiousness, neuroticism, which means your emotions change, and as opposed to stability. And then openness, specifically openness to new experiences. One study I know actually tried to group the states in the US using these Big 5 personality factors. They got information from social media posts and then evaluated each state and created a profile. And from that, they found that the states in the US went into three very broad groups. The big group there in orange down in the middle, they're called the friendly and conventional. The yellow off to the West Coast and actually a little bit off on the East Coast, they called relaxed creative. And then in green, which is most the Northeast but also Texas is temperamental and uninhibited. And these are different ways of thinking about the kinds of things that people think about and the way that they behave. Now, you could use psychological characteristics or maybe you could group states by, for instance, how they search for things online, which might be more relevant if you're doing e-commerce. So I went to Google Correlate and I chose a dozen search terms that I thought might roughly correspond to the Big 5 personality factors and what that data tells you is the relative popularity of each search term on a state-by-state basis. I then did an analysis in R doing what's called a hierarchical cluster where all of the states start together and then it splits them apart one step at a time. And you can see, for instance, that my state of Utah is kind of unusual by itself over here but you can see the degree of connection between each of the states until finally all of the 48, it doesn't include Alaska or Hawaii because the original researchers didn't have that in their personality data, but I could say give me two groups and then it groups them this way. You can see we have just these five states over here listed on the right. Or we could say give us three groups in which case, all it does is it separates Utah from those other five. But you can go down to a level that seems to work well. It's something that makes sense and that works with your organization's needs. Now, I want you to know that when you're doing a cluster analysis, you could do a hierarchical clustering, which I just did. That's a very common approach but there's a lot of alternatives. You can do something called K-means or a group centroid model. You can use density models or distribution models. Or a linkage clustering model. Again, we have other resources here that will show you the details on each of these and how to carry them out. Mostly, I want you to be aware that these alternatives exist and they can be useful in different situations for putting your data together. Remember, with any analysis, something like clustering exists to help you decide how you want to do things. So use your own experience and use commonsense as you interpret and implement the results of your analysis and you will get more value and more direction out of that for your own organization.
Open data
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] I love my public library here in Salt Lake City. And this right here is the main library downtown. It's a beautiful building. Has desks by a wall of windows looking out over the Wasatch Mountains. It's a great place to be, but even more, I love it because it's a library, and they have books there. They have a collection of over half a million books and other resources, including rows and rows of beautiful books on architecture and landscaping that I can never purchase on my own. I love to browse the shelves, find something unexpected and beautiful, go by the windows and enjoy it. Open data is like the public library of data science. It's where you can get beautiful things that you might not be able to gather or process on your own, and it's all for the public benefit. Basically, it's data that is free because it has no cost, and it's free to use, that you can integrate it into your projects and do something amazing. Now there are a few major sources of open data. Number one is government data. Number two is scientific data. And the third one is data from social media and tech companies. And I want to give you a few examples of each of these. Let's start with government sources. In the United States, the biggest and most obvious one is data.gov, which is home of open data from the U.S. Federal Government, where you can find an enormous amount of data sets on anything from consumer spending, to local government, to finance, a wide range of topics that would be really useful for data science. For example, you can look up the Housing Affordability Data System, which is going to actually have a big impact on where you can employ people to work for you, and where potential customers are for your products and services. If you're in the United Kingdom, then data.gov.uk is going to be the major choice. Or if you're in Sweden, you've got open data sources from the government there too. In fact, the Open Knowledge Foundation runs what they call the Global Open Data Index, which gives you information about the open data sources in every country around. At a state by state level you also have open data sources. I live in Utah and this is utah.gov's Open Data Catalog, and you can even drill down to their city level. It's within the same website, but now I can focus on specific cases for my city. For scientific data, one great source is the ICSU's World Data System, which is a way of hosting a wide range of scientific data sets that you can use in your own work. There's also Nature, that's the major science journal has put together a resource called Scientific Data, which is designed to both house and facilitate the sharing of the datasets used within this kind of research. There's the Open Science Data Cloud, which by the way, is not to be confused with the Open Data Science Conference, a wonderful event. And then, there's also the Center for Open Science, which gives you the ability to house your actual research there. So many opportunities to upload data, to find data, to download and share it, and then, share your results with other people. And then, in social data, one of the favorite ones is Google Trends where you can look at national trends on search terms. So for instance, here's one that is looking at the relative popularity over time of the terms data science, machine learning, and artificial intelligence. And then, Yahoo Finance is one of the great sources for stock market information. And then, Twitter allows you to search by tags, by users, and it lets you look at things like #data science, and you can download that information by using the Twitter API and include that in your own analysis. So there's an enormous range of possibilities with open data. It's like bringing the best of library availability and teamwork to your own data science projects. So find what's out there and get the information you need to answer your specific questions, and get rolling with the actionable insights you need.
Languages for data science
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] I love the saxophone. I play it very badly, but I love to listen to it. And one of the curious things about the saxophone is that if you want to play gigs professionally, you can't play just the saxophone. At the very least, you have to play both the alto and the tenor saxophone, as well as the flute and the clarinet. And for other gigs, you may need to be able to play oboe, English horn, bassoon, bass clarinet, and even recorder and crumhorn, like one of my teachers. You have to be a musical polyglot. I mention this because one of the most common questions in data science is whether you should work in Python or in R, two very common languages for working with data. The reason this question comes up is because programming languages give you immense control over your work in data science. You may find that your questions go beyond the capabilities of data analysis applications and so the ability to create something custom tailored that matches your needs exactly, which is the whole point of data science in the first place, is going to be critical. But let me say something more about Python and R. Now, Python is currently the most popular language for data science and machine learning. It's a general purpose programming language. You can do anything with Python. People do enormous numbers of things that are outside of data science with it. Also, Python code is very clean and it's very easy to learn and so there are some great advantages to Python. R, on the other hand, is a programming language that was developed specifically for work in data analysis. And R is still very popular with scientists and with researchers. Now, there are some important technical differences between the two, such as the fact that R works natively with vectorized operations and does non-standard evaluation and Python manages memory and large data sets better in its default setup. But neither of those is fixed. Both of them can be adapted to do other things. And really, the sum of this is that like any professional saxophonist is going to need to be able to play several different instruments, any professional data scientist is going to need to be able to work comfortably in several different languages. So those languages can include both Python and R, they can include SQL or a structured query language or Java or Julia or Scala or MATLAB. Really, all of these serve different purposes. They overlap, but depending on both the question that you're trying to answer, the kind of data that you have, and the level at which you're working, you may need to work with some, many, or all of these. Now, I do want to mention one other reason why programming languages are so helpful in data science. And that's because you can expand their functionality with packages. These are collections of code that you can download that give extra functionality or facilitate the entire process of working with data. And often it's the packages that are more influential than the actual language. So things like TensorFlow, which make it so easy to do deep learning neural networks, you can use that in Python or in R and it's going to facilitate your work. But no matter what language you use and what packages you use, it is true that the programming languages that are using data science are going to give you this really fine-level control over your analysis and let you tailor it to the data and to the question that you have.
Feature selection and creation
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] I teach statistics to undergraduate students who don't always see how it connects to their lives. I can give specific examples about each of the fields, but I've found that even the most recalcitrant student can get excited about data when we talk about sports, like baseball. Baseball's a data-friendly sport. It's been going on for over 100 years, there are 162 games in the regular season, and they count everything. If you're trying to figure out how good, for example, a particular batter is, you can start with these basic bits of data, and you'll have an enormous amount of information to work with. These are the features in the dataset that you start with. But if you're a coach or a manager, you can do a lot more than just use those raw data points to make a strategy. You can start combining them to create new features in your dataset and finding value and possibilities in your team. Now, you can start with really simple ones. This is the batting average, and all it is is the number of hits divided by the number of at-bats. And you need to know that those are defined in particular ways, but it's just one number divided by the other. Or you can get something more sophisticated, like the on-base percentage, where you take three things: the number of hits, the number of bases on balls, and hit-by-pitch, and divide that by four things: at-bats, bases on balls, hit-by-pitch and sacrifice flies. That gives you a better measure, according to some judgment. Or if you want to jump ahead to the 21st century, you can start getting really fancy with something like the Weighted Runs Created Plus, where you have a whole plethora of things you're putting together. And what's interesting is, every one of these is actually its own formula going into it. So that one's complicated, but it's all based on these little bits and pieces of information that are available. Before I go ahead, I want to mention one thing, and that's feature selection and creation is a different process than the dimension reduction that I mentioned elsewhere. Dimension reduction's often used as a part of getting the data ready so you can then start looking at which features to include in the models you're creating, and that's what we're addressing right now. So given that you have these formulas to create all these new features, to pick the best players or best outcomes, which one should you actually use when you're making that decision? Which ones have the greatest decision value? Well, if you've seen the movie "Moneyball," which is a dramatized account of how the Oakland A's general manager, Billy Beane, used data to select and assign players, you will remember he keeps directing the scouts towards one primary factor over any other, whether a player reliably gets on base. He had data to drive that decision and to guide him to that focus, although they didn't share that process with us in the movie. But I can tell you basically how it works outside of baseball. There are a few methods that you can use for feature selection and feature creation in your data science projects. So, for instance, you can start with just basic correlation. Is this variable correlated with the outcome, or is this variable correlated? Which one has a bigger correlation? That works, but it's one variable at a time. And 'cause correlation generally looks at linear associations, it has some limits. Or you could do something called stepwise regression, where you take all of your potential predictor variables, you put them in the computer, and you say, this is the outcome variable. And it looks at correlations and picks the best one. And then it starts doing what are called partial correlations. And it's a really easy way to sift through the data. You know, you just hit go, and it's done. The problem, however, is that stepwise regression really capitalizes on chance fluctuation in your dataset. And you can get stuff that simply is not going to generalize to anything else. And so stepwise is generally considered a poor choice, even if it's an easy one. On the other hand, more modern methods like lasso regression, that's least absolute shrinkage and selection operator, and ridge regression are better ways that are more robust or these flukes of chance variation. And they give a better impression of the variable's role in the equation and which ones you should emphasize. And if you do something like a neural network, you can do variable importance. And there's a lot of other different ways of evaluating each one of these. But when you're selecting your variables, there's a few things you want to keep in mind. Number one, is it something that you can control? Ideally, if you're trying to bring about a particular outcome, you want to have the ability to make it happen. So look at variables that are under your control, or that you can select, at least. And then look at the ROI, the return on investment. Not everything that can be manipulated or controlled can be controlled easily or inexpensively. And so you need to look at the combined cost and the value or the benefit that you get from working with that particular predictor. And then the third one is, is it sensible? Does the model make sense? Does it make sense to include this particular variable in your equation? You've got experience, you know your domain. Always keep that in mind as you're evaluating the information that you can use in your model to make predictions. That, taken together, will let you make an informed choice about the best things in your data for predicting and, ideally, for bringing about the things that matter to you and to your organization.
Anomaly detection
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Some time ago, I opened up Launchpad on my Mac, which is a way of launching applications. And it's supposed to look like this. However, this particular time something weird happened. And this is what I saw instead. Now, normally when you get an anomaly like this, you just restart the app or reboot the computer, but it turns out I'm fascinated by generative art, or art that comes through as the result of an algorithm, often with a fair amount of randomness thrown in. So before I restarted things and got it back to normal, I took a screenshot and I've gone back to it several times. I consider this an excellent example of found generative art. Really, a happy digital glitch, or a fortuitous anomaly. It's also an example of serendipity, or the unexpected insight that can come along. Well known examples of serendipity include Silly Putty, Velcro, popsicles, and of course the post-it notes that every office has. You can think about these as trying to find anomalies, unusual things, and latching onto them. Now usually when we talk about anomalies, we talk about things like fraud detection. Is a particular transaction legitimate or fraudulent? You can also use it to detect imminent process failure, like a machine's going to break, or an employee has a heightened risk of burnout or leaving the company. But you can also think of this as a way of identifying cases with potentially untapped value, a new category, a new audience that you can work with. Now, what all of these have in common are the focus on outliers. These are cases that are distant from the others in a multidimensional space. They also can be cases that don't follow an expected pattern or trend over time. Or in the case of fraud, there may be cases that match known anomalies or other fraudulent cases. Any of these can be ways of identifying these anomalies and responding appropriately to them. And when you do that, it brings up the usual suspects, the usual method for analyzing data in data science, things like regression. Does this particular observation fit well with the prediction, or is there a large error? You can do Bayesian analysis to get a posterior probability that this is a fraudulent transaction. You can do hierarchical clustering or even do neural networks as a way of finding how well the data fits these known patterns. And if it doesn't, you may have an anomaly. Now there are a couple of things that make this a little harder than it might be otherwise. Number one is that we are dealing with rare events. By definition, if it's an anomaly, it's not common. So things like fraud are uncommon, and that leads to what are called unbalanced models. When you're trying to predict something that happens only 1% or one-10th of a percent of the time, you got to have a huge amount of data, and you have to have a model that can deal well with that kind of categorical imbalance. The second thing is difficult data. You may not be dealing just with a nice SQL database. You may have biometrics data. You may have multimedia data. You may have things like-time sensitive signatures, where you have to measure how it happens over an event. So as an example of all of this, think about when you've used your credit card to make a purchase online. You, the online store, and your credit card company all have a vested interest in making sure the transaction is legitimate because fraud costs money, it takes time, and it causes headaches. So your credit card company can take several steps to identify legitimate cases and potential anomalies. They might look at something like the purchase characteristics. What was purchased for how much, when and where, through what means and so on. I got a call a few years ago from my credit card company when someone tried to spend several thousand dollars on a hotel several thousand miles from my home. You can also use personal measures, things like biometrics or latency in typing, or you can measure a person's height approximately by the angle at which they hold their cell phone, the mode of transportation they're on by getting the vibration through the accelerometer. Or an interesting one is a signature of their name, or a person trying to find the cursor on their computer is, in fact, a signature that can be measured and stored and new data can be compared against that. And then there are general trends. Are there new scams going around? Are they more common in one area or another? Have scammers found ways around old technology? And so on. Needless to say, fraud detection is a cat and mouse game. So there's constant research and progress in data science to deal with new methods of fraud and harnessing the evolving capabilities of machines and algorithms. And that means that you will want to stay in touch with the resources available here to learn the specific details about the latest and greatest methods in data science, for things like fraud detection and any kind of anomaly detection, including the potential for new discoveries via serendipity.
Labeling data
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Sometimes it's nice to know where you are and what's around you. Now, I mentioned elsewhere that the supervised and unsupervised learning approaches differ primarily in terms of whether the data are labeled, and in a machine learning context, the labels are the outcomes that we care about, whether it's a category that you classify something into, or a score of some kind that you're trying to predict. But what those labels do, just like this sign is they orient us in the data, they give us a point of focus, like this art installation that's just a few blocks from my home in Salt Lake City. But there's an art to labeling data, just there's an art to sorting your laundry, so things don't come out with their colors all mixed up, except instead of sorting one basket of laundry, you might be trying to sort a hundred million cases, which is why you need to use something a little more robust than one item at a time. Now, there are a few ways that you can get labels for your data. The easiest is to use data that has labels already, like a financial data set, that already has all the successful loan payments already marked, or a biological data set, that has this species indicated already. That's kind of like getting a special little gift because you can take it, and you can get up and running on your project right away. Or maybe you can do some kind of automatic labeling, like the automated mail sorting here. The trick is that, A, somebody has to have already developed a method for creating those labels, and B, it's probably not going to be 100% percent accurate. So you have to think about how much noise, or how many misclassifications you can tolerate in your data, and third, maybe you can turn to crowdsourcing, like a service, like Mechanical Turk, or CrowdFlower, where you pay a lot of people a little bit of money to manually categorize data, and I know that last one sounds really tedious and it's because it is tedious, but it's also the source of some of the best known datasets in the machine learning world. For instance, ImageNet is a database that was created in 2007 for object recognition in digital images. It consisted of people rating 3.2 million images, just to say what was in the pictures. There were over 20,000 different categories of things in those pictures and they were hand labeled by nearly 50,000 people on Amazon's Mechanical Turk. What this did is it gave the absolutely critical reference source to begin developing enormous models, to deal with the complex issue of the categorization of images. Now, another way to do this is CAPTCHAs, or the Completely Automated Public Turing Test to tell Computers and Humans Apart, or the I'm a human thing. So for instance, you are going to go log into something and it tells you to click all the images of shoes and you got to click the two shoes, and sometimes it'll ask you to do it again, and what's going on here is that this kind of authentication task, which is again, designed to tell whether you're a human, where they have people select the objects in the photos. They're not just finding out whether you're human, they are also gathering manual classification of photos for use in other tasks. So it's kind of a twofer for them, you're not aware that you performing this crowdsourced work. It's an example of what's called an embedded training task for embedded training data. On the other hand, I will mention that there are some ethical issues when you have to keep doing it. I've had to do CAPTCHAs three times in a row, or when it doesn't really seem necessary, it feels like someone's kind of getting some free data from you, but it is a way to get your categorized data. Now that's if you don't care who does the categorization. On the other hand, there are times when the labels that you need to give are critical, it can be a matter of life and death. Like for instance, you need a cancer doctor, an oncologist to identify potential tumors and CT scans like we have here, and in that case, because you need a very, very well-trained expert to rate the data and it can take a long time, and it's very important to be accurate, you can go through this by steps. You can have a pre-training data set that might have a very large number of cases that have each been reviewed only one time by one doctor. So take it as the first estimate, understanding that anyone doctor could make a mistake, could miss something, or it might be a difficult one. Now the advantage of this is you can get a lot of data relatively quickly, and then you can train your machine learning algorithm on this larger data set with the weak labels is the term, those are labels that are not 100% guaranteed to be accurate, and then you can get another smaller data set, wherever example, maybe every case has been reviewed by multiple doctors, where you can have very high confidence in the accuracy of the labels, and then that data set can be used to fine tune the model that came from the first maybe not 100% accurate data set. So again, you can go in successive approximations, so you get something that you were very confident with. Now, all of this is separate from the other issue of bias in labeling. This is a very big issue and it's much more than simply having noise in the data, it's about missing the target entirely. Now, sometimes it's a technical issue, but sometimes it's something that has some very serious social repercussions, and there are several things that can give rise to this. One is for instance, maybe just have limited data that you're developing your algorithm with. So the training data doesn't contain the full range of possibilities, like our semimythical black swan here. Maybe you don't have the black swan in your training data, 'cause black swans are very rare, I don't think I've ever personally seen one, but it's the sort of thing that could show up in the real data and your algorithm would miss it. So you need to check whether certain groups or categories are underrepresented or even excluded from your data. You also have to watch out for the potential of confounded categories, where even though this group always goes with this group, you need to be able to tell them apart separately, and it functions differently inside the algorithm. There's also the issue of limited or biased labels, and this is where you have labels created by people with similar culture and expectations. So they may view things the same way. Like for instance, the high-end glamping setup here with your very fancy canvas tent out in the woods. Some people love this and they would be willing to pay a lot of money for it. Others just want to run away screaming. If you have a very similar group of people who are rating, for instance, accommodations, and they all give this one a five because it looks so wonderful. Well, that's going to be biased because not everybody wants it, and sometimes these are just matters of individual differences, but they can also be cultural differences, either very small cultures or very large groups, where you can't reliably expect these people to view things the same way as other people, and when that happens, those labels that go into the machine learning algorithm, they can be biased in a way that makes the algorithm less useful for those other group of people or potentially outright misleading, and so the idea is that social judgments that happen at the labeling stage can get baked into the algorithms. They become part of it. So you have very significant responsibility to get accurate and sufficiently diverse data and labels for use in your algorithm. The general rule that you've probably heard before is G-I-G-O or GIGO, garbage in, garbage out. If you have bad data, or restricted data, or culturally limited data, going into your algorithm, you're not going to get very useful results. You might be able to predict for that one group of people that you got the data from. But generally when you send something out into the world, you're going to want it to be applicable to much more of the world, and so you need to have something that's much more diverse going in, and so the labels that you get become really a matter of both technical accuracy and of social justice, and it's something that fully deserves your attention.
Prescriptive analytics
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] Sometimes, you just have to do the impossible. About 2,500 years ago, the Greek philosopher Zeno of Elea argued that it was impossible to get from point A to point B, like walking across your room. His reasoning was that before you could get all the way to point B, you first had to get halfway there. And before he'd get the rest of the way, you had to go halfway again. The process of getting halfway would occur an infinite number of times, which Zeno said wasn't possible, so you could never get to point B. Now, aside from the fact that Zeno didn't know that you could solve an infinite series problem with calculus, the obvious answer is that people walk from one part of the room to the other all the time. So the theoretically impossible task was obviously possible and accomplished frequently. And that gets us to an interesting issue about cause and effect relationships. Now, strictly speaking, three things need to happen to be able to say that one thing causes another. The first is that there needs to be an observed correlation between the putative cause and the effect. That is, the effect needs to be more likely when the cause is present. If it's not, it can't possibly cause it. The second thing is temporal precedence, and that simply means that the cause needs to come before the effect, if it's going to be a cause. And both of those are pretty easy to establish. The first one, you just need something like a correlation coefficient. The second one, you just need to show that the cause happened first. You can set that out pretty easily. But the third one's the kicker, no other explanations for the association between the possible cause and effect. The connection between those two can't be accounted for by anything else. The problem is, that part is theoretically impossible. You can't show that there's nothing else there. And so, while we go along pretty well with number one, number two, three is this huge sticking point. On the other hand, that doesn't mean you can't establish causality. It means you just kind of have to get close enough for practical purposes. Now, let me go back and compare what I'm talking about here with cause and effect to something we've seen previously. I've spoken about predictive analytics. That is where you're focusing on correlations because correlations are adequate for saying, "If this happens, then this will probably happen as well." And there's a huge amount of work in data science on predictive analytics and really amazing things have come out of that. On the other hand, prescriptive analytics is about causation. You're trying to specifically focus on things that you can do to make something happen that's important to you. Now, the gold standard for establishing cause and effect is what's called an RCT or a randomized controlled trial. Theoretically, they're very simple. You assign a bunch of people to one situation or another. You do that randomly. You control all the other conditions, and then you see how things come out. Theoretically, very simple to do. But I can tell you, given my training as an experimental research psychologist, they can be enormously difficult, often complex in practice. And so, the theory is nice and clean, but the practice can be really difficult. There is one major exception to that. And that's A/B testing for web design, where for instance, you set up your software to have one offer on this version and another offer on another version of your website, and you see which one gets more clicks. That can be automated and it is an experimental design, and it's randomized. It's an example of what we're looking for, even though that's a very simple one. But something more complex like that, like for instance, does making public transportation in a city have a direct effect on the influx of new businesses? That's a huge experiment. That's very, very difficult to do well. And so the gold standard is the randomized controlled trial, but often very difficult to do in reality. And that leads you to some of the more practical solutions, the alternatives that can help you get close to a cause and effect conclusion, even if they can't get you 100% of the way. Those include things like what-if simulations. These are ways of manipulating data in a spreadsheet that say, "Well, if this is true "and if we have these parameters, "then what will we expect?" And then, you can simply see how that matches up with reality a little bit later. You can do optimization models. These are correlational models based on the information you have that say, "If we balance things out, "so we spend a certain amount of time and money on this "or if we price things in a particular way, "that will maximize an outcome." Again, it's correlational, but it often gives you specific suggestions on what to do based on that past information. You can do what are called cross-lag correlations. This is where you have data at two or more specific points in time. And you're able to see if changes in the cause at time one produce corresponding changes in the effect at time two, and not vice versa. And then, finally, there's the entire category of what are called quasi-experiments. These are a whole host of research designs that let you use correlational data to try to estimate the size of the causal relationship between the two variables. On the other hand, one of the easiest ways to isolate causality is simply to do things again and again. Iteration is critical. You may be familiar with this chart, which comes from the agile design process. You design, you develop, you try something out once. Well, test it and then do it again, make a variation, do it again, make a variation. And as you do that, you will come close enough to causality through your repeated experience that you'll then be able to isolate and say, "This particular action "is producing the cause that we want." That is the prescriptive analysis. That's the result that you're looking for. And now, let me say something about how prescriptive analytics and data science compare and contrast with one another. Specifically, you can have prescriptive analytics without requiring the full data science toolkit. If you're doing experimental research and you have well-structured data; it's nice and quantitative, you've got complete data, and that includes most automated A/B experiments; you can do a very good prescriptive analysis without needing everything that goes into data science. On the other hand, there are times where you're doing data science without necessarily trying to prescribe a particular plan of action. Predictive and descriptive work fall into that category. That includes things like classifying and clustering, doing trend analysis, identifying anomalies. And so, that's when data science doesn't need prescriptive analytics, as opposed to when prescriptive analytics doesn't need data science. And so they are distinguishable fields. On the other hand, I do want to finish with this one thing about causality, which is so central to prescriptive analytics. Causality may be at least, in theory, impossible, but prescriptive analytics can get you close enough for any practical purposes and help put you and your organization on the right path to maximizing the outcomes that are most important to you.
APIs
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] When you draw a picture or write a letter, chances are that you can draw well with one of your hands, your dominant hand, and not so much with the other. I recently heard someone described this as having a well-developed API for your dominant hand, but only a clunky one for the non-dominant hand. An API, or application programming interface, isn't a source of data, but rather it's a way of sharing data. It can take data from one application to another or from a server to your computer. It's the thing that routes the data, translates it and gets it ready for use. I want to show you a simple example of how this works. And so I've gone to this website that has what's called the JSON placeholder. JSON stands for JavaScript Object Notation. It's a data format. And if we scroll down here, you'll see this little tiny piece of code. And what it says is go to this web address and get the data there and then show it, include it. And you can just click on this to see what it does. There's the data in JSON format. If you want to go to just this web address directly, you can, and there's the same data. You can include this information in a Python script or an R script or some other web application that you're developing. It brings it in and it allows you to get up and running very, very quickly. Now, APIs can be used for a million different things. Three very common categories include social APIs that allow you to access data from Twitter or Facebook and other sources, as well as use them as logins for your own sites. Utilities, things like Dropbox and Google Maps so you can include that information in your own apps. Or commerce, Stripe for processing payments or Mailchimp for email marketing or things like Slack or a million other applications. The data can be open, which means all you need is the address to get it. Or it may be proprietary. Maybe you have to have a subscription or you purchase it and then you'll need to log in. But the general process is the same. You include this bit of code and it brings the data in and get you up and running. You can then use that data and data analysis so it becomes one step of a data science project. Or maybe you're creating an app. You can make a commercial application that relies on data that it pulls from any of several different APIs like weather and directions. Really the idea here is that APIs are teamwork. APIs facilitate the process of bringing things together and then adding value to your analysis into the data science-based services that you offer.
The importance of interpretability
Selecting transcript lines in this section will navigate to timestamp in the video- [Instructor] You've done all the work. You've planned the project, you found data, you cleaned and organize the data. You created a model, you validated the model, and you just want to put a bow on it and be done. Well, one thing that you need to consider before you wrap it all up is who is going to make the decision? Who's using the results and the insights that you got from your analysis? Because you have a couple of choices. One is maybe you're developing something that is for the benefit and use of algorithms. This is, for instance, a recommendation system, which automatically puts something in front of people, or a mortgage application system, which processes it immediately while people are still on the website. In that case, the machines are the ones making the decisions, and machines don't need to understand what they're working with. They have the data, and if you've set up the algorithm properly, they can just kind of run with it. Also machines and algorithms can create complex models, much more complex than a human could easily understand, and implement them directly and immediately. And so if you've done your analysis in such a way that it's going to be finished working with by an algorithm, then you don't need to worry too much about how interpretable it is 'cause the machine's not spending time on that. On the other hand, if you have done your work for the benefit of humans, humans need to understand the principles involved. They need to know why things are happening the way that they are so they can then take that information, and they can reason from the data to apply it to new situations. It's the principles that are going to be important. And so you're going to have to be able to explain that to them as a result of your work in data science. Now, the trick is some results are easy to interpret. Here's a decision tree I showed you earlier. It's about classifying flowers as one of three different species. You only have to make three decisions. It says, first, look at the petal length. And if it's long, then look at the petal width. And if that is short, then look at the petal length again. And if you do that, you can make a very good classification. This is a very simple system and human accessible. Anybody can work with this. On the other hand, some other results are very difficult to interpret. This is another decision tree that I showed you earlier. It's enormously complicated, by regular human standards. You'd have kind of a hard time following through with this. And algorithms that are made in data science, like with deep learning, are infinitely more complex than this. And so you're going to have a hard time explaining to somebody else how this works and why it's set up the way it is and what they can do with it. The point of all this is that in your analysis, interpretability is critical. You're telling a story, and you need to be able to make sense of your findings so you can make reasonable and justifiable recommendations. Tell a story that makes sense, is clear and compelling. Only then can you see the value from your data science project.
Passive collection of training data
Selecting transcript lines in this section will navigate to timestamp in the video- [Narrator] Some things you can learn faster than others, and food is a good example. A lot of people love eating mussels, but some people eat them get sick and then never want to touch them again. And I'm one of those people. This is called a conditioned taste aversion and results from something that is psychologically unusual, and that's one trial learning. You only have to do it once to get it ingrained into your behavior. But if you're working in data science and especially if you're training a machine learning algorithm, you're going to need a lot more than one trial for learning. You're going to need an enormous amount of labeled training data to get your algorithm up and working properly. One of the interesting things about data science is that gathering enormous amounts of data doesn't always involve enormous amounts of work. In certain respects, you can just sit there and wait for it to come to you. That's the beauty of passive data collection. Now, there are a few examples of this that are pretty common, one is for instance, photo classification. You don't have to classify the photos. The people who take the photos and load them online will often tag them for you, put titles on them or share them in a folder. That's classification that comes around that you can use in machine learning without you having to go through the work personally. Or autonomous cars, as they drive, they are gathering enormous amounts of information from the whole plethora of sensors that they have. That information is combined, it's uploaded to distant servers, and that allows you to improve the way that your automobiles function. Or even health data, people who have smartwatches are able to constantly gather information about their activity, their number of steps, how many flights they've walked, how long they've slept, their heart rate. All this information is gathered without you doing any extra work. And if you're the provider of an app that measures health, you can get that information directly without having to do anything else for it. Now, there are several benefits to this kind of passive collection of training data. Number one is that you can get a lot of it. You can get enormous amounts of data very quickly simply by setting up the procedure and then letting it roll, either automatically or outsourcing it to the people who use it. The data that you gather can be either very general or it can be very, very specific. General like categorizing photos as cats or dogs, or specific like knowing what a particular person's heart rate is at a particular time of day. There are, on the other hand, some challenges associated with this passive data collection. One, and this is actually a huge issue, is that you need to ensure that you have adequate representation, things like categorizing photos. You need to make sure you have lots of different kinds of photos of lots of different kinds of things and different kinds of people, so you can get all of those categorized and that your algorithm understands the diversity of the information it will be encountering. You also need to check for shared meaning. What that is for instance, is that something being happy or beautiful, you need to make sure that people interpret that in the same way. You also need to check for limit cases. Think about, for instance, heart rate. Some people are going to have higher heart rate than others, and you don't want to have an algorithm that always says anything above this level is a heart problem, anything below is fine because that is going to vary from one person to another and one situation to another. But what all of this does together is it helps you assemble the massive amounts of data that you need to get critical work done in data science.
Big data
Selecting transcript lines in this section will navigate to timestamp in the video- [Narrator] There was a time just a few years ago when data science and big data were practically synonymous terms, as were semi-magical words like Hadoop, that brought up all the amazing things happening in data science. But things are a little different now, so it's important to distinguish between the two fields. I'll start by reminding you what we're talking about when we talk about big data. Big data is data that is characterized by any or all of three characteristics: unusual volume, unusual velocity, and unusual variety, again, singly or together, can constitute big data. Let me talk about each of these in turn. First, volume. The amount of data that's become available, even over the last five years, is really extraordinary. Things like customer transactions at the grocery store, the databases that track these transactions and compile them in consumer loyalty programs have hundreds of billions of rows of data on purchases. GPS data from phones includes information from billions of people, constantly throughout the day. Or scientific data. For instance, this image of the black hole in Messier 97 from the Event Horizon telescope that was released in April of 2019. It involved half a ton of hard drives that had to be transported on airplanes to central processing locations, because that was several times faster than trying to use the internet. Any one of these is an overwhelming dataset for normal methods. And that brought about some of the most common technologies associated with big data: distributed file systems, like Hadoop, that made it possible to take these collections that were simply too big to fit on any one computer, any one drive, put it across many, and still be able to integrate them in ways that let you get collective intelligence out of them. Then there's velocity, and the prime culprit in this one is social media. YouTube gets 300 hours of new video uploaded every single minute. It gets about five billion views per day. Instagram had 95 million posts per day, and that was back in 2016, when it only had half as many users as it does now. And Facebook generates about four petabytes of data per day. The data is coming in so fast. It's a fire hose that no common methods that existed before the big data revolution could handle it. This required new ways of transporting data, integrating data, and being able to update your analyses constantly to match the new information. And then finally, there's the variety, probably one of the most important elements of big data. That included things like multimedia data: images and video and audio. Those don't fit into spreadsheets. Or biometric data, facial recognition, your fingerprints, your heart readings, and when you move the mouse on your computer to find out where the cursor went. That's a distinctive signature that is recorded and identified for each user. And then there's graph data. That's the data about the social networks and the connections between people. That requires a very special kind of database. Again, it doesn't fit into the regular rows and columns of a conventional dataset. So all of these showed extraordinary challenges for simply getting the data in, let alone knowing how to process it in useful ways. Now, it is possible to distinguish big data and data science. For instance, you can do big data without necessarily requiring the full toolkit of data science, which includes computer programming, math and statistics, and domain expertise. So for instance, you might have a large dataset, but if it's structured and very consistent, maybe you don't have to do any special programming. Or you have streaming data. It's coming in very fast, but it only has a few variables, a few kinds of measurements. Again, you can set it up once and kind of run with it as you go. And so that might be considered big data, but it doesn't necessarily require the full range of skills of data science. You can also have data science without big data. And that's any time you have a creative combination of multiple datasets, or you have unstructured text, like social media posts. Or you're doing data visualization. You may not have large datasets with these, but you're definitely going to need the programming ability and the mathematical ability, as well as the topical expertise, to make these work well. And so now that I've distinguished them, I want to return to one particularly important question. You can find this on the internet. And the question is, is big data dead? Because its interest peaked about four or five years ago, and it looks like it's been going down since then. So is big data passé? Is it no longer there? Well, it's actually quite the opposite. It turns out that big data is alive and well. It's everywhere. It has simply become the new normal for data. The practices that it introduced, the techniques that it made possible, are used every single day now in the data science world. And so while it's possible to separate big data and data science, the two have become so integrated now that big data is simply taken for granted as an element of the new normal in the data world.
Descriptive analyses
Selecting transcript lines in this section will navigate to timestamp in the video- [Narrator] When it comes to business decisions, humans and machines approach things very differently. One element of this is that machines have essentially perfect memory. You can give it to them once, and they'll probably give it back to you exactly the same way later. They are also able to see all of the data at once in detail at a way that humans can't. On the other hand, they're not very good at spotting general patterns in data, there's some ways around that, but it's not one of the strong points of algorithms. Human decision makers, on the other hand, are very good at finding patterns and connecting the data to outside situations. On the other hand, humans have limited cognitive bandwidth. We can only think of so many things at a time. One of the consequences of that is that we need to simplify the data. We need to narrow it down to a manageable level and try to find the signal in the noise. And so descriptive analyses are one way of doing this. It's a little like cleaning up the mess in your data to find clarity in the meaning of what you have. And I like to think that there are three very general steps to descriptive statistics. Number one, visualize your data, make a graph and look at it. Number two, compute univariate descriptive statistics. This is things like the mean, it's an easy way of looking at one variable at a time. And then, go on to measures of association or the connection between the variables in your data. But before I move on, I do want to remind you of my goal in this course. I'm not trying to teach you all of the details of every procedure, rather I'm trying to give you a map, an overview of what's involved in data science. We have excellent resources here at LinkedIn Learning, and when you find something that looks like it's going to be useful for you, I encourage you to go find some of the other resources that can give you the step-by-step detail you need. Right now, we're trying to get a feel for what is possible and what sorts of things you can integrate. And so with that in mind, let's go back to the first step of descriptive analyses, and that's to start by looking at your data, or visual animals, and visual information is very dense in data. So you might try doing something as simple as a histogram. So this shows the distribution of scores in a quantitative variable. That's also sometimes called a continuous variable. The bell curve, which is high in the middle, tapers off nicely to each side, doesn't have any big outliers, is a common occurrence, and it forms the basis of a lot of methods for analyzing data. On the other hand, if you're working with something like financial data, you're going to have a lot of positively skewed distributions. Most of the numbers are at the low end, and a very small number go very, very high up. Think of the valuations accompanies the cost of houses, that requires a different approach, but it's easy to see it by looking what you have. Or maybe you have negative skew, where most of the people are at the high end and the trailing ones are at the low end. If you think of something like birth weight, that's an example of this. Or maybe you have a U-shaped distribution, where most of the people are either all the way at the right, all the way at the left, and although it's possible for people to be in the middle, there aren't many. That's a little bit like a polarizing movie and the reviews that it gets. But once you get some visualizations, you can look for one number that might be able to represent the entire collection. That's a univariate descriptive. The most common of these is going to be the mode. If each box here represents one data point, the mode is simply the most common. And that's going to be right here on the left at one, because there are more ones than there are of any other score. Or maybe you want the median, the score that splits the distribution into two equal size halves. We have six scores down here. We have six scores right here. So the median is 3.5, that splits the dataset into two equal halves. Or you have the mean, this one actually has a formula, which means the sum of X divided by N. It also has a geometric expression. The mean is actually the balance point. If you put these as actual boxes on a seesaw, the mean is where it's going to balance. And in this case, it's exactly at four, it's going to rest flat at that point. And so these are very common procedures, I imagine you know them already, but think of them as a good place to start when you're looking at your data. And if you can choose a second number to describe your data, you should consider a measure of variability, which tells you how different the scores are from each other. So that can include things like the range, which is simply the distance between the highest and lowest score. The quartiles or IQR which splits the data up into 25% groups. The variance and the standard deviation, two very closely related measures that are using a lot of statistics. And you will also want to look at associations. And so for instance, this is a scatter plot that shows the association between the psychological characteristic of openness and agreeableness at a state-by-state level. You can look at some measures that give you a numerical description of association, like the correlation, coefficient, regression analysis, like I just barely showed you, or depending on your data, maybe an odds ratio or a risk ratio. But remember there's a few things. The data that you're analyzing must be representative of the larger group you're trying to understand. And things like the level of measurement isn't nominal, ordinal, interval, or ratio, is going to have an impact on what measures you use and the kinds of inferences you can make. You always need to be attentive to the effect of outliers. You have one score that's very different from all the others, 'cause that's going to throw off a lot of these measures. Also, open-ended scores, where you have like one, two, three, four, or five plus, or undefined scores, where somebody started something, but didn't finish, can also have a dramatic effect on the data. So you want to screen your data for these things. Now, I can't go into the detail of all of these things right here, but we do have other courses here that can do that for you, such as Data Fluency: Exploring and Describing Data. Go there, go to the other courses available that give you an introduction to these basic concepts of understanding what's going on in your data and describing the patterns that you can find, so you can get started on the further exploration of your data science analyses.
In-house data
Selecting transcript lines in this section will navigate to timestamp in the video- [Presenter] Data science projects can feel like massive, overwhelming undertakings, like epic expeditions. But sometimes you can get started right here, right now; that is, your organization may already have the data that you need. And it may be, for instance, the fastest way to start, because it's already in the format that you need. Also, restrictions may not apply, a lot of the things about, like GDPR and FERPA and privacy regulations. Well, if the data's being used exclusively within the organization that gathered it for their own purposes, maybe some of those regulations don't apply, which means you have a little more flexibility in what you're able to do. Also, maybe you can talk with your creators. Maybe the people who gathered the data in the first place are still there and you can get some of the details you need about the process. And so between getting up and running right away, maybe having a little more latitude in how you work with the data and the ability to talk with the people, hopefully, who gathered it, this can be a great way to start. And what it means is that the pieces may fit perfectly. They may have the same code, use the same software, comply with the same standards, and save you a lot of time. And so if that data exists, your are in great shape. On the other hand, when it comes to in-house data, there are downsides. So, for instance, sometimes if the data was collected in a ad hoc manner, it may not be well-documented. Maybe it's not documented at all, 'cause nobody every thought anybody else was going to look at it. It also may not be well-maintained. Maybe they didn't label things, or they didn't tell you what the transformations were, or maybe the data's out of date, but you just can't tell. And of course, the biggest one is that in-house data simply may not exist, and so it's not an option. That said, there's an interesting qualification on this idea of data that does not exist, and it's the concept of dark data. Dark data is data that does in fact exist but isn't used to drive insights or make decisions or really isn't used at all. And like the dark matter that is in the universe and can't be seen directly, but it's hypothesized to constitute maybe 85% of the matter of the universe, there's a lot of it. There's a lot of dark matter out there, there's a lot of dark data out there. And so let me explain a little bit what I mean by dark data. One estimate from a 2019 report by Splunk reported the following statistics from a survey of more than 1,300 business managers and IT leaders from around the globe. When asked about their data, they said approximately 12% of their data was actively used; that is, they actually used it to get some insights and make some decisions. On the other hand, about twice as much, 23% of the data was ROT, which means redundant, obsolete, and trivial. So it's data that exists, but it's kind of useless. And the thing is, it costs money to have this out-of-date data around. In fact, there's an estimate that the cost of this redundant, obsolete, and trivial data could be over 3 trillion dollars. And then finally, the largest chunk by far, 65%, is dark data. This is data that exists, but it's hidden within the networks or people or their own machines. Now, other estimates put the percentage of dark data even higher, from 90 to even 99% of all data. And don't forget, this matters because, A, it takes resources and takes money and equipment to store the data, and there are also regulatory issues and even some risks associated with having very large amounts of data around that you actually don't need for your current purposes. Now, in terms of why people have this dark data, what the challenges are, one is an organization may not have the right tools to capture or to use the data. 85% of people surveyed said that was one of the reasons that they couldn't work with it. The flip side of that is they've got too much data. They are so overwhelmed with the big data fire hose coming in, and they don't have enough analytics or the ability to work with it. 39% of people gave that as a reason. Next is that they simply couldn't use the unstructured data. Remember, things like text, free text, the things that people put on social media or the reviews they put on an e-commerce site, that's valuable, but it is difficult to process, especially if you're working in Excel. And then finally, there's the ever-present risk that the data is dark because it's missing or incomplete, and simply not in a standard to be usable. And 66%, two-thirds of the people, gave that as a reason for their dark data. Now, that said, there is some value in simply turning around, look around, see what's at your organization. Chances are there is at least some data, possibly a very large amount, that is already very well-suited to the projects you have in mind to bring value to your organization. And maybe with some extra work, with some extra exploration, you can find ways to bring in all of that extra data that currently is not being used, that's dark data, to bring more insights and more value to your data science work.
Scraping data
Selecting transcript lines in this section will navigate to timestamp in the video- [Tutor] Watts Towers in Los Angeles is a collection of sculptures and structures by Simon Rodia that are nearly a hundred feet tall and made from things that he found around him. Scrap pieces of rebar, pieces of porcelain tile, glass, bottles, seashells, mirrors, broken pottery, and so on. But the towers are a testament to what a creative and persistent person can do with the things that they find all around them. Data scraping is in a sense the found art of data science. It's when you take the data that's around you, tables on pages, and graphs in newspapers and integrate that information into your data science work. Unlike the data that's available with APIs or application programming interfaces, which is specifically designed for sharing, data scraping is for data that isn't necessarily created with that integration in mind. But I need to immediately make a quick statement about ethics and data science. Even though it's possible to scrape data from digital and print sources, there are still legal and ethical constraints that you need to be aware of. For instance, you need to respect people's privacy. If the data is private, you still need to maintain that privacy. You need to respect copyright. Just because something is on the web doesn't mean that you can use it for whatever you want. The idea is visible doesn't mean open. Just like in an open market, just because it's there in front of you and doesn't have a price tag, doesn't mean it's free. There are still these important elements of laws, policies, social practices that need to be maintained to not get yourself in some very serious trouble. And so keep that in mind when you're doing data scraping. So for instance, let's say you're at Wikipedia and you find a table with some data that you want. Here's a list of dance companies. And you can actually copy and paste this information, but you can also use some very, very simple tools for scraping. In fact, if you want to put it into a Google sheet, there's even a function that's designed specifically for that. All you need to do is open up a Google sheet and then you use this function, import HTML. You just give it the address that you're looking for. Say that you're importing a table. And then if there's more than one table, you have to give it the numbers. This one only has one. So I can just put in one, and there's the data. It just fills it in automatically. And that makes it possible for us to get this huge jumpstart on our analysis. Some other kinds of data that you might be interested in scraping might include things like reviews online and ranking data. You can use specialized apps for scraping, consistently structured data, or you can use packages in programming languages like Python and R. Now, one interesting version of this is taking something like a heat map or a choropleth is actually what it's called in this particular case. Here's a map that I created in Google Correlate on the relative popularity of the term data science. Now, if you're on Google Correlate, this is interactive. You can hover over and you can download the data, but right now, this is just a static image. And if you wanted to get data from this, you could for instance write a script that had go through the image pixel by pixel, get the color of each pixel. You can then compare it to the scales at the bottom left, you can get the XY coordinates for each pixel, and then compare that to a shape file of the US, and then ultimately put the data in a spreadsheet. It's not an enormous amount of work. And it is a way of recovering the data that was used to create a map like this. If you want to see a really good example of data scraping after a manner where you're taking image data and getting something useful out of it, here at publicintelligence.net, they're reporting on a project that's done with Syrian refugee camps. And what they're using here are satellite photos. And then they're using a machine learning algorithm to count the number of tents within the camp. And that's important because you see as it goes over time, it gets much, much larger. And this can be used to assess the severity of a humanitarian crisis and really how well you can respond to it. But it's the same basic principles of data scraping. Taking something that was created in one way, just an image, but then using the computer algorithms to extract some information out of it. And give you the ability to get some extra insight. And figure out what your next steps in your project need to be.
The CRISP-DM model in data science
So it turns out that I am not necessarily the most organized person in the world. I have, in fact, had times that important documents and even data were stored approximately like this, but we all know that the composting approach to organization makes your DATA PROJECTS a LOT HARDER, which is why some very smart people have spent a lot of time thinking about HOW to better ORGANIZE DATA PROJECTS more EFFECTIVELY. And one of the most common approaches in the DATA MINING world is something called CRISP-DM, which is short for Cross-Industry Standard Process for Data Mining. So this is a great way of thinking about how to organize data projects, not just in data mining, but really anywhere. And CRISP-DM describes a data project as having SIX PHASES. Now, again, they're specifically talking about data mining, but you can apply it to anything. So the first phase is BUSINESS UNDERSTANDING. The idea here is what's the business objective? What is your actual goal? And then the situation assessment and determine the data mining goal. Again, trying to know what you're trying to accomplish as you go through the project. And then to produce a project plan, actually write down, create the map so you know when you're at the goal. The next step is the DATA UNDERSTANDING. So you collect the initial data, you describe the data, you explore it, you verify the data quality, an important step that if you don't perform adequately, everything else can go straight out the window as you're going. And then you have DATA PREPARATION. So you get your data set and you select data and you clean the data and you construct the data by, for instance, creating indicators or factor scores or features that you're going to be using to build your models. You can also integrate the data with not just this current data set, but with other data sets that are available to you to bring in extra meaning. And then FORMATTING the DATA in an appropriate way for the tools and the analysis that you're going to conduct. After that is MODELING, where you select a modeling technique, you generate a test design, you build the model. So for instance, you do your neural network, you do your random forest, you assess the model. Often, if you're doing a classification test, that looks at how accurate is it or what is the level of specificity or sensitivity to certain conditions? Then you have EVALUATION. So you evaluate the results and you review the process. How well did this go over? And then you determine the next steps. And then you finish in CRISP-DM with deployment. I like to think as the we're open for business stage. You plan the DEPLOYMENT, what you're actually going to do with your results, as well as plan the monitoring and maintenance because what you find is there is often drift between a model established at one point in time and the way the data evolves and so you need to check up on it frequently. And then there's PRODUCING a final report to the stakeholders and then reviewing the project overall. Taken together, these steps can be one very helpful way and one well-established way, in certain fields, for planning and organizing a data project to make sure you cover all your bases, you check the list, make sure you're getting done what you need to, and that gives you the best chance of having efficiency the first time through to get the value from the data and send that off to your stakeholders.
The data science Venn diagram
Sometimes the whole is greater than the sum of its parts, and you can see this in a lot of different places. Take for instance, music. You take John, and Paul, and George, and Ringo, all wonderful musicians in their own rights, but put them together and you have the Beatles, and you have revolutionized popular culture. Or, take the fact that everybody has a circle of friends, and basically everybody has the internet now, and you have created social networks, and you have revolutionized the computing world. Or, in 2013, Drew Conway proposed the combination of HACKING SKILLS, that's computer PROGRAMMING, and MATH and STATISTICS and substantive, or topical DOMAIN, expertise, TOGETHER give you DATA SCIENCE, a new field that has revolutionized both the technology and the business world. And I want to talk a little more about why each of those THREE ELEMENTS in the Venn diagram of data science are so important. 1. the HACKING SKILLS, or computer programming, the reason that's important is because you have such -novel sources of data. You have social media and social networks. - You have CHALLENGING FORMATS like the graph data from those social networks, or images, or video that don't fit into the rows and columns of a spreadsheet, -or you have STREAMING DATA like sensor data or live web data that comes in so fast that you can't pause it to analyze it. All of these REQUIRE the CREATIVITY that comes WITH HACKING and the ability to work freely with what you have in front of you. 2. computer programming skills, there's a few things that are very USEFUL in data science. - The ABILITY to WORK WITH with a LANGUAGE like Python or R. These are programming languages that are very frequently used for data manipulation and modeling. - There's C, and C++, and Java. These are general purpose languages that are used for the backend, the foundational elements of data science, and they provide maximum speed. -There's SQL, or SEQUEL, that stands for structured query language. This is a language for working with relational databases to do queries and data manipulation -And then there are packages you can use in other languages like TensorFlow. This is an open source library that's used for deep learning, and that has revolutionized the way that data science is performed right now. And then there's the mathematical elements of data science. First off, there are several -forms of mathematics that are particularly useful in data science. There's PROBABILITY, and LINEAR ALGEBRA, and CALCULUS, and REGRESSION, and I'll talk about some of each of these, but they allow you to do something important. -Number one, they allow you to CHOOSE the PROCEDURES. You want to judge the fit between your question, which is always the first and most important thing, the data that you have available to you, and then you CHOOSE a PROCEDURE that ANSWERS your QUESTIONS based on your data, and -IF you understand the mathematics and how it works, you'll be able to make a much better and more informed choice, and also you'll be able to diagnose problems. MURPHY'S LAW applies in data science, as well as everywhere else, that ANYTHING that can GO WRONG will go WRONG. And you need to know what to do when the PROCEDURES that you've CHOSEN FAIL, or they give sometimes IMPOSSIBLE RESULTS. You need to UNDERSTAND exactly how the DATA is being MANIPULATED, so you can see where the trouble areas are and how to resolve them. 3. And then the third area of CONWAY'S DATA science Venn diagram is SUBSTANTIVE EXPERTISE. -This is each domain, or topic area, has its own goals, methods, and constraints. If you're working in social media marketing, you're going to have a very different set of goals and methods than if you're working in biomedical informatics. -And you need to know what constitutes value in the particular domain you're working in. -And finally, how to IMPLEMENT the INSIGHTS, because data science is an ACTION-ORIENTED field. It's designed to tell you what to do next, to get the most value, provide the best service that you possibly can based on the data that you have. So, taken together, the hacking or programming, the math and statistics, and the substantive expertise, are the individual elements or components, the parts, that make up the larger than the sum whole of data science.
Next steps and additional resources
This may be the end of this course, but it's just a beginning for you, and so it's time for you to start making plans on how you're going to produce your own data science revolution. I want you to remember that this course is designed to be a foundation. It's an introduction that gives you a feel for the breadth and depth of data science. I haven't been able to go into detail on these topics, but that's okay. Right here at LinkedIn learning, you have a wealth of additional courses that can give you the step-by-step instruction that you need. Consider LEARNING new things, like for instance how to program in Python or R or how to work with open data or how to build algorithms for machine learning and artificial intelligence. Any of these would be fantastically useful tools and approaches in data science. Also learn how to APPLY the things that you've worked with. Get some courses on data-driven decision-making in business settings, get more information on business strategy, and how the information you use can help any organization make better decisions in their own operations, and then get information on how people work and the elements that are most important in fields like marketing or nonprofits or healthcare or education or whatever is of greatest use and interest to you. Also get connected with the actual people in data science. Go to conferences. There's so many different ways you can do this. For example, you can go to large national general topic conferences like -the Strata Data Conferences, or one of my favorites, - ODSC for the Open Data Science Conference. Both of these meet in multiple locations throughout the year. There's one that's going to be near you that you can go to, or maybe have a specialized topic. - Like for instance, I've gone to the Marketing Analytics and Data Science or MADS conference, which very specifically focuses on data science within this particular marketing realm. Or where you live, you may have local events. -I'm in Utah and each year we have the Silicon Slopes Tech Summit, which has tracks focused on artificial intelligence, machine learning, and data science. But remember, all of this is fundamentally about people and the way that you can both better understand the people in the world around you and turn that understanding into something of value for the people you work with. So thanks for joining me and good luck in the great things that you'll accomplish.