Quiz 1 Prep Predictive Analytics

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Why Did Google Flu Trends Fail?

"A huge collection of misinformation cannot produce a small gem of true information" The people who were looking up flu like symptoms themselves dont know if they have the flue, they're completely ignorant to that fact, so this data is useless.

What list of skill do you need to be a Data Scientist?

1. Able to handle and navigate business situations 2. Real Big Data Expertise (Being able to process 50 million rows of data in 2 hours) 3. Ability to sense data 4. A distrust of models 5. Knowledge of the curse of Big Data 6. Ability to communicate and understand problems management is trying to solve. 7. Ability to correctly asses lift or ROI on the salary paid to you 8. Ability to quickly identity a simple, robust and scalable solution to a problem 9. Ability to convince and drive management in the right direction, sometimes against its will, for the benefit of the company its uses and shareholder 10. Good programming skills/algorithm knowledge

Data Preparation Phase

An important and time consuming part of data mining which can take up 50%-80% of the project's time and effort. It involves selecting data to include, cleaning data to improve data quality, constructing new data that may be required, integrating multiple data sets, and formatting data.

What is science?

An organized way of gathering and analyzing evidence about the natural world.

What are the limitations of R?

It was and still is limited to in-memory data processing

Is a Data Analytics Project's lifecycle the same as a software engineering lifecylce?

No they are vastly differet, but do share some similarities

Discrete Data

Numerical data that has gaps in it

semi-structured data

In-between Structured and Unstructured data and can possibly be converted into structured data

Examples of Structured Information

Information that is stored in databases well-formed documents XML

Business Understanding(or Discovery phase)

Involves determining and defining business objectives in business terms, translating these to data mining goals and making a project assessment and plan

Why Google Flue is a failure: Notes

The problem is that most people dont know what "the flue" is and relying on google searches by people who may be utterly ignorant about the flue does not produce useful information. Or to put it another way, a huge collection of misinformation cannot produce a small gem of true information. One problem is that Google's scientists havve never revealed what search terms they actually use to track the flue. A bigger problem with google flu, is that most people who think they have "the flu" do not. The vast majoirty of doctor's office vists for flu like symptoms turn out to be other viruses. CDC tracks these visits under "influenza like illness" because so many turn out to be something else.

Using artificial intelligence to predict legislative votes in united states congress

We present in this study an experimental artificial intelligence tool to predict the likelihood of a legislative bill becoming a law. Using historical data of legislative bills, we designed an ensemble of predictive analytics algorithms that can predict whether or not a bill will pass both the Senate and the House of Representatives. Empirical results indicate that a bill's legislative vote could be predicted with an 80% accuracy using AI algorithms.

Examples of Unstructured Information

Web pages, presentations, documents (PDF, Doc) Emails, images, videos blogs log files

What is the curse of big data?

With a large enough sample, everything is considered statistically significant even associations that are practically not significant or interesting

Predictive Analytics

a branch of data science that is forward looking; we use historical data to predict events; we use past events to predict future outcomes.

What is big data hubris?

the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis

Covid-19 early-alert signals using human behavior alternativev data

Google searches create a window into population-wide thoughts and plans not just of individuals, but populations at large. Since the outbreak of COVID-19 and the non-pharmaceutical interventions introduced to contain it, searches for socially distanced activities have trended. We hypothesize that trends in the volume of search queries related to activities associated with COVID-19 transmission correlate with subsequent COVID-19 caseloads. We present a preliminary analytics framework that examines the relationship between Google search queries and the number of newly confrmed COVID-19 cases in the United States. We designed an experimental tool with search volume indices to track interest in queries related to two themes: isolation and mobility. Our goal was to capture the underlying social dynamics of an unprecedented pandemic using alternative data sources that are new to epidemiology. Our results indicate that the net movement index we defned correlates with COVID-19 weekly new case growth rate with a lag of between 10 and 14 days for the United States at-large, as well as at the state level for 42 out of 50 states with the exception of 8 states (DE, IA, KS, NE, ND, SD, WV, WY) from March to June 2020. In addition, an increasing caseload was seen over the summer in some southern US states. A sharp rise in mobility indices was followed by a sharp increase, respectively, in the case growth data, as seen in our case study of Arizona, California, Florida, and Texas. A sharp decline in mobility indices is often followed by a sharp decline, respectively, in the case growth data, as seen in our case study of Arizona, California, Florida, Texas, and New York. The digital epidemiology framework presented here aims to discover predictors of the pandemic's curve, which could supplement traditional predictive models and inform early warning systems and public health policies.

What is Hadoop and MapReduce? What is their relationship?

Hadoop is an implementation of MapReduce, much like Java is an implementation of OOP.

What is CRISP-DM?

*CRIPS-DM*: *Cr*oss-*I*ndustry *S*tandard *P*rocess for *D*ata *M*ining. Is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems

Define what a recommender system is

A recommender system, also known as a recommendation system, is a type of information filtering system used in various fields to suggest products, services, or information to users based on their preferences, behavior, and interaction with the system. These systems are used to personalize user experiences, thus improving customer satisfaction and driving business growth. Recommender systems are widely used in various online platforms including e-commerce websites like Amazon, music and movie streaming services like Spotify and Netflix, and social media platforms like Facebook and LinkedIn. Recommender systems can be categorized into three main types: Collaborative filtering: This method makes recommendations based on patterns of user behavior. The underlying assumption is that if two users agree on one issue, they are likely to agree on others as well. For example, if user A and user B both liked certain books, and user A likes another book, then that book could be recommended to user B. Content-based filtering: This method makes recommendations based on the description of items and a profile of the user's preferences. For example, if a user frequently watches action movies, the system will recommend more action movies. Hybrid methods: These methods combine collaborative filtering and content-based filtering. Hybrid methods can be more effective as they can incorporate the advantages of both methods and overcome certain limitations. Recommender systems play a crucial role in data-driven businesses and they are an active area of research in the field of machine learning and artificial intelligence.A recommender system, also known as a recommendation system, is a type of information filtering system used in various fields to suggest products, services, or information to users based on their preferences, behavior, and interaction with the system. These systems are used to personalize user experiences, thus improving customer satisfaction and driving business growth. Recommender systems are widely used in various online platforms including e-commerce websites like Amazon, music and movie streaming services like Spotify and Netflix, and social media platforms like Facebook and LinkedIn. Recommender systems can be

What is a NoSQL database?

A term used to describe high-performance, non-relational databases. NoSQL databases use a variety of data models, including document, graph, key/value, and columnar.

Machine Learning

A type of artificial intelligence that leverages massive amounts of data so that computers can improve the accuracy of actions and predictions on their own without additional programming.

Emotion Artifical Intelligence derivved from Ensemble Learning

Abstract— We present in this work a predictive analytics framework that can computationally identify and categorize opinions expressed in text to discover and analyze attitudes towards a particular topic or product. We provide a new approach based on an ensemble model of three widely used sentiment analysis algorithms: TextBlob, OpinionFinder and Stanford NLP. In this work we investigated the performance of these latter algorithms on large, real datasets. Then, we designed two ensembles (1) one based on multivariate regression that computes a final prediction from three classification algorithms and (2) an ensemble that is based on majority rule. We computed the accuracy of the ensemble framework on labeled real datasets used in the literature that include tweets, as well as Amazon, Yelp and IMDb movie reviews. Our experiments indicated that the ensemble algorithms outperformed all three sentiment algorithms. The ensemble learning algorithm draws from the strengths of the individual sentiment algorithms, avoiding the need to select just one algorithm, creating a stronger tool for harnessing Emotion AI. This approach creates promises beyond the tweets and reviews analyzed here and can potentially be applied to marketing, finance, politics, and beyond.

Twitter mood predicts the stock market

Abstract—Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e. can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%Abstract—Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e. can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS

Twitter Mood Predicts the stock market

Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e. can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely Opinion Finder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis is that public mood states,as measured by the Opinion Finder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%

Explain "Variety"

Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when we consider all these different types of data together

Unsupervised Learning

Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process.

Supervised Learning

Category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

What is data classification?

Data classification in machine learning is a type of supervised learning approach where the computer program learns from the data input given to it and uses this learning to classify new observation. This data set used for learning is called the training set, and the data used to validate the model is called the test set. In a classification problem, the objective is to assign a data point to one of a number of predefined categories or classes based on its features. For instance, an email can be classified as "spam" or "not spam", or a tumor can be classified as "benign" or "malignant" based on its characteristics. There are various types of classification algorithms, each with its strengths and weaknesses. Common ones include logistic regression, decision trees, random forests, gradient boosting, support vector machines (SVM), and neural networks, among others. In binary classification, there are only two possible classes. Multiclass classification problems have more than two possible classes, and multilabel classification involves assigning a data point to multiple classes. Classification models are evaluated based on metrics like accuracy, precision, recall, F1 score, and Area Under the Receiver Operating Characteristic Curve (AUROC), among others. Each of these metrics provides different insights into the performance of the model, and the appropriate metric to use depends on the specific problem and business context.

data clustering

Data clustering is a technique used in machine learning and statistics to group data points or items into subsets or "clusters" based on similarities or shared characteristics. The goal of clustering is to ensure that data points in the same cluster are as similar as possible (according to some defined similarity measure), and data points in different clusters are as dissimilar as possible. Clustering is a form of unsupervised learning, which means it does not rely on pre-labeled examples in order to learn the relationships within the data. Instead, it identifies patterns and structures in the data itself. This distinguishes it from supervised learning methods such as classification, which require labeled data to train the model. Common clustering methods include K-means clustering, hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Each of these methods has its own strengths and weaknesses, and the appropriate method to use depends on the specific nature and requirements of the data and the problem at hand. It's important to note that the results of clustering can be highly dependent on the choice of similarity measure, the clustering algorithm, and the specific parameters chosen for the algorithm. Thus, careful validation and interpretation of clustering results is necessary. Clustering is often used in a wide variety of fields, including computer vision, market research, image analysis, information retrieval, bioinformatics, and more.

structured data

Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.

continuous data

Data that can take on any value. There is no space between data values for a given domain. Graphs are represented by solid lines.

Categorical Data

Data that consists of names, labels, or other nonnumerical values

Explain "Volume"

Enterprises are awash with ever-growing data of all types, easily amassing terabytes, even petabytes of information.

what is data?

Facts that you can draw conclusions from

What are some examples of NoSQL?

NOSQL RDMS examples: MongoDB, MarkLogic

Restaurant Health Inspections and Crime Statistics Predict the Real Estate Market in NYC

Predictions of apartments prices in New York City (NYC) have always been of interest to new homeowners, investors, Wall Street funds managers, and inhabitants of the city. In recent years, average prices have risen to the highest ever recorded rebounding after the 2008 economic recession. Although prices are trending up, not all apartments are. Different regions of the city have appreciated differently over time; knowing where to buy or sell is essential for all stakeholders. In this project, we propose a predictive analytics framework that analyzes new alternative data sources to extract predictive features of the NYC real estate market. Our experiments indicated that restaurant health inspection data and crime statistics can help predict apartments prices in NYC. The framework we introduce in this work uses an artificial recurrent neural network with Long Short-Term Memory (LSTM) units and incorporates the two latter predictive features to predict future prices of apartments. Empirical results show that feeding predictive features from (1) restaurant inspections data and (2) crime statistics to a neural network with LSTM units results in smaller errors than the traditional Autoregressive Integrated Moving Average (ARIMA) model, which is normally used for this type of regression. Predictive analytics based on non-linear models with features from alternative data sources can capture hidden relationships that linear models are not able not discover. The framework presented in this study has the potential to serve as a supplement to the traditional forecasting tools of real estate markets.

What is R? And why knowing just R doe not make you a data scientist?

R is an open source statistical programming language that is at least 20 + years old, it is the successor of S+.

What are the successors of R?

R was extended to other tech such as RHadoop (R+Hadoop) to bypass its limitation (in-memory limitation)

GKB: A predictive analytics framework to generate online product recommendations

Recommender Systems are essential to many of the largest internet companies' core products. Online users today expect sites that offer a large assortment of products to also serve recommendations. These recommendations are based on various pieces of data including user ratings of products and product features. In this paper, we explore the use case of applying foundational recommender system techniques and algorithms to provide book recommendations. First, we introduce what recommender systems are and the different types. Then, we will describe the Greenquist-Kilitcioglu-Bari (GKB) framework, an end to end process of building out a fully functional and live recommendation system that can be hosted on the internet, which has an RMSE of 0.842. Steps of the process that will be highlighted are data collection and preprocessing, model selection and evaluation, combining different models to create a hybrid model, and hosting the models on a live website that can serve recommendations in real time to many users. We also use the trained model to serve recommendations to a new user that was not part of the training process. This approach creates promises beyond book recommendations and can be applied to marketing, finance, politics, e-commerce, and any data matching applications.

What are graph databases? Examples?

Rely on the concepts of edges and nodes to manage and access data like special data. An example would be ArangoDB

Explain "Velocity":

Sometimes 2 minutes is too late. For time sensitive processes such as catching fraud, big data must be used as it streams into your enterprise to maximize its value

Granger causality

Suppose that we have three terms, Xt , Yt , and Wt , and that we first attempt to forecast Xt+1 using past terms of Yt and Wt . We then try to forecast Xt+1 using past terms of Xt , Yt , and Wt . If the second forecast is found to be more successful, according to standard cost functions, then the past of Y appears to contain information helping in forecasting Xt+1 that is not in past Xt or Wt. ... Thus, Yt would "Granger cause" Xt+1 if (a) Yt occurs before Xt+1 ; and (b) it contains information useful in forecasting Xt+1 that is not found in a group of other appropriate variables."

The Real First Class? Inferring confidential Corporate mergers and government relations from air traffic communications

This paper exploits publicly available aircraft meta data in conjunction with unfiltered air traffic communication gathered from a global collaborative sensor network to study the privacy impact of large-scale aircraft tracking on governments and public corporations. First, we use movement data of 542 verified aircraft used by 113 different governments to identify events and relationships in the real world. We develop a spatio-temporal clustering method which returns 47 public and 18 non-public meetings attended by dedicated government aircraft over the course of 18 months. Additionally, we illustrate the ease of analyzing the long-term behavior and relationships of aviation users through the example of foreign governments visiting Europe. Secondly, we exploit the same types of data to predict potential merger and acquisition (M&A) activities by 36 corporations listed on the US and European stock markets. We identify seven M&A cases, in all of which the buyer has used corporate aircraft to visit the target prior to the official announcement, on average 61 days before. Finally, we analyze five existing technical and non-technical mitigation options available to the individual stakeholders. We quantify their popularity and effectiveness, finding that despite their current widespread use, they are ineffective against the presented exploits. Consequently, we argue that regulatory and technical changes are required to be able to protect the privacy of non-commercial aviation users in the future.

Earthquake Shakes Twitter Users

Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.

Event Detection In Twitter

Twitter, as a form of social media, is fast emerging in recent years. Users are using Twitter to report real-life events. This paper focuses on detecting those events by analyzing the text stream in Twitter. Although event detection has long been a research topic, the characteristics of Twitter make it a non-trivial task. Tweets reporting such events are usually overwhelmed by high flood of meaningless "babbles". Moreover, event detection algorithm needs to be scalable given the sheer amount of tweets. This paper attempts to tackle these challenges with EDCoW (Event Detection with Clustering of Wavelet-based Signals). EDCoW builds signals for individual words by applying wavelet analysis on the frequency-based raw signals of the words. It then filters away the trivial words by looking at their corresponding signal auto-correlations. The remaining words are then clustered to form events with a modularity-based graph partitioning technique. Experimental studies show promising result of EDCoW. We also present the design of a proof-of-concept system, which was used to analyze netizens' online discussion about Singapore General Election 2011.

What are the three V's of big data?

Velocity, Volume, Variety

Data Understanding Phase

involves collecting initial data, describing the data in terms of amount, type and quality of data, exploring available tools and verifying data quality.

Deployment Phase

involves consolidating the findings, determining what might be deployed and planing the monitoring and maintenance required to keep the model relevant

Evaluation Phase

involves evaluating the results against the business success criteria defined at the beginning of the project

Model Planning and Building

involves selecting suitable modeling techniques, generating test designs to validate the model, building predictive models and assessing these models. Case of Predictive Analytics: A predictive model is a mathematical function that predicts the value of some output variables based on the mapping between input variables. Historical data is used to train the model to arrive at the most suitable modeling technique. For example, a predictive model might predict the risk of developing a certain disease based on patient details. Some commonly used modeling techniques are as follows: Regression analysis that analyzes the relationship between the response or dependent variable and a set of independent or predictor variables. Decision trees that help explore possible outcomes for various options. Cluster analysis that groups objects into clusters to look for patterns. Association techniques that discover relationships between variables in large databases.

unstructured data

nonnumeric information that is typically formatted in a way that is meant for human eyes and not easily understood by computers. Ex. We have 5 used white balls with a diameter of 45 mm at 50 cents each.

Predicting Financial Markets using the Wisdom of Crowds

—In the world of finance, one key lesson is the importance of psychology in the behavior of financial markets. Many investors are irrationally exuberant when making financial decisions, but predictive analytics can generate insights that are free of investors' emotions, and hence human irrational exuberance in decision-making can be mitigated. Data sources that investors adopt in their investment decisionmaking are, in most cases, traditional - including quarterly earnings reports and financial statements. In this work, we propose a predictive analytics framework that aims at mining insights from two alternative data sources: news articles and micro-blogs. We investigate the predictive correlation and causation between (1) collective opinion mining in news articles fused with Twitter mood and (2) movements in financial markets. Experimental results indicate a relationship between stock market prices and collective opinion mining variations on news articles combined with Twitter's sentiment variations. The framework introduced in this work could potentially be adopted as a supplement to the conventional analyses being used in major investment banks. This research was partially funded by the Australian government under the AwardsEndeavour research grant.


Ensembles d'études connexes

Help Desk Customer Service Quiz #6(slides part two)

View Set

Java, Ch. 22- Lists, Stacks, Queues, and Priority Queues

View Set

Unit 1, Concept 3 --> Energy Flow

View Set

LAST BIOETHICS MOTHERUFCKING QUIZ

View Set