Real Time Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

InfoSphere Streams

"Streams allows you to apply the analytics on data in motion."

For big V's of data

1. Volume -Scale of data 2. Veracity -Uncertainty of data 3. Variety -Different forms of data 4. Velocity -Analysis of streaming data

Extract, Transform and Load

(in computing) refers to a process in database usage and especially in data warehousing that: Extracts data from homogeneous or heterogeneous data sources. Transforms the data for storing it in the proper format or structure for the purposes of querying and analysis.

Sentiment Analysis

-(or opinion mining) is a natural language processing technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs. -Maybe you want to gauge brand sentiment on social media, in real time and over time, so you can detect disgruntled customers immediately and respond as soon as possible.

E-Commerce

-Amazon Pricing -"Retail price wars online have entered a new era of speed and precision, creating a confusing landscape for shoppers in which prices leap and plummet on short notice." -NY Times -MarketTrack.com

Brand Sentiment Example

-Around Christmas time, Expedia Canada ran a classic "escape winter" marketing campaign. All was well, except for the screeching violin they chose as background music. Understandably, people took to social media, blogs, and forums. Expedia noticed right away and removed the ad. -Then, they created a series of follow-up spin-off videos: one showed the original actor smashing the violin; another invited a real negative Twitter user to rip the violin out of the actor's hands on screen. Though their original campaign was a flop, Expedia were able to redeem themselves by listening to their customers and responding. -Sentiment analysis allows you to automatically monitor all chatter around your brand and detect and address this type of potentially-explosive scenario while you still have time to defuse it.

Brand Monitoring

-Brand monitoring offers a wealth of insights from conversations happening about your brand from all over the internet. Analyze news articles, blogs, forums, and more to gauge brand sentiment, and target certain demographics or regions, as desired. Automatically categorize the urgency of all brand mentions and route them instantly to designated team members. -Get an understanding of customer feelings and opinions, beyond mere numbers and statistics. Understand how your brand image evolves over time, andcompare it to that of your competition. You can tune into a specific point in time to follow product releases, marketing campaigns, IPO filings, etc., and compare them to past events.

What problems are associated with data sources in motion?

-Can be unstructured, different kinds of data -Security

Streams Output

-Dashboards -Maps -Visualizations -Text Alert/Email

1. Eviction policy

-Defines how large a window can get -Determines which older tuples are evicted from the window and removed

4. Multilingual sentiment analysis

-Multilingual sentiment analysis can be difficult. It involves a lot of preprocessing and resources. Most of these resources are available online (e.g. sentiment lexicons), while others need to be created (e.g. translated corpora or noise detection algorithms), but you'll need to know how to code to use them.

Setting Variables in python

-N = 100 -Text = "Python"

1. Tumbling windows

-Non-overlapping sets of consecutive tuples -Defined only with an eviction policy

Problems Associated with Massive Amounts of "moving" Data

-Produced too quickly to store -Not in the right format, unstructured -Not enough room to store data -Too expensive to store data -Already too late in finding the information after the data is stored in tables.

Social Media Monitoring

-Sentiment analysis is used in social media monitoring, allowing businesses to gain insights about how customers feel about certain topics, and detect urgent issues in real time before they spiral out of control. -On the fateful evening of April 9th, 2017, United Airlines forcibly removed a passenger from an overbooked flight. The nightmare-ish incident was filmed by other passengers on their smartphones and posted immediately. One of the videos, posted to Facebook, was shared more than 87,000 times and viewed 6.8 million times by 6pm on Monday, just 24 hours later. -The fiasco was only magnified by the company's dismissive response. On Monday afternoon, United's CEO tweeted a statement apologizing for "having to re-accommodate customers."

Sentiment Algorithms

-Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials.

How Does Sentiment Analysis Work?

-Sentiment analysis, otherwise known as opinion mining, works thanks to natural language processing (NLP) and machine learning algorithms, to automatically determine the emotional tone behind online conversations.

Benefits of python

-Simple interface -Similar concepts to all languages -Valuable tool -Easy to find code -Well supported -Free

2. Emotion detection

-This type of sentiment analysis aims to detect emotions, like happiness, frustration, anger, sadness, and so on. Many emotion detection systems use lexicons (i.e. lists of words and the emotions they convey) or complex machine learning algorithms. -One of the downsides of using lexicons is that people express emotions in different ways. Some words that typically express anger, like bad or kill (e.g. your product is so bad or your customer support is killing me) might also express happiness (e.g.this is bad ass or you are killing it).

Pip Install

-Tweepy -Textblob

Expectations of the students

-Understand concepts and terminology behind streams and big data. -Determine when a streams application makes sense to use. -Understand Cloud applications like AWS and Azure -Understand how compute processing power is utilized and allocated in production streams environments. -Be able to explain virtual environments -Identify potential applications of streams technology. -Expand your skillsets, greater develop problem solving necessary for complex applications, be creative.

1. Rule-based approach to sentiment analysis:

-Usually, a rule-based system uses a set of human-crafted rules to help identify subjectivity, polarity, or the subject of an opinion. -These rules may include various NLP techniques developed in computational linguistics, such as: •Stemming, tokenization (breaking down each word, go through and compare each word in the database to see what it equals if its neutral or negative or positive), part-of-speech tagging and parsing. •Lexicons (i.e. lists of words and expressions). •Two lists, count number of negative and positive words

3. Aspect-based Sentiment Analysis

-Usually, when analyzing sentiments of texts, let's say product reviews, you'll want to know which particular aspects or features people are mentioning in a positive, neutral, or negative way. That's where aspect-based sentiment analysis can help, for example in this text: "The battery life of this camera is too short", an aspect-based classifier would be able to determine that the sentence expresses a negative opinion about the feature battery life.

2. Trigger policy

-When an operation, such as aggregation, takes place as new tuples arrive into the window

2. Sliding windows

-Windows formed by adding new tuples to the end of the window and evicting old tuples from the beginning of the window -Defined with both an eviction and a trigger policy

Eric's past roles

-eLocal - Technical Services Director -FireCenter - Systems Program Manager/Grad Student -WashCorp - Web and Applications Developer -College of Business - IT Director -Currently GIS Manager

Data currently being produced

-http://www.internetlivestats.com/ -4.4 Billion internet users -2.5 Quintillion Bytes (2.5 Exabytes) per day produced per day. -By 2020 we expect to have stored 35 zettabytes (ZB) of data. (= 37,580,963,840 TB) -"But in the past 2 years alone, data has grown so much that I would not be surprised if total data size passes 50 ZB." -Much of the data we are producing today isn't being analyzed. -Every second 6000 tweets are sent -http://www.internetlivestats.com/twitter-statistics/ -Facebook produces 10 TB 500+TB 600TB 4 PB per day with a data warehouse around 300PB

skillset of data scientists

-understand analytics -but also well versed in IT -often having advanced degrees in computer science, computational physics or biology or network oriented social sciences -Their upgraded data management skill set — including programming, mathematical and statistical skills, as well as business acumen and the ability to *communicate effectively with decision makes* -*This combination of skills, valuable as it is, is in very short supply*

Three alternatives for determining how many tuples are evicted

1. Count -Once window fills up, one old tuple evicted for each arriving tuple 2. Time -Tuples that have been in the window longer that the specified time period are evicted -Tuple eviction takes place independently of tuple insertion 3. Delta -Evict tuples for which a specified attribute's value is more than delta less than that attribute's value in new tuple

Properties that define a window

1. Eviction policy 2. Trigger policy 3. Partitioning

Types of Sentiment Analysis

1. Fine-grained Sentiment Analysis 2. Emotion detection 3. Aspect-based Sentiment Analysis 4. Multilingual sentiment analysis

Windows specifications (policy specifications)

1. Punctuation -Allows one to work with a set of tuples that are between punctuations -punct() 2. Count (Fixed window size) -Allows one to work with a fixed number set of tuples count(int32 n) or count(uint32n) 3. Time (Fixed time interval) -Allows one to work with a set of tuples that arrived in a specified period of time -time(int32 n) or time(uint32 n) or time(float64 n) 4. Delta (A specified numeric or timestamp attribute increasing by a specified amount) -Allows one to work with a set of tuples where each of the tuple's specified attribute value is no more than x greater than that of the first tuple in the window -Usually used for time-based windows based on timestamps that are part of tuple data -The attribute's value in one tuple must never be less that its value in the previous tuple delta(attribute, delta amount)

Sentiment analysis algorithms fall into one of three buckets

1. Rule-based: these systems automatically perform sentiment analysis based on a set of manually crafted rules. 2. Automatic: systems rely on machine learning techniques to learn from data. 3. Hybrid systems combine both rule-based and automatic approaches.

The overall benefits of sentiment analysis

1. Sorting Data at Scale •Can you imagine manually sorting through thousands of tweets, customer support conversations, or surveys? There's just too much business data to process manually. Sentiment analysis helps businesses process huge amounts of data in an efficient and cost-effective way. uReal-Time Analysis 2. Sentiment analysis can identify critical issues in real-time, for example is a PR crisis on social media escalating? Is an angry customer about to churn? Sentiment analysis models can help you immediately identify these kinds of situations, so you can take action right away. 3. Consistent criteria •It's estimated that people only agree around 60-65% of the time when determining the sentiment of a particular text. Tagging text by sentiment is highly subjective, influenced by personal experiences, thoughts, and beliefs. By using a centralized sentiment analysis system, companies can apply the same criteria to all of their data, helping them improve accuracy and gain better insights.

Course Concepts

1. Sources 2. Transformation 3. Algorithms 4. Visualization

Window types

1. Tumbling windows 2. Sliding windows

Uses of Sentiment Analysis

1.Social Media Monitoring 2.Brand Monitoring 3.Voice of customer (VoC) 4.Customer Service 5.Market Research

PARTITIONING WINDOWS

A partition is a set of tuples with the same values for the partitioning attributes Instead of maintaining a single window per port -The operator maintains one window per port per partition Example: -For a schema with a patientID attribute, there is one partition for each value of this attribute -Allows for tuples related to the same patient to be manipulated together When partitioning is defined -Windows are formed and processed independently for each partition

Relational databases

Access, SQL, Oracle, DB2, MongoDB

String

Definition -List of characters Example -"GOOG" Values -Characters

Float

Definition -Number including fractions Size -32 bits Range ´-3.4E+38 to +3.4E+38

Double

Definition -Number including fractions, twice the size of Float -Think of it like memory on a computer -Very large -If we know the data type is going to be an integer, why don't we use doubles? It will slow down performance Size -64 bits Range -1.7E+308 to +1.7E+308

Integers

Definition -Whole numbers, no fractions or anything following decimal point Example -0, 10, -10, 5000 Values -Whole Numbers Range -2147483648 through 2147483647

Boolean

Definition -a binary variable, having two possible values called "true" and "false." Example -X = True Values -True or False

List

Definition -is a sequence of mutable objects Example -List = [1, 2, 3, 4] Values -String, Integer, Float, List...

Tuple

Definition -tuple is a sequence of immutable objects -Immutable objects= can't change it --A dataset you don't want someone to change --Things that will be flying by that you want to be running queries on Example -{"IBM",158.09,"GOOG",525.34} Values -String, Integer, Float, List...

Edgar Codd

English computer scientist who, while working for IBM, invented the relational model for database management, the theoretical basis for relational databases

Examples of Streams Applications

Financial Services -Risk and fraud management -Customer analytics Transportation -Logistics optimization -Traffic congestion Healthcare/Life Sciences -Medical record text analytics -Genomic analytics Telecommunications -Call detail record processing -Customer profile monetization Energy and Utilities -Smart meter analytics -Asset management Digital Media -Real-time ad targeting -Website analysis Retail -Omnichannel marketing -Clickstream analysis Law Enforcement -Real-time multimodal surveillance -Cyber security detection

Tweet2File

Grabs tweets based on keywords and writes them to a file.

Example of rule-based approach

Here's a basic example of how a rule-based system works: 1.Defines two lists of polarized words (e.g. negative words such as bad, worst, ugly, etc and positive words such as good, best, beautiful, etc). 2.Counts the number of positive and negative words that appear in a given text. 3.If the number of positive word appearances is greater than the number of negative word appearances, the system returns a positive sentiment, and vice versa. If the numbers are even, the system will return a neutral sentiment.

1. Fine-grained Sentiment Analysis

If polarity precision is important to your business, you might consider expanding your polarity categories to include: •Very positive •Positive •Neutral •Negative •Very negative This is usually referred to as fine-grained sentiment analysis, and could be used to interpret 5-star ratings in a review, for example: •Very Positive = 5 stars •Very Negative = 1 star

Tumbling Window

Is specified by providing an eviction policy only -Punctuation -Count -Time -Delta Stores tuples until the window is full -Full is based upon the eviction policy When the window is full -Executes the operator behavior After the operator has been executed -Flushes the window

Twitter Data with Python

Issue with sorting for streaming analytics: you don't have all the data, sorting on a moving target

Non-relational databases

NoSQL

3. Partitioning

Maintains separate windows for each group of tuples with the same grouping key value

Files in python

Opening a file for read access -FileRead = open('filename', 'mode') (read = r, write = w) Reading the file with a Loop function (a loop reads through a defined set o objects, going through until we're done with the file, go through a hundred times, etc.) -For i in Range (N):Line = file.next()Print Line ´Writing to a File FileWrite = open('newfile.txt' , 'w')Line = file.next()FileWrite.write(Line + "\n") -Close File -File.Close()

AVG (AZURE STREAM ANALYTICS)

Returns the average of the values in a group. Null values are ignored. Syntax AVG (expression ) Arguments expression Is an expression of the exact numeric or approximate numeric data type category. AVG can be used with bigint and float columns. Aggregate functions and sub queries are not permitted. Return Types The return type is determined by the type of the evaluated result of expression. Example SELECT System.TimeStamp AS OutTime, TollId, AVG (Toll) FROM Input TIMESTAMP BY EntryTime GROUP BY TollId, TumblingWindow(minute,3)

STREAMS WINDOWS

Sorting, aggregating, or joining data in a relational table -All data in the table can be processed For streaming data, data flows continuously -No beginning, no end -Requires a different paradigm for sorting, aggregating and joining Can only work with a subset of consecutive tuples This finite set of tuples is called a window

Strings and Lists in Python

Split() - Split functions splits a string based on a qualifier such as a space, tab or comma. -myList = Line.Split(' ') -myList[0] Returns the first value in the list -Another term for list: array -List of 0 identifies the first value in that list, 1 is the second value Len() - Returns the length of the object -listLength = Len(object)

How does streaming analytics differ from database driven analytics?

Streaming analytics is in real time

SLIDING WINDOWS - TRIGGER POLICIES

Three alternatives for determining when a sliding window is processed: 1. count(n) -When a specified number of tuples have arrived since the last time a window was processed 2. time(n) -When a specified amount of time has passed 3. delta(attrib, value) -When a tuple arrives whose specified non-decreasing numeric attributevalue is more than a specified amount greater than that attribute's value for the tuple that triggered the last processing operation *Eviction policy and trigger policy are independent* All nine combinations are possible

SLIDING WINDOWS - EVICTION POLICIES

When a new tuple arrives -It is added to the end of the window -Zero or more old tuples are evicted from the start of the window(first-in/first-out)

WriteSentiment.py

Writes Sentiment values to file. All files will be in the same folder where the python files reside

4. Visualization

•Dashboards •Maps •Graphs •Alerts •Real-Time

2. Transformation

•Filter, Sort, Split •Tools (Python, Streams, etc)

3. Algorithms

•Statistics •Sentiment •Image Recognition •R, SPSS

1. Sources

•Text, FTP •Sensor •Video •HTTP •Real-Time


Conjuntos de estudio relacionados

Microeconomics Ch.7 Elasticity and Consumer Choice

View Set

B Law 2nd Exam (corrections only)

View Set

Assessment Chapter 8 & Test Types

View Set

Critical Care Final Exam Practice Questions

View Set

NEW YORK REAL ESTATE CHAPTER 3-A

View Set

MARK Test 3 (ch 2, 8, 13, 17-19)

View Set

Unintended Consequences of Human Activity Test

View Set

11 - Texas Laws and Rules Pertinent to Insurance

View Set