ASU CIS415 Quiz2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Bayes' Theorem

- Prior probability (probability a hypothesis is true) P(h) - posterior or conditional probability (the probability the hypothesis is true given some additional evidence) P(h|d) -Bayes Theorem describes the relationship between P(h|d), P(d), P(d|h) and P(h). usually easier to get data about P(d|h) than P(h|d). P(h|d) = (P(d|h)P(h)) / (P(d))

Node Centrality Measures

- degree centrality (celebrities) number of connections a node has - closeness centrality (gossip) shortest path between each pair using Dijkstras algorith and the average shortest distance to all other nodes divided by the maximum distance. closeness = 1/ avg distance

Search

- depth first search DFS, go deep before going broad. visit neighbor than neighbors neighbors first and then continue. Breadth first Search BFS, go broad then deep. visit all immediate neighbors first the proceed.

Text pre-processing decisions/ when do we pre-process?

- lowercase - remove punctuation - remove digits - remove short words - remove stop words - stem the words - retain only unique words

Graph Measures

- number of nodes - number of edges - density (actual edges/ potential edges) - diameter (shortest distances between all node pairs and the max of them all)

Text Classification

- the task of assigning a category or a class to a text. - through supervised learning: start with a set of documents that are labeled with the classes they belong to. extract a set of features from the docs (often words), produce a classifier which can classify a new doc into the appropriate class focusing on Naive Bayes Classification. Corpus - collection of docs Document - collection of tokens or terms or words

Clustering coefficients

-measures proportion of the egos friends that are also friends with eachother -star networks with a single broadcast node and passive listeners with have low coefficient - dense ego networks with have a lot of mutual trust and high coefficient

Item Based Filtering

1) adjusted cosine similarity 2) weighted slope one

adjusted cosine similarity

1) compute item to item similarity matrix based on mean adjusted cosine similarity measure.similar to pearson. 2) calculate normalized ratings of user v for all items i (must be between -1 and 1) similar to knn 3) predict normalized rating of how well user v will like item k 4) denormalize the prediction score

User Based Filtering - measures of similarity

1) distance based (dis)similarity measures 2) cosine-based similarity measure 3) pearson correlation-based similarity measure 4) k nearest neighbor

Naive Bayes Classifier

A family of algorithms that consider every feature as independent of any other feature

Maximum A Posterior (MAP) rule

Bayes Theorem is used to decide among alternative hypothesis by computing probability for each hypothesis and picking the one with the highest probability. essentially P(d|h)P(h) = Hmap

Betweeness Centrality

Location in the network: - *Control* data flow - Serve as a *"Broker"* - *Isolate* Leadership How many people have to go through this individual to get to others?

Predicting Ratings

Mean Absolute Error MEA measures average absolute deviation between predicted ratings and users true rating (also RMSE) we can convert rating to binary to predict bad ratings to good ratings. true positive rate and 1-specificity rate.

Graph measures: Shortest Path

Shortest path- least amount of nodes Dijkstra's Algorithm - for a given vertex find the lowest cost path to all other vertices where cost is determined by summing edge weights. shortest path is least weight.

TFIDF

TFIDF(t,d) = TF(t,d) * IDF(t) where N is the number of documents Nt is the number of docs with term t TF(t,d) is the number of times a term appears in a doc, divided by the total number of terms in a doc and IDF(t) = log(N/(1+Nt))

Model Based Filtering

User based filtering- find similar users and use their ratings for recommendations. Item based filtering- find similar items and use that and users ratings to generate recommendation. ratings are NOT stored we just build a model based on closeness.

Collaborative filtering problems

cold start scalability sparsity winner takes all

weighted slope one

computationally efficient. item to item similarity. weighted because mutiplying cardinaliry 1) compute deviations between every pair of items 2) predict rating of how well user v will like k

Distance based methods

distance between two points P & Q in 2 dimensional space. Minkowski Distance (with parameter r) = (|a-c|^r + |b-d|^r)^(1/r) with r = 3 Manhattan Distance = Minkowski with r = 1 Euclidean Distance = Minkowski with r = 2

sparsity

even though there are a lot of users not eveyone rates everything so there may be books on amazon that no one rates and an algorithm might not be able to find nearest neighbors.

types of ratings

explicit ratings- user explicitly rates the item. thumbs down or up. problem is most people are too lazy to do this. implicit ratings- ratings when we keep track of users behavior. - amazon 'frequently bought also' tab. the problem is its hard to distinguish individual preferences.

Eigenvector Centrality

how well-connected are those I'm connected to. recursive version of degree centrality.

Pearson Correlation

measure of a linear dependance between two variables.

Islands

only retain edges that have weight > threshold. remaining is subcore of maximal activity between nodes that have developed trust relationship

winner takes all

recommendation system's bias towards already popular items

Ego Network

the web and characteristics of social relationships that surround an individual, from the focal individual's perspective. subnetworks centered on a certain node. breath-first search and limiting depth to <=3

confusion matrix

top is predicted condition side is true condition P N P tp fn N fp tn accuracy is (tp + tp) / n where n is number of total isntances in the test

Cold Start

typically need a significant amount of existing data on a user in order to make recommendations

Scalability

user based methods work great for thousands of user but scalability is a problem when there are millions of users because you have to calculate more distances.


Kaugnay na mga set ng pag-aaral

Torts II Midterm Rule Statements

View Set

(12) BUSN Ch. 11 quiz, hospitality management chapter 12 quiz, Management 301 Exam 4, Final Exam 1-15, Mangement: Chapter 14, Management Op. 470 Midterm, Chapter 14 Organizational Management

View Set

Unit 17 - Food Allergies and Intolerances

View Set

Computer Forensics Chapter 2 Review Questions

View Set

Logic of American Politics Final Exam Ch.1-14, AMERICAN GOV FINAL REVIEW

View Set