ASU CIS415 Quiz2
Bayes' Theorem
- Prior probability (probability a hypothesis is true) P(h) - posterior or conditional probability (the probability the hypothesis is true given some additional evidence) P(h|d) -Bayes Theorem describes the relationship between P(h|d), P(d), P(d|h) and P(h). usually easier to get data about P(d|h) than P(h|d). P(h|d) = (P(d|h)P(h)) / (P(d))
Node Centrality Measures
- degree centrality (celebrities) number of connections a node has - closeness centrality (gossip) shortest path between each pair using Dijkstras algorith and the average shortest distance to all other nodes divided by the maximum distance. closeness = 1/ avg distance
Search
- depth first search DFS, go deep before going broad. visit neighbor than neighbors neighbors first and then continue. Breadth first Search BFS, go broad then deep. visit all immediate neighbors first the proceed.
Text pre-processing decisions/ when do we pre-process?
- lowercase - remove punctuation - remove digits - remove short words - remove stop words - stem the words - retain only unique words
Graph Measures
- number of nodes - number of edges - density (actual edges/ potential edges) - diameter (shortest distances between all node pairs and the max of them all)
Text Classification
- the task of assigning a category or a class to a text. - through supervised learning: start with a set of documents that are labeled with the classes they belong to. extract a set of features from the docs (often words), produce a classifier which can classify a new doc into the appropriate class focusing on Naive Bayes Classification. Corpus - collection of docs Document - collection of tokens or terms or words
Clustering coefficients
-measures proportion of the egos friends that are also friends with eachother -star networks with a single broadcast node and passive listeners with have low coefficient - dense ego networks with have a lot of mutual trust and high coefficient
Item Based Filtering
1) adjusted cosine similarity 2) weighted slope one
adjusted cosine similarity
1) compute item to item similarity matrix based on mean adjusted cosine similarity measure.similar to pearson. 2) calculate normalized ratings of user v for all items i (must be between -1 and 1) similar to knn 3) predict normalized rating of how well user v will like item k 4) denormalize the prediction score
User Based Filtering - measures of similarity
1) distance based (dis)similarity measures 2) cosine-based similarity measure 3) pearson correlation-based similarity measure 4) k nearest neighbor
Naive Bayes Classifier
A family of algorithms that consider every feature as independent of any other feature
Maximum A Posterior (MAP) rule
Bayes Theorem is used to decide among alternative hypothesis by computing probability for each hypothesis and picking the one with the highest probability. essentially P(d|h)P(h) = Hmap
Betweeness Centrality
Location in the network: - *Control* data flow - Serve as a *"Broker"* - *Isolate* Leadership How many people have to go through this individual to get to others?
Predicting Ratings
Mean Absolute Error MEA measures average absolute deviation between predicted ratings and users true rating (also RMSE) we can convert rating to binary to predict bad ratings to good ratings. true positive rate and 1-specificity rate.
Graph measures: Shortest Path
Shortest path- least amount of nodes Dijkstra's Algorithm - for a given vertex find the lowest cost path to all other vertices where cost is determined by summing edge weights. shortest path is least weight.
TFIDF
TFIDF(t,d) = TF(t,d) * IDF(t) where N is the number of documents Nt is the number of docs with term t TF(t,d) is the number of times a term appears in a doc, divided by the total number of terms in a doc and IDF(t) = log(N/(1+Nt))
Model Based Filtering
User based filtering- find similar users and use their ratings for recommendations. Item based filtering- find similar items and use that and users ratings to generate recommendation. ratings are NOT stored we just build a model based on closeness.
Collaborative filtering problems
cold start scalability sparsity winner takes all
weighted slope one
computationally efficient. item to item similarity. weighted because mutiplying cardinaliry 1) compute deviations between every pair of items 2) predict rating of how well user v will like k
Distance based methods
distance between two points P & Q in 2 dimensional space. Minkowski Distance (with parameter r) = (|a-c|^r + |b-d|^r)^(1/r) with r = 3 Manhattan Distance = Minkowski with r = 1 Euclidean Distance = Minkowski with r = 2
sparsity
even though there are a lot of users not eveyone rates everything so there may be books on amazon that no one rates and an algorithm might not be able to find nearest neighbors.
types of ratings
explicit ratings- user explicitly rates the item. thumbs down or up. problem is most people are too lazy to do this. implicit ratings- ratings when we keep track of users behavior. - amazon 'frequently bought also' tab. the problem is its hard to distinguish individual preferences.
Eigenvector Centrality
how well-connected are those I'm connected to. recursive version of degree centrality.
Pearson Correlation
measure of a linear dependance between two variables.
Islands
only retain edges that have weight > threshold. remaining is subcore of maximal activity between nodes that have developed trust relationship
winner takes all
recommendation system's bias towards already popular items
Ego Network
the web and characteristics of social relationships that surround an individual, from the focal individual's perspective. subnetworks centered on a certain node. breath-first search and limiting depth to <=3
confusion matrix
top is predicted condition side is true condition P N P tp fn N fp tn accuracy is (tp + tp) / n where n is number of total isntances in the test
Cold Start
typically need a significant amount of existing data on a user in order to make recommendations
Scalability
user based methods work great for thousands of user but scalability is a problem when there are millions of users because you have to calculate more distances.