Week 10 - Graph Database and ML
Execution of a Pregel program
"1. Many copies of the program begin executing on a cluster of machines 2. The master assigns a partition of the input to each worker • Each worker loads the vertices and marks them as active 3. The master instructs each worker to perform a superstep • Each worker loops through its active vertices and computes for each vertex • Messages are sent asynchronously, but are delivered before the end of the superstep. This step is repeated as long as any vertices are active, or any messages are in transit 4. After the computation halts, the master may instruct each worker to save its portion of the graph "
Describe vertex-oriented graph processing
"Based on BSP (Bulk Synchronous Parallel) model. Input: a set of directed graph to Pregel, run computation (fusion and diffusion) at each vertex. Repeat computation. Halt when each vertex votes to halt. Output: returns another directed graph. Computations: Fusion: aggregate information from neighbors to a set of vertices Diffusion: propagate information from a vertex to a set of neighbors"
How does Pregel guarantee fault-tolerance? Describe checkpointing, failure detection and recovery
"Checkpointing • The master periodically instructs the workers to save the state of their partitions to persistent storage • e.g., Vertex values, edge values, incoming messages • Failure detection • Using regular "ping" messages Recovery • The master reassigns graph partitions to the currently available workers • The workers all reload their partition state from most recent available checkpoint"
What is the Google Cloud AI Platform? What tools does it provide?
"Ecosystem to deploy an ML model. Notebooks => Platform Training => Continuous Evaluation => Predictions. KubeFlow: deployment of ML on Kubernetes AutoML tables"
Optimizations for K-Means
"Elbow method to look for the optimal number of clusters Remove outliers Standardize input "
Frequent pattern mining (fpm) and its use cases
"Given a set of transactions: 1. Calculate item frequencies and identify frequent items. 2. Use a suffix tree (FP-tree) to encode transaction 3. Extract frequent item set from the FP-tree"
Differences between Graph Databases and Relational Databases
"Graph Databases Overview: - Associative data sets describing some attributes of objects such as persons or cars. - Structure of OOP applications. - Do not require JOINs. Relational Database: - Perform same operation on large numbers of data elements. - Row keys are unique and using lots of entity instance tables - Use relational model of data.
What is Graph Processing? What is a graph database?
"Graph databases with explicit graph structure: Each node knows its adjacent nodes. Cost of node A to node B (hop or local step) is the same regardless of how many nodes there are. Can be faster than relational for graph type queries: who is a friend of a friend? (Social media). Scales well. Does not require joins. Less rigid schema permits easier evolution.
What is a hyper parameter? What is AutoML? How do they relate to one another?
"Hyperparameter: parameter about training of the model. AutoML creates a search space to fine tune the hyperparameters (grid search, random search, gradient descent)"
What does Microsoft Azure offer for cloud-based machine learning?
"Managed platform Visual workflow design for no-code ML tasks Managed Jupyter Notebooks"
What is Spark GraphFrames? What are some of the graph operators that are used in GraphFrames?
"Move away from RDD and more towards DataFrame. Python interface. Support motif finding for structural pattern searches. Easier to do optimization. Method 1: DataFrame & GraphFrame operations such as Motif finding as Series of DataFrame joins (vertices DF and edges DF to find structural pattern) and Use other pre-packaged algorithms. Method 2: Message passing: Send messages between vertices, and aggregate messages for each vertex. Method 3: Pregel"
What is the OSEMN Data Science Model? What happens in each of its stages? What cloud tools can you use in each stage?
"Obtain, Scrub, Explore, Model, Interpret. Obtain: data sources on the cloud, command line, APIs. Scrub: data prep and data wrangling Explore: Analyze, descriptive stats, EDA Model: train and test models, evaluation (precision, recall, f1 score, MAE, RMSE). Interpreting data "
Describe the Pregel system architecture
"Pregel system uses the master/worker model. Master: Maintains worker. Recovers faults of workers. Provides Web-UI monitoring tool of job progress Worker: Processes its task. Communicates with the other workers "
What is Spark GraphX? What are some of the graph operators that are used in GraphX?
"RDD based API for graph processing. Operators: vertices(), edges(), reverse(), subgraph(), mapVertices(), mapEdges(), joinVertices(), connectedComponents()"
What is GraphX? What technologies does GraphX use?
"RDD based API for graph processing. "
What is Human in the loop AI? Why use it? What are its strengths and weaknesses? What are tools that support Human in the loop AI?
"Reframe the automation problem as a HCI design problem. Pros: Gain in transparency, human judgement, no need to build the perfect AI system. Cons: need to pay extra. Unbiased reviews. Human errors. Tools: Amazon Mechanical Turk (Augmented AI, Ground Truth) "
Describe the standard execution flow in Giraph
"Setup: load the graph from disk, assign vertices to workers, validate worker health Compute (repeat): assign msg to workers, iterate on active vertices, call compute() Synchronize (repeat): send msgs to workers, compute aggregators, checkpoint Teardown: write back result, write back aggregators. "
What are examples of unstructured data? What are the ML cloud tools specifically designed to handle unstructured data?
"Unstructured: voice, language, vision. Vision: AWS Rekognition, Azure Computer Vision, Google Vision AI, IBM Watson visual recognition Voiced-based services: AWS Lex, Google Cloud Speech-to-Text, Azure Speech-To-Text, IBM Watson Speech to text NLP: AWS Lex, AWS Polly, Microsoft Language Understanding, Google natural language, IBM Watson natural language classifier. "
Describe the responsibilities of the following components in Giraph - ZooKeeper, Master, and Worker
"ZooKeeper: computation state (partition/worker mapping, global state, checkpoint paths, aggregator values, statistics) Master: coordination (assign partition to workers, coordinate synchronization, request checkpoints, aggregate values, collect health status) Worker: vertices (invoke Compute() function, send/receive/assign msgs, compute local aggregation values)."
What is classification with respect to machine learning?
A type of supervised learning ML to predict/ understand a discrete variable. The discrete variable can be binary or multi.
What is the K-Means algorithm used for? How does it work?
Assume each data point belong to a random centroid and the total number of centroids are the number of clusters. Maximizing the inter-cluster distances while minimizing the intra-cluster distances. Kmeans is an unsupervised clustering algorithm. 1) Randomly select 'c' cluster centers (use elbow method to pick c). 2) Calculate the distance between each data point and cluster centers. 3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers. 4) Recalculate the new cluster center based on the coordinates of data points belong to the same cluster. 5) Recalculate the distance between each data point and new obtained cluster centers. 6) If no data point was reassigned then stop, otherwise repeat from step 3).
Describe the Naive Bayes algorithm
Assumption: conditional independence P(x(j)|y) ind P(x(j-1)|y). Input is an RDD LabeledPoint and a smoothing lambda parameter. Output is a Naive Bayes model to be used for multinomial classification task.
What is collaborative filtering?
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
Describe an algorithm to find connected components in Giraph
Compute: Propagate smallest vertex label to neighbors until convergence. In the end, all vertices of a component will have the same label.
Limitations of current shared-memory systems for graph processing
Existing shared memory parallel graph algorithm => no fault tolerance
What are Graphs? Describe nodes / vertices, edges, etc
Graph DB is a storage system that provided index-free adjacency. Properties are pertinent information related to nodes and vertices. Relations between vertices and edges. Node represents entities. Edges represent the relationship (verbs, adjectives, attributes) between vertices (objects).
What is Pregel? How does it work? What are its properties? How does it detect and respond to failures?
Graph processing model based on the research paper by Google. It uses index-free adjacency matrix to store its vertices and edges. Vertices are first class primitive type. Workers regularly uses "ping" messages to detect failures. In case of failures, master uses recovery to reload the state from the most recent available checkpoint from the workers.
How does GraphX compare to Giraph / Pregel?
GraphX implements Pregel using RDDs when processing the graph Pregel/Giraph uses Master/Worker model with vertices as its first class citizen.
Describe how AWS SageMaker supports ML on the cloud.
Label (AWS Sagemaker Ground Truth), Build (Sagemaker Studio), Train and Tune (Experiments, Debugger and Tuning), Deploy and Manage (Model Monitor, Neo, Augmented AI)
Limitations of Map-Reduce with respect to graph processing
Limits: - out-of-core memory - graph states are computationally heavy. Explain: Graph computations involve local data and the connectivity is sparse. The data may not fit into one node => makes it harder for MapReduce model. Inefficient because the graph state must be stored at each stage of the graph algorithm, and each computational stage will produce much communication between stages"
Describe a few models which can be used for classification
Logistic regression, Naive Bayes, SVM, decision trees, Random forests and gradient boosted trees.
How does Data Mining and Machine Learning relate to Artificial Intelligence? What are some applications?
ML and DM are subset of AI. Information retrieval, stats, biology, linear algebra, marketing and sales.
Machine Learning in the Cloud
ML and data mining
Graph operators supported by GraphX
Operators: vertices(), edges(), reverse(), subgraph(), mapVertices(), mapEdges(), joinVertices(), connectedComponents()
Storage of temporary and persistent data in Pregel
Persistent data is stored as files on a distributed storage system (such as GFS, HDFS, or BigTable). Temporary data is stored on local disk
What is Giraph? How does it differ from Pregel? Why was it created?
Pregel system uses the master/worker mode Open source implementation based on Pregel. Giraph adds several features beyond the basic Pregel model, including master computation, sharded aggregators, edge-oriented input, out-of-core computation
Layout of a standard Pregel program
Property graph, vertices table, edges table.
Building a graph in GraphX
Scala: list of vertices and list of edges
What is Mahout? What are its goals?
Scalable Apache Mahout ML library that implements different approaches to ML.
What is Spark? What are its goals?
Scalable ML models within Spark environment => ease of use and scalable. Ted Malaska thinks Spark is a type of DB.
What is index-free adjacency?
Still create the same graph regardless of the index of each row (vertex, <list reachableVertices>)
What are super steps? What does each super step involve doing?
Supersteps are sequence of iterations/computations. Has subclasses Vertex to write Compute method(s). Get/set outgoing edges, get/set vertex, send/receive msg.
Differences between undirected and directed graphs
Undirected graphs are bi-directional directed graphs
What is clustering? What is it used for?
Unsupervised ML technique to cluster groups of objects. Gain more information about the groups or later on be used as features in an ML model.
Primitives in Pregel? Are edges or are vertices first-class citizens in Pregel?
Vertices and edges. Vertices are first class citizen.