Big Data, Weeks 1-7
What are the 4 ways provided to construct an RDD?
1) RDDs can be created through the parallelization of Scala collections 2) RDDs can be created through the transformation of parent RDDs 3) RDDs can be created by reading from a file 4) RDDs can be maintained through changing the persistence of an existing RDD (i.e. caching)
1-to-1
1-to-1: Person, Signature; Each person is associated with exactly one signature and each signature is associated with exactly one person
1-to-N
1-to-N: Director, Movies; Each Director may be associated with many movies, but each movie has only one director
Short
16 bits/2 bytes, -2^15 through 2^15 - 1
What's the default port for MongoDB?
27017
Int
32 bits/4 bytes, -2^31 through 2^31 -1
Float
4 bytes, 5-6 decimal precision
Long
64 bits/8 bytes, -2^63 through 2^63 - 1
Byte
8 bits/1 byte, -2^7 through 2^7 - 1
Double
8 bytes, 14-16 decimal precision
How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?
A 200MB file will be stored in two blocks, as each block in HDFS stores 128MB of data by default
What is a Hive bucket?
A Hive bucket is similar to a partition in that it divides a table into a number of subtables, however, these subtables are based on a more arbitrary data column than in a partition, allowing each bucket to effectively serve as a smaller random sample of its database at large.
What is a Spark Application? Job? Stage? Task?
A Spark application is the full program that contains all of the various RDDs/jobs that will be evaluation upon its execution. A Spark job is spawned each time an action is called upon an RDD such as .collect() or .take() and is made up of stages. A spark stage is a series of transformations that can be performed upon an RDD without any need to repartition. Each time we call a shuffle transformation on an RDD, we end one stage and begin a new stage. A spark task is each individual transformation, such as .map or . filter that is performed upon an RDD. As long as no shuffle occurs, an RDD will be passed from one task to the next in the same stage.
nano
A commonly used CLI text editor
What is a Container in YARN?
A container is a bundle of computing resources given to a worker by the Scheduler
What is a daemon?
A daemon is a long-running background process that allows for the functioning of programs and processes that the end-user interacts with
What is a database?
A database is an application, running on a server, that is responsible for the long-term storage and management of data, allowing for large amounts of data to be stored and accessed efficiently as needed.
What is Apache Kafka
A distributed event streaming platform
lineage
A fault tolerance concept that allows our RDDs to be reconstructed based on the data from which they were constructed and the transformations necessary for their construction.
What is a Future?
A future is a part of the program execution that is directed to execute in a separate thread, allowing the main program flow to continue while the future is being executed. This is useful when a task takes a relatively long time to execute and we do not want the main execution flow to be held up while the task completes. The result of the Future can then be called later in the program's execution.
What is a hashCode? How are hashCode used in (default) partitions?
A hashing algorithm is a mathematical function which will always return the same output when given the same input. It is used to convert things like files into hashes and can be used for integrity checks. In the default partition algorithm, a hashing algorithm is performed upon the various intermediary keys, which are then modulated using R (The number of reduce tasks/output files as specified by the user).
What is a higher order function?
A higher order function is a function that takes a function as an argument and/or returns a function.
In broad terms, what is a Join?
A join is a Spark method that combines two RDDs based on key which outputs and values as a tuple.
What is a Lambda?
A lambda is an alternative method for using functions that does not require functions to be defined beforehand and then called when needed, but rather allows for functions that only need to be used once to be defined at the time of execution.
cache
A method that allows us to change the persistence of an RDD after its use, so it remains in memory. A cache call on an RDD that causes an overflow of available RAM will lead to partitions being dropped and recalculated as necessary.
When a Kafka broker that is the leader for some partition fails how is a new leader chosen?
A new leader is selected from among the set of in sync replicas
What does a partitioned table look like in HDFS?
A partitioned table will be divided into multiple directories within HDFS, with each directory corresponding to one partition.
What is a primary key?
A primary key is analogous to the object id in Mongo, it serves as a unique identifier for a given record.
What is a pure function?
A pure function is a function that takes an input and produces an output with no "side effects", i.e. a mathematical function
What do we need to include in any effective recursive function?
A recursive function needs two things to be effective: it needs to call itself in a different manner than it was original called, and it needs to have a base case.
What is a side effect?
A side effect occurs when a chunk of code does something other than simply return a value, such as performing I/O, i.e. printing to the console
What is a Thread?
A thread is a single flow of program execution. Each thread has its own stack.
What is a Trait?
A trait is similar an interface in that it stores some sort of values or methods that can be shared by classes that implement the trait.
What's the difference between a while loop and a do-while loop?
A while loop will only execute while the condition is true, thus it may never execute if the condition is never met. A do-while loop, however, will always execute at least once.
What is a Working Set?
A working set is a dataset that is stored in memory (RAM) to allow for quick access and use.
What is AWS?
Amazon Web Services (AWS) is a cloud-based service provided by Amazon that allows for the execution of application on a virtual server that runs using servers maintained or rented by Amazon. AWS has all the typical benefits of a cloud service, including elasticity, scalability, and the ability to use large amounts of computing resources on a per-use fee basis, which is ultimately much cheaper than maintaining a server farm directly.
ArrayBuffer
An ArrayBuffer is like a List but is mutable
What is an ObjectId?
An ObjectId is a unique identifier assigned to each document in MongoDB. The field for this identifier is $oid
What does it mean to transform an RDD?
An RDD can be created by performing a transformation on a parent RDD, i.e. mapping a function to the values stored in an existing RDD.
What's the difference between an absolute and a relative path?
An absolute file path gives the entire file path of a file or directory beginning at the root directory, while a relative file path gives the path to a file or directory relative to the current working directory.
How do we define an auxiliary constructor?
An auxiliary constructor is a constructor defined within the body of a class that makes a call to the primary constructor to instantiate an object but does not need to be passed the full set of parameters.
What is found inside an empty Option?
An empty option contains None
What are some examples of side effects?
Anytime a function alters some sort of external state, this is called a side effect. Some common examples include a method which alters some form of data stored within an object or a method that serves some sort of I/O purpose, like printing a line to the screen.
EC2?
EC2 stands for Elastic Cloud Computing and is the AWS service that allows for the creation of a virtual server to provide large amounts of computing resources that can be used to run applications and perform various jobs. EC2s are based on a per hour fee model.
What is beeline?
Beeline is a CLI designed to interact with Hive, allowing for easy and efficient querying of Hive databases via the use of HQL.
In a typical Hadoop cluster, what's the relationship between HDFS data nodes and YARN node managers?
Both HDFS data nodes as well as YARN node managers exist on the worker machines within a Hadoop cluster. It is the responsibility of the data nodes to maintain the data that is stored on the machine, while it is the responsibility of the node manager to manage the computing resources (called a "container") that can be used to perform tasks using the data.
What are some differences between hard disk space and RAM?
Both hard disk space and RAM provide memory used for storing computer data. However, RAM is volatile memory, meaning it does not persist on the loss of power, while disk space is persistent memory, allowing data stored in hard disk space to continue to exist after shutdown. Typically, data that is going to be used for some type of process is moved to RAM to allow quick and easy access, while data that needs to be stored for a longer period of time is written to disk
What are some steps we can take to debug a Spark Application?
Both our driver program as well as our executors will write to StdErr when an issue is encountered with execution. By looking at the output to this file (typically stored in /var/log/spark) we can find what the issue with our application is. Typical issues are OOM for program drivers and GC issues for executors. These errors can be rectified by allocating more memory to the appropriate machine
Why are broadcast joins significantly faster than shuffle joins?
Broadcast joins are faster than shuffle joins because they do not require an expensive shuffle step, instead passing the full RDD to each executor as a broadcast variable.
What is a broadcast variable?
Broadcast variables are the immutable/read-only, global variables used within Spark. Broadcast variables are useful in situations where worker nodes must share a variable to perform their work, but will not need to alter the variable, i.e. filtering a list RDD based on a fixed value.
How many replications of data does GFS provide by default? Block size?
By default, GFS provides a replication set of 3 for every chunk, which are 64MB in size
What is the default number of replications for each block?
By default, blocks are given three total replications
What is the size of an Int? a Double? (the default integer and decimal types)
By default, integers are stored as Ints(4 bytes) unless otherwise specified, while Decimals are stored as Doubles(8 bytes) unless otherwise specified
Is the Master fault tolerant in GFS? Why or why not?
By default, the master does not have a fault tolerant implementation as it is highly unlikely that the master would fail and, should it fail, the MapReduce task can simply be restarted
EMR?
EMR stands for Elastic MapReduce, and is AWS's service that allows for Hadoop cluster to be created on top of an EC2. This EMR can then be used for the execution of MapReduce jobs, such as Spark jobs.
How does Structured Streaming lets us process an incoming stream of data?
By processing the incoming stream in small batches, then appending incoming records to our streaming DF/DSs
How does Spark Streaming lets us process an incoming stream of data?
By processing the incoming stream in small batches, then appending new RDDs to our DStreams
What's the name for the code written in .class files?
Bytecode
What is the CAP Theorem?
CAP stands for Consistency, Availability, Partition Tolerance. CAP theorem states only 2 of these 3 concepts can be achieved simultaneously, not all 3.
How do we create a table?
CREATE TABLE table_name (Column 1 TYPE, Column 2 TYPE) OPTIONS TBLPROPERTIES
What is CRUD?
CRUD stands for Create, Read, Update and Delete, and represents the 4 major ways in which a user interacts with a database
What does it mean to cache an RDD?
Caching an RDD means storing it in RAM so that it can be used for further analysis without having to be recreated, i.e. this overrides the RDDs natural ephemeral state.
What is a case class?
Case classes are classes that are automatically given certain "syntactic sugar" that helps in creating immutable classes. For example, fields in case classes are val by default and several methods that may be advantageous are generated.
cd
Changes directories into a given directory
What are classes? Objects?
Classes are blueprints for objects that contain data and methods. For a class to be used, it must be instantiated as an object which is then stored in memory and can be used as needed without affecting the class itself
What does Cluster Computing refer to?
Cluster computing refers to the concept of storing and processing large amounts of data across a networked set of computers within which each computer represents one node of the larger cluster.
How do we write code to throw an Exception?
Code can be written to throw an exception using a match-case code block.
Provide an example of a good column or set of columns to partition on.
Columns used for partitioning should be those of relative importance to the dataset at large. For example, if one were to create a table of all the Wikipedia view data provided by the Wikimedia analytics dumps, it may be valuable to partition the data on domain code, allowing for quicker queries of data within one of the many domain codes available.
How many partitions does a single task work on?
Each task operates on its own partition, meaning that the number of partitions for our dataset will determine the number of tasks that are spawned.
Encapsulation
Encapsulation is a concept with respect to protecting the internal workings of an object or class by allowing an object to only be manipulated by itself. Other objects can make a call to a method that an object uses to manipulate itself but cannot directly manipulate the data of another object
What are Errors?
Errors are similar to exceptions, however they occur when something goes wrong in the JVM, such as a stack overflow. Unlike exceptions, these should not (and generally cannot) be caught by the program
What are Exceptions?
Exceptions are thrown by the program whenever something unexpected occurs, such as a function being passed an incorrect data type or an attempt to access an item in an array using an invalid index
What is an executor? What are executors when we run Spark on YARN?
Executors are the worker nodes of spark, i.e. where the actual evaluation of RDDs as well as the caching of RDDs in memory occurs. When we run Spark on YARN, executors run as container on the YARN NodeManagers.
Transformation
Expressions performed upon an RDD to create a child RDD, i.e. RDD1.map(_ + 2) = RDD2
What does FROM do?
FROM is used in conjunction with SELECT to specify which table to search for a given record criteria.
True or False: Pub Sub messaging works well as long as there are relatively few publisher-subscriber pairs, but it is inferior to client-server messaging when we need to send messages to 100s of machines.
False
True or False: Since Kafka is a streaming platform, we need to be careful to have our applications wait for subscribers before sending messages
False
True or False: Structured Streaming in Spark SQL is built on top of Spark Streaming (with DStreams)?
False
What does .filter do?
Filter allows for the return of only certain elements within a collection, i.e. list.filter(_ > 4) will return only those elements in the list that are greater than 4.
What are some transformations available on an RDD?
FlatMap, map, filter, parallelize.
What happens to the output of completed Map tasks on failed worker nodes?
Completed Map tasks on failed worker nodes are re-executed because the completed task is written to the local drive of the failed worker, and thus would no longer be accessible
What are Counters for?
Counters can be implemented to count the number of occurrences for a given value. The counter is initialized to 0 and incremented with each relevant record reduced. Ex: Counting the number of total words processed. (MapReduce automatically generates counters for number of (key, value) pairs input and output.
touch
Creates a new empty file
What are DML, DDL, DQL?
DML, DDL, and DQL are all subsets of SQL used to refer to the type of actions being performed in each SQL statement. DML stands for data manipulation language and refers to statements such as INSERT, UPDATE, and DELETE, i.e. statements that affect records in the database. DDL stands for Data Definition Language and refers to statements used to define tables for records, such as CREATE, ALTER, and DROP. DQL stands for Data Query Language and refers to statements used to read data from the database, namely SELECT.
lazy evaluation
Data is not read from memory until an action is called upon an object that uses contains or uses that data.
What is Data Locality? Why is it important?
Data locality refers to the concept of the proximity between the data store and the processor/worker that will access the data. Since it is far faster and more efficient for a given worker to access data from the local disk than over the network, the master assigns tasks to the workers based on data locality.
What is data locality and why is it important?
Data-locality is a term used to describe where data is being stored in the physical computer cluster in relation to the resources that need to access the data. Data locality is important because it is much faster for a task to be performed on data written to the local disk than it is to perform a task on data that must be accessed over the network.
What are DataFrames?
DataFrames are an abstraction of RDDs used with Spark SQL that are similar to their namesake in R and Python. They allow data scientists who may not be familiar with RDD concepts to still perform SQL queries via Spark.
How do we convert a DataFrame to a DataSet?
DataFrames are converted to DataSets via a call to .as and the providing of a defined case class for typing, i.e. ds = df.as[caseclass]
How are DataNodes fault tolerant?
DataNode fault tolerance is achieved by the NameNode, which in the event that it stops receiving heartbeats from a DataNode, will use replication sets to ensure the data is preserved and a new replication set is created to maintain the data replication factor.
What are DataSets?
DataSets are similar to DataFrames, in that they are an abstraction of RDDs, however they differ in that they require strict data typing via the use of case classes, and thus guarantee compile time type safety.
Why do we use databases instead of just writing to file?
Databases allow for the storage of large amounts of data that can persist over long periods of time and are available remotely to many users. While most of these features are also available to some degree if we write to a file instead, the real advantage that databases provide is efficiency in accessing, updating, and removing entries. In a file, it would be necessary to comb through the entire file using some sort of string identification method, while databases have built in methods to quickly query and return appropriate results.
What are the ways we could provide default values for a field (Scala class)?
Default values can be provided for a field by using the "=" operator with a parameter in the primary constructor. Alternatively, auxiliary constructors can be used to instantiate objects with some chosen default values for certain parameters.
What is Dictionary encoding?
Dictionary encoding is a method for data compression used by Parquet in which a dictionary is created to represent the values stored within a column and assign each a key. The data in the column can then be stored simply using the dictionary and keys rather than storing the entire column.
What is a distributed application? A distributed data store?
Distributed applications and databases are those that are stored/run on a cluster of machines connected over a network. This allows for the application or database to run in parallel and thus leads to much faster execution time.
What about handling document relationships using references -- advantages and disadvantages?
Document relationships can also be handled using references, rather than embedding documents within other documents. These references can be useful when dealing with large datasets as they allow for a large amount of data to be referenced within a document, while allowing that dataset to also be a stand-alone document. However, references have an added disadvantage in that they require an additional query to find the referenced document
In MongoDB, what is a document?
Documents are the fundamental building blocks for data storage in MongoDB. Each document contains several fields and the corresponding data for those fields. When we access data from MongoDB, we do so by importing documents
What is spark.driver.memory? What about spark.executor.memory?
Driver memory in Spark is the memory allocated to the driver program, which is 1GB by default. If the executors return more than 1GB of data to the driver program, this can cause an OOM error, leading the driver program to crash and thus terminate the execution of our Spark job. Executor memory refers to the amount of RAM available on the worker node of an executor that is allocated to the processing of Spark tasks. This memory is equivalent to roughly 60% of all memory available on the worker node by default, though it can be adjusted. Exceeding Executor memory will cause data to be spilled to disk, which can lead to inefficiency in the execution of a spark task as it incurs I/O overhead.
Parquet is stored efficiently on disk and is easy to query, traits that make it useful for big data. What are the downsides of the parquet format?
Due to parquets columnar storage and compressions techniques, it is far more difficult and inefficient to write to/alter records in a parquet file. Parquet is ideal for use with read-only data, that will be queried but not changed.
broadcast variable
Immutable/read-only, global variable to be passed to all worker nodes in a Spark job.
What is a managed table?
In contrast to an external table, a managed table is a table in which Hive is directly responsible for managing the data. This means that Hive can guarantee the data is normalized, but also means that if the table holding the data is dropped, the data will be lost.
Do we need to declare type for our variables in Scala? always?
It is not necessary to declare variable types in Scala as Scala is able to infer the data type based on input. Data type can be declare however using the syntax ": " followed by the data type. EX: var x = 44: Int
How does Scala relate to the JRE and JVM?
Scala uses JRE and JVM in a similar manner to Java, in that its bytecode files are turned into machine code by the JRE and then run in the JVM
What is Scalability? How is it achieved in Mongo?
Scalability means that the system is capable of being expanded to deal with increased demands. This is achieved in Mongo using sharding.
Set
Sets are like Maps but rather than containing key, value pairs, they contain only values and cannot contain duplicates. They can be either mutable or immutable, and are not indexed.
Explain sharding
Sharding is used in Mongo to achieve scalability. Sharding allows us to divide a set into subsets when the demand to write to or read from that set is increased. EX: If our Author collection is receiving a large number of writes, we can shard the set into Authors with names A-M and those with names N-Z, so each set will only need to deal with part of the increased demand
What is a shuffle in Spark?
Shuffles in Spark occur when we call .reduceByKey, .sortBy, and .repartition
What is an ApplicationMaster? How many of them are there per job?
The ApplicationMaster requests resources from the Scheduler for each task within a job. There is one ApplicationMaster per job.
What is the ApplicationsManager?
The ApplicationsManager is responsible for accepting submitted jobs, creating an ApplicationMaster for each job, and maintaining fault tolerance of the ApplicationMasters.
What is CDH?
The Cloudera Distribution Hadoop (CDH) was a version of Hadoop that bundled cluster management tools together with the MapReduce tools provided by Hadoop
What is the Hive metastore?
The Hive metastore is a special read-only table created within Hive that stores the meta-data related to all tables in a Hive database. This allows for the querying of metadata from the metastore, while also protecting that metadata from writes that may ultimately undermine the structure of other tables within Hive.
What is the JDK and what does it let us do?
The Java Development Kit is a tool used by Java developers that allows for the compilation of Java code into .class files as well as the execution of these files. JDK contains the JRE.
What is the JRE and what does it let us do?
The Java Runtime Environment is a compiler that compiles .class files into machine code and then runs them within the JVM(Java Virtual Machine). The JRE contains the JVM.
What does the Map part of MapReduce take in? What does it output?
The Map part of the MapReduce takes in a primary (key, value) pair and produces an intermediate (key, value) pair that is written to the local disk
Be able to explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]
The Mapper and Reducer classes referred to above are used by Hadoop to provide the logic used in the map and reduce tasks. In this particular case, the Mapper class is defined with a map function that takes a LongWritable (line number from an html document) and a string of Text (the text of the line) and then outputs a key, value pair of Text and IntWritable (individual words from the text followed by a 1). The reducer then takes these key, value pairs and reduces them to provide a final word count from the original document
What does the target folder contain in Scala?
The target folder is used for all files that are created as a result of the program executing, such as the .class files created a compilation time
What does it mean that functions are first class citizens in Scala?
The term "first class citizen" refers to something that can be passed as a parameter to a function, returned from a function, or stored as a variable. Scala allows for functions to be used in all these ways
Why are there multiple APIs to work with Spark SQL?
The three types of APIs available in Spark SQL are DataFrames, DataSets, and SQL. DataFrames and DataSets both allow for the interaction of SQL queries on RDDs with DataFrames being untyped and DataSets making use of case classes for compile time type safety. Basic Spark SQL makes use of views, which are created from DataFrames and can be queried using SQL queries passed as strings. Each of these APIs has unique benefits and drawbacks, and choosing which to use will be based upon a user's familiarity and circumstances.
True or False: A messagining queue can be used to provide eventual consistency across disparate data stores
True
True or False: In Kafka, the subscribers pull messages/events from topics, rather than having new messages/events sent to them automatically
True
True or False: It's reasonable to have Kafka be the only streaming input to our Spark cluster
True
True or False: Kafka Stream Processing lets us transform and join streams inside of Kafka, instead of having an external application consume one topic and produce another
True
True or False: Kafka events can be received by any number of consumers and can be saved indefinitely
True
True or False: One of the advantages of Pub Sub is that, since messages go through channels, publishers and subscribers are decoupled
True
True or False: Since Kafka writes and reads are seqential writes/reads involving the log associated with a topic stored on the filesystem, Kafka can take advantage of OS optimizations for writing/reading files
True
What are some examples of filters?
Two of the most common filters used in MongoDB queries are the equal and exists filter. Equal is a filter that is given two arguments, a field name, and a corresponding value, and returns results in which the given field name matches the specified value, i.e. ("Name", "Bob"). Exists is a method that is given only one argument, such as the name of a field, and returns any documents that contain that field.
What's the difference in output between MapReduce wordcount in Hadoop and .map followed by .reduceByKey in Spark?
Ultimately, the difference between .map followed by .reduceByKey in Spark and a typical MapReduce is only that the .map followed by .reduceByKey is lazy by nature and will not produce an actual output until an action, such as .collect or .take. It is also important to note that the shuffle/sort phase occurs by default in a MapReduce, but in the Spark example above, it would be necessary to make a call to .sortBy.
What is cardinality?
While multiplicity refers to the proportionality relationship between objects within a database, cardinality refers to the magnitude of a given set of objects, i.e. how large is N in a 1-to-N relationship.
How many NameNodes exist on a cluster?
While there is only one primary NameNode running on the cluster at a given time, there can be up to two additional NameNodes that serve to help maintain fault tolerance in the case of the NameNode failing. These two additional NameNodes are known as the Standby NameNode, and the Secondary NameNode. The Standby NameNode simply keeps backups of NameNode metadata to prevent catastrophic data loss, while the secondary NameNode reads the log of actions taken by the NameNode and replicates those actions so that, should the NameNode fail, it will be prepared to step in and take over.
Explain the deficiency in using Hive for interactive analysis on datasets. How does Spark alleviate this problem?
With each query made in Hive, we must run a full MapReduce. Spark allows for the creation of RDDs which can be persisted in memory. This allows for us to run interactive analysis on RDDs at any point in our MapReduce-like functions and rerun further functions upon the data as fits our needs.
How is a partitioned parquet file stored in the filesystem?
Without the coalesce method, each partition will be a file in its own folder, with each folder being named baed on the partition contained within.
What methods get generated when we declare a case class?
equals, hashCode, toString, copy, apply and unapply
Describe the flow of events in Kafka using proper terminology
events are sent to topics by producers and are read from topics by consumers. The machines in the cluster are called brokers
What two commands are used in sequence to save changes to our local repo?
git add and git commit (alternatively git commit -a can be used to perform both steps at once)
What command moves changes from GitHub to our local repo?
git pull
What command moves changes from our local repo to GitHub?
git push
Some example CRUD methods for Mongo? (The Scala methods mostly match mongo shell syntax)
insertOne, insertMany, updateOne, updateMany, deleteOne, DeleteMany
What are the 3 Vs of big data?
· Volume - The large amount of data that must be processed. · Velocity - The speed at which new data is being created. · Variety - The variance in types of data that must be accounted for in big data storage and processing.
What does the src folder contain in a sbt project?
The src folder is used to contain all the source code used in the project, i.e. the .scala files
What are some functions available to us when using DataFrames?
.select, .filter, .groupBy, .agg, .drop, .write
Where is the default location of Hive's data in HDFS?
/user/hive/warehouse
What does BASE stand for?
Basically Available, Soft State, Eventual Consistency
Abstraction
Abstraction is the concept that only the data and methods necessary to use an object are "visible" to the end user. In other words, it is not necessary to understand data structure or methods that an object/class has or how they work in full detail because abstraction allows the use of methods and data structures without needing to look "under the hood"
What is an accumulator?
Accumulators are mutable, global variables which can be passed along with closures to worker nodes. These worker nodes will then alter their accumulator based on the closure passed to them and return this accumulator to the driver program which can then aggregate the accumulator totals from all worker nodes to create one final accumulator, i.e. adding the elements of an RDD that are distributed across the cluster.
What's the difference between aggregate and scalar functions?
Aggregate functions perform some calculation on multiple values within a column collectively, such as an average. Scalar functions perform calculations on individual values within a column, such as adding 10 to each value in a column.
What is an enumeration?
An enumeration is a case object that extends a sealed trait to provide a list of choices with which the trait can be defined. Using enumerations solves the issue present with string flags in that other developers working on the code cannot use alternative names or spellings when referring to a potential value for a variable, but must instead choose from the list of defined enumerations, i.e. pizza example
What's the difference between an expression and a statement?
An expression is used to return a value, whereas a statement is used for its side effects, i.e. what it does rather than what it returns.
What is an External table?
An external table is a type of Hive table in which the underlying data is not controlled by Hive, but rather is stored elsewhere within the HDFS. This allows for the protection of data, i.e. if a table containing the data is dropped, the data itself will not be affected. However, it also means Hive is unable to guarantee the data, i.e. the data may not be normalized.
Is if-else an expression or a statement in Scala?
An if-else can be either a statement or an expression based on context. An if-else is a statement in that it evaluates some form of logic, however it can be used as an expression to return a value. In Scala, we consider if-else to be primarily an expression since it returns some value rather than executes a code block.
When is an object on the Heap eligible for garbage collection?
An object in the Heap is eligible for garbage collection when no remaining functions in the Stack contain a reference to the object.
How do we write an Option that contains a value?
An option that contains a value will be Some(value)
What does the apply method do?
Apply allows for the instantiation of an object without the use of the new keyword
How is the data of Kafka events structured?
As immutable key-value pairs
Why might we want mutable state (OOP) in our application? Why might we want immutable state (FP)?
As mentioned in the previous answer, OOP provides for a lot of functionality in circumstance where changes to state are valuable and need to be tracked over time, however, FP is valuable when a function is not concerned with these things because it is not susceptible to anomalous behavior in the same way in which OOP is susceptible.
If Kafka attempts to provide a message to a consumer, but doesn't try again in the case of potential failure, Kafka is providing what delivery guarantee?
At most once delivery
What does it mean that an operation or transaction on a data store is atomic?
Atomicity means that operations on a document will either fully succeed, or will not execute at all, i.e. and updateMany operation will fully update a given document, or not update it at all, but it will never partially update it.
What does ACID stand for?
Atomicity, Consistency, Isolation, Durability
What is BASH?
BASH stand for Bourne Again Shell and is one of the most common UNIX based shell used to interact with a computer or server from the command line.
What are some of the data types included in BSON?
BSON includes all of the standard data types available in JSON, however it also includes additional data type, such as datetime and byte array, and also expands on the "number" data type in JSON by providing for fields such as Double, Decimal, 32-bit Int and 64-bit Int.
What is BSON?
BSON is shorthand for Binary JSON, and is a data interchange format used by the MongoDB
What is a foreign key?
Foreign keys are used to create references between records in a SQL database.
Action
Function call that act upon, and thus instantiate, our RDDs, i.e. reduce, collect, take, and foreach.
What is function composition?
Function composition is the process of providing one function as the input to another function, i.e. f(g(x))
What is GFS?
GFS stands for Google File System and is a proprietary distributed file system developed by google. It distributes files across a cluster using a master node as well as several chunk servers. The master server is responsible for assigning and storing metadata related to file chunks, such as a 64-bit unique identifier and the mapping of chunk locations on the cluster, while the chunk servers are responsible the storage of chunks. GFS provides replication of these chunks across the cluster to provide high availability.
What is garbage collection?
Garbage collection is the process through which objects that are no longer necessary to the functioning of the program are removed from the Heap, thus freeing up memory for new objects to be instantiated as well as improving application performance.
What is a generic?
Generics are used with collection by telling the collection to take a type as a parameter. For example, the code "val myList = List[Int](1, 2, 3)" creates a val named myList using a generic to ensure myList will only ever contain Integer types. This helps with compile time type safety because it ensures that the collection will only ever contain the type defined using the generic, thus ensuring any functions that my use the list as an argument will not be given the wrong type at compile, and thus will help prevent exceptions.
What is Git?
Git is a version control software that allows a developer to easily keep track of changes made to their program and to restore earlier version if necessary.
What is GitHub? How is it related to Git?
GitHub is a cloud-based repository for Git projects allowing multiple developers to collaborate on a project and see the changes others have made to the code without necessarily having their own source code changed. Changes made in Git on the local machine can be pushed to the GitHub repository allowing others to pull those changes to their own local repositories.
What is High Availability? How is it achieved in Mongo?
High availability means that the system will almost always be available, in other words, the system is prepared to maintain availability even in the event of a failure of some sort. This is achieved in Mongo by using replica sets
What is Hive?
Hive is an RDBMS/data warehouse software built atop HDFS that provides for a SQL-like query of data stored within the HDFS.
What is a Hive partition?
Hive partitions allow for the segmenting of data on a given column such that all queries of data based on that column will be faster, as it is only necessary to query the data contained in the given partition of data rather than querying the entire database.
How do we write the output of a query to HDFS?
INSERT OVERWRITE DIRECTORY 'file/path' *HQL QUERY*
If the storage level for a persist is MEMORY_ONLY and there isn't enough memory, what happens?
If the memory required to cache an RDD exceeds the memory available, as many partitions as of the RDD as possible will be cached in memory, but those that exceed that memory will be recomputed as necessary in future operations.
If a producer provides events A, B, then C to a kafka topic, will the topic always preserve the order A,B,C
If there is only a single partition for that topic then the order will be preserved but it is not a guarantee across partitions
Why should we be careful about using accumulators outside of an action?
If we use an accumulator outside of an action, potential re-executions of that action, whether the result of a failed node, a speculative execution, or the rebuilding of a partition may result in the value of the action being added to the accumulator redundantly.
Why might we want to write an application using pure functions instead of impure functions?
Impure function, i.e. functions that change the state of external objects and variables can be useful in a number of contexts in which a developer may want to track changes, however they can also cause issues with respect to the execution of other functions that depend upon the state of external objects and variables which is where pure functions can be valuable. Since pure function do not have side effects, a program using only pure functions is far less likely to run into bugs.
What does it mean that side effects should be atomic and idempotent?
In MapReduce, we want our side-effects to be atomic so that if multiple workers are running a function, the side-effect output we are looking for does not get reproduced over and over. We also want them to be idempotent, meaning we want to the function to not change the output if applied to the output a second time, i.e. it only creates the given side-effect the first time it is applied.
What is a collection?
In MongoDB, collections are used for storing documents of a given sort, which can then be imported by accessing the collection
What does SELECT do?
In SQL, a SELECT statement serves the same function as a projection in Mongo, i.e. it tells the database which fields from a given record to return.
What is the closure of a task? Can we use variables in a closure?
In Scala, a closure is a function which acts upon a variable outside its scope. For this reason, we cannot use variables within our closures, as they will not be returned and thus it will appear as if our closure has done nothing.
How are DataFrames and DataSets "unified" in Spark 2.0?
In Spark 2.0, DataFrames and DataSets were unified by defining DataFrames as DataSets of type Generics, i.e. DataSet[Row].
What is a reasonable number of partitions for Spark?
In Spark, it is good practice to have at least 2x as many partitions as executors in order to maximize parallelization of tasks. As a general rule, if the number of partitions causes the execution of tasks to be less than 100ms, there are probably to many partitions as the tasks are moving so fast, the overhead for scheduling tasks begins to become burdensome.
Walk through a MapReduce job that counts words in the sentence: "the quick brown fox jumped over the lazy dog"
In a MapReduce job performed on the above sentence using two map tasks, the sentence would be split into two chunks and the master would then assign one half of the sentence to each of the map tasks. The map tasks would map each word in the previous sentence into an intermediate (key, value) pair with some key corresponding to each unique word in the input. These intermediate (key, value) pairs would then be written to local disk on the corresponding worker machine, the reduce task would then look for repeating keys and combine them, likely performing some sort of sum function to produce a word count. Since only one word, namely "the", appears more than once in our input, the reduce task worker would only need to perform one reduce task, i.e. reducing the two (key, value) pairs representing "the" into one final output.
What is a filter in a MongoDB query?
In a MongoDB query, filters are used to help specify the relevant fields that should be used when attempting to find a document.
How does a broadcast join work in Spark?
In a broadcast join, the RDD being joined in stored in a broadcast variable which is passed to all executors and joined based on key.
What is a cross join / cartesian join?
In a cartesian join, all records from one table are joined with all records from another table, creating an output table that contains all possible combinations of records.
How does a shuffle hash join work in Spark?
In a shuffle join, the RDD being joined is shuffled across all partitions and joined based on key.
What does a bucketed table look like in HDFS?
In contrast to a partitioned table, which will appear as multiple directories in our Hive database, a bucketed table will appear as multiple files within a single directory within HDFS
How does MapReduce increase processing speed when some machines in the cluster are running slowly but don't fail?
In order to increase processing speed, MapReduce uses a mechanism in which it assigns nearly completed tasks a back-up execution in another machine and marks the task as complete once either the primary or back-up execution completes, freeing up both workers to be assigned a new task. This is known as speculative execution.
Why can't we always use broadcast joins?
In order to perform a broadcast join, one of the RDDs being joined must be small enough to fit in RAM so it can be stored as a Broadcast Variable.
How do we make a Dataset queryable using SQL strings?
In order to query a DataSet with SQL strings, a temp view must be created via a call to .createOrReplaceTempView, i.e. ds.createOrReplaceTempView("view name")
How would the Logistic Regression example function differently if we left off the call to .cache() on the points parsed from file?
In the Logistic Regression example from the paper, the .cache() call allows us to keep our points values in memory, so they can be iterated over as our Logistic Regression runs. If it were not the case that these point values were cached, it would be necessary to read them from memory each time the function iterated over them to find a new value, thus drastically slowing down the program's overall execution.
What do the indentations mean in the printed lineage?
Indentations in or printed lineage represent the beginning of a new stage once our RDD is translated into a physical plan. Each indentation block represents one stage, with new stages occurring whenever a shuffle function is called.
What is an Index?
Indexes are applied to certain fields within a document in order to improve the time it takes to run a query based on that field, as the query can check the index rather than needing to check every document to see if the relevant field matches.
What advantages does creating an Index give us? Disadvantages?
Indexing provides the advantage mentioned above, i.e. it speeds up data querying time, however it also comes with the disadvantage that it will increase the time it takes to create new documents, as all of the applicable indexes will need to be update with any new documents
Inheritance
Inheritance allows for child objects to "inherit" data structures and methods from parent objects, allowing for reusability of code and thus preventing the need to rewrite code from parent classes in every subsequent child class
Why is Zookeeper is necessary for Kafka to run?
It helps handle the failure of Kafka brokers by informing consumers of changes in which broker they should connect to
What are JSON data types?
JSON has 6 data types: String, Number, Object, Array, Boolean, and Null
What is JSON?
JSON stands for JavaScript Object Notation and is a type of text format used for storing information about objects. Json documents contain several fields and their corresponding values. These fields are formatted as "Field Name" : "Value"
What are some data formats we can query with Spark SQL?
JSON, CSV, Parquet, Hive Tables, and SQL Tables
How do we load data into a table?
LOAD DATA (LOCAL) INPATH 'file/path' INTO table_name
How can we write Lambdas in Scala?
Lambdas in Scala are written using the => operator between an input and the return value, i.e. (a,b) => a+b takes two arguments, a and b, and returns the sum.
How can we see the lineage of an RDD?
Lineage can be accessed via a .toDebugString call on an RDD. If we want this information to be printed to StdOut, we can then call print() on our RDD.toDebugString.
ls -al
Lists all files, including hidden files, in long format for a given directory
List
Lists are immutable, singe-linked, indexed collections that store multiple items of the same data type. Lists are valuable in FP because they are immutable and thus there can be no change to their state and no unintended side effects
How do messaging queues differ from pub sub?
MQs expect messages to be resolved by consumers, and not every consumer will receive all messages in the queue
mkdir
Makes a new directory
What does .map do?
Map is a collections method that allows a function/method to be applied to each element within a collection individually.
Where do Map tasks read from? Write to?
Map tasks are assigned to workers based on data locality, i.e. the master will attempt to assign a map task to a worker that has a replica of its input data stored locally. If that is not possible, it will assign the task to an available worker as near to the input data as possible. Map tasks are written to the disk of the worker that performs them.
Why do we say that MapReduce has an acyclic data flow?
MapReduce is said to have an acyclic data flow due to the fact that, once a MapReduce is initiated, data flows through all the steps of the MapReduce (Mapper, Combiner, Partition, Sort & Shuffle, Reduce) without any opportunity to interrupt and re-run portions of the MapReduce architecture. For example, we cannot run our mapper and combiner, then rerun our data through our mapper and combiner again, then send it through the remaining step. All steps must occur in order.
Map
Maps are collections that contain key, values pairs in which each key points to exactly one value, allowing for values to be extracted using their keys. Maps can be either mutable or immutable. Maps are hash tables.
How do methods executing on the stack interact with objects on the heap?
Methods on the Stack interact with objects in the Heap via references that "point" the method to the object. These references tell the methods what objects they are meant to be manipulating or otherwise iterating with.
In what situations might we want to use multiple threads? (just some)
Multiple threading allows for different parts of a program to be executed simultaneously, i.e. parallel processing. This is useful when dealing with large amounts of data processing if the results of on stacks execution are not dependent on the results of another stacks execution.
What is multiplicity?
Multiplicity is a term used to refer to the number of connections between different tables/collections in a database. For example, if a data set has a 1-to-N relationship, it means each single object of a certain type is related to multiple objects of a different type.
accumulator
Mutable, global variable to be used by each node, then passed to the program driver to be aggregated in a Spark job.
N-to-N
N-to-N: Car, Auto Parts; Each car is associated with several auto parts, and each part is associated with several different cars
Can we freely store any data we like in an RDBMS table the same way we store any data in a mongo collection?
No, RDBMSs follow a set of rules in the way they store data. This process is referred to as normalization.
Do SQL databases have embedded records like MongoDB has embedded documents?
No, SQL databases do not use embedded records in the same way as mongo databases. Instead, tables in SQL databases use a reference method, storing references to other databases as "foreign keys".
In Mongo, are operations on embedded documents atomic? What about operations on referenced documents?
Operations on embedded documents are atomic in Mongo because the operation occurs on only one document. Operations on references, however, are not atomic because multiple documents must be queried when using references, meaning that it is possible for the operations on one document to succeed but the operations on another to fail.
How do operators work in Scala? Where is "+" defined?
Operators are calls to object methods in Scala. the method for "+" is defined in both the Int class and the String class.
What is an Option?
Options can be used when searching for values in a collection that may or may not exist. If the value does not exist, the option will return None rather than throwing an exception. Otherwise the Option will return Some(value).
cat
Outputs the entire text of a document
What is a package manager? What package manager do we have on Ubuntu?
Package managers allow new packages, as well as updates to existing packages, without having to worry about dependencies. Ubuntu uses Debian's APT (Advanced Package Tool).
How can Parquet's columnar storage efficiently encode the column values: "Kentucky, Kentucky, Kentucky, Kentucky, Virginia, Virginia, Virginia"?
Parquet has two methods that could be used to efficiently store this data, RLE and Dictionary Encoding. In this case, since all of the matching records appear consecutively, it would be most efficient to store the data with RLE, i.e. 4 | Kentucky | 3 | Virginia
What is Parquet?
Parquet is a file storage format that uses columnar storage to maximize the efficiency of compression techniques on the underlying data.
What's the benefit of partitioning?
Partitioning allows for faster data queries by allowing a given query to be done on only the data within a given partition.
How do Kafka topics achieve scalability?
Partitioning events in the topic by key and storing replicas of partitons across the cluster
What does it mean to perform a parallel operation on an RDD?
Performing parallel operations upon an RDD means the RDD is partitioned across multiple machines and operations are run simultaneously across these various nodes before the final product is returned to the program driver.
How do permissions work in Unix?
Permissions in Unix are broken up into three subsets: owner, group, user. The owner category refers to the permission of the owner of the file/directory, group to the group that owns the file/directory, and user to all other users who are not the owner or group. Each of these subsets can have either read, write, or execute privileges, or some combination of those privileges.
What are Persistence Storage Levels in Spark?
Persistence storage levels in spark refer to the way in which data is stored, i.e. MEMORY_ONLY, MEMORY_AND_DISK, and DISK_ONLY.
Polymorphism
Polymorphism allows for the code that is inherited to behave differently in different contexts. Two child classes may both inherit the same method from their parent class, however, polymorphism allows the method to function differently in the context of each child class if necessary
pwd
Prints working directory
How does the Scala REPL store the results of evaluated expressions?
Scala stores the results from expressions evaluated in REPL under the variables res# with # starting at 0 and ascending in order.
Why do we make use of programming paradigms like Functional Programming and Object-Oriented Programming?
Programming paradigms are used to allow for more consistent structuring of code and thus facilitate collaborative projects that involve multiple developers.
What is a projection in a MongoDB query?
Projection is a query modification that allows for the return of only certain fields within a document. It can be achieved by passing the field name followed by : and then 0 to not return the field or 1 to return the field, i.e. db.mycol.find({},{"title":1,_id:0}) will display the "title" field, but not the _id
history
Provides a list of previously entered commands
less
Provides a the first several lines of a file as output. Can be advanced to read through a document a few lines at a time
man
Provides the manual pages for a given command
What is the lineage of an RDD?
RDDs are based upon the concept of lineage. This means that each RDD stores information that points to the data from which it was derived and the transformation process that was used to create the RDD. This provides a level of fault tolerance for our RDDs, as an RDD is able to rebuild a partition stored on a failed node based on the lineage data it contains.
When we cache an RDD, can we use it across Tasks? Stages? Jobs?
RDDs typically are not shared across jobs, however it is possible for them to be shared if they are written to local disk as part of a shuffle. RDDs are shared across tasks when they are cached and can also be shared across stages.
How do we specify we're reading data into a table from a csv file in HQL?
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
What is a REPL?
Read, Eval, Print, Loop (REPL) is a command line interpreter that allows code to be run line by line directly in the shell. This feature can be useful for debugging and testing small chunks of code quickly.
What is recursion?
Recursion occurs when a function calls itself as part of its execution.
What does reduce do?
Reduce takes two elements from a collection and applies a function to them. It then takes the return value form the function and uses it and another value within the collection to perform the function again. This process continues until all elements in the collection have been used as input
Where do Reduce tasks read from? Write to?
Reduce tasks read from the disk of the worker that completed the Map task over the network. They write their output to output files.
What are some actions available on an RDD?
Reduce, collect, take, foreach.
What does RDBMS stand for?
Relational Database Management System
Explain replica sets
Replica sets in Mongo are used for maintaining high availability. When our database is running, we create multiple copies of the database in which one serves as the primary and the remaining serve as secondary sets should an issue arise with the primary set. Typically, we run at least 3 replica sets which run in a distributed manner
What does RDD stand for?
Resilient Distributed Dataset
RDD
Resilient Distributed Dataset.
How does a Future return a value or exception?
Return values are wrapped in a Success and exceptions are wrapped in a Failure
What is RLE?
Run-length Encoding (RLE) is a compression method used to store data efficiently in Parquet. It makes use of repeating values in a column by storing the value and the number of times it appears consecutively, rather than storing the value multiple times.
S3?
S3 (Simple Storage Service) is AWS's FTP server that allows for storing files of a variety of types, including text files, CSVs, Jars, and more. S3 also allows for easy integration of your S3 file server and other services available on AWS, making it ideal for storing files that serve as input for jobs on an EC2
How do we group records and find the count in each group?
SELECT COUNT(column) FROM table_name GROUP BY column
How do we query data in a table?
SELECT columns FROM table_name (FILTERS)
What is a SQL dialect?
SQL dialect refers to the particulars of a given SQL implementation. The most common SQL dialect is "core SQL"/"ANSI SQL" which is implemented in a nearly identical fashion across all RDBMSs. However, a database may then require some additional functionality that can be provided through the implementation of its own SQL dialect.
What is ssh?
SSH stands for secure shell protocol and is an encrypted communication protocol that allows for the transmission of files as well as the creation of remote connections over a network. SSH uses port 22 and is a secure version of the unsecure telnet protocol. It also can be used in place of FTP.
Is Scala statically or dynamically typed?
Scala is statically typed, i.e. the data type is known at compilation
How does Scala relate to Java?
Scala relates to Java in several ways, including the use of the JRE and JVM as well as the use of many Java libraries. It also uses many similar data types as Java.
What does it mean to have data skew and why does this matter when bucketing?
Since buckets are calculated using a hashCode mod method like that of MapReduce partitioning, it is possible for data skew to lead to unevenly distributed, and thus misleading buckets. If the skew within our original data leads to a disproportionate amount of our data being inserted into a given bucket while the other buckets receive far less data, we may create buckets that misrepresent our data. When bucketing, the goal should be to create roughly evenly filled buckets of randomized groups of our data. This is best accomplished by bucketing on columns of high cardinality with relatively little importance to our analysis.
Are the workers in GFS fault tolerant? Why or why not?
Since there are many workers operating simultaneously, it is relatively likely that several workers will fail. For this reason, workers are fault tolerant, if a worker fails in an operation, the operation will simply be re-executed by another worker when it becomes available.
Some levels have _SER, what does this mean?
Some levels of memory allow for the data to be stored in a serialized format, which will reduce the amount of memory space used to store the data, but will result in additional computing overhead in both the serialization and deserialization processes.
What do sort and limit do in Mongo?
Sort and limit are query modifications that limit the number of documents that will be returned from a given query and sort them based on some sort of argument, such as in descending order of _id
What is Spark SQL?
Spark SQL is a spark abstraction that allows the querying and interaction of RDDs, stored as DataFrames/DataSets, using basic SQL syntax
How does Spark SQL relate to the Spark applications we've been writing, using RDDs?
Spark SQL is built on top of Spark and abstracts direct manipulation of RDDs into SQL queries, allowing those who may be unfamiliar with RDD manipulation, but are comfortable with SQL to work with large datasets via Spark.
What's the difference between cluster mode and client mode on YARN?
Spark can operate in two different modes when using YARN as a cluster manager. First, it can operate in cluster mode, which means that the spark driver program runs within the ApplicationMaster and output from the driver program will appear inside the container spun up for the ApplicationMaster. In contrast, when spark runs in client mode, the driver program run separately from YARN, and simply communicates with the ApplicationMaster to ensure that resources are allocated appropriately. In this case, the driver program will typically run on the YARN master node and output from the driver program will appear on the same machine on which it is running.
Why does Spark need special tools for shared variables, instead of just declaring, for instance, var counter=0?
Spark uses Scala closures, which are functions dependent on variables that exist outside their scope. Since these closures are passed to worker nodes, any variables assigned within these closures would not have the proper scope to be returned to the driver program. This means that any var counter defined would not be properly returned by the worker nodes, and the closure would instead return 0 to the program driver.
How does Spark SQL evaluate a SQL query?
Spark uses a catalyst optimizer which transforms a Spark SQL query/DataFrame to create a logical plan, which is then optimized and translated into a number of potential physical plans, one of which is selected for final evaluation based upon a cost model.
What other contexts are superseded by SparkSession?
SparkContext, SqlContext and HiveContext are all superseded by SparkSession.
What is the SparkSession?
SparkSession is a unified entry point for Spark SQL that combines SparkContext, SqlContext, HiveContext.
What does it mean to run an EMR Step Execution?
Step execution allows for the spinning up of an EMR, execution of a number of MapReduce jobs as a series of steps, and finally the termination of the EMR. Using this method allows for easy to understand workflow when running jobs on an EMR and can help prevent accidentally leaving an EMR running after it is no longer in use, thus preventing unnecessary/unintentional expenses.
What are some benefits of storing your data in partitions?
Storing data in partitions allows for faster queries based on those partitions.
How does String interpolation work in Scala?
String interpolating is used with the syntax (s" Some text ${variable or expression}") allowing for a cleaner way to write Strings and include variable in them.
How do we provide structure to the data contained in a DataSet?
Structure is provided to a DataSets via providing a pre-defined case class as type.
What does SQL stand for?
Structured Query Language
What are some examples of structured data? Unstructured data? Semi-structured?
Structured data refers to that data of which the size, and data types of inputs is always consistent, such as in the case of an RDBMS. Unstructured data refers to things like raw text, or image files, which can vary in size from one input to another and can have a wide arrange of data types. In between the two is semi-structured data which refers to data such as csv files which have some degree of structure, but not to the same degree as a database.
What is the job of the NameNode? What about the DataNode?
The NameNode serves as the master node for the HDFS, while each worker in the cluster is given its own DataNode. The DataNode is responsible for the storage of data, as well as reporting back the health of the node to the NameNode. The NameNode is responsible for keeping the filesystem image as well as recording edits to the file system in a log. It is also responsible for making sure that proper data replication is preserved based on the information it receives from DataNodes
What does the NodeManager do?
The NodeManager is responsible for managing the computing resources ("container") it has been assigned by the ResourceManager
What is the significance of the Object class in Scala?
The Object class is the superclass of all other classes and thus all objects in Scala. This means that all objects inherit certain properties from the Object class, such as the .toString method
What does the PATH environment variable do?
The PATH variable tell BASH which directories to search to find the binary files for commands in the file structure. This allows the command to be run from the command line simply by typing the command from any location.
Which of the I/O step in a MapReduce would you expect to take the longest?
The Reduce I/O takes longer than the Map I/O because it requires data to be transmitted over the network rather than read/written from/to the local disk(or near the local disk). However, a given MapReduce is will have far more Map function performed than Reduce function, so it is beneficial to put the larger burden on Reduce.
What does the Reduce part of MapReduce take in? What does it output?
The Reduce part of MapReduce then reads the intermediate (key, value) pairs and performs a reduce function that outputs a final (key, value) pair which is written to an output file
What is sbt (high level)?
The Scala build tool(sbt) is the primary build tool for Scala projects and offers several feature including incremental compilation, allowing for faster and less resource intensive compilation.
Which responsibilities does the Scheduler have?
The Scheduler is responsible for allocating resources based on requests from a given job's ApplicationMaster.
How does a Secondary NameNode make the NameNode fault tolerant?
The Secondary NameNode keeps backups of the NameNode metadata to prevent catastrophic data loss if the NameNode should fail.
What is the Spark History Server?
The Spark History Server is a service provided by EMR when a spark job is executed. This spark history server breaks down the spark job in a number of clear and concise graphs, such as a timeline and a Directed Acyclic Graph that give a visual representation of the execution of stages and tasks in a given Spark job, as well as the use of resources and time to perform various tasks, allowing for a better understanding of the execution of that Spark job and the ability to easily performing debugging/tuning on the underlying code.
What is the Stack? What is stored on the Stack?
The Stack is the term used for the space in memory in which the execution of methods and functions from the program takes place. The Stack functions as a LIFO(Last In, First Out) meaning that the last method called will execute first and then the next method in the stack, etc. New method calls will be placed at the top of the stack and thus, executed first.
What purpose does a StandBy NameNode serve?
The StandBy NameNode can step in and taking over as NameNode should the NameNode fail.
What is the _id field for in mongo?
The _id field is used as an identifier for the ObjectId object in MongoDB. The _id field is automatically generated and indexed for every document in Mongo, allowing it to be used in database queries
What is the difference between inner, outer left, outer right, and outer full joins?
The four major types of joins are inner, outer left, outer right, and outer full. An inner join will only return the rows where the join condition is true for both tables. Outer left and outer right joins return the rows in which the join condition is true, as well as the additional rows where it is not for either the table being joined to (left) or the table being joined (right). Outer joins will include the records where the join condition is true, as well as all additional records from both tables.
What is the catalyst optimizer?
The catalyst optimizer is used in Spark SQL to translate the initial query into a logical plan, and finally a workable physical plan that is then evaluated.
How does the chmod command change file permissions?
The chmod command can be used to explicitly assign privileges to owner, group, and user. This can be accomplished either using binary number format, i.e. 777 for all privileges to all groups, or through letter format, i.e. o + rwx, g + rwx, u + rwx
When does the combine phase run, and where does each combine task run?
The combine phase runs on the local machine after the worker has completed its map tasks. This allows for come reduction to take place before the data is sent over the network to the final reducers, thus limiting the use of network resources and speeding up the execution of the job.
What is the optional Combiner step?
The combiner function is useful in that it acts as an intermediary reduce to partially merge highly repetitive intermediate keys so the final Reduce functions do need to read as much data over the network, thus improving performance.
Why is the Combiner step useful in the case of a wordcount?
The combiner function is useful in the context of a large wordcount application because it is likely that the initial Map function will produce a highly repetitive set of intermediate keys, thus the combiner can help ease the load of network traffic that would otherwise be produced during the Reduce function.
How are replications typically distributed across the cluster? What is *rack awareness*?
The data is typically stored with two replication blocks on one rack, while the third is stored on a different rack. This rack awareness is important because it allows for the proper balancing of considerations regarding network resources and fault tolerance
What is the storage level for .cache()?
The default storage level for .cache() is MEMORY_ONLY
What does find() do?
The find method is used to query a collection in MongoDB and return some result, it is typically used with some sort of filters to help narrow and specify the query.
What is the Heap? What is stored on the Heap?
The heap is conceptually similar to the Stack, however instead of storing function and method calls, it stores the objects which have been instantiated and upon which methods and function operate.
What is the lineage of an RDD?
The lineage of an RDD is the information an RDD stores about its construction that allows RDDs to maintain a level of fault tolerance. The lineage is made up of two parts, information about the data the RDD needs to read from during construction, as well as the transformations it must perform upon that data. This lineage forms the logical plan for the execution of an RDD.
In Spark, what is the logical plan? The physical plan?
The logical plan for our RDD is represented by the lineage of the RDD, and corresponds to the data that will be used as well as the transformations that will occur on that data. The physical plan is the translation of that logical plan into actual processes that occurs when we call an action upon our RDD, thus causing it to be evaluated. The physical plan consists of stages, which consist of tasks.
What are some disadvantages of handling document relationships by embedding documents?
The major disadvantage of using embedded documents comes from cases in which the cardinality of N is very large, as it may slow down the ability to interact with the larger document as well as the fact that any changes to the embedded document in one document will need to be subsequently applied to every other document in which the document to be changed is embedded
How many records would be in the output of a cross join/cartesian join of two tables with 10 records each?
The number of records produced by a cartesian join is AxB where A is the number of records in table a and B is the number of records in table B. For this example, 100 records will be in the output table.
How does the onComplete callback function work?
The onComplete function returns the results of the Future once it is done executing.
What rules does Mongo enforce about the structure of documents inside a collection?
The only rule Mongo explicitly enforces about the structure of documents inside of a collection is that it must have an object id. All other rules are defined in the implementation of the database.
Know the input and output of the shuffle + sort phase.
The shuffle + sort phase takes in the output from map tasks, partitions the output by keys and sorts the partitions so all values for a given key are passed to the reducer together.
What needs to be true about the types contained in the Mapper and Reducer generics?
The types of the above generics mist be written to match the input the function will receive and the types of the key, value pairs it will output. It is therefore important that the Mapper class has the right generic types defined based on the data it will be provided and that the key, value pair output of Mapper matches the key, value pair input for Reducer
What are some benefits to using the cloud?
There are a number of benefits to using the cloud, including elasticity, high availability, and cost effectiveness. Since cloud resources are virtualized, they are highly elastic, meaning that the resources needed for a given job can be expanded as demand increases, and contracted when they are no longer necessary. This elasticity leads to both scalability, as the system expands to meet higher levels of demand, as well as high availability, as the server will is guaranteed to be available at almost all times. Also, the ability to transfer the burden of physical resource storage and maintenance to the cloud provider means it is cheaper and more feasible to run large cluster computing operations.
What are some advantages of handling document relationships by embedding documents?
There are several advantages to handling relationships by embedding documents, including ease and speed of accessing embedded items, easy to understand relationships, and logical organization
hdfs dfs -get /user/adam/myfile ~
This command will copy the file "/user/adam/myfile" from the distributed filesystem to the user's home directory on the local machine.
hdfs dfs -put ~/coolfile /user/adam/
This command will copy the file "coolfile" from the user's home directory on the local machine to the folder /user/adam on the distributed filesystem.
What does a custom partitioning function change about how your MapReduce job runs?
Though MapReduce comes with a default partitioning function, users can specify a custom partitioning function to partition the data used by the reduce function in a manner that will produce the desired output. For example, if a user wants all data of a certain type to end up in the same output file, i.e. outputting all URL keys for a single host in the same output file
What is Throwable?
Throwable is the superclass of both Exception and Error.
How might we scale a HDFS cluster past a few thousand machines?
To scale up a HDFS cluster past a few thousand machines, it becomes necessary to create a federated cluster in which multiple clusters with their own NameNodes are connected via the network.
How do we see the logical and physical plans produced to evaluate a DataSet?
To see that logical and physical planes of a dataset, we make a call to .explain(true). The true Boolean tells Spark to provide an extended explanation which includes the physical plan, otherwise, only the logical plan will be shown.
How do we specify dependencies using sbt?
To specify dependencies in sbt, we need to edit the build.sbt file of our scala project. In the build.sbt file, we can add the line "libraryDependencies +=" followed by the dependency we wish to use. (Typically, these appear as a set of strings separated by %% and %).
What is/was Unix? Why is Ubuntu a Unix-like operating system?
Unix was an open-source OS developed Bell Labs in the 1970's that quickly became popular among tech-minded individuals due to its open nature. However, in the 1980's, Bell attempted to revise the Unix business model to make it a closed source OS, leading to a collapse of the Unix Market. Several attempts were made to create Unix-like OS's that would be like Unix but would provide an open source model the same way the original Unix did. The main two branches spawned by this open source movement were the Berkeley Software Distribution and GNU/Linux (GNU tools/apps running on a Linux kernel). Ubuntu is considered a Unix-like operating system because it is a Linux distribution(specifically a subset of Debian).
cp
Used to copy a file or directory to another location in the file structure
mv
Used to move a file or directory to another location in the file structure
rm
Used to remove a file, or recursively (-r option) to remove a directory
What are users, what are groups?
Users is a term simply used to refer to every given account on a Unix-based system. Users are assigned to groups, which are then given privileges because of group membership in a role-based access control model.
If I join two datasets with 10 records each, what is the maximum possible number of records in the output excluding cartesian joins?
Using an outer full join, if no records match in either table, the maximum number of possible output records is 20, i.e. all records from both tables.
What is a VM?
VM stand for virtual machine and is one of the main methods of virtualizing an Operating System. In the case of a type II hypervisor, such as Virtual Box or VMWare, the virtualized OS runs within a "sandboxed" environment on top of an underlying OS. The virtual OS is known as a "guest" OS while the OS of the system hosting the VM as known as the "host".
Vector
Vectors are indexed, immutable data sequences that are similar to lists, but allow for random access and thus consistent querying speed across the list.
What does WHERE do?
WHERE functions like filter in an RDBMS/SQL-database by providing some filtering parameter for the records the query wants returned, i.e. SELECT * FROM movies WHERE year > 1999 will return all records from the movie database in which the year is greater than 1999
Ho do we filter the records from a query?
WHERE, GROUP BY, LIMIT
How can we do something with the result of a Future?
We can use the results of a future by calling on the results at a later point in the program. Multiple methods exist for calling Futures, including await.ready, .onComplete and Thread.sleep
How do we interact with the distributed filesystem in Hadoop?
We interact with the distributed filesystem from the command line using the hdfs command and the appropriate option flags.
What was the "Hadoop Explosion"?
When Hadoop initially became popular, it provided an open source framework for tackling MapReduce jobs. For this reason, the rise of Hadoop led to the creation of many new tools to allow for more specialized use of MapReduce in varying contexts.
What does it mean that a function returns Unit in Scala?
When a function returns Unit, it is like Void in Java, meaning the function does not return a value.
What does it mean to "spill to disk" when executing spark tasks?
When executing a Spark task, it is possible for the amount of data on which transformations are being performed to exceed the RAM available to the executor. In this case, the excess data will be spilled to disk and read back when necessary.
RDDs are lazy and ephemeral. What does this mean?
When we say that RDDs are lazy and ephemeral, we refer to two different traits of our RDDs. 1) they are lazy in that they contain pointer to the data from which they are created, but do not actually read that data or preform the necessary transformations upon it for their creation until an action is performed upon them, forcing them to instantiate. 2) RDDs are ephemeral because they are discarded from memory after they are no longer in use unless otherwise cached or persisted through the use of commands.
What does it mean when we say an RDD is a collection of objects partitioned across a set of machines?
When we say that an RDD is a collection of objects partitioned across a cluster, we refer to that fact that a given RDD is divided into read-only objects when it is instantiated, i.e. it cannot be altered upon instantiation, and that these objects are spread across the cluster such that the failure of a given worker node will lead to the loss of only a small piece of the RDD which can then be quickly and effectively rebuilt due to RDDs' lineage-based model.
How can we partition files we write using dataframes in Spark?
When writing files using dataframes, we can use the following syntax to partition the file: df.write.partitionBy("partitionKey1", "partitionKey2" , etc.)
Some persistence levels have _2, what does this mean?
Whenever a call is made to _2 on a storage level, it means the partitions will be replicated and stored on two separate nodes in the cluster
How would we write code that handles a thrown Exception?
Whenever a code block may cause an Exception, it is best practice to nest the code within a try-catch statement to ensure proper exception handling. This can be done by placing the code within a try block, and adding a catch block for various exceptions that may occur from improper input to the code, thus producing some sort of output that deals with the exception and allows the program to continue functioning.
What is a port number?
Whenever a computer/server attempts to communicate with another computer/server over a network, it is necessary for the connection to be sent to both the correct IP address as well as the correct port number on that IP address. A given computer has 65535 ports that are each capable of running a service. Typically ports 1- 1023 are reserved for common communication protocols.
Why can Spark skip entire stages if they are the same between jobs?
Whenever a stage ends, typically as the result of a shuffle, data is repartitioned and written to disk. This allows for further stages executed upon this data to skip the initial stage as they can read the post stage RDD from disk rather than needing to reconstruct it.
What is a join condition?
Whenever two tables are joined, it is necessary to provide a join condition, which provides the context for the join. In SQL, this is done with an expression that states which columns should be joined upon, such as table1.column1=table2.column1
When, during a Spark Job, do we need to pay attention to the number of partitions and adjust if necessary?
Whenever we first create an RDD by reading in some form of input data, it is good practice to ensure proper partitioning before we perform the initial stage of transformations. It is also important to pay attention to the number of partitions after a shuffle stage when our data set is large, to ensure we are not reducing our number of partitions in a way the is detrimental to the parallel execution of our tasks.
What does CAP mean for our distributed data stores when they have network problems?
Whenever we have network problems, it is necessary for our database to have partition tolerance. Since CAP theorem states we can only achieve 2 of the 3 CAP goals at one time, the necessity for partition tolerance when there is a network problem means that a distributed data store experiencing a network problem must choose between maintaining consistency, or availability.
What does it mean that parquet is columnar storage?
Whereas file storage formats such as csv and json use a row-based storage format, in which all the columns of a given record/row are stored together, parquet uses columnar storage, meaning all values for every record in a given column are stored in sequence in memory. This allows for a number of compression techniques, like dictionary encoding and run-length encoding that makes storage of large datasets more efficient.
Can we access the SparkContext via a SparkSession?
Yes, SparkContext can be accessed from within a SparkSession.
Are DataSets lazily evaluated, like RDDs?
Yes, it is necessary to call an action, such as .show to evaluate a DataSet.
BigDecimal
any precision
BigInt
any size number
Pub Sub is a design pattern for messaging that is best described as...
publishers publish messages to channels, subscribers receive messages from those channels
We read/write streaming DataSets with what methods
readStream, writeStream
What is the return type of spark.sql("SELECT * FROM mytable") ?
spark.sql returns a DataFrame
What is the difference between val and var?
val is an immutable variable while var is a mutable variable.