Big Data, Weeks 1-5

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What does the apply method do?

Apply allows for the instantiation of an object without the use of the new keyword

How many replications of data does GFS provide by default? Block size?

By default, GFS provides a replication set of 3 for every chunk, which are 64MB in size

What is the default number of replications for each block?

By default, blocks are given three total replications

What's the name for the code written in .class files?

Bytecode

pwd

Prints working directory

What are Counters for?

Counters can be implemented to count the number of occurrences for a given value. The counter is initialized to 0 and incremented with each relevant record reduced. Ex: Counting the number of total words processed. (MapReduce automatically generates counters for number of (key, value) pairs input and output.

touch

Creates a new empty file

What are some transformations available on an RDD?

FlatMap, map, filter, parallelize.

What is function composition?

Function composition is the process of providing one function as the input to another function, i.e. f(g(x))

broadcast variable

Immutable/read-only, global variable to be passed to all worker nodes in a Spark job.

What does SELECT do?

In SQL, a SELECT statement serves the same function as a projection in Mongo, i.e. it tells the database which fields from a given record to return.

What is a projection in a MongoDB query?

Projection is a query modification that allows for the return of only certain fields within a document. It can be achieved by passing the field name followed by : and then 0 to not return the field or 1 to return the field, i.e. db.mycol.find({},{"title":1,_id:0}) will display the "title" field, but not the _id

history

Provides a list of previously entered commands

less

Provides a the first several lines of a file as output. Can be advanced to read through a document a few lines at a time

How do we specify we're reading data into a table from a csv file in HQL?

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

Where do Reduce tasks read from? Write to?

Reduce tasks read from the disk of the worker that completed the Map task over the network. They write their output to output files.

What does SQL stand for?

Structured Query Language

What is an ApplicationMaster? How many of them are there per job?

The ApplicationMaster requests resources from the Scheduler for each task within a job. There is one ApplicationMaster per job.

Do SQL databases have embedded records like MongoDB has embedded documents?

No, SQL databases do not use embedded records in the same way as mongo databases. Instead, tables in SQL databases use a reference method, storing references to other databases as "foreign keys".

What does the NodeManager do?

The NodeManager is responsible for managing the computing resources ("container") it has been assigned by the ResourceManager

What is an Option?

Options can be used when searching for values in a collection that may or may not exist. If the value does not exist, the option will return None rather than throwing an exception. Otherwise the Option will return Some(value).

cat

Outputs the entire text of a document

What's the benefit of partitioning?

Partitioning allows for faster data queries by allowing a given query to be done on only the data within a given partition.

What does it mean to perform a parallel operation on an RDD?

Performing parallel operations upon an RDD means the RDD is partitioned across multiple machines and operations are run simultaneously across these various nodes before the final product is returned to the program driver.

Where is the default location of Hive's data in HDFS?

/user/hive/warehouse

What is a SQL dialect?

1. A variant of SQL specific to the database product in which it is used. Examples: 2. PL/SQL - Procedural Language/SQL - Oracle 3. Transact-SQL - Microsoft SQL Server 4. PL/pgSQL - PostgreSQL

What does ACID stand for?

1. Atomicity - When it comes to your database, atomicity refers to the integrity of the entire database transaction, not just a component of it. In other words, if one part of a transaction doesn't work like it's supposed to, the other will fail as a result—and vice versa. 2. Consistency - Only data which follows the database's rules (schema) is permitted to be written to the database. 3. Isolation - refers to the ability to concurrently process multiple transactions in a way that one does not affect another. 4. Durability - In databases that possess durability, data is saved once a transaction is completed, even if a power outage or system failure occurs right afterwards.

What is a filter in a MongoDB query?

1. Filters are used to narrow down the set of results obtained by using a query.

Float

4 bytes, 5-6 decimal precision

nano

A commonly used CLI text editor

What is a primary key?

A primary key is analogous to the object id in Mongo, it serves as a unique identifier for a given record.

How do we define an auxiliary constructor?

An auxiliary constructor is a constructor defined within the body of a class that makes a call to the primary constructor to instantiate an object but does not need to be passed the full set of parameters.

What is Throwable?

Throwable is the superclass of both Exception and Error.

What command moves changes from GitHub to our local repo?

git pull

What are the 4 ways provided to construct an RDD?

1) RDDs can be created through the parallelization of Scala collections 2) RDDs can be created through the transformation of parent RDDs 3) RDDs can be created by reading from a file 4) RDDs can be maintained through changing the persistence of an existing RDD (i.e. caching)

1-to-1

1-to-1: Person, Signature; Each person is associated with exactly one signature and each signature is associated with exactly one person

1-to-N

1-to-N: Director, Movies; Each Director may be associated with many movies, but each movie has only one director

What is cardinality?

1. A cardinality is how many unique elements are in a column relative to the number of rows in the set.

What is a collection?

1. A grouping of MongoDB documents. 2. The equivalent of an RDBMS table. 3. Exists within a single database.

Explain replica sets

1. A replica set in MongoDB is a group of mongod processes that maintain the same data set. 2. Replica sets provide redundancy and high availability, and are the basis for all production deployments.

What is a database?

1. An organized collection of structured information, or data, typically stored electronically in a computer system.2. The data can then be easily accessed, managed, and modified.

What is BSON?

1. BSON simply stands for Binary JSON2. BSON's binary structure encodes type and length information, which allows it to be parsed much more quickly.

What does BASE stand for?

1. Basically Available - This constraint states that the system does guarantee the availability of the data as regards CAP Theorem; there will be a response to any request. But, that response could still be 'failure' to obtain the requested data or the data may be in an inconsistent or changing state, much like waiting for a check to clear in your bank account. 2. Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to 'eventual consistency,' thus the state of the system is always 'soft.' 3. Eventual Consistency - The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it should sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.

How do we specify dependencies using sbt?

1. Declaring a dependency looks like this, where groupId, artifactId, and revision are strings:libraryDependencies += groupID % artifactID % revision2. or like this, where configuration can be a string or a Configuration value (such as Test):libraryDependencies += groupID % artifactID % revision % configuration

What is a port number?

1. In computer networking, a port number is a number that identifies a communication endpoint.2. At the software level, within an operating system, a port number is a number that identifies a specific process or a type of network service.

Explain sharding

1. Sharding is a method for distributing data across multiple machines. 2. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

What do sort and limit do?

1. Specifies the order in which the query returns matching documents. 2. You must apply sort() to the cursor before retrieving any documents from the database: db.users.find({ }).sort( { age : -1, posts: 1 } ) The above operation sorts the documents first by the age field in descending order and then by the posts field in ascending order: 1. Use the limit() method on a cursor to specify the maximum number of documents the cursor will return. syntax:db.collection.find(<query>).limit(<number>)

What is an ObjectId?

1. The ObjectId class is the default primary key for a MongoDB document. 2. Usually found in the _id field in an inserted document.

What is the _id field for in mongo?

1. The _id field contains a unique ObjectID value. 2. The _id field is the primary key for the collection so that each document can be uniquely identified in the collection.

What does find() do?

1. The collection.find() action returns a cursor object that points to any documents that match the specified query filters.

In MongoDB, what is a document?

1. The format in which MongoDB stores data. 2. More specifically BSON documents. BSON is a binary representation of JSON documents 3. MongoDB documents are composed of field-and-value pairs

What advantages does creating an Index give us? Disadvantages?

1. They make it possible to quickly retrieve data.2. They can be used for sorting.3. Unique indexes guarantee uniquely identifiable records in the database. 1. Increase storage space needed2. They decrease performance on inserts and deletes, since these commands would cause re-indexing.

What are JSON data types?

1. a string2. a number3. an object (JSON object)4. an array5. a boolean6. null

Short

16 bits/2 bytes, -2^15 through 2^15 - 1

What's the default port for MongoDB?

27017

Int

32 bits/4 bytes, -2^31 through 2^31 -1

Long

64 bits/8 bytes, -2^63 through 2^63 - 1

Byte

8 bits/1 byte, -2^7 through 2^7 - 1

Double

8 bytes, 14-16 decimal precision

How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?

A 200MB file will be stored in two blocks, as each block in HDFS stores 128MB of data by default

What is a foreign key?

A FOREIGN KEY is a field (or collection of fields) in one table that refers to the PRIMARY KEY in another table

What is a Hive bucket?

A Hive bucket is similar to a partition in that it divides a table into a number of subtables, however, these subtables are based on a more arbitrary data column than in a partition, allowing each bucket to effectively serve as a smaller random sample of its database at large.

What is a Spark Application? Job? Stage? Task?

A Spark application is the full program that contains all of the various RDDs/jobs that will be evaluation upon its execution. A Spark job is spawned each time an action is called upon an RDD such as .collect() or .take() and is made up of stages. A spark stage is a series of transformations that can be performed upon an RDD without any need to repartition. Each time we call a shuffle transformation on an RDD, we end one stage and begin a new stage. A spark task is each individual transformation, such as .map or . filter that is performed upon an RDD. As long as no shuffle occurs, an RDD will be passed from one task to the next in the same stage.

What is a Container in YARN?

A container is a bundle of computing resources given to a worker by the Scheduler

What is a daemon?

A daemon is a long-running background process that allows for the functioning of programs and processes that the end-user interacts with

lineage

A fault tolerance concept that allows our RDDs to be reconstructed based on the data from which they were constructed and the transformations necessary for their construction.

What is a Future?

A future is a part of the program execution that is directed to execute in a separate thread, allowing the main program flow to continue while the future is being executed. This is useful when a task takes a relatively long time to execute and we do not want the main execution flow to be held up while the task completes. The result of the Future can then be called later in the program's execution.

What is a hashCode? How are hashCode used in (default) partitions?

A hashing algorithm is a mathematical function which will always return the same output when given the same input. It is used to convert things like files into hashes and can be used for integrity checks. In the default partition algorithm, a hashing algorithm is performed upon the various intermediary keys, which are then modulated using R (The number of reduce tasks/output files as specified by the user).

What is a higher order function?

A higher order function is a function that takes a function as an argument and/or returns a function.

What is a Lambda?

A lambda is an alternative method for using functions that does not require functions to be defined beforehand and then called when needed, but rather allows for functions that only need to be used once to be defined at the time of execution.

cache

A method that allows us to change the persistence of an RDD after its use, so it remains in memory. A cache call on an RDD that causes an overflow of available RAM will lead to partitions being dropped and recalculated as necessary.

What does a partitioned table look like in HDFS?

A partitioned table will be divided into multiple directories within HDFS, with each directory corresponding to one partition.

What is a pure function?

A pure function is a function that takes an input and produces an output with no "side effects", i.e. a mathematical function

What do we need to include in any effective recursive function?

A recursive function needs two things to be effective: it needs to call itself in a different manner than it was original called, and it needs to have a base case.

What is a side effect?

A side effect occurs when a chunk of code does something other than simply return a value, such as performing I/O, i.e. printing to the console

What is a Thread?

A thread is a single flow of program execution. Each thread has its own stack.

What is a Trait?

A trait is similar an interface in that it stores some sort of values or methods that can be shared by classes that implement the trait.

What's the difference between a while loop and a do-while loop?

A while loop will only execute while the condition is true, thus it may never execute if the condition is never met. A do-while loop, however, will always execute at least once.

What is a Working Set?

A working set is a dataset that is stored in memory (RAM) to allow for quick access and use.

Abstraction

Abstraction is the concept that only the data and methods necessary to use an object are "visible" to the end user. In other words, it is not necessary to understand data structure or methods that an object/class has or how they work in full detail because abstraction allows the use of methods and data structures without needing to look "under the hood"

What is an accumulator?

Accumulators are mutable, global variables which can be passed along with closures to worker nodes. These worker nodes will then alter their accumulator based on the closure passed to them and return this accumulator to the driver program which can then aggregate the accumulator totals from all worker nodes to create one final accumulator, i.e. adding the elements of an RDD that are distributed across the cluster.

What is AWS?

Amazon Web Services (AWS) is a cloud-based service provided by Amazon that allows for the execution of application on a virtual server that runs using servers maintained or rented by Amazon. AWS has all the typical benefits of a cloud service, including elasticity, scalability, and the ability to use large amounts of computing resources on a per-use fee basis, which is ultimately much cheaper than maintaining a server farm directly.

ArrayBuffer

An ArrayBuffer is like a List but is mutable

What does it mean to transform an RDD?

An RDD can be created by performing a transformation on a parent RDD, i.e. mapping a function to the values stored in an existing RDD.

What's the difference between an absolute and a relative path?

An absolute file path gives the entire file path of a file or directory beginning at the root directory, while a relative file path gives the path to a file or directory relative to the current working directory.

What is found inside an empty Option?

An empty option contains None

What is an enumeration?

An enumeration is a case object that extends a sealed trait to provide a list of choices with which the trait can be defined. Using enumerations solves the issue present with string flags in that other developers working on the code cannot use alternative names or spellings when referring to a potential value for a variable, but must instead choose from the list of defined enumerations, i.e. pizza example

What's the difference between an expression and a statement?

An expression is used to return a value, whereas a statement is used for its side effects, i.e. what it does rather than what it returns.

What is an External table?

An external table is a type of Hive table in which the underlying data is not controlled by Hive, but rather is stored elsewhere within the HDFS. This allows for the protection of data, i.e. if a table containing the data is dropped, the data itself will not be affected. However, it also means Hive is unable to guarantee the data, i.e. the data may not be normalized.

Is if-else an expression or a statement in Scala?

An if-else can be either a statement or an expression based on context. An if-else is a statement in that it evaluates some form of logic, however it can be used as an expression to return a value. In Scala, we consider if-else to be primarily an expression since it returns some value rather than executes a code block.

When is an object on the Heap eligible for garbage collection?

An object in the Heap is eligible for garbage collection when no remaining functions in the Stack contain a reference to the object.

How do we write an Option that contains a value?

An option that contains a value will be Some(value)

What are some examples of side effects?

Anytime a function alters some sort of external state, this is called a side effect. Some common examples include a method which alters some form of data stored within an object or a method that serves some sort of I/O purpose, like printing a line to the screen.

Why might we want mutable state (OOP) in our application? Why might we want immutable state (FP)?

As mentioned in the previous answer, OOP provides for a lot of functionality in circumstance where changes to state are valuable and need to be tracked over time, however, FP is valuable when a function is not concerned with these things because it is not susceptible to anomalous behavior in the same way in which OOP is susceptible.

What does it mean that an operation or transaction on a data store is atomic?

Atomicity means that operations on a document will either fully succeed, or will not execute at all, i.e. and updateMany operation will fully update a given document, or not update it at all, but it will never partially update it.

What is BASH?

BASH stand for Bourne Again Shell and is one of the most common UNIX based shell used to interact with a computer or server from the command line.

What are some of the data types included in BSON?

BSON includes all of the standard data types available in JSON, however it also includes additional data type, such as datetime and byte array, and also expands on the "number" data type in JSON by providing for fields such as Double, Decimal, 32-bit Int and 64-bit Int.

What is beeline?

Beeline is a CLI designed to interact with Hive, allowing for easy and efficient querying of Hive databases via the use of HQL.

In a typical Hadoop cluster, what's the relationship between HDFS data nodes and YARN node managers?

Both HDFS data nodes as well as YARN node managers exist on the worker machines within a Hadoop cluster. It is the responsibility of the data nodes to maintain the data that is stored on the machine, while it is the responsibility of the node manager to manage the computing resources (called a "container") that can be used to perform tasks using the data.

What are some differences between hard disk space and RAM?

Both hard disk space and RAM provide memory used for storing computer data. However, RAM is volatile memory, meaning it does not persist on the loss of power, while disk space is persistent memory, allowing data stored in hard disk space to continue to exist after shutdown. Typically, data that is going to be used for some type of process is moved to RAM to allow quick and easy access, while data that needs to be stored for a longer period of time is written to disk

What are some steps we can take to debug a Spark Application?

Both our driver program as well as our executors will write to StdErr when an issue is encountered with execution. By looking at the output to this file (typically stored in /var/log/spark) we can find what the issue with our application is. Typical issues are OOM for program drivers and GC issues for executors. These errors can be rectified by allocating more memory to the appropriate machine

What is a broadcast variable?

Broadcast variables are the immutable/read-only, global variables used within Spark. Broadcast variables are useful in situations where worker nodes must share a variable to perform their work, but will not need to alter the variable, i.e. filtering a list RDD based on a fixed value.

What is the size of an Int? a Double? (the default integer and decimal types)

By default, integers are stored as Ints(4 bytes) unless otherwise specified, while Decimals are stored as Doubles(8 bytes) unless otherwise specified

Is the Master fault tolerant in GFS? Why or why not?

By default, the master does not have a fault tolerant implementation as it is highly unlikely that the master would fail and, should it fail, the MapReduce task can simply be restarted

What is the CAP Theorem?

CAP stands for Consistency, Availability, Partition Tolerance. CAP theorem states only 2 of these 3 concepts can be achieved simultaneously, not all 3.

How do we create a table?

CREATE TABLE table_name (Column 1 TYPE, Column 2 TYPE) OPTIONS TBLPROPERTIES

What is CRUD?

CRUD stands for Create, Read, Update and Delete, and represents the 4 major ways in which a user interacts with a database

What does it mean to cache an RDD?

Caching an RDD means storing it in RAM so that it can be used for further analysis without having to be recreated, i.e. this overrides the RDDs natural ephemeral state.

cd

Changes directories into a given directory

What are classes? Objects?

Classes are blueprints for objects that contain data and methods. For a class to be used, it must be instantiated as an object which is then stored in memory and can be used as needed without affecting the class itself

What does Cluster Computing refer to?

Cluster computing refers to the concept of storing and processing large amounts of data across a networked set of computers within which each computer represents one node of the larger cluster.

How do we write code to throw an Exception?

Code can be written to throw an exception using a match-case code block.

Provide an example of a good column or set of columns to partition on.

Columns used for partitioning should be those of relative importance to the dataset at large. For example, if one were to create a table of all the Wikipedia view data provided by the Wikimedia analytics dumps, it may be valuable to partition the data on domain code, allowing for quicker queries of data within one of the many domain codes available.

What happens to the output of completed Map tasks on failed worker nodes?

Completed Map tasks on failed worker nodes are re-executed because the completed task is written to the local drive of the failed worker, and thus would no longer be accessible

Some example CRUD methods for Mongo? (The Scala methods mostly match mongo shell syntax)

Create: 1. db.collection.insertOne() 2. db.collection.insertMany() Read:3. db.collection.find() Update: 4. db.collection.updateOne() 5. db.collection.updateMany() 6. db.collection.replaceOne() Delete: 7. db.collection.deleteOne() 8. db.collection.deleteMany()

What are DML, DDL, DQL?

DML, DDL, and DQL are all subsets of SQL used to refer to the type of actions being performed in each SQL statement. DML stands for data manipulation language and refers to statements such as INSERT, UPDATE, and DELETE, i.e. statements that affect records in the database. DDL stands for Data Definition Language and refers to statements used to define tables for records, such as CREATE, ALTER, and DROP. DQL stands for Data Query Language and refers to statements used to read data from the database, namely SELECT.

lazy evaluation

Data is not read from memory until an action is called upon an object that uses contains or uses that data.

What is Data Locality? Why is it important?

Data locality refers to the concept of the proximity between the data store and the processor/worker that will access the data. Since it is far faster and more efficient for a given worker to access data from the local disk than over the network, the master assigns tasks to the workers based on data locality.

What is data locality and why is it important?

Data-locality is a term used to describe where data is being stored in the physical computer cluster in relation to the resources that need to access the data. Data locality is important because it is much faster for a task to be performed on data written to the local disk than it is to perform a task on data that must be accessed over the network.

How are DataNodes fault tolerant?

DataNode fault tolerance is achieved by the NameNode, which in the event that it stops receiving heartbeats from a DataNode, will use replication sets to ensure the data is preserved and a new replication set is created to maintain the data replication factor.

What are the ways we could provide default values for a field (Scala class)?

Default values can be provided for a field by using the "=" operator with a parameter in the primary constructor. Alternatively, auxiliary constructors can be used to instantiate objects with some chosen default values for certain parameters.

What is a distributed application? A distributed data store?

Distributed applications and databases are those that are stored/run on a cluster of machines connected over a network. This allows for the application or database to run in parallel and thus leads to much faster execution time.

What about handling document relationships using references -- advantages and disadvantages?

Document relationships can also be handled using references, rather than embedding documents within other documents. These references can be useful when dealing with large datasets as they allow for a large amount of data to be referenced within a document, while allowing that dataset to also be a stand-alone document. However, references have an added disadvantage in that they require an additional query to find the referenced document

What is spark.driver.memory? What about spark.executor.memory?

Driver memory in Spark is the memory allocated to the driver program, which is 1GB by default. If the executors return more than 1GB of data to the driver program, this can cause an OOM error, leading the driver program to crash and thus terminate the execution of our Spark job. Executor memory refers to the amount of RAM available on the worker node of an executor that is allocated to the processing of Spark tasks. This memory is equivalent to roughly 60% of all memory available on the worker node by default, though it can be adjusted. Exceeding Executor memory will cause data to be spilled to disk, which can lead to inefficiency in the execution of a spark task as it incurs I/O overhead.

EC2?

EC2 stands for Elastic Cloud Computing and is the AWS service that allows for the creation of a virtual server to provide large amounts of computing resources that can be used to run applications and perform various jobs. EC2s are based on a per hour fee model.

EMR?

EMR stands for Elastic MapReduce, and is AWS's service that allows for Hadoop cluster to be created on top of an EC2. This EMR can then be used for the execution of MapReduce jobs, such as Spark jobs.

How many partitions does a single task work on?

Each task operates on its own partition, meaning that the number of partitions for our dataset will determine the number of tasks that are spawned.

Encapsulation

Encapsulation is a concept with respect to protecting the internal workings of an object or class by allowing an object to only be manipulated by itself. Other objects can make a call to a method that an object uses to manipulate itself but cannot directly manipulate the data of another object

What are Errors?

Errors are similar to exceptions, however they occur when something goes wrong in the JVM, such as a stack overflow. Unlike exceptions, these should not (and generally cannot) be caught by the program

What are some examples of filters?

Examples: delete field with "qty" that equals 20 col.deleteOne(equal("qty", 20)) find all documents where the value of the qty field is either 5 or 15: col.find(in("qty", 5, 15))

What are Exceptions?

Exceptions are thrown by the program whenever something unexpected occurs, such as a function being passed an incorrect data type or an attempt to access an item in an array using an invalid index

What is an executor? What are executors when we run Spark on YARN?

Executors are the worker nodes of spark, i.e. where the actual evaluation of RDDs as well as the caching of RDDs in memory occurs. When we run Spark on YARN, executors run as container on the YARN NodeManagers.

Transformation

Expressions performed upon an RDD to create a child RDD, i.e. RDD1.map(_ + 2) = RDD2

What does FROM do?

FROM is used in conjunction with SELECT to specify which table to search for a given record criteria.

What does filter do?

Filter allows for the return of only certain elements within a collection, i.e. list.filter(_ > 4) will return only those elements in the list that are greater than 4.

Action

Function call that act upon, and thus instantiate, our RDDs, i.e. reduce, collect, take, and foreach.

What is GFS?

GFS stands for Google File System and is a proprietary distributed file system developed by google. It distributes files across a cluster using a master node as well as several chunk servers. The master server is responsible for assigning and storing metadata related to file chunks, such as a 64-bit unique identifier and the mapping of chunk locations on the cluster, while the chunk servers are responsible the storage of chunks. GFS provides replication of these chunks across the cluster to provide high availability.

What is garbage collection?

Garbage collection is the process through which objects that are no longer necessary to the functioning of the program are removed from the Heap, thus freeing up memory for new objects to be instantiated as well as improving application performance.

What is a generic?

Generics are used with collection by telling the collection to take a type as a parameter. For example, the code "val myList = List[Int](1, 2, 3)" creates a val named myList using a generic to ensure myList will only ever contain Integer types. This helps with compile time type safety because it ensures that the collection will only ever contain the type defined using the generic, thus ensuring any functions that my use the list as an argument will not be given the wrong type at compile, and thus will help prevent exceptions.

What is Git?

Git is a version control software that allows a developer to easily keep track of changes made to their program and to restore earlier version if necessary.

What is GitHub? How is it related to Git?

GitHub is a cloud-based repository for Git projects allowing multiple developers to collaborate on a project and see the changes others have made to the code without necessarily having their own source code changed. Changes made in Git on the local machine can be pushed to the GitHub repository allowing others to pull those changes to their own local repositories.

What is High Availability? How is it achieved in Mongo?

High availability means that the system will almost always be available, in other words, the system is prepared to maintain availability even in the event of a failure of some sort. This is achieved in Mongo by using replica sets

What is Hive?

Hive is an RDBMS/data warehouse software built atop HDFS that provides for a SQL-like query of data stored within the HDFS.

What is a Hive partition?

Hive partitions allow for the segmenting of data on a given column such that all queries of data based on that column will be faster, as it is only necessary to query the data contained in the given partition of data rather than querying the entire database.

How do we write the output of a query to HDFS?

INSERT OVERWRITE DIRECTORY 'file/path' *HQL QUERY*

If the storage level for a persist is MEMORY_ONLY and there isn't enough memory, what happens?

If the memory required to cache an RDD exceeds the memory available, as many partitions as of the RDD as possible will be cached in memory, but those that exceed that memory will be recomputed as necessary in future operations.

Why should we be careful about using accumulators outside of an action?

If we use an accumulator outside of an action, potential re-executions of that action, whether the result of a failed node, a speculative execution, or the rebuilding of a partition may result in the value of the action being added to the accumulator redundantly.

Why might we want to write an application using pure functions instead of impure functions?

Impure function, i.e. functions that change the state of external objects and variables can be useful in a number of contexts in which a developer may want to track changes, however they can also cause issues with respect to the execution of other functions that depend upon the state of external objects and variables which is where pure functions can be valuable. Since pure function do not have side effects, a program using only pure functions is far less likely to run into bugs.

What does it mean that side effects should be atomic and idempotent?

In MapReduce, we want our side-effects to be atomic so that if multiple workers are running a function, the side-effect output we are looking for does not get reproduced over and over. We also want them to be idempotent, meaning we want to the function to not change the output if applied to the output a second time, i.e. it only creates the given side-effect the first time it is applied.

What is the closure of a task? Can we use variables in a closure?

In Scala, a closure is a function which acts upon a variable outside its scope. For this reason, we cannot use variables within our closures, as they will not be returned and thus it will appear as if our closure has done nothing.

What is a reasonable number of partitions for Spark?

In Spark, it is good practice to have at least 2x as many partitions as executors in order to maximize parallelization of tasks. As a general rule, if the number of partitions causes the execution of tasks to be less than 100ms, there are probably to many partitions as the tasks are moving so fast, the overhead for scheduling tasks begins to become burdensome.

Walk through a MapReduce job that counts words in the sentence: "the quick brown fox jumped over the lazy dog"

In a MapReduce job performed on the above sentence using two map tasks, the sentence would be split into two chunks and the master would then assign one half of the sentence to each of the map tasks. The map tasks would map each word in the previous sentence into an intermediate (key, value) pair with some key corresponding to each unique word in the input. These intermediate (key, value) pairs would then be written to local disk on the corresponding worker machine, the reduce task would then look for repeating keys and combine them, likely performing some sort of sum function to produce a word count. Since only one word, namely "the", appears more than once in our input, the reduce task worker would only need to perform one reduce task, i.e. reducing the two (key, value) pairs representing "the" into one final output.

What does a bucketed table look like in HDFS?

In contrast to a partitioned table, which will appear as multiple directories in our Hive database, a bucketed table will appear as multiple files within a single directory within HDFS

What is a managed table?

In contrast to an external table, a managed table is a table in which Hive is directly responsible for managing the data. This means that Hive can guarantee the data is normalized, but also means that if the table holding the data is dropped, the data will be lost.

How does MapReduce increase processing speed when some machines in the cluster are running slowly but don't fail?

In order to increase processing speed, MapReduce uses a mechanism in which it assigns nearly completed tasks a back-up execution in another machine and marks the task as complete once either the primary or back-up execution completes, freeing up both workers to be assigned a new task. This is known as speculative execution.

How would the Logistic Regression example function differently if we left off the call to .cache() on the points parsed from file?

In the Logistic Regression example from the paper, the .cache() call allows us to keep our points values in memory, so they can be iterated over as our Logistic Regression runs. If it were not the case that these point values were cached, it would be necessary to read them from memory each time the function iterated over them to find a new value, thus drastically slowing down the program's overall execution.

What do the indentations mean in the printed lineage?

Indentations in or printed lineage represent the beginning of a new stage once our RDD is translated into a physical plan. Each indentation block represents one stage, with new stages occurring whenever a shuffle function is called.

What is an Index?

Indexes are applied to certain fields within a document in order to improve the time it takes to run a query based on that field, as the query can check the index rather than needing to check every document to see if the relevant field matches.

Inheritance

Inheritance allows for child objects to "inherit" data structures and methods from parent objects, allowing for reusability of code and thus preventing the need to rewrite code from parent classes in every subsequent child class

Do we need to declare type for our variables in Scala? always?

It is not necessary to declare variable types in Scala as Scala is able to infer the data type based on input. Data type can be declare however using the syntax ": " followed by the data type. EX: var x = 44: Int

What is JSON?

JSON stands for JavaScript Object Notation and is a type of text format used for storing information about objects. Json documents contain several fields and their corresponding values. These fields are formatted as "Field Name" : "Value"

How do we load data into a table?

LOAD DATA (LOCAL) INPATH 'file/path' INTO table_name

How can we write Lambdas in Scala?

Lambdas in Scala are written using the => operator between an input and the return value, i.e. (a,b) => a+b takes two arguments, a and b, and returns the sum.

How can we see the lineage of an RDD?

Lineage can be accessed via a .toDebugString call on an RDD. If we want this information to be printed to StdOut, we can then call print() on our RDD.toDebugString.

ls -al

Lists all files, including hidden files, in long format for a given directory

List

Lists are immutable, singe-linked, indexed collections that store multiple items of the same data type. Lists are valuable in FP because they are immutable and thus there can be no change to their state and no unintended side effects

mkdir

Makes a new directory

What does map do?

Map is a collections method that allows a function/method to be applied to each element within a collection individually.

Where do Map tasks read from? Write to?

Map tasks are assigned to workers based on data locality, i.e. the master will attempt to assign a map task to a worker that has a replica of its input data stored locally. If that is not possible, it will assign the task to an available worker as near to the input data as possible. Map tasks are written to the disk of the worker that performs them.

Why do we say that MapReduce has an acyclic data flow?

MapReduce is said to have an acyclic data flow due to the fact that, once a MapReduce is initiated, data flows through all the steps of the MapReduce (Mapper, Combiner, Partition, Sort & Shuffle, Reduce) without any opportunity to interrupt and re-run portions of the MapReduce architecture. For example, we cannot run our mapper and combiner, then rerun our data through our mapper and combiner again, then send it through the remaining step. All steps must occur in order.

Map

Maps are collections that contain key, values pairs in which each key points to exactly one value, allowing for values to be extracted using their keys. Maps can be either mutable or immutable. Maps are hash tables.

How do methods executing on the stack interact with objects on the heap?

Methods on the Stack interact with objects in the Heap via references that "point" the method to the object. These references tell the methods what objects they are meant to be manipulating or otherwise iterating with.

In what situations might we want to use multiple threads? (just some)

Multiple threading allows for different parts of a program to be executed simultaneously, i.e. parallel processing. This is useful when dealing with large amounts of data processing if the results of on stacks execution are not dependent on the results of another stacks execution.

What is multiplicity?

Multiplicity is a term used to refer to the number of connections between different tables/collections in a database. For example, if a data set has a 1-to-N relationship, it means each single object of a certain type is related to multiple objects of a different type.

accumulator

Mutable, global variable to be used by each node, then passed to the program driver to be aggregated in a Spark job.

N-to-N

N-to-N: Car, Auto Parts; Each car is associated with several auto parts, and each part is associated with several different cars

Can we freely store any data we like in an RDBMS table the same way we store any data in a mongo collection?

No, RDBMSs follow a set of rules in the way they store data. This process is referred to as normalization.

In Mongo, are operations on embedded documents atomic? What about operations on referenced documents?

Operations on embedded documents are atomic in Mongo because the operation occurs on only one document. Operations on references, however, are not atomic because multiple documents must be queried when using references, meaning that it is possible for the operations on one document to succeed but the operations on another to fail.

How do operators work in Scala? Where is "+" defined?

Operators are calls to object methods in Scala. the method for "+" is defined in both the Int class and the String class.

What is a package manager? What package manager do we have on Ubuntu?

Package managers allow new packages, as well as updates to existing packages, without having to worry about dependencies. Ubuntu uses Debian's APT (Advanced Package Tool).

How do permissions work in Unix?

Permissions in Unix are broken up into three subsets: owner, group, user. The owner category refers to the permission of the owner of the file/directory, group to the group that owns the file/directory, and user to all other users who are not the owner or group. Each of these subsets can have either read, write, or execute privileges, or some combination of those privileges.

What are Persistence Storage Levels in Spark?

Persistence storage levels in spark refer to the way in which data is stored, i.e. MEMORY_ONLY, MEMORY_AND_DISK, and DISK_ONLY.

Polymorphism

Polymorphism allows for the code that is inherited to behave differently in different contexts. Two child classes may both inherit the same method from their parent class, however, polymorphism allows the method to function differently in the context of each child class if necessary

Why do we make use of programming paradigms like Functional Programming and Object-Oriented Programming?

Programming paradigms are used to allow for more consistent structuring of code and thus facilitate collaborative projects that involve multiple developers.

man

Provides the manual pages for a given command

What is the lineage of an RDD?

RDDs are based upon the concept of lineage. This means that each RDD stores information that points to the data from which it was derived and the transformation process that was used to create the RDD. This provides a level of fault tolerance for our RDDs, as an RDD is able to rebuild a partition stored on a failed node based on the lineage data it contains.

When we cache an RDD, can we use it across Tasks? Stages? Jobs?

RDDs typically are not shared across jobs, however it is possible for them to be shared if they are written to local disk as part of a shuffle. RDDs are shared across tasks when they are cached and can also be shared across stages.

What is a REPL?

Read, Eval, Print, Loop (REPL) is a command line interpreter that allows code to be run line by line directly in the shell. This feature can be useful for debugging and testing small chunks of code quickly.

What is recursion?

Recursion occurs when a function calls itself as part of its execution.

What are some actions available on an RDD?

Reduce, collect, take, foreach.

What does RDBMS stand for?

Relational Database Management System

What does RDD stand for?

Resilient Distributed Dataset

RDD

Resilient Distributed Dataset.

How does a Future return a value or exception?

Return values are wrapped in a Success and exceptions are wrapped in a Failure

S3?

S3 (Simple Storage Service) is AWS's FTP server that allows for storing files of a variety of types, including text files, CSVs, Jars, and more. S3 also allows for easy integration of your S3 file server and other services available on AWS, making it ideal for storing files that serve as input for jobs on an EC2

Ho do we group records and find the count in each group?

SELECT COUNT(column) FROM table_name GROUP BY column

How do we query data in a table?

SELECT columns FROM table_name (FILTERS)

What is ssh?

SSH stands for secure shell protocol and is an encrypted communication protocol that allows for the transmission of files as well as the creation of remote connections over a network. SSH uses port 22 and is a secure version of the unsecure telnet protocol. It also can be used in place of FTP.

What is a case class?

Scala case classes are just regular classes which are immutable by default and decomposable through pattern matching. It uses equal method to compare instance structurally. It does not use new keyword to instantiate object. All the parameters listed in the case class are public and immutable by default.Instances of case classes are compared by structure and not by reference:

Is Scala statically or dynamically typed?

Scala is statically typed, i.e. the data type is known at compilation

How does Scala relate to Java?

Scala relates to Java in several ways, including the use of the JRE and JVM as well as the use of many Java libraries. It also uses many similar data types as Java.

How does the Scala REPL store the results of evaluated expressions?

Scala stores the results from expressions evaluated in REPL under the variables res# with # starting at 0 and ascending in order.

How does Scala relate to the JRE and JVM?

Scala uses JRE and JVM in a similar manner to Java, in that its bytecode files are turned into machine code by the JRE and then run in the JVM

What is Scalability? How is it achieved in Mongo?

Scalability means that the system is capable of being expanded to deal with increased demands. This is achieved in Mongo using sharding.

Set

Sets are like Maps but rather than containing key, value pairs, they contain only values and cannot contain duplicates. They can be either mutable or immutable, and are not indexed.

What is a shuffle in Spark?

Shuffles in Spark occur when we call .reduceByKey, .sortBy, and .repartition

What does it mean to have data skew and why does this matter when bucketing?

Since buckets are calculated using a hashCode mod method like that of MapReduce partitioning, it is possible for data skew to lead to unevenly distributed, and thus misleading buckets. If the skew within our original data leads to a disproportionate amount of our data being inserted into a given bucket while the other buckets receive far less data, we may create buckets that misrepresent our data. When bucketing, the goal should be to create roughly evenly filled buckets of randomized groups of our data. This is best accomplished by bucketing on columns of high cardinality with relatively little importance to our analysis.

Are the workers in GFS fault tolerant? Why or why not?

Since there are many workers operating simultaneously, it is relatively likely that several workers will fail. For this reason, workers are fault tolerant, if a worker fails in an operation, the operation will simply be re-executed by another worker when it becomes available.

Some levels have _SER, what does this mean?

Some levels of memory allow for the data to be stored in a serialized format, which will reduce the amount of memory space used to store the data, but will result in additional computing overhead in both the serialization and deserialization processes.

Why do we use databases instead of just writing to file?

Some reasons:1. Multiple users are modifying the data2. Your data is larger than your memory3. You can query data in a database4. You can easily create reports from a database

What's the difference between cluster mode and client mode on YARN?

Spark can operate in two different modes when using YARN as a cluster manager. First, it can operate in cluster mode, which means that the spark driver program runs within the ApplicationMaster and output from the driver program will appear inside the container spun up for the ApplicationMaster. In contrast, when spark runs in client mode, the driver program run separately from YARN, and simply communicates with the ApplicationMaster to ensure that resources are allocated appropriately. In this case, the driver program will typically run on the YARN master node and output from the driver program will appear on the same machine on which it is running.

Why does Spark need special tools for shared variables, instead of just declaring, for instance, var counter=0?

Spark uses Scala closures, which are functions dependent on variables that exist outside their scope. Since these closures are passed to worker nodes, any variables assigned within these closures would not have the proper scope to be returned to the driver program. This means that any var counter defined would not be properly returned by the worker nodes, and the closure would instead return 0 to the program driver.

What does it mean to run an EMR Step Execution?

Step execution allows for the spinning up of an EMR, execution of a number of MapReduce jobs as a series of steps, and finally the termination of the EMR. Using this method allows for easy to understand workflow when running jobs on an EMR and can help prevent accidentally leaving an EMR running after it is no longer in use, thus preventing unnecessary/unintentional expenses.

How does String interpolation work in Scala?

String interpolating is used with the syntax (s" Some text ${variable or expression}") allowing for a cleaner way to write Strings and include variable in them.

What are some examples of structured data? Unstructured data? Semi-structured?

Structured data refers to that data of which the size, and data types of inputs is always consistent, such as in the case of an RDBMS. Unstructured data refers to things like raw text, or image files, which can vary in size from one input to another and can have a wide arrange of data types. In between the two is semi-structured data which refers to data such as csv files which have some degree of structure, but not to the same degree as a database.

What is the ApplicationsManager?

The ApplicationsManager is responsible for accepting submitted jobs, creating an ApplicationMaster for each job, and maintaining fault tolerance of the ApplicationMasters.

What is CDH?

The Cloudera Distribution Hadoop (CDH) was a version of Hadoop that bundled cluster management tools together with the MapReduce tools provided by Hadoop

What is the Hive metastore?

The Hive metastore is a special read-only table created within Hive that stores the meta-data related to all tables in a Hive database. This allows for the querying of metadata from the metastore, while also protecting that metadata from writes that may ultimately undermine the structure of other tables within Hive.

What is the JDK and what does it let us do?

The Java Development Kit is a tool used by Java developers that allows for the compilation of Java code into .class files as well as the execution of these files. JDK contains the JRE.

What is the JRE and what does it let us do?

The Java Runtime Environment is a compiler that compiles .class files into machine code and then runs them within the JVM(Java Virtual Machine). The JRE contains the JVM.

What does the Map part of MapReduce take in? What does it output?

The Map part of the MapReduce takes in a primary (key, value) pair and produces an intermediate (key, value) pair that is written to the local disk

Be able to explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]

The Mapper and Reducer classes referred to above are used by Hadoop to provide the logic used in the map and reduce tasks. In this particular case, the Mapper class is defined with a map function that takes a LongWritable (line number from an html document) and a string of Text (the text of the line) and then outputs a key, value pair of Text and IntWritable (individual words from the text followed by a 1). The reducer then takes these key, value pairs and reduces them to provide a final word count from the original document

What is the job of the NameNode? What about the DataNode?

The NameNode serves as the master node for the HDFS, while each worker in the cluster is given its own DataNode. The DataNode is responsible for the storage of data, as well as reporting back the health of the node to the NameNode. The NameNode is responsible for keeping the filesystem image as well as recording edits to the file system in a log. It is also responsible for making sure that proper data replication is preserved based on the information it receives from DataNodes

What is the significance of the Object class in Scala?

The Object class is the superclass of all other classes and thus all objects in Scala. This means that all objects inherit certain properties from the Object class, such as the .toString method

What does the PATH environment variable do?

The PATH variable tell BASH which directories to search to find the binary files for commands in the file structure. This allows the command to be run from the command line simply by typing the command from any location.

Which of the I/O step in a MapReduce would you expect to take the longest?

The Reduce I/O takes longer than the Map I/O because it requires data to be transmitted over the network rather than read/written from/to the local disk(or near the local disk). However, a given MapReduce is will have far more Map function performed than Reduce function, so it is beneficial to put the larger burden on Reduce.

What does the Reduce part of MapReduce take in? What does it output?

The Reduce part of MapReduce then reads the intermediate (key, value) pairs and performs a reduce function that outputs a final (key, value) pair which is written to an output file

Which responsibilities does the Scheduler have?

The Scheduler is responsible for allocating resources based on requests from a given job's ApplicationMaster.

What purpose does a Secondary NameNode serve?

The Secondary NameNode can step in and taking over as NameNode should the NameNode fail.

What is the Spark History Server?

The Spark History Server is a service provided by EMR when a spark job is executed. This spark history server breaks down the spark job in a number of clear and concise graphs, such as a timeline and a Directed Acyclic Graph that give a visual representation of the execution of stages and tasks in a given Spark job, as well as the use of resources and time to perform various tasks, allowing for a better understanding of the execution of that Spark job and the ability to easily performing debugging/tuning on the underlying code.

What is the Stack? What is stored on the Stack?

The Stack is the term used for the space in memory in which the execution of methods and functions from the program takes place. The Stack functions as a LIFO(Last In, First Out) meaning that the last method called will execute first and then the next method in the stack, etc. New method calls will be placed at the top of the stack and thus, executed first.

How does a Standby NameNode make the NameNode fault tolerant?

The Standby NameNode keeps backups of the NameNode metadata to prevent catastrophic data loss if the NameNode should fail.

How does the chmod command change file permissions?

The chmod command can be used to explicitly assign privileges to owner, group, and user. This can be accomplished either using binary number format, i.e. 777 for all privileges to all groups, or through letter format, i.e. o + rwx, g + rwx, u + rwx

When does the combine phase run, and where does each combine task run?

The combine phase runs on the local machine after the worker has completed its map tasks. This allows for come reduction to take place before the data is sent over the network to the final reducers, thus limiting the use of network resources and speeding up the execution of the job.

What is the optional Combiner step?

The combiner function is useful in that it acts as an intermediary reduce to partially merge highly repetitive intermediate keys so the final Reduce functions do need to read as much data over the network, thus improving performance.

Why is the Combiner step useful in the case of a wordcount?

The combiner function is useful in the context of a large wordcount application because it is likely that the initial Map function will produce a highly repetitive set of intermediate keys, thus the combiner can help ease the load of network traffic that would otherwise be produced during the Reduce function.

How are replications typically distributed across the cluster? What is *rack awareness*?

The data is typically stored with two replication blocks on one rack, while the third is stored on a different rack. This rack awareness is important because it allows for the proper balancing of considerations regarding network resources and fault tolerance

What is the storage level for .cache()?

The default storage level for .cache() is MEMORY_ONLY

What is the Heap? What is stored on the Heap?

The heap is conceptually similar to the Stack, however instead of storing function and method calls, it stores the objects which have been instantiated and upon which methods and function operate.

What is the lineage of an RDD?

The lineage of an RDD is the information an RDD stores about its construction that allows RDDs to maintain a level of fault tolerance. The lineage is made up of two parts, information about the data the RDD needs to read from during construction, as well as the transformations it must perform upon that data. This lineage forms the logical plan for the execution of an RDD.

In Spark, what is the logical plan? The physical plan?

The logical plan for our RDD is represented by the lineage of the RDD, and corresponds to the data that will be used as well as the transformations that will occur on that data. The physical plan is the translation of that logical plan into actual processes that occurs when we call an action upon our RDD, thus causing it to be evaluated. The physical plan consists of stages, which consist of tasks.

What are some disadvantages of handling document relationships by embedding documents?

The major disadvantage of using embedded documents comes from cases in which the cardinality of N is very large, as it may slow down the ability to interact with the larger document as well as the fact that any changes to the embedded document in one document will need to be subsequently applied to every other document in which the document to be changed is embedded

How does the onComplete callback function work?

The onComplete function returns the results of the Future once it is done executing.

What rules does Mongo enforce about the structure of documents inside a collection?

The only rule Mongo explicitly enforces about the structure of documents inside of a collection is that it must have an object id. All other rules are defined in the implementation of the database.

What does reduce do?

The reduce() method is a higher-order function that takes all the elements in a collection (Array, List, etc) and combines them using a binary operation to produce a single value. It is necessary to make sure that operations are commutative and associative.

Know the input and output of the shuffle + sort phase.

The shuffle + sort phase takes in the output from map tasks, partitions the output by keys and sorts the partitions so all values for a given key are passed to the reducer together.

What does the src folder contain in a sbt project?

The src folder is used to contain all the source code used in the project, i.e. the .scala files

What does the target folder contain in Scala?

The target folder is used for all files that are created as a result of the program executing, such as the .class files created a compilation time

What does it mean that functions are first class citizens in Scala?

The term "first class citizen" refers to something that can be passed as a parameter to a function, returned from a function, or stored as a variable. Scala allows for functions to be used in all these ways

What needs to be true about the types contained in the above generics?

The types of the above generics mist be written to match the input the function will receive and the types of the key, value pairs it will output. It is therefore important that the Mapper class has the right generic types defined based on the data it will be provided and that the key, value pair output of Mapper matches the key, value pair input for Reducer

What are some benefits to using the cloud?

There are a number of benefits to using the cloud, including elasticity, high availability, and cost effectiveness. Since cloud resources are virtualized, they are highly elastic, meaning that the resources needed for a given job can be expanded as demand increases, and contracted when they are no longer necessary. This elasticity leads to both scalability, as the system expands to meet higher levels of demand, as well as high availability, as the server will is guaranteed to be available at almost all times. Also, the ability to transfer the burden of physical resource storage and maintenance to the cloud provider means it is cheaper and more feasible to run large cluster computing operations.

What are some advantages of handling document relationships by embedding documents?

There are several advantages to handling relationships by embedding documents, including ease and speed of accessing embedded items, easy to understand relationships, and logical organization

hdfs dfs -get /user/adam/myfile ~

This command will copy the file "/user/adam/myfile" from the distributed filesystem to the user's home directory on the local machine.

hdfs dfs -put ~/coolfile /user/adam/

This command will copy the file "coolfile" from the user's home directory on the local machine to the folder /user/adam on the distributed filesystem.

What does a custom partitioning function change about how your MapReduce job runs?

Though MapReduce comes with a default partitioning function, users can specify a custom partitioning function to partition the data used by the reduce function in a manner that will produce the desired output. For example, if a user wants all data of a certain type to end up in the same output file, i.e. outputting all URL keys for a single host in the same output file

How might we scale a HDFS cluster past a few thousand machines?

To scale up a HDFS cluster past a few thousand machines, it becomes necessary to create a federated cluster in which multiple clusters with their own NameNodes are connected via the network.

What's the difference in output between MapReduce wordcount in Hadoop and .map followed by .reduceByKey in Spark?

Ultimately, the difference between .map followed by .reduceByKey in Spark and a typical MapReduce is only that the .map followed by .reduceByKey is lazy by nature and will not produce an actual output until an action, such as .collect or .take. It is also important to note that the shuffle/sort phase occurs by default in a MapReduce, but in the Spark example above, it would be necessary to make a call to .sortBy.

What is/was Unix? Why is Ubuntu a Unix-like operating system?

Unix was an open-source OS developed Bell Labs in the 1970's that quickly became popular among tech-minded individuals due to its open nature. However, in the 1980's, Bell attempted to revise the Unix business model to make it a closed source OS, leading to a collapse of the Unix Market. Several attempts were made to create Unix-like OS's that would be like Unix but would provide an open source model the same way the original Unix did. The main two branches spawned by this open source movement were the Berkeley Software Distribution and GNU/Linux (GNU tools/apps running on a Linux kernel). Ubuntu is considered a Unix-like operating system because it is a Linux distribution(specifically a subset of Debian).

cp

Used to copy a file or directory to another location in the file structure

mv

Used to move a file or directory to another location in the file structure

rm

Used to remove a file, or recursively (-r option) to remove a directory

What are users, what are groups?

Users is a term simply used to refer to every given account on a Unix-based system. Users are assigned to groups, which are then given privileges because of group membership in a role-based access control model.

What is a VM?

VM stand for virtual machine and is one of the main methods of virtualizing an Operating System. In the case of a type II hypervisor, such as Virtual Box or VMWare, the virtualized OS runs within a "sandboxed" environment on top of an underlying OS. The virtual OS is known as a "guest" OS while the OS of the system hosting the VM as known as the "host".

Vector

Vectors are indexed, immutable data sequences that are similar to lists, but allow for random access and thus consistent querying speed across the list.

What does WHERE do?

WHERE functions like filter in an RDBMS/SQL-database by providing some filtering parameter for the records the query wants returned, i.e. SELECT * FROM movies WHERE year > 1999 will return all records from the movie database in which the year is greater than 1999

Ho do we filter the records from a query?

WHERE, GROUP BY, LIMIT

How can we do something with the result of a Future?

We can use the results of a future by calling on the results at a later point in the program. Multiple methods exist for calling Futures, including await.ready, .onComplete and Thread.sleep

How do we interact with the distributed filesystem in Hadoop?

We interact with the distributed filesystem from the command line using the hdfs command and the appropriate option flags.

What was the "Hadoop Explosion"?

When Hadoop initially became popular, it provided an open source framework for tackling MapReduce jobs. For this reason, the rise of Hadoop led to the creation of many new tools to allow for more specialized use of MapReduce in varying contexts.

What does it mean that a function returns Unit in Scala?

When a function returns Unit, it is like Void in Java, meaning the function does not return a value.

What does it mean to "spill to disk" when executing spark tasks?

When executing a Spark task, it is possible for the amount of data on which transformations are being performed to exceed the RAM available to the executor. In this case, the excess data will be spilled to disk and read back when necessary.

RDDs are lazy and ephemeral. What does this mean?

When we say that RDDs are lazy and ephemeral, we refer to two different traits of our RDDs. 1) they are lazy in that they contain pointer to the data from which they are created, but do not actually read that data or preform the necessary transformations upon it for their creation until an action is performed upon them, forcing them to instantiate. 2) RDDs are ephemeral because they are discarded from memory after they are no longer in use unless otherwise cached or persisted through the use of commands.

What does it mean when we say an RDD is a collection of objects partitioned across a set of machines?

When we say that an RDD is a collection of objects partitioned across a cluster, we refer to that fact that a given RDD is divided into read-only objects when it is instantiated, i.e. it cannot be altered upon instantiation, and that these objects are spread across the cluster such that the failure of a given worker node will lead to the loss of only a small piece of the RDD which can then be quickly and effectively rebuilt due to RDDs' lineage-based model.

Some persistence levels have _2, what does this mean?

Whenever a call is made to _2 on a storage level, it means the partitions will be replicated and stored on two separate nodes in the cluster

How would we write code that handles a thrown Exception?

Whenever a code block may cause an Exception, it is best practice to nest the code within a try-catch statement to ensure proper exception handling. This can be done by placing the code within a try block, and adding a catch block for various exceptions that may occur from improper input to the code, thus producing some sort of output that deals with the exception and allows the program to continue functioning.

Why can Spark skip entire stages if they are the same between jobs?

Whenever a stage ends, typically as the result of a shuffle, data is repartitioned and written to disk. This allows for further stages executed upon this data to skip the initial stage as they can read the post stage RDD from disk rather than needing to reconstruct it.

When, during a Spark Job, do we need to pay attention to the number of partitions and adjust if necessary?

Whenever we first create an RDD by reading in some form of input data, it is good practice to ensure proper partitioning before we perform the initial stage of transformations. It is also important to pay attention to the number of partitions after a shuffle stage when our data set is large, to ensure we are not reducing our number of partitions in a way the is detrimental to the parallel execution of our tasks.

What does CAP mean for our distributed data stores when they have network problems?

Whenever we have network problems, it is necessary for our database to have partition tolerance. Since CAP theorem states we can only achieve 2 of the 3 CAP goals at one time, the necessity for partition tolerance when there is a network problem means that a distributed data store experiencing a network problem must choose between maintaining consistency, or availability.

How many NameNodes exist on a cluster?

While there is only one primary NameNode running on the cluster at a given time, there can be up to two additional NameNodes that serve to help maintain fault tolerance in the case of the NameNode failing. These two additional NameNodes are known as the Standby NameNode, and the Secondary NameNode. The Standby NameNode simply keeps backups of NameNode metadata to prevent catastrophic data loss, while the secondary NameNode reads the log of actions taken by the NameNode and replicates those actions so that, should the NameNode fail, it will be prepared to step in and take over.

Explain the deficiency in using Hive for interactive analysis on datasets. How does Spark alleviate this problem?

With each query made in Hive, we must run a full MapReduce. Spark allows for the creation of RDDs which can be persisted in memory. This allows for us to run interactive analysis on RDDs at any point in our MapReduce-like functions and rerun further functions upon the data as fits our needs.

BigDecimal

any precision

BigInt

any size number

What methods get generated when we declare a case class?

equals, hashCode, toString, copy, apply and unapply

What two commands are used in sequence to save changes to our local repo?

git add and git commit (alternatively git commit -a can be used to perform both steps at once)

What command moves changes from our local repo to GitHub?

git push

What is sbt (high level)?

sbt is an open-source build tool for Scala and Java projects, similar to Apache's Maven and Ant. Its main features are: Native support for compiling Scala code and integrating with many Scala test frameworks. Continuous compilation, testing, and deployment

What is the difference between val and var?

val is an immutable variable while var is a mutable variable.

What are the 3 Vs of big data?

· Volume - The large amount of data that must be processed. · Velocity - The speed at which new data is being created. · Variety - The variance in types of data that must be accounted for in big data storage and processing.


Kaugnay na mga set ng pag-aaral

geo chapter 5 volcanic eruption, Chapter 8, Geology set 2-metamorphic rock, Chapter 10 Earthquakes, Soil Horizons, Geology Ch 5, Geology Ch 4, Geology Exam 2 Review, Geology Mountain Building, Geology Ch 5, Geology Ch 4, Metamorphic rocks, chapter 7,...

View Set

Production & Operations Ch. 13-15

View Set

Quiz 1- Introduction to Religion

View Set

Software Engineering Rapid Fire Questions

View Set

Exploring Linux Filesystems (review questions) - [LINUX System Administration]

View Set

STRAT 5701 Week 5 - Cost Leadership

View Set