Big Data
What is MapReduce? What is the syntax you use to run a MapReduce program?
MapReduce is a programming model in Hadoop for processing large data sets over a cluster of computers, commonly known as HDFS. It is a parallel programming model. The syntax to run a MapReduce program is - hadoop_jar_file.jar /input_path /output_path.
What is an RDD?
Resilient Distributed Datasets (RDDs) are the core concepts in Spark. In order to understand how spark works, we should know what RDD's are and how they work. The Spark RDD is a fault tolerant, distributed collection of data that can be operated in parallel. Each RDD is split into multiple partitions, and spark runs one task for each partition. The Spark RDDs can contain any type of Python, Java or Scala objects, including user-defined classes. They are not actual data, but they are Objects, which contains information about data residing on the cluster. The RDDs try to solve these problems by enabling fault tolerant, distributed In-memory computations.
What does the copyToLocal HDFS command do?
Similar to get command, only the difference is that in this the destination is restricted to a local file reference. example: hdfs dfs -copyToLocal /user/dataflair/dir1/sample /home/dataflair/Desktop
What are the main differences between NAS (Network-attached storage) and HDFS?
The main differences between NAS (Network-attached storage) and HDFS - - HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less. - Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.
Define respective components of YARN
The two main components of YARN are- 1. ResourceManager- This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs. 2. NodeManager- It executes tasks on each single Data Node
What does the getfacl HDFS command do?
This Apache Hadoop command shows the Access Control Lists (ACLs) of files and directories. If a directory contains a default ACL, then getfacl also displays the default ACL. Options : -R: It displays a list of all the ACLs of all files and directories recursively. <path?: File or directory to list. example: hadoop fs -getfacl /user/dataflair/dir1/sample hadoop fs -getfacl -R /user/dataflair/dir1
How do you specify the number of partitions while creating an RDD? What are the functions?
You can specify the number of partitions while creating a RDD either by using the sc.textFile or by using parallelize functions as follows: Val rdd = sc.parallelize(data,4) val data = sc.textFile("path",4)
What is fsck?
fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.
What is cache() and persist() in Spark?
With cache(), you use only the default storage level MEMORY_ONLY. With persist(), you can specify which storage level you want,(rdd-persistence). Use persist() if you want to assign a storage level other than MEMORY_ONLY to the RDD. Other options are: -MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala) -Disk
How do you copy the file or directory from the local file system to the destination in HDFS?
hdfs dfs -put /home/dataflair/Desktop/sample /user/dataflair/dir1
What is the Command to format the NameNode?
$ hdfs namenode -format
Explain the Apache Spark Architecture. How to Run Spark applications?
-Apache Spark application contains two programs namely a Driver program and Workers program. -A cluster manager will be there in-between to interact with these two cluster nodes. Spark Context will keep in touch with the worker nodes with the help of Cluster Manager. -Spark Context is like a master and Spark workers are like slaves. -Workers contain the executors to run the job. If any dependencies or arguments have to be passed then Spark Context will take care of that. RDD's will reside on the Spark Executors. -You can also run Spark applications locally using a thread, and if you want to take advantage of distributed environments you can take the help of S3, HDFS or any other storage system.
What is Apache Spark?
Apache Spark is a cluster computing framework which runs on a cluster of commodity hardware and performs data unification i.e., reading and writing of wide variety of data from multiple sources. In Spark, a task is an operation that can be a map task or a reduce task. Spark Context handles the execution of the job and also provides API's in different languages i.e., Scala, Java and Python to develop applications and faster execution as compared to MapReduce.
What is the role of coalesce() and repartition () in Spark?
Both coalesce and repartition are used to modify the number of partitions in an RDD but Coalesce avoids full shuffle. If you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions and this does not require a shuffle. Repartition performs a coalesce with shuffle. Repartition will result in the specified number of partitions with the data distributed using a hash practitioner.
Describe coalesce() in Spark
It avoids a full shuffle. If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept. So, it would go something like this: Node 1 = 1,2,3 Node 2 = 4,5,6 Node 3 = 7,8,9 Node 4 = 10,11,12 Then coalesce down to 2 partitions: Node 1 = 1,2,3 + (10,11,12) Node 3 = 7,8,9 + (4,5,6) Notice that Node 1 and Node 3 did not require its original data to move.
What do you understand by Rack Awareness in Hadoop?
It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions, network traffic is minimized between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be placed on one rack whereas the third copy in a separate rack.
What is data preparation?
It is the process of gathering, combining, structuring and organizing data so it can be analyzed as part of data visualization, analytics and machine learning applications.
What is a lazy evaluation in Spark?
Lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur. Transformations are lazy in nature, meaning when we call some operation in RDD, it does not execute immediately.
Describe structured data vs. unstructured data
Structured data is comprised of clearly defined data types whose pattern makes them easily searchable; while unstructured data - "everything else" - is comprised of data that is usually not as easily searchable, including formats like audio, video, and social media postings.
What does the get HDFS command do?
This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path identified by a local destination. example: hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop
What does the cat HDFS command do?
This Hadoop fs shell command displays the contents of the filename on console or stdout. example: hdfs dfs -cat /user/dataflair/dir1/sample
How is Spark different from MapReduce? Is Spark faster than MapReduce?
Yes, Spark is faster than MapReduce. There are few important reasons why Spark is faster than MapReduce and some of them are below: There is no tight coupling in Spark i.e., there is no mandatory rule that reduce must come after map. Spark tries to keep the data "in-memory" as much as possible. In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.
How do you list the contents of a directory in HDFS?
hdfs dfs -ls /user/dataflair/dir1
How do you create a directory in HDFS?
hdfs dfs -mkdir /user/dataflair/dir1
What does dfs mean in the hdfs command prompt? e.g. hdfs dfs
runs a filesystem command on the file systems supported in Hadoop.
What is the data preparation process?
After data has been validated and reconciled, data preparation software runs files through a workflow, during which specific operations are applied to files. For example, this step may involve creating a new field in the data file that aggregates counts from preexisting fields, or applying a statistical formula -- such as a linear or logistic regression model -- to the data. After going through the workflow, data is output into a finalized file that can be loaded into a database or other data store, where it is available to be analyzed.
What is an entity relationship (ER) model?
An ER model describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types). Three levels of abstraction: Physical layer — how data is stored on hardware (actual bytes, files on disk, etc.) Logical layer — how data is stored in the database (types of records, relationships, etc.) View layer — how applications access data (hiding record details, more convenience, etc.)
Explain the different modes in which Hadoop run.
Apache Hadoop runs in the following three modes - 1. Standalone (Local) Mode - By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode. 2. Pseudo-Distributed Mode - In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes. 3. Fully-Distributed Mode - In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.
What do you know about the term "Big Data"?
Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that's why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.
What are the common input formats in Hadoop?
Below are the common input formats in Hadoop - 1. Text Input Format - The default input format defined in Hadoop is the Text Input Format. 2. Sequence File Input Format - To read files in a sequence, Sequence File Input Format is used. 3. Key Value Input Format - The input format used for plain text files (files broken into lines) is the Key Value Input Format.
How is big data analysis helpful in increasing business revenue?
Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is - Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
Tell us how big data and Hadoop are related to each other.
Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.
What is a block in HDFS and what is its default size in Hadoop 1 and Hadoop 2? Can we change the block size?
Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster. The default block size in Hadoop 1 is: 64 MB The default block size in Hadoop 2 is: 128 MB Yes, we can change block size by using the parameter - dfs.block.size located in the hdfs-site.xml file.
What is Distributed Cache in a MapReduce Framework
Distributed Cache is a feature of Hadoop MapReduce framework to cache files for applications. Hadoop framework makes cached files available for every map/reduce tasks running on the data nodes. Hence, the data files can access the cache file as a local file in the designated job.
Which hardware configuration is most beneficial for Hadoop jobs?
Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly. Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the most common kinds of internal data corruption.
Explain the core components of Hadoop.
Explain the core components of Hadoop. Answer: Hadoop is an open source framework that is meant for storage and processing of big data in a distributed manner. The core components of Hadoop are - 1. Hadoop MapReduce - MapReduce is the Hadoop layer that is responsible for data processing. It writes an application to process unstructured and structured data stored in HDFS. It is responsible for the parallel processing of high volume of data by dividing data into independent tasks. The processing is done in two phases Map and Reduce. The Map is the first phase of processing that specifies complex logic code and the Reduce is the second phase of processing that specifies light-weight operations. 2. YARN - The processing framework in Hadoop is YARN. It is used for resource management and provides multiple data processing engines i.e. data science, real-time streaming, and batch processing. 3. HDFS (Hadoop Distributed File System) - HDFS is the basic storage system of Hadoop. The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.
What happens when two users try to access the same file in the HDFS?
HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.
What is a Hadoop Mapper?
Hadoop Mapper task processes each input record and it generates a new <key, value> pairs. The <key, value> pairs can be completely different from the input pair. In mapper task, the output is the full collection of all these <key, value> pairs. Before writing the output for each mapper task, partitioning of output take place on the basis of the key and then sorting is done. This partitioning specifies that all the values for each key are grouped together.
What are the different file permissions in HDFS for files or directory levels?
Hadoop distributed file system (HDFS) uses a specific permissions model for files and directories. Following user levels are used in HDFS - Owner Group Others. For each of the user mentioned above following permissions are applicable - read (r) write (w) execute(x). Above mentioned permissions work differently for files and directories.
Explain some important features of Hadoop.
Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges. Some important features of Hadoop are: 1. Open Source - Hadoop is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements. 2. Distributed Processing - Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data. 3. Fault Tolerance - Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically. 4. Reliability - Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in Hadoop environment is not affected by the failure of the machine. 5. Scalability - Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes. 6. High Availability - The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.
How do you convert unstructured data to structured data?
Information Extraction & NLP are the ways to go. Imagine you have a template, which should be filled with information extracts from an unstructured data feed. This is the most rudimentary way in which structured data is constructed from unstructured feeds. There is also research on discovering structures from unstructured data. Here, there will be no template; but typically a graph gets constructed with nodes representing information extracts and links representing how are the information fragments related to eachother.
Explain JobTracker in Hadoop
JobTracker is a JVM process in Hadoop to submit and track MapReduce jobs. JobTracker performs the following activities in Hadoop in a sequence - 1. JobTracker receives jobs that a client application submits to the job tracker 2. JobTracker notifies NameNode to determine data node 3. JobTracker allocates TaskTracker nodes based on available slots. 4. it submits the work on allocated TaskTracker Nodes, 5. JobTracker monitors the TaskTracker nodes. 6. When a task fails, JobTracker is notified and decides how to reallocate the task.
How can you achieve security in Hadoop?
Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service while using Kerberos, at a high level. Each step involves a message exchange with a server. 1. Authentication - The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client. 2. Authorization - In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server). 3. Service Request - It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.
Explain Spark RDD vs Hadoop
Let's look at the features of Resilient Distributed Datasets in the below explanation: - In Hadoop, we store the data as blocks and store them in different data nodes. In Spark, instead of following the above approach, we make partitions of the RDDs and store in worker nodes (datanodes) which are computed in parallel across all the nodes. - In Hadoop, we need to replicate the data for fault recovery, but in case of Spark, replication is not required as this is performed by the RDDs. - RDDs load the data for us and are resilient, which means they can be recomputed. - RDDs perform two types of operations: Transformations, which creates a new dataset from the previous RDD and Actions, which return a value to the driver program after performing the computation on the dataset. - RDDs keeps track of Transformations and checks them periodically. If a node fails, it can rebuild the lost RDD partition on the other nodes, in parallel.
Do you prefer good data or good models? Why?
Many companies want to follow a strict process of evaluating data, means they have already selected data models. In this case, having good data can be game-changing. The other way around also works as a model is chosen based on good data. Answer it from your experience. However, don't say that having both good data and good models is important as it is hard to have both in real life projects.
What does the mv HDFS command do?
This basic HDFS command moves the file or directory indicated by the source to destination, within HDFS. example: hadoop fs -mv /user/dataflair/dir1/purchases.txt /user/dataflair/dir2
Why is Hadoop used for Big Data Analytics?
Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of -Storage -Processing -Data collection Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.
What is the difference between "HDFS Block" and "Input Split"?
The HDFS divides the input data physically into blocks for processing which is known as an HDFS Block. Input Split is a logical division of data by a Mapper for mapping operation.
Explain the steps to be followed to deploy a Big Data solution.
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS. Data Integstion -> Data Storage -> Data Processing ii. Data Storage After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access. iii. Data Processing The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
How do you recover a NameNode when it is down?
The following steps need to execute to make the Hadoop cluster up and running: 1. Use the FsImage, which is the file system metadata replica, to start a new NameNode. 2. Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode. 3. Once the new NameNode completes loading the last checkpoint FsImage, which has received enough block reports from the DataNodes, it will start to serve the client. In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance. FSimage is a point-in-time snapshot of HDFS's namespace.
What are the configuration parameters in a "MapReduce" program?
The main configuration parameters in "MapReduce" framework are: Input locations of Jobs in the distributed file system Output location of Jobs in the distributed file system The input format of data The output format of data The class which contains the map function The class which contains the reduce function JAR file which contains the mapper, reducer and the driver classes
Describe repartition() in Spark.
The repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly. Let's create a DataFrame with the numbers from 1 to 12. val x = (1 to 12).toList val numbersDf = x.toDF("number") numbersDf contains 4 partitions on my machine. numbersDf.rdd.partitions.size // => 4 Here is how the data is divided on the partitions: Partition 00000: 1, 2, 3 Partition 00001: 4, 5, 6 Partition 00002: 7, 8, 9 Partition 00003: 10, 11, 12 Let's do a full-shuffle with the repartition method and get this data on two nodes. val numbersDfR = numbersDf.repartition(2) Here is how the numbersDfR data is partitioned on my machine: Partition A: 1, 3, 4, 6, 7, 9, 10, 12 Partition B: 2, 5, 8, 11 The repartition method makes new partitions and evenly distributes the data in the new partitions (the data distribution is more even for larger data sets).
Define respective components of HDFS
The two main components of HDFS are- 1. NameNode - This is the master node for processing metadata information for data blocks within the HDFS 2. DataNode/Slave node - This is the node which acts as slave node to store the data, for processing and use by the NameNode In addition to serving the client requests, the NameNode executes either of two following roles - CheckpointNode - It runs on a different host from the NameNode - BackupNode- It is a read-only NameNode which contains file system metadata information excluding the block locations
How does Hadoop MapReduce works?
There are two phases of MapReduce operation. 1. Map phase - In this phase, the input data is split by map tasks. The map tasks run in parallel. These split data is used for analysis purpose. 2. Reduce phase- In this phase, the similar split data is aggregated from the entire collection and shows the result.
What does the getmerge HDFS command do?
This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a copy of them to one single, merged file in the local file system identified by the local destination. example: hdfs dfs -getmerge /user/dataflair/dir2/sample /home/dataflair/Desktop
What does the getfattr HDFS command do?
This HDFS file system command displays if there is any extended attribute names and values for a file or directory. Options: -R: It recursively lists the attributes for all files and directories. -n name: It displays the named extended attribute value. -d: It displays all the extended attribute values associated with the pathname. -e encoding: Encodes values after extracting them. The valid converted coded forms are "text", "hex", and "base64". All the values encoded as text strings are with double quotes (" "), and prefix 0x and 0s are used for all the values which are converted and coded as hexadecimal and base64. path: The file or directory. example: hadoop fs -getfattr -d /user/dataflair/dir1/sample
What does the cp HDFS command do?
This Hadoop File system shell command copies the file or directory identified by the source to destination, within HDFS. example: hadoop fs -cp /user/dataflair/dir2/purchases.txt /user/dataflair/dir1
What does the copyFromLocal HDFS command do?
This hadoop shell command is similar to put command, but the source is restricted to a local file reference. example: hdfs dfs -copyFromLocal /home/dataflair/Desktop/sample /user/dataflair/dir1
What are actions and transformations in Spark?
Transformations create new RDD's from existing RDD and these transformations are lazy and will not be executed until you call any action. Eg: map(), filter(), flatMap(), etc., Actions will return results of an RDD. Eg: reduce(), count(), collect(), etc.,
Mention some Transformations and Actions.
Transformations: map(), filter(), flatMap() Actions: reduce(), count(), collect()