Hadoop & HDFS

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What are the steps undertaken in case the primary NameNode fails:

1.) Use the file system metadata replica (FsImage) to start a new NameNode. 2.) Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started. 3.) Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes.

What is Secondary NameNode? Is it a substitute or back up node for the NameNode?

A Secondary NameNode is a helper daemon that performs checkpointing in HDFS. No, it is not a backup or a substitute node for the NameNode. It periodically, takes the edit logs (meta data file) from NameNode and merges it with the FsImage (File system Image) to produce an updated FsImage as well as to prevent the Edit Logs from becoming too large.

Who is the 'user' in HDFS?

Anyone who tries to retrieve data from database using HDFS is the user. Client is not end user but an application that uses job tracker and task tracker to retrieve data.

What is the problem in having lots of small files in HDFS?

As we know, the NameNode stores the metadata information regarding file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too much of files will lead to the generation of too much meta data and storing these meta data in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.

What is meant by 'commodity hardware'? Can Hadoop work on them?

Average and non-expensive systems are known as commodity hardware and Hadoop can be installed on any of them. Hadoop does not require high end hardware to function.

What is a block scanner in HDFS?

Block scanner runs periodically on every DataNode to verify whether the data blocks stored are correct or not. The following steps will occur when a corrupted data block is detected by the block scanner: First, the DataNode will report about the corrupted block to the NameNode. Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes. The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).

Can blocks be broken down by HDFS if a machine does not have the capacity to copy as many blocks as the user wants?

Blocks in HDFS cannot be broken. Master node calculates the required space and how data would be transferred to a machine having lower space.

What is checkpointing in Hadoop?

Checkpointing is the process of combining the Edit Logs with the FsImage (File system Image). It is performed by the Secondary NameNode.

Define Data Integrity? How does HDFS ensure data integrity of data blocks stored in HDFS?

Data Integrity talks about the correctness of the data. It is very important for us to have a guarantee or assurance that the data stored in HDFS is correct. However, there is always a slight chance that the data will get corrupted during I/O operations on the disk. HDFS creates the checksum for all the data written to it and verifies the data with the checksum during read operation by default. Also, each DataNode runs a block scanner periodically, which verifies the correctness of the data blocks stored in the HDFS.

What is a DataNode?

DataNodes are the slave nodes in HDFS. It is a commodity hardware that provides storage for the data. It serves the read and write request of the HDFS client.

Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?

Default block size in Hadoop 2.x is 128 MB. So, a file of size 514 MB will be divided into 5 blocks ( 514 MB/128 MB) where the first four blocks will be of 128 MB and the last block will be of 2 MB only. Since, we are using the default replication factor i.e. 3, each block will be replicated thrice. Therefore, we will have 15 blocks in total where 12 blocks will be of size 128 MB each and 3 blocks of size 2 MB each.

There are two files associated with metadata present in the NameNode, what are they?

FsImage: It contains the complete state of the file system namespace since the start of the NameNode. EditLogs: It contains all the recent modifications made to the file system with respect to the recent FsImage.

What is HDFS?

HDFS is filing system use to store large data files. It handles streaming data and running clusters on the commodity hardware.

HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the Fault Tolerance capability of the system?

HDFS provides fault tolerance by replicating the data blocks and distributing it among different DataNodes across the cluster. By default, this replication factor is set to 3 which is configurable. So, if I store a file of 1 GB in HDFS where the replication factor is set to default i.e. 3, it will finally occupy a total space of 3 GB because of the replication. Now, even if a DataNode fails or a data block gets corrupted, I can retrieve the data from other replicas stored in different DataNodes.

What is meant by streaming access?

HDFS works on the principle of "write once, read many" and the focus is on fast and accurate data retrieval. Streaming access refers to reading the complete data instead of retrieving single record from the database.

What is the difference between Hadoop 1.x and Hadoop 2.x?

Hadoop 1.x provided support for only one NameNode, resulting in a single point of failure. Hadoop 1.x only has jobTracker and Tasktracker as daemons for nameNode and datanode respectively. Hadoop 2.x solves this with the primary and StandBy NameNode. Hadoop 2.x also provides the ResourceManager and NodeManager.

What is a heartbeat in HDFS?

Heartbeats in HDFS are the signals that are sent by DataNodes to the NameNode to indicate that it is functioning properly (alive). If the signal is not received it would indicate problems with the Data node. By default, the heartbeat interval is 3 seconds, which can be configured using dfs.heartbeat.interval in hdfs-site.xml.

What is the difference between NAS (Network Attached Storage) and HDFS?

Here are the key differences between NAS and HDFS: Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides a service for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed file system to store data using commodity hardware. In HDFS, data blocks are distributed across all the machines in a cluster. Whereas in NAS, data is stored on a dedicated hardware. HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations. HDFS uses commodity hardware which is cost effective, whereas a NAS is a high-end storage devices which includes high cost.

How Name node determines which data node to write on?

Name node contains metadata or information in respect of all the data nodes and it will decide which data node to be used for storing data.

What do you mean by the High Availability of a NameNode? How is it achieved?

NameNode used to be single point of failure in Hadoop 1.x where the whole Hadoop cluster becomes unavailable as soon as NameNode is down. In other words, High Availability of the NameNode talks about the very necessity of a NameNode to be active for serving the requests of Hadoop clients. To solve this Single Point of Failure problem of NameNode, HA feature was introduced in Hadoop 2.x where we have two NameNode in our HDFS cluster in an active/passive configuration. Hence, if the active NameNode fails, the other passive NameNode can take over the responsibility of the failed NameNode and keep the HDFS up and running.

Would the calculations made on one node be replicated to others in HDFS?

No! The calculation would be made on the original node only. In case the node fails then only the master node would replicate the calculation on to a second node.

Can you modify the file present in HDFS?

No, I cannot modify the files already present in HDFS, as HDFS follows Write Once Read Many model. But, I can always append data into the existing HDFS file.

Can multiple clients write into an HDFS file concurrently?

No, multiple clients can't write into an HDFS file concurrently. HDFS follows single writer multiple reader model. The client which opens a file for writing is granted a lease by the NameNode. Now suppose, in the meanwhile, some other client wants to write into that very file and asks NameNode for the write permission. At first, the NameNode will check whether the lease for writing into that very particular file has been granted to someone else or not. Then, it will reject the write request of the other client if the lease has been acquired by someone else, who is currently writing into the very file.

What is a rack awareness algorithm and why is it used in Hadoop?

Rack Awareness algorithm in Hadoop ensures that all the block replicas are not stored on the same rack or a single rack. Considering the replication factor is 3, the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a different (remote) rack but, on a different DataNode within that (remote) rack. There are two reasons for using Rack Awareness: To improve the network performance: In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps to reduce write traffic in between different racks and thus provides a better write performance. To prevent loss of data: I don't have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if one thinks about it, it will make sense, as it is said that never put all your eggs in the same basket.

What is Rack Awareness?

Rack Awareness is the process that Hadoop follows to ensure fault tolerance on a block level. HDFS will place a replicated block on a different rack than where the other replicas of that block are. This is so that in the event of a data center outage, at least one copy of a block will be available.

What is a rack in HDFS?

Rack is the storage location where all the data nodes are put together. Thus it is a physical collection of data nodes stored in a single location.

What are the key features of HDFS?

Some of the prominent features of HDFS are as follows: - Cost effective and Scalable: HDFS, in general, is deployed on a commodity hardware. So, it is very economical in terms of the cost of ownership of the project. Also, one can scale the cluster by adding more nodes. - Variety and Volume of Data: HDFS is all about storing huge data i.e. Terabytes & Petabytes of data and different kinds of data. So, I can store any type of data into HDFS, be it structured, unstructured or semi structured. - Reliability and Fault Tolerance: HDFS divides the given data into data blocks, replicates it and stores it in a distributed fashion across the Hadoop cluster. This makes HDFS very reliable and fault tolerant. - High Throughput: Throughput is the amount of work done in a unit time. HDFS provides high throughput access to application data.

Why replication is pursued in HDFS though it may cause data redundancy?

Systems with average configuration are vulnerable to crash at any time. HDFS replicates and stores data at three different locations that makes the system highly fault tolerant. If data at one location becomes corrupt and is inaccessible it can be retrieved from another location.

What is a NameNode in Hadoop?

The NameNode is the master node that manages all the DataNodes (slave nodes). It records the metadata information regarding all the files stored in the cluster (on the DataNodes), e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc.

What is the Secondary NameNode?

The Secondary NameNode helps the primary NameNode by occasionally writing changes from the log file (journal) to fsimage, in order to prevent writing a lot of changes to the disk in the event of restarting / shutting down the cluster.

What is the StandBy NameNode?

The StandBy NameNode is the "Hot Swap" NameNode. Its job is to take over as the primary NameNode if the primary NameNode fails. This is to ensure that there is no Single Point of Failure, so that operations can continue seamlessly.

What is the YARN Federation/ Hadoop Federation?

The YARN Federation is a feature added in Hadoop 2.x that allows the use of multiple NameNodes and namespaces (subclusters).

How data or a file is written into HDFS?

The best way to answer this question is to take an example of a client and list the steps that will happen while performing the write without going into much of the details: Suppose a client wants to write a file into HDFS. So, the following steps will be performed internally during the whole HDFS write process: The client will divide the files into blocks and will send a write request to the NameNode. For each block, the NameNode will provide the client a list containing the IP address of DataNodes (depending on replication factor, 3 by default) where the data block has to be copied eventually. The client will copy the first block into the first DataNode and then the other copies of the block will be replicated by the DataNodes themselves in a sequential manner.

How the client communicates with Name node and Data node in HDFS?

The communication mode for clients with name node and data node in HDFS is SSH

What do you mean by meta data in HDFS? List the files associated with metadata.

The metadata in HDFS represents the structure of HDFS directories and files. It also includes the various information regarding HDFS directories and files such as ownership, permissions, quotas, and replication factor.

How would you check whether your NameNode is working or not?

There are many ways to check the status of the NameNode. Most commonly, one uses the jps command to check the status of all the daemons running in the HDFS. Alternatively, one can visit the NameNode's Web UI for the same.

What is the difference between traditional RDBMS and Hadoop?

This question seems to be very easy, but in an interview these simple questions matter a lot. Make sure to mention: Data types, processing, schema on read/write, read/write speed, best use case and cost

What is throughput? How does HDFS provides good throughput?

Throughput is the amount of work done in a unit time. HDFS provides good throughput because: The HDFS is based on Write Once and Read Many Model, it simplifies the data coherency issues as the data written once can't be modified and therefore, provides high throughput data access. In Hadoop, the computation part is moved towards the data which reduces the network congestion and therefore, enhances the overall system throughput.

Is Namenode machine same as DataNode machine as in terms of hardware?

Unlike the DataNodes, a NameNode is a highly available server that manages the File System Namespace and maintains the metadata information. Therefore, NameNode requires higher RAM for storing the metadata information corresponding to the millions of HDFS files in the memory, whereas the DataNode needs to have a higher disk capacity for storing huge data sets.

How a data node is identified as saturated?

When a data node is full and has no space left the name node will identify it.

Explain the HDFS Architecture and list the various HDFS daemons in HDFS cluster?

While listing various HDFS daemons, you should also talk about their roles in brief. Here is how you should answer this question: Apache Hadoop HDFS Architecture follows a Master/Slave topology where a cluster comprises a single NameNode (Master node or daemon) and all the other nodes are DataNodes (Slave nodes or daemons). Following daemon runs in HDFS cluster: - NameNode: It is the master daemon that maintains and manages the data block present in the DataNodes. - DataNode: DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is responsible for actual storage locations and serves read and writer requests for clients. - Secondary NameNode: The Secondary NameNode works concurrently with the primary NameNode as a helper daemon. It performs checkpointing.

Can you change the block size of HDFS files?

Yes, I can change the block size of HDFS files by changing the default size parameter present in hdfs-site.xml. But, I will have to restart the cluster for this property change to take effect.

Can we have different replication factor of the existing files in HDFS?

Yes, one can have different replication factor for the files existing in HDFS. Suppose, I have a file named test.xml stored within the sample directory in my HDFS with the replication factor set to 1. Now, the command for changing the replication factor of text.xml file to 3 is: hadoop fs -setrwp -w 3 /sample/test.xml Finally, I can check whether the replication factor has been changed or not by using following command: hadoop fs -ls /sample or hadoop fsck /sample/test.xml -files

Does HDFS allow a client to read a file which is already opened for writing?

Yes, one can read the file which is already opened. But, the problem in reading a file which is currently being written lies in the consistency of the data i.e. HDFS does not provide the surety that the data which has been written into the file will be visible to a new reader before the file has been closed. For this, one can call the hflush operation explicitly which will push all the data in the buffer into the write pipeline and then the hflush operation will wait for the acknowledgements from the DataNodes. Hence, by doing this the data that has been written into the file before the hflush operation will be visible to the readers for sure.

What is a block?

You should begin the answer with a general definition of a block. Then, you should explain in brief about the blocks present in HDFS and also mention their default size. Blocks are the smallest continuous location on your hard drive where data is stored. HDFS stores each file as blocks, and distribute it across the Hadoop cluster. The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x) which is much larger as compared to the Linux system where the block size is 4KB. The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.


Kaugnay na mga set ng pag-aaral

NCLEX Style Practice Questions Burns, Med Surg - Burns NCLEX Review Questions, Med Surg Exam 3 Burns Questions, Med Surg : Chapter 25 Burns

View Set

Medical Insurance Chapters 11-17

View Set

Genetics - McGraw questions- final

View Set

Chapter 25 Section 2 and 3 earth science

View Set

Economic Indicators and Challenges

View Set