Big Data

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What does the copyToLocal HDFS command do?

Similar to get command, only the difference is that in this the destination is restricted to a local file reference. example: hdfs dfs -copyToLocal /user/dataflair/dir1/sample /home/dataflair/Desktop

Describe structured data vs. unstructured data

Structured data is comprised of clearly defined data types whose pattern makes them easily searchable; while unstructured data - "everything else" - is comprised of data that is usually not as easily searchable, including formats like audio, video, and social media postings.

What does the getfacl HDFS command do?

This Apache Hadoop command shows the Access Control Lists (ACLs) of files and directories. If a directory contains a default ACL, then getfacl also displays the default ACL. Options : -R: It displays a list of all the ACLs of all files and directories recursively. <path?: File or directory to list. example: hadoop fs -getfacl /user/dataflair/dir1/sample hadoop fs -getfacl -R /user/dataflair/dir1

What does the getfattr HDFS command do?

This HDFS file system command displays if there is any extended attribute names and values for a file or directory. Options: -R: It recursively lists the attributes for all files and directories. -n name: It displays the named extended attribute value. -d: It displays all the extended attribute values associated with the pathname. -e encoding: Encodes values after extracting them. The valid converted coded forms are "text", "hex", and "base64". All the values encoded as text strings are with double quotes (" "), and prefix 0x and 0s are used for all the values which are converted and coded as hexadecimal and base64. path: The file or directory. example: hadoop fs -getfattr -d /user/dataflair/dir1/sample

What does the mv HDFS command do?

This basic HDFS command moves the file or directory indicated by the source to destination, within HDFS. example: hadoop fs -mv /user/dataflair/dir1/purchases.txt /user/dataflair/dir2

What is fsck?

fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.

How do you list the contents of a directory in HDFS?

hdfs dfs -ls /user/dataflair/dir1

What is an entity relationship model?

An ER model describes interrelated things of interest in a specific domain of knowledge. A basic ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between entities (instances of those entity types). Three levels of abstraction: Physical layer — how data is stored on hardware (actual bytes, files on disk, etc.) Logical layer — how data is stored in the database (types of records, relationships, etc.) View layer — how applications access data (hiding record details, more convenience, etc.)

What is the Command to format the NameNode?

$ hdfs namenode -format

How do you create a directory in HDFS?

hdfs dfs -mkdir /user/dataflair/dir1

How do you copy the file or directory from the local file system to the destination in HDFS?

hdfs dfs -put /home/dataflair/Desktop/sample /user/dataflair/dir1

What do you know about the term "Big Data"?

Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that's why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.

How is big data analysis helpful in increasing business revenue?

Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is - Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.

Tell us how big data and Hadoop are related to each other.

Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.

Do you prefer good data or good models? Why?

Many companies want to follow a strict process of evaluating data, means they have already selected data models. In this case, having good data can be game-changing. The other way around also works as a model is chosen based on good data. Answer it from your experience. However, don't say that having both good data and good models is important as it is hard to have both in real life projects.

Why is Hadoop used for Big Data Analytics?

Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of -Storage -Processing -Data collection Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.

Explain the steps to be followed to deploy a Big Data solution.

The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS. Data Integstion -> Data Storage -> Data Processing ii. Data Storage After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access. iii. Data Processing The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.

What are the main differences between NAS (Network-attached storage) and HDFS?

The main differences between NAS (Network-attached storage) and HDFS - - HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less. - Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.

Define respective components of HDFS

The two main components of HDFS are- 1. NameNode - This is the master node for processing metadata information for data blocks within the HDFS 2. DataNode/Slave node - This is the node which acts as slave node to store the data, for processing and use by the NameNode In addition to serving the client requests, the NameNode executes either of two following roles - CheckpointNode - It runs on a different host from the NameNode - BackupNode- It is a read-only NameNode which contains file system metadata information excluding the block locations

Define respective components of YARN

The two main components of YARN are- 1. ResourceManager- This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs. 2. NodeManager- It executes tasks on each single Data Node

What does the getmerge HDFS command do?

This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a copy of them to one single, merged file in the local file system identified by the local destination. example: hdfs dfs -getmerge /user/dataflair/dir2/sample /home/dataflair/Desktop

What does the get HDFS command do?

This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path identified by a local destination. example: hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop

What does the cp HDFS command do?

This Hadoop File system shell command copies the file or directory identified by the source to destination, within HDFS. example: hadoop fs -cp /user/dataflair/dir2/purchases.txt /user/dataflair/dir1

What does the cat HDFS command do?

This Hadoop fs shell command displays the contents of the filename on console or stdout. example: hdfs dfs -cat /user/dataflair/dir1/sample

What does the copyFromLocal HDFS command do?

This hadoop shell command is similar to put command, but the source is restricted to a local file reference. example: hdfs dfs -copyFromLocal /home/dataflair/Desktop/sample /user/dataflair/dir1

What does dfs mean in the hdfs command prompt? e.g. hdfs dfs

runs a filesystem command on the file systems supported in Hadoop.


Ensembles d'études connexes

1 and 2 Peter: The Apostle and Writer

View Set

Chapter 4 multiple choice questions

View Set

Lesson 6 - The atlantic Slave Trade

View Set

Primerica Life Insurance State Exam RI

View Set

POLS 207 - Chapter 7 - Legislatures

View Set

Delegated, Concurrent, and Reserved Powers

View Set