Chapter 2 - Data science
What is the massive amount of data which cannot be stored, processed, and analyzed using the traditional ways?
Big data
The final or main cause of the industrial revolution was the effects created by
Agricultural revolution
What are the four basic kinds of devices
Memory, Microprocessors, Logic, and Networks
T/F. Rural to Urban migration is not the result of industrial revolution
T
In which industrial revolution did internet were introduced
The Third Industrial Revolution
Communication between Human and machine requires an interface. And, the interface may include
- Gesture recognition - Voice recognition - Terminal
What are the goals of data analysis in the Big Data Value Chain?
- Highlighting relevant data - Synthesizing and extracting useful hidden information with high potential from a business point of view.
What are the characteristics of Hadoop Ecosystem
- Reliability - Flexibility - Scalable
List things introduced or generated by in second industrial revolution
- Telegraph, Telephone, and Electrical power - The modern lightbulb, the assembly line, the automobile, aircraft, and the construction of the transcontinental railroad
How many types of data is present? A. 2 B. 3 C. 5 D. 4
3
Hadoop is a framework that works with a variety of related tools. Common cohorts include: A ,MapReduce, Hive and HBase B, MapReduce, MySQL and Google Apps C, MapReduce, Hummer and Iguana D, MapReduce, Heron and Trumpet
A
In Hadoop ecosystem, which component performs data management A. Spark and Mapreduce B. Oozie, Zookeeper C. Sqoop and Flume D. Hive, Pig
A
Which of the following are the Goals of HDFS? A. Fault detection and recovery B. Huge datasets C. Hardware at data D. All of the above
All
Which of the following is not a future trend of networks A. 5G technology B. Rise of centralization C. Embedded computation D. Network developments in edge computing
B
The minimum amount of data that HDFS can read or write is called a _____________. A. Datanode B. Namenode C. Block D. None of the above
Block
A device that provides interfacing, data communication, signal processing and other key similar functionalities is A. Microprocessor B. Memory device C. Logic device D. Network device
C
What is a general term that describes the delivery of on-demand services, usually through the internet, on a pay per use basis
Cloud Computing
On which platform Hadoop language runs? A. Bare metal B. Debian C. Cross-platform D. Unix-Like
Cross-platform
According to analysts, for what can traditional IT systems provide a foundation when they're integrated with big data technologies like Hadoop? A, Big data management and data mining B, Collecting and storing unstructured data C, Management of Hadoop clusters D, Data warehousing and business intelligence
D
All of the following are service enabling devices except A. Modems B. Routers C. Switches D. Hadoop
D
All of the following are the specific functions of Logic Devices except A. Control operations B. Data communication C. Signal processing D. None
D
Hadoop benefits big data users for the following reasons except: A, It can store and process vast amounts of structured, semi-structured and unstructured data, quickly B, It can support real-time analytics to help drive better operational decision-making B, It protects application and data processing against hardware failures C, It requires data to be pre-processed for storage before filtering it for specific analytic uses
D
_____Is making the raw data acquired amenable to use in decision making as well as domain specific usage
Data Analysis
In data value chain, the activities of ensuring that data are trustworthy, discoverable, accessible, reusable and fit their purpose is called
Data Curation
__________Covers the data driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity
Data Usage
____________ is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose.
Data processing
______Is defined as the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data
Data storage
_____Describes the information flow with in the big data system in generate hidden pattern and useful knowledge from data.
Data value Chain
What is a data value chain and briefly explain the different chain activities in it
Describes the process of data creation and use from first identifying a need for data to its final use and possible reuse
__ and ___are used to ingest data from external sources into HDFS
Flume and Sqoop
What is human to computer interaction and how human beings interact with computers?
HCI is the study of how people interact with computers and to what extent computers are or are not developed for successful interaction with human beings The user interacts directly with hardware for the human input and output such as displays, e.g. through a graphical user interface.
Which Hadoop component is used for a distributed data storage unit?
HDFS - Hadoop Distributed File System
_______Is described as a transition to new manufacturing processes
IR 1.0
_______IR is also known as the Technological Revolution
IR 2.0
Which IR is known as Digital revolution
IR 3.0
Stage of Industrial revolution characterized by the use of IoT, AI, and Big Data Technologies to the industry that will allow the intelligent production
Industry 4.0
Which technique in big data life cycle is used to transfer data from various sources to Hadoop?
Ingest
____is the input, or what you tell the computer to do or save
Input
One of the following is not common example of data type? A, Integer B, Float C, Text D, Variable
Integer
What is a cyber-physical system
Is a mechanism that is controlled or monitored by computer-based algorithms, tightly integrated with the Internet and its users.
What are examples of unstructured data?
JASON & XML
What are the specific functions of Logic Devices
Logic devices provide specific functions, including device-to-device interfacing, data communication, signal processing, data display, timing and control operations
Which component of Hadoop is used for programming based data processing? A. HIVE B. Spark MLLib C. MapReduce D. Oozie
MapReduce
Which of the following is some sort of hardware architecture or software framework that allows the software to run? A. Hadoop file system B. Platform C. Malware D. Chatbot
Platform
The goal of most big data systems is A. To save cost B. To reduce complexity C. To surface insight D. To secure data
Probably C
"Information Technology" is an example of A. Primary Industry B. Secondary Industry C. Tertiary Industry D. Quaternary Industry
Quaternary
All of the following accurately describe Hadoop, EXCEPT: A, Open source B, Real-time C, Java-based D, Distributed computing approach
Real-time
Which of the following is not Features Of Hadoop? A. Suitable for Big Data Analysis B. Scalability C. Robust D. Fault Tolerance
Robust
In which agricultural revolution did mass crop production were invented?
Second Agricultural Revolution or The British Agricultural Revolution
What type of data is self-describing data?
Semi-structured data type, eg. JASON, XML
T/F Data is independent of information whereas information is dependent on data
T
What stage of industrial revolution provide a service like teaching and nursing
The forth industrial revolution
What is the core factor of IR 3.0 revolution?
The mass production and wide spread use of digital logic circuits
List and explain basic components of Hadoop system?
There are three components of Hadoop: Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit. Hadoop MapReduce - Hadoop MapReduce is the processing unit. Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.
What do we mean by data science is a multi-disciplinary and why it is multi-disciplinary
This approach generally includes the fields of data mining, forecasting, machine learning, predictive analytics, statistics, and text analytics.
List Service Enabling Devices (SEDs)
Traditional channel service unit (CSU) and data service unit (DSU)• Modems• Routers• Switches• Conferencing equipment• Network appliances (NIDs and SIDs)• Hosting equipment and servers
A much bigger percentage of all the data in our world is unstructured data. T/F
True
True or false? Hadoop can be used to create distributed clusters, based on commodity servers, that provide low-cost processing and storage for unstructured data, log files and other forms of big data.
True
True or false? MapReduce can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of unstructured data.
True
The type of data that doesn't have any predefined data model?
Unstructured data
Which characteristics of big data indicates data in uncertainty due to data inconsistency and ambiguity
Veracity
The primary goal of HCI is to improve the interaction between computers to computers!? T/F
False, It's Computer to humans
The bi-directional information flow between a human brain and a machine is referred to as A. Neuro-modulation B. Social computing C. Human machine interface D. Brain computer interface
D
Which of the following is not Features Of HDFS? A. It is suitable for the distributed storage and processing. B. Streaming access to file system data. C. HDFS provides file permissions and authentication. D. Hadoop does not provides a command interface to interact with HDFS.
D
In which technique in the big data life cycle is the data stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase?
Processing
________Is used to transfer data from RDBMS to HDFS in a Hadoop ecosystem
Sqoop
Which of the following is regarded as smart Revolution A. Information revolution B. Agriculture revolution C. Industrial revolution (second revolution) D. Knowledge revolution
A
What are the different activities of Big-data life cycle A. Ingesting data, persisting data, computing and analyzing data, and visualizing results B. Acquisition, analysis, curation, storage, and usage C. Veracity, variability, and value D. Input, processing, and output
B
_______Is a model for enabling convenient on demand network access to a shared of computing resources
Cloud computing
Discuss the difference between cloud computing and cluster computing.
Cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. Large clouds often have functions distributed over multiple locations, each location being a data center. Cluster computing A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike cloud computing, computer clusters have each node set to perform the same task, controlled and scheduled by software.
Is the process of gathering, filtering, and cleaning data before it put in a data warehouse
Data acquisition
Data curators also known as
Data annotators
___________is performed by expert curators that are responsible for improving the accessibility and quality of data.
Data curation
True or false? Due to Hadoop's ability to manage unstructured and semi-structured data and because of its scale-out support for handling ever-growing quantities of data, many experts view it as a replacement for the enterprise data warehouse.
False
_______is an open-source framework intended to make interaction with big data easier
Hadoop
____is a distributed file system that may run on a cluster of commodity machines, where the storage of data is distributed among the cluster and the processing is distributed too.
Hadoop
List the importance of cluster computing
High availability through fault tolerance and resilience, load balancing and scaling capabilities, and performance improvements
_______ IR Introduced the transition from mechanical and analog electronic technology to digital electronics
IR 3.0
List and explain the five characteristics of Big Data? 5V's
Volume, Velocity, Variety and Veracity - Volume. - Volume refers to how much data is actually collected, Volume is like the base of big data, as it is the initial size and amount of data that is collected. If the volume of data is large enough, it can be considered big data. - Veracity - Veracity relates to how reliable data is, it is related to consistency, accuracy, quality, and trustworthiness. Data veracity refers to the biasedness, noise, and abnormality in data. - Velocity - Velocity in big data refers to how fast data can be generated, gathered, and analysed. It refers to how quickly data is generated and how quickly that data moves. This is an important aspect for companies need that need their data to flow quickly, so it's available at the right times to make the best business decisions possible. - Variety - Refers to the diversity of data types. An organization might obtain data from a number of different data sources, which may vary in value. - Value - It refers to the usefulness of gathered data for your business. Data by itself, regardless of its volume, usually isn't very useful — to be valuable, it needs to be converted into insights or information, and that is where data processing steps in.