Big Data

Ace your homework & exams now with Quizwiz!

Data Architect

A data architect lays down the foundation for data management systems to ingest, integrate and maintain all the data sources. This role requires knowledge of tools like SQL, XML, Hive, Pig, Spark, etc.

Four Main Modules within Hadoop

Hadoop Distributed File System (HDFS), YARN, Common, and MapReduce

Big Data Analytics

"Big data Analytics" is a phrase that was coined to refer to amounts of datasets that are so large, traditional data processing software simply can't manage them. For example, big data is used to pick out trends in economics, and those trends and patterns are used to predict what will happen in the future. These vast amounts of data require more robust computer software for processing, best handled by data processing frameworks.

Data Lake

A data lake is a vast pool of raw data, the purpose for which is not yet defined. Often used by Data Scientists. Data Lakes are highly accessible and quick to update.

Data Pipeline

A data pipeline is a software that eliminates many manual steps from the process and enables a smooth, automated flow of data from one station to the next. It starts by defining what, where, and how data is collected. It automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. Big Data Engineers must be able to design, create, build, and maintain data pipelines.

Data Warehouse

A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose. Often used by business professionals. Data Warehouses are more complicated than lakes to more costly to change.

Example Data Warehouses

Amazon Redshift, Google BigQuery, IBM Db2 Warehouse, Microsoft Azure SQL Data Warehouse, Oracle Autonomous Data Warehouse, SAP Data Warehouse Cloud, Snowflake

Apache Kafka

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Apache Spark

Apache Spark is a batch processing framework that has the capability of stream processing, making it a hybrid framework. It's easy to use and to write applications in Java, Scala, Python, and R. This open-source cluster-computing framework is ideal for machine-learning. Spark can be run on a single machine, with one executor for every CPU core. It can be used as a standalone framework, and you can also use it in conjunction with Hadoop or Apache Mesos, making it suitable for just about any business. Spark is capable of accessing data sources like HDFS, Cassandra, HBase, and S3, for distributed storage.

Apache Storm

Apache Storm is another open-source framework, but one that provides distributed, real-time stream processing. Storm is mostly written in Clojure, and can be used with any programming language. The idea behind Storm is to define small, discrete operations, and then compose those operations into a topology, which acts as a pipeline to transform data.

Database Administrator

As the name suggests, a person working in this role requires extensive knowledge of databases. Responsibilities entail ensuring the databases are available to all the required users, is maintained properly and functions without any hiccups when new features are added.

NoSQL Tech

As the requirements of organizations has grown beyond structured data, so NoSQL databases were introduced. It can store large volumes of structured, semi-structured & unstructured data with quick iteration and agile structure as per application requirements.

Cassandra

Cassandra is a highly scalable database with incremental scalability. The best part of Cassandra is minimal administration and no single point of failure. It good for applications with fast & random, read & writes. It provides AP(Available & Partitioning) out of CAP.

Examples of NoSQL Tech

Cassandra, Mongo, HBase, CosmosDB, DynamoDB

Data Mining

Data Mining is the process of analyzing data to extract information not offered by the raw data alone. Data mining allows you to find anomalies, patterns and correlations within large data sets to predict outcomes.

ETL/Data Warehousing Solutions

Data Warehousing is very important when it comes to managing a huge amount of data coming in from heterogeneous sources where you need to apply ETL(Extract Transform Load). Data Warehouse is used for data analytics & reporting, and is a very crucial part of Business Intelligence. It is very important for a Big Data Engineer to master a Data Warehousing or ETL tool. After mastering one, it becomes easy to learn new tools as the fundamental remains the same.

Data Ingestion

Data ingestion means taking the data from the various sources and then ingesting it into the data lake, including both batch and real-time extraction methods.

Data Transformation

Data transformation alters original or raw data to make it more suitable for data mining.

ETL Process

ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database. Extract is the process of reading data from a database. In this stage, the data is collected, often from multiple and different types of sources. Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data. Load is the process of writing the data into the target database/mart/store/warehouse.

Flink

Flink is a hybrid framework, open-source, and stream processes, but can also manage batch tasks. It uses a high-throughput, low-latency streaming engine that is written in Java and Scala, and the runtime system that is pipelined allows for the execution of both batch and stream processing programs. Programs can be written in Java, Scala, Python, and SQL. Flink does not provide its own storage system, however, so that means you will have to use it in conjunction with another framework.

Flume & Sqoop

Flume is a tool which is used to import unstructured data to HDFS whereas Sqoop is used to import & export structured data from RDBMS to HDFS.

HBase

HBase is column-oriented NoSQL database on top of HDFS which is good for scalable & distributed big data store. It is good for applications with optimized read & range based scan. It provides CP(Consistency & Partitioning) out of CAP.

Hadoop Distributed File System (HDFS)

HDFS is a a highly distributed, file storage system designed to manage large amounts of data at high speeds. It divides the data into subsets and distributes the subsets onto different servers for processing.

Hadoop

Hadoop is a Java based, open source, high speed, fault-tolerant distributed storage and computational framework. Hadoop uses low-cost hardware to create clusters of thousands of computer nodes to store and process data.

5 Best Data Processing Frameworks

Hadoop, Apache Spark, Apache Storm, Samza, Flink

Pige & Hive

Hive is a data warehousing tool on top of HDFS. Hive caters professionals from SQL background to perform analytics. Whereas Apache Pig is a high-level scripting language which is used for data transformation on top of Hadoop. Hive is generally used by the data analyst for creating reports whereas Pig is used by researchers for programming. Both are easy to learn if you are familiar with SQL.

UNIX, Linux, Solaris or MS Windows

Industry-wide various operating systems are used. Unix & Linux are some of the prominently used operating systems & Big Data Engineer needs to master one of them at least.

Informatica & Talend

Informatica & Talend are the two well-known tools used in the industry. Informatica & Talend Open Studio are Data Integration tools with ETL architecture. The major benefit of Talend is its support for the Big Data frameworks.

Hadoop MapReduce

MapReduce is a parallel processing paradigm which allows data to be processed parallelly on top of Distributed Hadoop Storage i.e. HDFS. The Hadoop MapReduce involves the implementation of the MapReduce programming model for large-scale data processing.

MongoDB

MongoDB is a document-oriented NoSQL database which is schema-free, i.e. your schema can evolve as the application grows. It also gives full index support for high performance & replication for fault tolerance. It has a master-slave architecture & provides CP out of CAP. It is rigorously used by the web application & semi-structured data handling.

Examples of SQL Tech

MySQL, SQL, PL SQL, Oracle

Performance Optimization

Performance optimization refers to building a system which is both scalable and efficient. Big Data engineer needs to make sure that the complete process, from the query execution to visualizing the data through report & interactive dashboards should be optimized. Big Data Engineers need to be able to automate these processes, optimizing data delivery, and re-designing the complete architecture to improve performance.

RDBMS

RDBMS stands for "Relational Database Management System." An RDBMS is a DBMS designed specifically for relational databases. Therefore, RDBMSes are a subset of DBMSes. A relational database refers to a database that stores data in a structured format, using rows and columns.

Samza

Samza is an open-sourced framework that offers near a real-time, asynchronous framework for distributed stream processing. More specifically, Samza handles immutable streams, meaning transformations create new streams that will be consumed by other components without any effect on the initial stream. This framework works in conjunction with other frameworks, using Apache Kafka for messaging and Hadoop YARN.

Elements included in Spark Core

Spark SQL, which provides domain-specific language used to manipulate DataFrames. Spark Streaming, which uses data in mini-batches for RDD transformations, allowing the same set of application code that is created for batch analytics to also be used for streaming analytics. Spark MLlib, a machine-learning library that makes the large-scale machine learning pipelines simpler. GraphX, which is the distributed graph processing framework at the top of Apache Spark.

SQL-based Tech for Engineers

Structured Query Language is used to structure, manipulate & manage data stored in databases. As Data Engineers work closely with the relational databases, they need to have a strong command on SQL. PL/SQL is also prominently used in the industry. PL/SQL provides procedural programming features on top of SQL.

Examples of ETL/Data Warehousing Solutions

Talend, Informatica, Pentaho, CloverETL, Oracle Data Integrator, SAS, Oracle Warehouse

Data Engineer

The master of the lot. A data engineer, as we've already seen, needs to have knowledge of database tools, languages like Python and Java, distributed systems like Hadoop, among other things. It's a combination of tasks into one single role.

Data Engineer vs. Data Scientist

The skills and responsibilities of Data Scientists and Data Engineers often overlap, though the positions are becoming distinct roles. Data Scientists tend to focus on the translation of Big Data into Business Intelligence, while Data Engineers focus much more on building the Data Architecture and infrastructure for data generation. Data Scientists need Data Engineers to create the environment and infrastructure they work in. A Data Scientist is focused more on interacting with the infrastructure than building and maintaining it. Data Scientists are given the responsibility of taking raw data, and turning it into useful, understandable, actionable information. Data Scientists work with Big Data, and Data Engineers work with data infrastructures and foundations.

Hadoop YARN

YARN performs resource management by allocating resources to different applications and scheduling jobs. YARN (Yet Another Resource Negotiator) is the resource management platform that manages the computing resources in clusters, and handles the scheduling of users' applications.

ZooKeeper

Zookeeper acts as a coordinator among the distributed services running in Hadoop environment. It helps in configuration management and synchronizing services.


Related study sets

ASE A6 PRO Electrical/Electronic Systems

View Set

First trimester Chapter 23 questions

View Set