BigDataEx1
KISS
"Keep It Simple, Sunshine" (KISS) Words to live by at Internet scale
real time for data analysts
"pretty fast" at the data layer and "very fast" at the decision layer.
Platform engineering embodies the mantra
"the whole is more than the sum of its parts" and can make up for many deficiencies in particular technologies.
Stakeholder subsets
"value" and "real time" will suggest different meanings to different subsets of stakeholders.
How much does India need in infrastructure investment?
$1.5 trillion in just the next 10 years to help modernize its economy and lift more of its people out of poverty.
Hadoop YARN?
- A framework for job scheduling and cluster resource management -Hadoop MapReduce: a YARN-based system for parallel processing of large data sets -manages the cluster resources, for job processing
What is RTBDA?
- Real-Time Big Data Analytics Stack -RTBDA technology exists for a specific purpose: creating value from data. It is also important to remember that "value" and "real time" will suggest different meanings to different subsets of stakeholders.
Hadoop Distributed File System (HDFS)?
-A distributed file system that provides high-throughput access to application data -is a distributed, scalable, and portable file system written in Java for the Hadoop framework -can store any type of file -data is automatically split into chunks and replicated for high availability
Apache Spark
-A super fast, in-memory data processing engine. -Open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkley's AMPLab, and open sourced in 2010 as an Apache project. -Industry standard for real time streaming data analytics -can combine sql, streaming, and complex analytics
Velocity
-Analysis of streaming data -speed of data
HDFS consists of:
-DataNodes: storage of data blocks -NameNode: coordination of dataNodes
Hadoop is not?
-Database (random Access) -Interactive OLAP (for the moment) -Updates to files -Nonparallel work -Many small files -Low latency
Variety
-Different forms of data -Diversity of data
DAG (In Optimizer)
-Directed Acyclic Graph -In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task.
Parallel Processing (YARN: MapReduce and more)
-Distributed calculations (no cross "record/file" dependencies) -Process to data -Self-contained -> can fail and be restarted seamlessly -Schedules/executes tasks as "close" to data as possible
Hadoop tools?
-ETL(extract, transform, load) -BI -Data Storage -Predictive and Statistical Modeling -Machine Learning -others
GFS
-Google File System -designed to solve the issues with distributed systems
Hadoop Ecosystem projects included in Cloudera's CDH:
-Spark, Hbase, Hive, Impala, Parquet, Sqoop, Flume/ Kafka, Solr, Hue, Sentry
Hadoop "Ecosystem"
-Tools built around the core Hadoop -All ecosystem tools are open source -Tools are designed to extend Hadoop's Functionality -New tools are added all the time
What are the 5 Phases of Real-Time?
1) Data Distillation 2) Model Development 3) Validation and Deployment 4)real-time scoring 5) model refresh
how much will NEX crop invest to set up an analytics center in Noida?
10 million. 3 years set timeframe.
when was the beginning of the digital age?
2002
When was Hadoop published?
2003-2004. Based on the solution used by Google in the 1990s
Reports suggest that big data and analytics market in India will grow approximately __ times, to ______ by 2020
8 Times $16 Billion
A cloud running a mail application
A cloud running a mail application that supports hundreds of millions of users already started out with hundreds of millions of mailboxes or little piles of hay. -conventional clouds have natural isolation
Big data platforms are monster computers
A single Hadoop cluster with serious punch consists of hundreds of racks of servers and switches. These racks do not include the surrounding infrastructure used to get the bales of data onto the cluster.
Druid
A system for scanning tens of billions of records per second. It can query 6 terabytes of in-memory data in 1.4 seconds.
What did Cloudera create?
A system of tools, including Flume and SQOOP, which handle ingestion from multiple sources from multiple sources into Hadoop, and Impala, which enables real-time, ad hoc querying of data.
Storm helps with
Ads at the right time
Thrift Server
Allows external clients to interact with hive
The "bigness" of big data depends on its location in the stack.
At the data layer, it is not unusual to see petabytes or even exabytes of data. At the analytics layer, you're more likely to encounter gigabytes and terabytes of refined data. By the time you reach the integration layer, you're handling megabytes. At the decision layer, the data sets have dwindled down to kilobytes, and we're measuring data less in terms of scale and more in terms of bandwidth. The higher you go in the stack, the less data you need to manage. At the top of the stack, size is considerably less relevant than speed.
Data Layer
At the foundation is the data layer. At this level you have structured data in an RDBMS, NoSQL, HBase, or Impala; unstructured data in Hadoop MapReduce; streaming data from the web, social media, sensors and operational systems; and limited capabilities for performing descriptive analytics. Tools such as Hive, HBase, Storm and Spark also sit at this layer. (Suggested that diving the data layer into two layers, one for storage and the other for query processing)
What was the process before Impala?
Before Impala, you did machine learning and larger-scale processes in Hadoop, and the ad hoc analysis in Hive, which involves relatively slow batch processing. Alternatively, you can perform the ad hoc analysis against a traditional database system, which limits your ad-hoc exploration to the data that is captured and loaded into the pre-defined schema. So essentially you are doing machine learning on one side, ad hoc querying on the other side, and then correlating the data between the two systems.
What is Apache Hive built on?
Built on top of Apache Hadoop for providing data summarization, query, and analysis
Hadoop has already been demonstrated to scale a very high node counts (thousands of nodes)
But since a 500-node cluster generates more throughput than a 450-node cluster, whether it is 8% faster instead of 11% isn't as important as Hadoop's ability to scale beyond thousands of nodes.
Most widely used enterprise-ready Hadoop distributions?
Cloudera, Hortonworks, and MapR
Parquet
Columnar data storage format
normalization
Common relational database design encouraged a practice called normalization, which maximized flexibility in case users needed to add new tables or new columns to existing tables. -also minimized the duplication of data between tables because disk space was expensive.
Driver
Compiles, optimizes, like a controller that receives Hive statements
How does data appear?
Data appears in all kinds of ways and people have all kinds of preferences for how they want to express what they want, and what kinds of languages they want to write their queries in.
Perishability
Data lives on a spectrum of perishability that spans from seconds to decades. Perishability puts the emphasis on insight not retention.
Sqoop
Data movement/ETL to and from RDBMS
What led to Hadoop
Doug Cutting's Open source project: Nutch
What does Impala do?
Enables ad hoc SQL analysis directly on top of your big data systems. You don't have to define the schema before you load the data.
No two platforms are alike unless they are built to extremely repeatable specifications.
For this very reason, Internet- scale pioneers go to great lengths to minimize platform variance in an attempt to avoid conditions that might trigger end-case, high-complexity software failures, which are very nasty to triage.
Googles System
Google, definitely wants their system to be as fast as possible and they definitely put real-time contraints on the internals of their system to make sure that it gives up on certain approaches very quickly. But overall, the system itself is not real time. It's pretty fast, almost all the time. This is the squishy definition of real time.
In traditional enterprise-scale software
HA capabilities are not values as features, but in supercomputer clusters with thousands of interconnected components, HA is as important as scalability.
Although Hadoop infrastructure looks very similar to the servers running all those mail-boxes
Hadoop clusters remain one single, monolithic supercomputer disguised as a collection of cheap, commodity servers.
Murphy's Law:
If it can fail, it will. -In reality, everything doesn't fail. There are parts that might fail but not all parts do fail, so it is important to assess the probability of a part failing.
SHARED-NOTHING
In practice, they do a little sharing, but only enough to propagate the illusion of a single, monolithic supercomputer.
Real-time scoring
In real-time systems, scoring is triggered by actions at the decision layer (by consumers at a website or by an operational system through an API), and the actual communications are brokered by the integration layer. In the scoring phase, some real-time systems will use the same hardware that's used in data layer, but they will not use the same data. At this phase of the process, the deployed scoring rules are "divorced" from the data in the data layer or data mart. Note also that at this phase, the limitations of Hadoop become apparent. Hadoop today is not particularly well-suited for real-time scoring, although it can be used for "near real-time" applications such as populating large tables or pre-computing scores. Newer technologies such as Cloudera's Impala are designed to improve Hadoop's real-time capabilities.
system redundancy strategy of no single point of failure (noSPOF)
It is a design principle applied to platforms to allow them to function in the presence of abnormal operating conditions. At Internet scale, not all single points of failure are created equal, so applying the principle across all potential points of failure is difficult to implement, complex to manage and very expensive.
Integration Layer
It is the "glue" that holds the end-user applications and analytics engines together, and it usually includes a rules engine or CEP engine, and an API for dynamic analytics that "brokers" communication between app developers and data scientists.
A CBO engineered for highly normalized OLTP schemas
It was not designed for big data schemas that have simpler tables with billions of rows.
Ad Hoc queries in real-time vs traditional analytics
It will take you longer to recognize and respond to new kinds of fraud if you use traditional analytics on top of a traditional enterprise data warehouse, than it would if you had the capabilities to run ad hoc queries in real time. When dealing with fraud, every lost minute translates into lost money.
Japanese companies are using India as a _____________ ______ to expand into Africa, and service providers are expanding from Japan into India
Manufacturing base
What technologies enable you to run queries without changing the data structures underneath?
MapReduce, Hive, and Impala.
What is real-time big data about?
Not just a process for storing petabytes or exabytes of data in a data warehouse, it's about the ability to make better decisions and take meaningful actions at the right time. It about detecting fraud while someone is swiping a credit card, or triggering an offer while a shopper is standing at a checkout line, or placing an ad on a website while someone is reading a specific article. Its about combining and analyzing data so you can take the right action, at the right time, and at the right place.
single, general-purpose database kernel or CBO
Not possible to build a single, general-purpose database kernel or CBO to handle the entire spectrum.
OLTP-based CBOs
OLTP-based CBOs optimize space for time, whereas big data CBOs must optimize time for space.
What does Hive provide?
Provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. -Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.
Impala
SQL Query Engine designed for BI workloads
Hive
SQL processing engine designed for batch workloads
Volume
Scale of data
What do both Drill and Dremel do?
Scan data in parallel
Ser De
Serializer / Deserializer -The deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can Manipulate. -The Hive serializer will take this Java object, convert it into a suitable format that can be stored into HDFS.
the File System (FS) ?
Shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others.
The easiest way to access hundreds of terabytes of data requires access methods to be simple (and by implication, scalable)
Simple and scalable requires a relatively simple method like key-value pair. The new generation of fast and cheap noSQL databases now being used for big data applications are also known as key-value pair databases. The structural antithesis of noSQL is the class of complex and expensive uberSQL relational databases.
Enterprises attempting to use Oracle's clustering technology, RAC, found it nearly impossible to set up
Since this failure could be a result of their customers' own poor platform engineering (which exposed more bugs), Oracle designed an engineered platform that combined all the components and product engineering expertise, which combined all the components and product engineering expertise, which made successful experiences possible. The resulting product, Exadata.
Stream
Stream processing looks at small amounts of data as they arrive. You can do intense computations, like parallel search, and merge queries on the fly. Normally if you want to do a search query, you need to create search indexes, which can be a slow process on one machine. With Storm, you can stream the process across many machines, and get much quicker results.
creating of analytics vs consumption of analytics
TWO DIFFERENT THINGS
Prior to the 1990s, data sources were abundant, but the high cost of storage still meant stashing much of it on tape.
Tape technology has more lives than cats. Even as disk storage approached $1/TB, tape remains a couple of orders of magnitude cheaper. Big data starts to live up to its name not when enterprises have 10 petabytes in their cluster, but when they can afford to load 500 exabytes.
Apache's Hadoop Definition
The Apache Hadoop software library is a framework that allows for the distributed processing of large scale data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
Value
The ability to achieve greater insights from data
Hadoop Common?
The common utilities that support the other Hadoop modules
Pi Estimation Method
The pi sample uses a statistical (quasi-Monte Carlo) method to estimate the value of pi. Points are placed at random in a unit square. The square also contains a circle. The probability that the points fall within the circle are equal to the area of the circle, pi/4. the value of pi can be estimated from the value of 4R. R is the ratio of the number of points that are inside the circle to the total number of points that are within the square. The larger the sample of points used, the better the estimate is.
SQL queries frequently require many tables to be joined together
The piece of magic inside the database kernel that makes this possible is called the SQL query parser/optimizer or cost-based optimizer (CBO).
example of perpetual prototyping.
The strategy to optimize everything is an operational example of perpetual prototyping.
Decision Layer
This is where the rubber meets the road, and it can include end-user applications such as desktop, mobile, and interactive web apps, as well as business intelligence software. This is the layer that most people "see". -It is the layer at which business analysts, c-suite executives, and customers interact with the real-time big data analytics system.
perpetual prototyping (PP)
To shift to a higher rate of release and still provide features that are stable, many Internet-scale organizations develop their products using an approach called perpetual prototyping (PP), which blurs together the formerly discrete steps of prototyping, development, testing, and release into a continuous loop.
Hadoop is distributed
a Hadoop cluster can have several machines
platfrom
a collection of sub-systems or components that must operate like one thing
Shark
a data warehousing system that has a Spark engine
A cluster?
a group of computers working together
Big data evolved from
a need to solve an emerging set of supercomputing problems.
Most significantly, cloud and big data
achieve scalability in very different ways, for very different reasons.
adding features quickly can destabilize any software system
achieving equilibrium between innovation and stability is important.
Internet-scale distributed systems
also contain critical pathways, but have fewer of them in addition to having many parallel service pathways.
The isolation benefits that make any shared-nothing platform achieve scalability
also make clouds scale.
the first rule of assembly lines
always optimize the cost and effort of the steps in the pipeline.
Matei Zaharia
an author of Spark, an open source cluster computing system that can be programmed quickly and runs fast.
What are we witnessing in big data now?
an explosion of new techniques for analyzing large data sets. In addition to new capabilities for handling large amounts of data, we're also seeing a proliferation of new technologies designed to handle complex, non-traditional data -- precisely the kinds of unstructured or semi-structured data by social media, mobile communications, customer servicer records, warranties, census reports, sensors, and web logs.
Storm
an open source low latency processing stream processing system designed to integrate with existing queuing and bandwidth systems. -used by companies such as Twitter, the Weather Channel, Groupon and Ooyala.
Conventional clouds
are a form of platform engineering designed to meet very specific and mostly operational requirements.
Real-Time Online algorithms
are constrained by time and space limitations. If you "unbound" them to allow more data, they can no longer function as real-time algorithms.
Enterprise-scale systems
are typically not highly distributed and are more susceptible to just a few critical pathways of failure.
The art and craft of platform engineering at Internet scale demands three critical tenets:
avoid complexity, prototype perpetually, and optimize everything.
Two paradigms for data processing
batch and stream
In CPU design, breaking down the circuitry pipeline
breaking down the circuitry pipeline that executes instructions allows that pipeline to run at high speeds.
The Hadoop isolation mechanism
breaks up a single pile of hay into hundreds of piles so each one of those nodes works on its own private pile. Hadoop creates synthesized isolation.
Big data cluster software like Hadoop
breaks up the analysis of a single, massive dataset into hundreds of identical steps (pipelining) and then runs hundreds of copies at once (parallelism).
Big data consists of a broad spectrum of purpose-built workloads
but traditional business intelligence products are either too general-purpose to address this diverse spectrum or too purpose-built and can only address a narrow range of workloads.
Conventional clouds designed for mailboxes can be elastic
but will not elastically accommodate supercomputing clusters that, as a single unit of service, span 25 racks.
Overly structured data
by definition, has been editorialized, refined, and transformed.
Hadoop is scalable
can add more machines to the cluster (proportionally adds capacity)
Hadoop is Fault-Tolerant
can recover hardware failures -Master re-assigns work -Data replicates by default on 3 machines -Nodes that recover rejoin the cluster automatically
A long product development cycle
can result in an expensive, late, or uncompetitive product. It also can lead to company failure. This traditional process is simply not responsive enough for big data.
Meaning of "real time"
can vary depending on the context in which it used. -Same sense that there really is no such thing as unstructured data, there's no such thing as real time, theres only near real time.
The most common notion of scalability
comes from outside the data center, in economies of scale.
Like scientific supercomputing
commercial supercomputing cannot be solved using products from a single vendor.
Spark analytics
companies have UIs that launch Spark on the back end of analytics dashboards. You see the statistics on a dashboard and if you're wondering about some data that hasn't been computed, you can ask a question that does out to a parallel computation on Spark and you get back an answer in about half a second
As applications became more complex
constructing platforms with silos became even more difficult.
Model Refresh
data is always changing, so there needs to be a way to refresh the data and refresh the model built on the original data. The existing scripts or programs used to run the data and build the models can be re-used to refresh the models. Simple exploratory data analysis is also recommended, along with periodic (weekly, daily, or hourly) model refreshes. The refresh process, as well as validation and deployment, can be automated using web-based services such as RevoDeployR, a part of the Revolution R Enterprise solution.
Seymour Cray
designed the Control Data 6600 in the 1960s which is considered one of the first successful supercomputers. -Successful because Cray pipelined tasks within the CPU and then turned to parallelism to increase the number of mathematical, or floating-point results that could be calculated in a second.
A small startup, Teradata
developed one of the first database kernels to handle queries against a TB of data.
"the present" meaning
different meanings to different users. from the perspective of an online merchant, "the present" means the attention span of a potential customer. If the processing time of a transaction exceeds the customers attention span, the merchant doesn't consider it real time.
ETL to ELT
economics and scale of Hadoop change the order to ELT since the raw data is loaded before the scheduling power of Map/Reduce can be brought to bear on multiple transform pipelines.
Simple steps make it possible to
either reduce effort required (price) or increase the production (performance).
Shared-everything data architectures
emphasize the value gained by having all nodes see a common set of data. -The need to insure a single, shared view is traded for scalability.
Each cluster node performs
exactly the same task, but the cluster contains hundreds of identically configured computers all running exactly the same job on their private slice of data.
The law of averages
for a 400-node cluster means failures are a constant, the software must provide the ability to scale and keep the cluster continuously available in the face of component failures.
Building a silo-centric Disaster Recovery plan
forces precise co-ordination across every single silo, which is organizationally complex and expensive. Although many companies get by with a silo approach to enterprise-scale disaster recovers, it's rarely optimal. At Internet-scale, it doesn't work at all.
Clusters are loaded two ways:
from all the existing stranded sources and the greenfield sources (such as a gaggle of web server logs).
Hadoop evolved directly from
from commodity scientific supercomputing clusters developed in the 1990s.
Larry Ellison's
genius was to take IBM's relational database technology and place it on the seminal VAX and create one of the first enterprise software companies in the post-mainframe era.
What do technologies like Hadoop do?
give you the scale and flexibility to store data before you know how you are going to process it.
Cloud computing evolved from
grid computing, which evolved from client-server computing, and so on, back to the IBM Selectric.
disk drive technology
has not changed much from the ones in 1980s
NoSQL databases
have become popular for their affordability and ease of use while operating at internet scale.
memory and storage
have not tracked Moore's Law
Companies using a PP style of development
have pushed testing and integration phases into sections of their production environment.
Pipelining
helps with efficiency by breaking the task into a series of simpler steps. Simple steps require less skill; incremental staff could be paid less.
People in the loop
if you have people in the loop its not real time. most people take a second or two to react, and that's plenty of time for a traditional transactional system to handle input and output.
RTBDA focus on stakeholders
important becuase it reminds us that the RTBDA technology exists for a specific purpose: creating value from the data.
Spark
in-memory and Streaming processing framework
The cast of developers required
includes data scientists, workflow engineers (data wranglers) and cluster engineers who keep the supercomputers fed and clothed.
Smith says it is helpful to divide the RTBDA process?
into five phases: data distillation, model development, validation and deployment, real-time scoring, and model refresh. At each phase, the terms "real time" and "big data" are fluid in meaning
A relational data model (or schema)
is a collection of tables with various columns. This model provided far more flexibility than the approach it replaced (IBM's IMS hierarchical data model form the 1970s), yet relational technology still required users to know, ahead of time, which columns went into what tables.
pathway severity index (PSI)
is a combination of the probability and outcome (or reduction in availability) from the failure of each pathway. Any pathway, software or hardware, with a high PSI requires a redundancy strategy.
A Hadoop cluster
is a couple of orders of magnitude (hundreds of times) cheaper than platforms built on relational technology and, in most cases, the price/performance is several orders of magnitude (thousands of times) better.
Cyclic graph?
is a directed graph which contains a path form at least one node back to itself. In simple terms cyclic graphs contain a cycle
Index Sequential Access Method
is a need to access information in a way that was simple and fast, yet not necessarily sequentially.
Whats an acyclic graph?
it is a directed graph which contains absolutely no cycle, that is no node can be traversed back to itself.
If Hadoop is imposed on a conventional cloud
it needs to be a cloud designed to run virtual supercomputers in the first place, not a conventional cloud that has been remodeled, repurposed and reshaped. Big data clouds have to be designed for big data.
When HDFS creates a file
it spreads the file over all available nodes and makes enough copes so that when a job runs on the cluster, there are enough spare copies of the file to insure as much parallelism and protection as possible.
Scalability isn't just about doing things faster:
it's about enabling the growth of a business and being able to juggle the chainsaws of margins, quality and cost.
Master nodes?
managers distribution of work and data to worker nodes
Real time or near real-time systems
means architectures that allow you to respond to data as you receive it without necessarily persisting it to a database first.
Most relational databases were not designed to handle acre-feet of data
most were designed to be proficient at online transaction processing (OLTP).
Big data Clusters
must be built for speed, scale, and efficiency.
Big data platforms
must be designed to scale and continue to work in the face of failure.
Internet-scale platforms
must operate at such high level of performance, complexity, and cost that their solution space must always be optimized at the intersection of operations, economics, and architecture.
Like Exadata, big data supercomputers
need to be constructed as engineered platforms and this construction requires an engineering approach where all the hardware and software components are treated as a single system. That's the platform way - the way it was before these components were sold by silos of vendors.
HBase
noSQL database built on HDFS
Todays technology
not designed for real time
Silos
obfuscate the true nature of computing platforms as a single system of interconnected hardware and software.
organization of data found in the family of NoSQL databases
often pejoratively described as unstructured, but a better way to describe it is simply structured.
Scientific and commercial supercomputing clusters are mostly about
parallelism with a little bit of pipelining on the side.
The top three categories of system failure
physical plant, operator error, and software bugs.
Big data clusters achieve scalability based on
pipelining and parallelism, the same assembly line principles found in fast-food kitchens.
Hadoop is a
powerful programming platform ,NOT an application platform
David Smith
proposes a four layer RTBDA technology stack. Geared for predictive analytics, it still serves as a good general model.
When products within silos or niches were sold to customer
putting the system together was no longer any single supplier's responsibility; it became the customers' job.
physical similarities in cloud and big data
racks of cloud servers and racks of Hadoop servers are constructed from the same physical components. But Hadoop transforms those servers into a single 1000-node supercomputer, whereas conventional clouds host thousands of private mailboxes.
What is a big issue in many situations involving big data?
random failures and resulting data loss
Storm
relatively user friendly, solves really hard problems such as fault tolerance and dealing with partial failures in distributed processing. -It is a platform you can build on. You don't have to focus on the infrastructure because that work has already been done. You can set up Storm by yourself and have it running in minutes.
Spark
relies on "resilient distributed datasets" (RDDs) and can be used to interactively query 1 to 2 terabytes of data in less than a second. -in cases of machine learning, Spark can run 10x to 100x faster than Hadoop MapReduce.
A cluster that scales better
remains smaller which also improves its operational efficiency as well.
Analytics Layer
sits above the data layer. Includes a productions environment for deploying real-time scoring and dynamic analytics; a development environment for building models; and a local data mart that is updated periodically from the data layer, situated near the analytics engine to improve performance.
Hadoop achieves impressive scalability with shared-nothing isolation techniques
so Hadoop clusters hate to share hardware with anything else. -Share no data and share no hardware - Hadoop is shared-nothing
Complexity contributes to two major failure categories
software bugs and operator error.
What does GFS store?
stores large volume of data, and distributed MapReduce processes that data
Metastore
stores metadata; stored in traditional RDBMS format
Flume, Kafka
streaming data ingestion
Conventional clouds consist of applications
such as mailboxes and Windows desktops and web servers because those applications no longer saturate commodity servers.
Hadoop clouds must be designed to support
supercomputers, not idle mailboxes.
Hadoop was designed to operate on
terabytes of data spread over thousands of flaky nodes with thousands of flaky drives.
Solr
test search functionality
An ISAM file has a single index key
that could be used to randomly access single records, instead of using a ponderous sequential scan.
Components of a platform do not have to be mature or stable
that is the best-of-breed myth corresponding to the current silo view of enterprise engineering.
The desire to put big data into clouds so it all operates as one fully elastic supercomputing platform overlooks
the added complexity that results from this convergence. COMPLEXITY IMPEDES SCALABILITY
Because internet-scale platforms are highly distributed
the effects of the distribution replaces just a few high PSI pathways with hundreds of low PSI pathways.
Hadoop consists of two major components
the file system (HDFS) and a parallel job scheduler.
Hadoop seems to be related to cloud computing because
the first cloud implementations came from the same companies attempting Internet-scale computing.
Once real-tie analytics becomes a commodity,
the focus will shift from data science to DECISION SCIENCE.
Validation and Deployment
the goal at this phase is testing the model to make sure that it works in the real world. The validation process involves re-extracting fresh data, running it against the model, and comparing results with outcomes run on data that's been withheld as a validation set. If the model works, it can be deployed into a production environment.
If the platform needs to handle hundreds of millions of users affordably
the secret sauce is in the platform engineering, not in the aggregation of best-of-breed products.
What happens when you have lots of data moving across multiple networks and many machines?
theres a greater chance that something will break and portions of the data won't be available.
best of breed at internet scale
unaffordable
Veracity
uncertainty of data
MapReduce Process
usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the map, which are then the input to the reduce task. Typically both the input and the output of the job are stored in a file-system.
The network fabric of a Hadoop cluster is the
veins and arteries of HDFS and must be designed for scalable throughput - not attachment and manageability, which is the default topology for conventional clouds and most enterprise networks.
In the 1970s, IBM's monopoly was curtailed
was curtailed enough for other startups such as Amdahl, DEC, and Oracle to emerge and begin providing IBM customers with alternatives.
Exadata
was originally designed for the data warehouse market, but found more success with main stream Oracle RAC customers running applications like SAP.
Hue
web based user interface for Hadoop
The file system and scheduling capabilities in Hadoop were primarily designed to operate on
were primarily designed to operate on, and be tolerant of, unreliable commodity components.
Hadoop also redefines ETL to ELTP
where the P stands for Park. The raw data, processed data, and archived data can now all park together in a single, affordable reservoir.
What type of intuition do people need to develop?
which kinds of processing are bounded in time, and which kinds aren't
The Apache Drill Project
will address the "squishy" factor by scanning through smaller sets of data very quickly. Drill is the open source cousin of Dremel, a Google tool that rips through larger data sets at blazing speeds and spits out summary results, sidestepping the scale issue. -Drill may be complementary to existing frameworks such as Hadoop.
The recent emergence of flash as storage technology and Hadoop as a low-cost alternative to arrays of expensive disks
will combine to produce its own form of disruption to that industry.
processes
you need processes for translating the analytics into good decisions.
Hadoop CAN run on conventional clouds, but
your SLA mileage will vary a lot. Performance and scalability will be severely compromised by underpowered nodes connected to underpowered external arrays or any other users running guests on the physical servers that contain Hadoop data nodes.
Two kind of nodes?
-Master node (Name Node) -Worker nodes (Data node)
Spark Components?
-Spark SQL: Fast engine for Hive Interactive Queries -Spark Streaming: Real Time data analysis -MLLiB: Machine learning algorithms -GraphX: Graph Processing Algorithms -Has an advanced DAG execution engine (Apache Spark Core Engine) -Comes with built in operators
Advantages of Spark?
-Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm: Speed, Java/ scala/ python, and streaming, ML, visualization
What are the four Vs of big data? what is the fifth?
-Volume -Variety -Velocity -Veracity Fifth is Value
What do most organizations prefer?
-An enterprise-ready distribution of Hadoop that is: Tested thoroughly, supported, and integrates well with Hadoop projects and other key software like ETL tools and databases.
What does Drill do?
-Brings big data analytics a step closer to real-time interactive processing, which is def. a step in the right direction -Drill is trying to be more things to more people -- probably at the cost of some performance, that's just mostly due to the different environment.
What class hardware do Master nodes use?
-Carrier-Class hardware -dual power supplies -dual ethernet cards -hard drives use RAID (Redundant Array of Inexpensive Disks) to protect from data loss -reasonable amount of RAM and reasonable number of CPUs required: 64gb for 20 nodes or less, 96gb for up to 300 nodes.
What does Hadoop Include?
-Hadoop Distributed File System (HDFS) -Hadoop YARN -Hadoop Common
Storage: Hadoop Distributed File System (HDFS)
-Hierarchical file system; no innate database, dump of schema-less data -no updates, only replace/append -self-healing nodes/computers -128mb file "chunks (stored 3x, can set # and block size by file) -Scalable, cloud friendly
HDFS is manipulated via:
-Java APIs -Utilities: fsck: diagnose health of the file system, find missing files or blocks and Rebalancer: balance the cluster when the data is unevenly distributed among DataNodes -Command Line Interface (CLI) -Configuration -Nodes Host internal web server; can view node status and see files via web browser
CheckpointNode
-Performs "checkpoints," i.e., every hour or every 1M transactions (configurable) -Copies "fsimage" to primary, clears edits -NOT a back-up to nameNode and dataNodes do not connect to CheckpointNode -Previously called secondary NameNode
BackupNode
-Receives stream of changes to "edit" and thus has a real time view of fsimage available for NameNode -NOT NameNode hot restore/switchover
SQOOP
-SQL+Hadoop = sq oop -To import data from relational databases into Hadoop and -to export data to relational databases from Hadoop.
Hadoop Is?
-Scalable, for parallel/distributable problems (no dependencies across data) -A write once, read many solution (vs. RDMS for write and update a lot)
Hadoop- Processing, storing, and analyzing large volumes of data
-Software: handles distribution of data, handling failures -Hardware: handles storage of data and processing power
The main components and ecosystem of big data as:
-Techniques for analyzing data, such as A/B testing, machine learning and natural learning processing -Big Data technologies, like business intelligence, cloud computing and databases -Visualization, such as charts, graphs and other displays of the data.
NameNode Function: Coordination
-Tracks metadata Information: Tracks in memory where blocks are written. Transactions logged on disk (can be used on restore of nameNode failure) -Provides Information to Application (aka Clients): Identified dataNodes for write and read: how many, and which configuration (node, rack, data center). Supplies arguments for application call to dataNodes
Three Dimensions of big data?
-Volume, Variety, and Velocity.
DataNode Functions
-accepts read/write from client application (directed to dataNode from nameNode) -Performs block creating, deleting, and replication upon instruction from the NameNode (reports results to NameNode) -Sends "heartbeats" (every 2 to 10 secs) to NameNode -No changes: append, create, overwrite
What is the Hadoop solution?
-bring computation to the data rather than bringing data to the computation -Distribute computing to where data is stored -run computations where data resides
Why is more disk needed for worker nodes?
-by default the HDFS data is replicated 3 times -20-30% of cluster capacity is needed for temporary raw storage -4 x your data storage need is a good number for estimation -a good practical maximum is 36 TB per worker node: 12x3 TB drives
Apache Hive?
-data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. -used to manipulate data
Daemons running on the Master nodes?
-ensure that the entire cluster works. Daemons for HDFS and YARN-- control the entire cluster -A failed daemon on Master node may result in the entire cluster being unavailable. -Master nodes are configured for high availability in an active-passive mode.
What is a live data stream?
-includes wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data.
Worker node hardware?
-midrange CPUs are okay -More RAM the better: memory intensive processing framework and tools are being used e.g. Spark and Impala. HDFS caching can take advantage of extra RAM, and 512gb RAM/Node or better is not uncommon.
Hadoop is Open Source
-overseen by Apache -close to 100 committers from companies like Cloudera, Hortonworks, etc.
What do worker nodes do?
-perform the work -can be scaled horizontally -Daemons on worker nodes handle the data processing -A failed worker node does not bring the cluster down because of data replication and high availability
Hadoop MapReduce
-processing framework to process the data -other processing frameworks, also now available. - A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
NameNode functions
-works with applications to identify dataNodes for write and read calls -Coordinated replication writes -ON start, reads "fsimage" file which captures "state" of HDFS -Operations result in append writes to tracking file ("edits") NOT "fsimage" and thus changes not immediately available to nameNode
Many private conventional clouds will not be able to support Hadoop because they rely on each individual application to provide its own isolation
Hadoop in a cloud means large, pseudo-monolithic supercomputer clusters lumbering around. Hadoop is not elastic like millions of mailboxes can be.
Japan is seeking ________, while India craves _____________ and ___________. The leaders of both countries, _______________(india) and _______________(Japan), are also working to counter the growing regional influence of _________ -- an important economic partner to both but also historically a rival.
Japan is seeking growth markets, while India craves Advanced technology and Foreign Investment. The Shinzo Abe (Japan), are also working to counter the growing regional influence of Chine -- an important economic partner to both but also historically a rival.
Data Distillation
Like unrefined oil, data in the data layer is crude and messy. It lacks the structure required for building models or performing analysis. The data distillation phase includes extracting features for un-structured text, combining disparate data sources, filtering for populations of interest, selecting relevant features and outcomes for modeling, and exporting sets of distilled data to a local data mart.
A caveat on the refresh phase
Refreshing the model based on re-ingesting the data and re-running the scripts will only work for a limited time, since the underlying data - and even the underlying structure of the data - will eventually change so much that the model will no longer be valid. Important variables can become non-significant, non-significant variables can becomes important, and new data sources are continuously emerging. If the model accuracy measure begins drifting, go back to phase 2 and re-examine the data. If necessary, go back to phase 1 and rebuild the model from scratch.
Sentry
an authorization tool for managing security
When doubles in size, a shared-nothing cluster
cluster that scales perfectly operates twice as fast or the job completes in half the time.
Daemon?
is a program running on a node
Choosing margins over markets
is a reasonable strategy in tertiary sectors that are mature and not subject to continuous disruption.
Wordcount
is a simple application that counts the number of occurrences of each word in a given input set.
Commodity hardware?
is an affordable system without expensive options like RAID or hot swappable CPUs. -applies mainly to data nodes; name nodes should be high quality and very reliable.
a node?
is an individual computer in that cluster
The responsiveness and degree of unavailability is determined
is determined both by expectations and the perception of time.
Stranded, captive data
is the result of vendors optimizing their products for margins and not markets.
A file system
isn't usually thought of as a distinct piece of technology, but more as a tightly integrated piece or natural extension to the database or operating system kernel.
Model Development
processes in this phase include feature selection, sampling and aggregation; variable transformation; model estimation; model refinement; and model benchmarking. The goal at this phase is creating a predictive model that is powerful, robust, comprehensible and implementable. The key requirements for data scientists at this phase are speed, flexibility, productivity, and reproducibility. These requirements are critical in the context of big data: a data scientist will typically construct, refine and compare dozens of models in the search for a powerful and robust real-time algorithm.
batch
processing is fundamentally high-latency. So if you're trying to look at a terabyte of data all at once, you'll never be able to do that computation in less than a second with batch processing
Once credit cards became popular
processing systems had to be built to handle the load and, more importantly, handle the growth without constant re-engineering. These early platforms were built around mainframes, peripheral equipment (networks and storage), and software, all from a single vendor.
HDFS contains a feature called federation
that, over time, could be used to create a reservoir of reservoirs, which will make it possible to create planetary file systems that can act locally but think globally.
Cloud computing evolved from
the need to handle millions of free email users cheaply.