ISM4402 Exam2 short answer
Describe data stream mining and how it is used.
Data stream mining is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. Examples of data streams include sensor data, computer network traffic, phone conversations, ATM transactions, web searches, and financial data. the goal is to predict the class or value of new instances in the data stream
In the opening vignette, why was the Telecom company so concerned about the loss of customers, if customer churn is common in that industry?
The company was concerned about its loss of customers, because the loss was at such a high rate. The company was losing customers faster than it was gaining them. Additionally, the company had identified that the loss of these customers could be traced back to customer service interactions. Because of this, the company felt that the loss of customers is something that could be analyzed and hopefully controlled.
List and describe four of the most critical success factors for Big Data analytics.
• A clear business need. Business investments ought to be made for the good of the business, not for the sake of mere technology advancements. • Strong, committed sponsorship (executive champion). • Alignment between the business and IT strategy. Analytics should play the enabling role in successful execution of the business strategy. • A fact-based decision making culture. In a fact-based decision-making culture, the numbers rather than intuition, gut feeling, or supposition drive decision making. • A strong data infrastructure. Success requires marrying the old with the new for a holistic infrastructure that works synergistically.
When considering Big Data projects and architecture, list and describe five challenges designers should be mindful of in order to make the journey to analytics competency less stressful.
• Data volume: The ability to capture, store, and process the huge volume of data at an acceptable speed • Data integration: The ability to combine data that is not similar in structure or source and to do so quickly and at a reasonable cost. • Processing capabilities: The ability to process the data quickly, as it is captured. • Data governance: The ability to keep up with the security, privacy, ownership, and quality issues of Big Data. • Skills availability: Big Data is being harnessed with new tools and is being looked at in different ways. • Solution cost: To ensure a positive ROI on a Big Data project, it is crucial to reduce the cost of the solutions used to find that value.
What is NoSQL as used for Big Data? Describe its major downsides.
• NoSQL is a new style of the database that processes large volumes of multi-structured data. NoSQL databases are aimed at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications • The downside of most NoSQL databases today is that they trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Many also lack mature management and monitoring tools.
List and briefly discuss the three characteristics that define and make the case for data warehousing.
1) Data warehouse performance, specifically the cost-based optimizer 2) Integrating data that provides business value to answer essential business questions. 3) Interactive BI tools provide access to data warehouse insights
Define MapReduce.
MapReduce is a programming model and an associated implementation for processing and generating large data sets.
What are the differences between stream analytics and perpetual analytics? When would you use one or the other?
• Streaming analytics involves applying transaction-level logic to real-time observations. The rules applied to these observations take into account previous observations as long as they occurred in the prescribed window. Perpetual analytics evaluates every incoming observation against all prior observations, where there is no window size. • When transactional volumes are high and the time-to-decision is too short, favoring non-persistence and small window sizes, this translates into using streaming analytics. However, when the mission is critical and transaction volumes can be managed in real time, then perpetual analytics is a better answer.
List and describe the three main "V"s that characterize Big Data.
• Volume: due to transaction-based data, social media, sensor data, RFID and GPS data. • Variety: Data today comes in all types of formats from structured to semi-structured to unstructured. • Velocity: This refers to both how fast data is being produced and how fast the data must be processed
Why are some portions of tape backup workloads being redirected to Hadoop clusters today?
• the difficulty of retrieval. data stored offline, take long to retrieve, tape formats change over time, prone to loss of data • it has been shown that there is value in keeping historical data online and accessible which is becoming more accessible
