ISM4402 Exam2 short answer
List and describe the three main "V"s that characterize Big Data.
• Volume: This is obviously the most common trait of Big Data. Many factors contributed to the exponential increase in data volume, such as transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, automatically generated RFID and GPS data, and so forth. • Variety: Data today comes in all types of formats—ranging from traditional databases to hierarchical data stores created by the end users and OLAP systems, to text documents, e-mail, XML, meter-collected, sensor-captured data, to video, audio, and stock ticker data. By some estimates, 80 to 85 percent of all organizations' data is in some sort of unstructured or semistructured format. • Velocity: This refers to both how fast data is being produced and how fast the data must be processed (i.e., captured, stored, and analyzed) to meet the need or demand. RFID tags, automated sensors, GPS devices, and smart meters are driving an increasing need to deal with torrents of data in near-real time.
List and briefly discuss the three characteristics that define and make the case for data warehousing.
1) Data warehouse performance: More advanced forms of indexing such as materialized views, aggregate join indexes, cube indexes, and sparse join indexes enable numerous performance gains in data warehouses. The most important performance enhancement to date is the cost-based optimizer, which examines incoming SQL and considers multiple plans for executing each query as fast as possible. 2) Integrating data that provides business value: Integrated data is the unique foundation required to answer essential business questions. 3) Interactive BI tools: These tools allow business users to have direct access to data warehouse insights. Users are able to extract business value from the data and supply valuable strategic information to the executive staff.
Identify, with a brief description, each of the four steps in the sentiment analysis process.
1. Sentiment Detection: Here the goal is to differentiate between a fact and an opinion, which may be viewed as classification of text as objective or subjective. 2. N-P Polarity Classification: Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities, or locate its position on the continuum between these two polarities. 3. Target Identification: The goal of this step is to accurately identify the target of the expressed sentiment. 4. Collection and Aggregation: In this step all text data points in the document are aggregated and converted to a single sentiment measure for the whole document.
What is the difference between white hat and black hat SEO activities?
An SEO technique is considered white hat if it conforms to the search engines' guidelines and involves no deception. Because search engine guidelines are not written as a series of rules or commandments, this is an important distinction to note. White-hat SEO is not just about following guidelines, but about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see. Black-hat SEO attempts to improve rankings in ways that are disapproved by the search engines, or involve deception or trying to trick search engine algorithms from their intended purpose.
Define MapReduce.
As described by Dean and Ghemawat (2004), MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
Describe data stream mining and how it is used.
Data stream mining, as an enabling technology for stream analytics, is the process of extracting novel patterns and knowledge structures from continuous, rapid data records. A data stream is a continuous flow of ordered sequence of instances that in many applications of data stream mining can be read/processed only once or a small number of times using limited computing and storage capabilities. Examples of data streams include sensor data, computer network traffic, phone conversations, ATM transactions, web searches, and financial data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream.
Why are the users' page views and time spent on your Web site important metrics?
If people come to your Web site and don't view many pages, that is undesirable and your Web site may have issues with its design or structure. Another explanation for low page views is a disconnect in the marketing messages that brought them to the site and the content that is actually available. Generally, the longer a person spends on your Web site, the better it is. That could mean they're carefully reviewing your content, utilizing interactive components you have available, and building toward an informed decision to buy, respond, or take the next step you've provided. On the contrary, the time on site also needs to be examined against the number of pages viewed to make sure the visitor isn't spending his or her time trying to locate content that should be more readily accessible.
How would you describe information extraction in text mining?
Information extraction is the identification of key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching.
Natural language processing (NLP), a subfield of artificial intelligence and computational linguistics, is an important component of text mining. What is the definition of NLP?
NLP is a discipline that studies the problem of understanding the natural human language, with the view of converting depictions of human language into more formal representations in the form of numeric and symbolic data that are easier for computer programs to manipulate.
What is search engine optimization (SEO) and why is it important for organizations that own Web sites?
Search engine optimization (SEO) is the intentional activity of affecting the visibility of an e-commerce site or a Web site in a search engine's natural (unpaid or organic) search results. In general, the higher ranked on the search results page, and more frequently a site appears in the search results list, the more visitors it will receive from the search engine's users. Being indexed by search engines like Google, Bing, and Yahoo! is not good enough for businesses. Getting ranked on the most widely used search engines and getting ranked higher than your competitors are what make the difference.
Provide some examples where a sensitivity analysis may be used.
Sensitivity analyses are used for: • Revising models to eliminate too-large sensitivities • Adding details about sensitive variables or scenarios • Obtaining better estimates of sensitive external variables • Altering a real-world system to reduce actual sensitivities • Accepting and using the sensitive (and hence vulnerable) real world, leading to the continuous and close monitoring of actual results
Why is the Monte Carlo simulation popular for solving business problems?
The Monte Carlo simulation is a probabilistic simulation. It is designed around a model of the decision problem, but the problem does not consider the uncertainty of any of the variables. This allows for a huge number of simulations to be run with random changes within each of the variables. In this way, the model may be solved hundreds or thousands of times before it is completed. These results can then be analyzed for either the dependent or performance variables using statistical distributions. This demonstrates a number of possible solutions, as well as providing information about the manner in which variables will respond under different levels of uncertainty.
Why are spreadsheet applications so commonly used for decision modeling?
Spreadsheets are often used for this purpose because they are very approachable and easy to use for end users. Spreadsheets have a shallow learning curve that allows basic functions to be learned quickly. Additionally, spreadsheets have evolved over time to include a more robust set of features and functions. These functions can also be augmented through the use of add-ins, many of which are designed with decision support systems in mind.
In the opening vignette, why was the Telecom company so concerned about the loss of customers, if customer churn is common in that industry?
The company was concerned about its loss of customers, because the loss was at such a high rate. The company was losing customers faster than it was gaining them. Additionally, the company had identified that the loss of these customers could be traced back to customer service interactions. Because of this, the company felt that the loss of customers is something that could be analyzed and hopefully controlled.
List and describe the most common approaches for treating uncertainty.
There are two common approaches to dealing with uncertainty. The first is the optimistic approach and the second is the pessimistic approach. The optimistic approach assumes that the outcomes for all alternatives will be the best possible and then the best of each of those may be selected. Under the pessimistic approach the worst possible outcome is assumed for each alternative and then the best of the worst are selected.
List and briefly discuss the major components of a quantitative model.
These components include: 1. Result (outcome) variables reflect the level of effectiveness of a system; that is, they indicate how well the system performs or attains its goal(s). 2. Decision variables describe alternative courses of action. The decision maker controls the decision variables. 3. Uncontrollable Variables - in any decision-making situation, there are factors that affect the result variables but are not under the control of the decision maker 4. Intermediate result variables reflect intermediate outcomes in mathematical models.
How are linear programming models vulnerable when used in complex situation?
These models have the ability to be vulnerable when used in very complex situations for a number of reasons. One reason focuses on the possibility that not all parameters can be known or understood. Another concern is that the standard characteristics of a linear programming calculation may not hold in more dynamic, real-world environments. Additionally, in more complex environments all actors may not be wholly rational and economic issues.
Describe the query-specific clustering method as it relates to clustering.
This method employs a hierarchical clustering approach where the most relevant documents to the posed query appear in small tight clusters that are nested in larger clusters containing less similar documents, creating a spectrum of relevance levels among the documents.
Why do many believe that making decisions under uncertainty is more difficult than making decisions under risk?
This opinion is commonly held because making decisions under uncertainty allows for an unlimited number of possible outcomes, yet no understanding of the likelihood of those outcomes. In contrast, decision-making under risk allows for an unlimited number of outcomes, but a known probability of the likelihood of those outcomes.
Why is there a trend to developing and using cloud-based tools for modeling?
This trend exists because it simplifies the process for users. These systems give them access to powerful tools and pre-existing models that they can use to solve business problems. Because these systems are cloud-based, there are costs associated with operating them and maintaining them.
List and describe four of the most critical success factors for Big Data analytics.
• A clear business need (alignment with the vision and the strategy). Business investments ought to be made for the good of the business, not for the sake of mere technology advancements. Therefore, the main driver for Big Data analytics should be the needs of the business at any level—strategic, tactical, and operations. • Strong, committed sponsorship (executive champion). It is a well-known fact that if you don't have strong, committed executive sponsorship, it is difficult (if not impossible) to succeed. If the scope is a single or a few analytical applications, the sponsorship can be at the departmental level. However, if the target is enterprise-wide organizational transformation, which is often the case for Big Data initiatives, sponsorship needs to be at the highest levels and organization-wide. • Alignment between the business and IT strategy. It is essential to make sure that the analytics work is always supporting the business strategy, and not other way around. Analytics should play the enabling role in successful execution of the business strategy. • A fact-based decision making culture. In a fact-based decision-making culture, the numbers rather than intuition, gut feeling, or supposition drive decision making. There is also a culture of experimentation to see what works and doesn't. To create a fact-based decision-making culture, senior management needs to do the following: recognize that some people can't or won't adjust; be a vocal supporter; stress that outdated methods must be discontinued; ask to see what analytics went into decisions; link incentives and compensation to desired behaviors. • A strong data infrastructure. Data warehouses have provided the data infrastructure for analytics. This infrastructure is changing and being enhanced in the Big Data era with new technologies. Success requires marrying the old with the new for a holistic infrastructure that works synergistically.
When considering Big Data projects and architecture, list and describe five challenges designers should be mindful of in order to make the journey to analytics competency less stressful.
• Data volume: The ability to capture, store, and process the huge volume of data at an acceptable speed so that the latest information is available to decision makers when they need it. • Data integration: The ability to combine data that is not similar in structure or source and to do so quickly and at reasonable cost. • Processing capabilities: The ability to process the data quickly, as it is captured. The traditional way of collecting and then processing the data may not work. In many situations data needs to be analyzed as soon as it is captured to leverage the most value. • Data governance: The ability to keep up with the security, privacy, ownership, and quality issues of Big Data. As the volume, variety (format and source), and velocity of data change, so should the capabilities of governance practices. • Skills availability: Big Data is being harnessed with new tools and is being looked at in different ways. There is a shortage of data scientists with the skills to do the job. • Solution cost: Since Big Data has opened up a world of possible business improvements, there is a great deal of experimentation and discovery taking place to determine the patterns that matter and the insights that turn to value. To ensure a positive ROI on a Big Data project, therefore, it is crucial to reduce the cost of the solutions used to find that value.
What are the three categories of social media analytics technologies and what do they do?
• Descriptive analytics: Uses simple statistics to identify activity characteristics and trends, such as how many followers you have, how many reviews were generated on Facebook, and which channels are being used most often. • Social network analysis: Follows the links between friends, fans, and followers to identify connections of influence as well as the biggest sources of influence. • Advanced analytics: Includes predictive analytics and text analytics that examine the content in online conversations to identify themes, sentiments, and connections that would not be revealed by casual surveillance.
Why are some portions of tape backup workloads being redirected to Hadoop clusters today?
• First, while it may appear inexpensive to store data on tape, the true cost comes with the difficulty of retrieval. Not only is the data stored offline, requiring hours if not days to restore, but tape cartridges themselves are also prone to degradation over time, making data loss a reality and forcing companies to factor in those costs. To make matters worse, tape formats change every couple of years, requiring organizations to either perform massive data migrations to the newest tape format or risk the inability to restore data from obsolete tapes. • Second, it has been shown that there is value in keeping historical data online and accessible. As in the clickstream example, keeping raw data on a spinning disk for a longer duration makes it easy for companies to revisit data when the context changes and new constraints need to be applied. Searching thousands of disks with Hadoop is dramatically faster and easier than spinning through hundreds of magnetic tapes. Additionally, as disk densities continue to double every 18 months, it becomes economically feasible for organizations to hold many years' worth of raw or refined data in HDFS.
What are the differences between stream analytics and perpetual analytics? When would you use one or the other?
• In many cases they are used synonymously. However, in the context of intelligent systems, there is a difference. Streaming analytics involves applying transaction-level logic to real-time observations. The rules applied to these observations take into account previous observations as long as they occurred in the prescribed window; these windows have some arbitrary size (e.g., last 5 seconds, last 10,000 observations, etc.). Perpetual analytics, on the other hand, evaluates every incoming observation against all prior observations, where there is no window size. Recognizing how the new observation relates to all prior observations enables the discovery of real-time insight. • When transactional volumes are high and the time-to-decision is too short, favoring nonpersistence and small window sizes, this translates into using streaming analytics. However, when the mission is critical and transaction volumes can be managed in real time, then perpetual analytics is a better answer.
What is NoSQL as used for Big Data? Describe its major downsides.
• NoSQL is a new style of database that has emerged to, like Hadoop, process large volumes of multi-structured data. However, whereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are aimed, for the most part (though there are some important exceptions), at serving up discrete data stored among large volumes of multi-structured data to end-user and automated Big Data applications. This capability is sorely lacking from relational database technology, which simply can't maintain needed application performance levels at Big Data scale. • The downside of most NoSQL databases today is that they trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Many also lack mature management and monitoring tools.
In what ways does the Web pose great challenges for effective and efficient knowledge discovery through data mining?
• The Web is too big for effective data mining. The Web is so large and growing so rapidly that it is difficult to even quantify its size. Because of the sheer size of the Web, it is not feasible to set up a data warehouse to replicate, store, and integrate all of the data on the Web, making data collection and integration a challenge. • The Web is too complex. The complexity of a Web page is far greater than a page in a traditional text document collection. Web pages lack a unified structure. They contain far more authoring style and content variation than any set of books, articles, or other traditional text-based document. • The Web is too dynamic. The Web is a highly dynamic information source. Not only does the Web grow rapidly, but its content is constantly being updated. Blogs, news stories, stock market results, weather reports, sports scores, prices, company advertisements, and numerous other types of information are updated regularly on the Web. • The Web is not specific to a domain. The Web serves a broad diversity of communities and connects billions of workstations. Web users have very different backgrounds, interests, and usage purposes. Most users may not have good knowledge of the structure of the information network and may not be aware of the heavy cost of a particular search that they perform. • The Web has everything. Only a small portion of the information on the Web is truly relevant or useful to someone (or some task). Finding the portion of the Web that is truly relevant to a person and the task being performed is a prominent issue in Web-related research.