MIS chapter 6 part 2
Web structure mining
The development of useful information from the links included in Web documents. book examines data related to the structure of a particular website
Data Cleansing
The process of detecting incorrect or insufficient data book also known as data scrubbing, consists of activities for detecting and correcting data in a database that are incorrect, incomplete, improperly format-ted, or redundant. Data cleansing not only corrects data but also enforces consistency among different sets of data that originated in separate information systems.
Unstructured data, most in the form of text files, is believed to account for more than
80 percent of useful organizational information and is one of the major sources of big data that firms want to analyze.
database server
A computer in a client/server environment that is responsible for running a DBMS to process SQL statements and perform database management tasks
A data warehouse
A logical collection of information - gathered from many different operational databases - that supports business analysis activities and decision-making tasks (book) is a database that stores current and historical data of potential interest to decision makers throughout the company. The data originate in many core operational transaction systems, such as systems for sales, customer accounts, and manufacturing, and may include data from website transactions. The data warehouse extracts current and historical data from multiple operational systems inside the organization.
distributed database
A logically related database that is stored over two or more physically independent sites. (book) is one that is stored in multiple physical locations. Parts or copies of the database are physically stored in one location and other parts or copies are maintained in other locations.
Online Analytical Processing (OLAP)
A method of querying and reporting that takes data from standard relational databases, calculates and summarizes the data, and then stores the data in a special database called a data cube. (book) OLAP supports multidimensional data analysis, enabling users to view the same data in different ways using multiple dimensions OLAP enables users to obtain online answers to ad hoc questions such as these in rapid time, even when the data are stored in very large databases, such as sales figures for multiple years.
Hadoop Distributed File System (HDFS)
A system used for data storage that divides the data into subsets and distributes the subsets onto different servers for processing. HDFS links together the file systems on the numerous nodes in a Hadoop cluster to turn them into one big file system. Hadoop's MapReduce was inspired by Google's MapReduce system for breaking down processing of huge data sets and assigning work to the various nodes in a cluster
web usage mining
An analysis of a Web site's usage patterns, such as navigational paths or time spent. book examines user interaction data a web server records whenever requests for a website's resources are received. The usage data records the user's behavior when the user browses or makes transactions on the website and collects the data in a server log.
MapReduce
An open-source application programming interface (API) that provides fast data analytics services; one of the main Big Data technologies that allows organizations to process massive data stores.
The discovery and analysis of useful patterns and information from the World Wide Web is called web mining.
Businesses might turn to web mining to help them understand customer behavior, evaluate the effectiveness of a particular website, or quantify the success of a marketing campaign. Web mining looks for patterns in data through content mining, structure min-ing, and usage mining.
Non-relational database management systems
Database management system for working with large quantities of structured and unstructured data that would be difficult to analyze with a relational model. (book) use a more flexible data model and are designed for managing large data sets across many distributed machines and for easily scaling up or down. They are useful for accelerating simple queries against large volumes of structured and unstructured data, including web, social media, graphics, and other forms of data that are difficult to analyze with traditional SQL-based tools.
There are a number of advantages to using the web to access an organization's internal databases.
First, everyone knows how to use web browser software, and employees require much less training than if they used proprietary query tools. Second, the web interface requires few or no changes to the internal database. Companies leverage their investments in older systems because it costs much less to add a web interface in front of a legacy system than to redesign and rebuild the system to improve user access.
information policy
Formal rules governing the maintenance, distribution, and use of information in an organization. book specifies the organization's rules for sharing, disseminating, acquiring, standardizing, classifying, and inventorying information. Information policies identify which users and organizational units can share information, where information can be distributed, and who is responsible for updating and maintaining the information
For handling unstructured and semi-structured data in vast quantities, as well as structured data, organizations are using Hadoop.
Hadoop is an open source software framework managed by the Apache Software Foundation that enables distributed parallel processing of huge amounts of data across inexpensive computers.
Analytic Platforms
High-speed platforms using both relational and non-relational tools optimized for large datasets (book) using both relational and non-relational technology that are optimized for analyzing large data sets. Analytic platforms such as IBM PureData System for Analytics feature preconfigured hardware-software systems that are specifically designed for query processing and analytics. Analytic platforms also include in-memory systems and NoSQL non-relational database management systems
CGI script
a program written in a programming language that communicates with the web server is a compact program using the Common Gateway Interface (CGI) specification for processing data on a web server.
A data lake
a storage repository that holds a vast amount of raw data in its original format until the business needs it (book) is a repository for raw unstructured data or structured data that for the most part have not yet been analyzed, and the data can be accessed in many ways. The data lake stores these data in their native format until they are needed.
data quality audit
a structured survey of the accuracy and level of completeness of the data in an information system book which is a structured survey of the accuracy and level of completeness of the data in an information system. Data quality audits can be performed by surveying entire data files, surveying samples from data files, or surveying end users for their perceptions of data quality.
sentiment analysis
a technique that allows marketers to analyze data from social media sites to collect consumer comments about companies and their products (book) software can mine text comments in an email message, blog, social media conversation, or survey form to detect favor-able and unfavorable opinions about specific subjects
text mining
analyzes unstructured data to find trends and patterns in words and sentences (book) tools are now available to help businesses analyze these data. These tools can extract key elements from unstructured big data sets, discover patterns and relationships, and summarize the information
Database administration
book A large organization will also have a database design and management group within the corporate information systems division that is responsible for defining and organizing the structure and content of the database and maintaining it. In close cooperation with users, the design group establishes the physical database, the logical relations among elements, and the access rules and security procedures. The functions it performs are called A person or department that develops procedures and practices to ensure efficient and orderly multiuser processing of the database, to control changes to database structure, and to protect the database.
sequences
events linked over time
web content mining
extracting textual information from web documents book is the process of extracting knowledge from the content of web pages, which may include text, image, audio, and video data.
Data administration
is responsible for the specific policies and procedures through which data can be managed as an organizational resource. These responsibilities include developing information policy, planning for data, overseeing logical database design and data dictionary development, and monitoring how information system's specialists and end-user groups use data
Associations
occurrences linked to a single event.
Hadoop
p is an open source software framework managed by the Apache Software Foundation that enables distributed parallel processing of huge amounts of data across inexpensive computers. It breaks a big data problem down into sub-problems, distributes them among up to thousands of inexpensive computer processing nodes, and then combines the result into a smaller data set that is easier to analyze. For handling unstructured and semi-structured data in vast quantities, as well as structured data, organizations are using Hadoop
Classification
recognizes patterns that describe the group to which an item belongs by examining existing items that have been classified and by inferring a set of rules.
The NoSQL database can handle structured, semi-structured, and unstructured information without
requiring tedious, expensive, and time-consuming database-mapping to normalize all data to a rigid schema, as required by relational databases.
A data mart
subset of a data warehouse in which only a focused portion of the data warehouse information is kept (book) is a subset of a data warehouse in which a summarized or highly focused portion of the organization's data is placed in a separate database for a specific population of users
in-memory computing
technology for very rapid analysis and processing of large quantities of data by storing the data in the computer's main memory rather than in secondary storage (book) which relies primarily on a computer's main memory (RAM) for data storage. (Conventional DBMS use disk storage systems.) Users access data stored in system's primary memory, thereby eliminating bottlenecks from retrieving and reading data in a traditional, disk-based database and dramatically shortening query response times
Big Data
the exponential growth in the volume, variety, and velocity of information and the development of complex, new tools to analyze and create meaning from such data (book) to describe these data sets with volumes so huge that they are beyond the ability of typical DBMS to capture, store, and analyze. Big data is often characterized by the "3Vs": the extreme volume of data, the wide variety of data types and sources, and the velocity at which the data must be processed. Big data doesn't designate any specific quantity but usually refers to data in the petabyte and exabyte range—in other words, billions to trillions of records, respectively, from different sources. Big data are produced in much larger quantities and much more rapidly than traditional data
Data mining
the process of analyzing data to extract information not offered by the raw data alone (book) is more discovery-driven. Data mining provides insights into corpo-rate data that cannot be obtained with OLAP by finding hidden patterns and rela-tionships in large databases and inferring rules from them to predict future behavior. is to provide detailed analyses of patterns in customer data for one-to-one marketing campaigns or for identifying profitable customers.
Data that are inaccurate
untimely, or inconsistent with other sources of information create serious operational and financial problems for businesses, even with a well-designed database and information policy.
forecasting
uses predictions in a different way. It uses a series of existing values to forecast what other values will be.
Clustering
works in a manner similar to classification when no groups have yet been defined.