Chapter 6: Deep Learning and Cognitive Computing

Ace your homework & exams now with Quizwiz!

In backpropagation, the learning algorithm includes the following procedures:

1. Initialize weights with random values and set other parameters. 2. Read in the input vector and the desired output. 3. Compute the actual output via the calculations, working forward through the layers. 4. Compute the error. 5. Change the weights by working backward from the output layer through the hidden layers.

regularization

A large group of strategies known as — strategies is designed to prevent models from overfitting by making changes or defining constraints for the model parameters or the performance function.

overfitting

It happens when the trained model is highly fitted to the training data set but performs poorly with regard to external data sets. causes serious issues with respect to the generalizability of the model.

Training

The process of adjusting weights and biases in a neural network

tensors

Used in deep neural networks inputs provide us with the ability to include additional dimensions (e.g., time, location) in analyzing the data sets.

performance function (F)

a.k.a. cost function or loss funtion) and optimizing (minimizing) that function by changing model parameters.

How to you minimize delta errors?

adjusting the network's weights. The key is to change the weights in the proper direction, making changes that reduce delta (i.e., error).

neurons are processing units (also called processing elements [PEs])

that perform a set of predefined mathematical operations on the numerical values coming from the input variables or from the other neuron outputs to create and push out its own outputs.

Google Lens

that uses deep learning artificial neural network algorithms (along with other AI techniques) to deliver information about the images captured by users from their nearby objects. This involves identifying the objects, products, plants, animals, and locations and providing information about them on the Internet.

error

the difference between the actual input and the target Usually, the —- is nothing but a measure of error

convolution layer

A layer containing a convolution function in a CNN

iterative approach to solving a nonlinear optimization problem that is very similar in meaning to the one characterizing multiple linear regression.

A more complicated expression can be derived to work backward in a similar way from the output neurons through the hidden layers to calculate the corrections to the associated weights of the inner neurons.

Processing Element (PE)

A neuron in a neural network. Each neuron receives inputs, processes them, and de- livers a single output. The input can be raw input data or the output of other processing elements. The output can be the final result (e.g., 1 means yes, 0 means no), or it can be input to other neurons.

Compute Unified Device Architecture (CUDA)

A parallel computing architecture for breaking down larger processing tasks into smaller tasks and processing them simultaneously on a GPU API

Equivariance

Another benefit in the convolution function convolution layer in a network will have a property for translation purposes It simply means that any changes in the input will lead to a change in the output in the same way. Apart from image- processing applications, this feature is especially useful for analyzing time-series data using convolutional networks where convolution can produce a kind of timeline that shows when each feature appears in the input.

deep belief networks (DBNs)

As the depth of network increases, chances of reaching a global optimum using random initializations with the gradient-based algorithms decrease. In such cases, usually pretraining the network parameters using some unsupervised deep learning methods such as ______can be helpful Introduction of ___ in 2006 is considered as the beginning of the current deep learning renaissance

two main reasons why the initial idea of deep learning had to wait more than two decades to advance:

Computing efforts needed to be more advanced No availability of large and feature-rich digitized data sets

difference between traditional ML and DL

DL mimics the thought process of humans using mathematical algorithms to learn from data. Having the capacity to automatically learn the knowledge necessary to do such informal jobs and, as a result, extract some advanced features that improve system performance. You do not have to handfeed the algorithm, like ML

representation learning

Deep learning is a member of the family of techniques known as representation learning, according to Goodfellow and his coworkers (2016). One sort of machine learning (which is also a component of AI) is represented by representation learning approaches, where the emphasis is on the system learning and discovering features as well as learning the mapping from those characteristics to the output/target.

Input

Each corresponds to a single attribute. The numeric value, or the numeric representation of non-numeric value, of an attribute is the input to the network. Several types of data, such as text, picture, and voice, can be used as inputs. Preprocessing may be needed to convert the data into meaningful —— from symbolic/non-numeric data or to numeric/scale data.

delta

Error: the difference between the actual output (Y or YT) and the desired output (Z) for a given set of inputs

_______ has enabled us to successfully run neural networks with over a million neurons.

GPU technology

word2vec project

Google published it causing the applications of deep learning for text mining to increased remarkably. two-layer neural network that gets a large text corpus as the input and converts each word in the corpus to a numeric vector of any given size (typically ranging from 100 to 1,000) with very interesting features. not a deep learning algorithm the vectors are specified in such a way that those of a similar context are placed very close to each other in the n-dimensional vector space. is able to handle and correctly represent words including typos, abbreviations, and informal conversations. algorithm maintains the words' relative associations.

threshold value

If an output value is smaller than the threshold value, it will not be passed to the next level of neurons. any value of 0.5 or less becomes 0, and any value above 0.5 becomes 1. sometimes —- is used instead of a transformation function

Weights and feedback roles are called what in dynamic networks?

In a dynamic network, the weights are called the long-term memory while the feedbacks role is the short-term memory.

Define short- term memory network in LSTM

LSTM networks do not have such a shortcoming. The term long short- term memory network then refers to a network in which we are trying to remember what happened in the past (i.e., feedbacks; previous outputs of the layers) for a long enough time so that it can be used/leveraged in accomplishing the task when needed.

AlexNet

One of the first convolutional networks designed for image classification using the ImageNet data set was —-. It was composed of five convolution layers followed by three fully connected (a.k.a. dense) layers. One of the contributions of this relatively simple architecture that made its training remarkably faster and computationally efficient was the use of rectified linear unit (ReLu) transfer functions in the convolution layers instead of the traditional sigmoid functions. By doing so, the designers addressed the issue called the vanishing gradient problem caused by very small derivatives of sigmoid functions in some regions of the images. The other important contribution of this network that has a dramatic role in improving the efficiency of deep networks was the introduction of the concept of dropout layers to the CNNs as a regularization technique to reduce overfitting. A dropout layer typically comes after the fully connected layers and applies a random probability to the neurons to switch off some of them and make the network sparser.

convolution function

Solution to, "in an image-processing task using a neural network for images of size 150 * 150 pixels, each input matrix will contain 22,500 (i.e., 150 times 150) integers, each of which should be assigned its own weight parameter per each neuron it goes into throughout the network. Therefore, having even only a single layer requires thousands of weight parameters to be defined and trained. As one might guess, this fact would dramatically increase the required time and processing power to train a network, since in each training iteration, all of those weight parameters have to be updated by the SGD algorithm." The trick is called parameter sharing additional benefits of parameter sharing: Specifically, in a convolution layer, instead of having a weight for each input, there is a set of weights referred to as the convolution kernel or filter, which is shared between inputs and moves around the input matrix to produce the outputs. by applying the convolution operation, we actually are converting the input matrix into an output in which the parts that have a particular feature (reflected by the kernel) are placed in the square box. This characteristic of convolution functions is especially useful in practical image-processing applications.

The main contribution of the GoogLeNet architecture was to introduce a module called Inception which is

The idea of Inception is that because one would have no idea of the size of convolution kernel that would perform best on a particular data set, it is better to include multiple convolutions and let the network decide which one to use. Therefore, as shown in Figure 6.27, in each convolution layer, the data coming from the previous layer is passed through multiple types of convolution and the outputs are concatenated before going to the next layer. Such architecture allows the model to take into account both local features via smaller convolutions and high abstracted features via larger ones.

R * 1

The input to a regular ANN model is typically an array of size R * 1, where R is the number of input variables.

Recurrent neural networks (RNN)

The neural networks in which feedback connections are allowed General RNN architectures, as well as a specific variation of RNNs called long short-term memory networks

Transfer Function

The relationship between the internal activation level and the output can be linear or nonlinear. The relationship is expressed by one of several types of —- Selection of the specific activation function affects the network's operation. modifies the output levels to fit within a reasonable range of values Without such a transformation, the value of the output becomes very large, especially when there are several layers of neurons can occur at the output of each processing element, or it can be performed only at the final output nodes.

weights

They express the relative strength (or mathematical value) of the input data or the many connections that transfer data from layer to layer. In other words, —— express the relative importance of each input to a processing element and, ultimately, to the output. it's crucial in that they store learned patterns of information. It is through repeated adjustments of — that a network learns.

popular software libraries used for deep learning and analysis of deep learning: developed with the purpose of programming GPUs for general-purpose processing (just as CPUs), and particularly for deep learning and analysis of Big Data

Torch, Caffe, Tensorflow, Theano, and Keras

dropout layers

Used in CNNs as a regularization technique to reduce overfitting. typically comes after the fully connected layers and applies a random probability to the neurons to switch off some of them and make the network sparser.

vanishing gradient problem

Using rectified linear unit (ReLu) transfer functions in the convolution layers instead of the traditional sigmoid functions addressed the issue called —— caused by very small derivatives of sigmoid functions in some regions of the images.

using some unsupervised deep learning methods such as deep belief networks (DBNs) can be helpful MLP

Using these unsupervised learning methods, we can train the MLP layers, one at a time, starting from the first layer, and use the output of each layer as the input to the subsequent layer and initialize that layer with an unsupervised learning algorithm. At the end, we will have a set of initialized values for the parameters across the whole network. Those pre- trained parameters, instead of random initialized parameters, then can be used as the initial values in the supervised learning of the MLP. This pretraining procedure has been shown to cause significant improvements to the deep classification applications.

neural network

a category of AI that attempts to emulate the way the human brain works composed of processing elements that are organized in different ways to form the network's structure. The basic processing unit in a neural network is the neuron. A number of neurons are then organized to establish a network of neurons. Neurons can be orgaNized in a number of different ways; these various network patterns are referred to as topologies or network architectures. One of the most popular approaches, known as the feedforward-multilayered perceptron, allows all neurons to link the output in one layer to the input of the next layer, but it does not allow any feedback linkage

According to the universal approximation theorem

a sufficiently large single-layer MLP network will be able to approximate any function. Although theoretically founded, such a layer with many neurons may be prohibitively large and hence may fail to learn the underlying pat- terns correctly. Whereas theoretically it is still an open research question, practically using more layers in a network seems to be more effective and computationally more efficient than using many neurons in a few layers.

stochastic gradient descent (SGD)

an iterative gradient-based optimizer used for finding the minimum (i.e., the lowest point) in performance functions, as in the case of neural networks. The idea behind — algorithm is that the derivative of the performance function with respect to each current weight or bias indicates the amount of change in the error measure by each unit of change in that weight or bias element. These derivatives are referred to as network gradients

Keras

an open-source neural network library written in Python that functions as a high-level application programming interface (API) and is able to run on top of various deep learning frameworks including Theano and TensorFlow. In essence, ___ just by getting the key properties of network building blocks (i.e., type of layers, transfer functions, and optimizers) via an extremely simple syntax automatically generates syntax in one of the deep learning frameworks and runs that framework in the backend. While ___ is efficient enough to build and run general deep learning models in just a few minutes, it does not provide several advanced operations provided by TensorFlow or Theano. Therefore, in dealing with special deep network models that require advanced settings, one still needs to directly use those frameworks instead of __- (or other APIs such as Lasagne) as a proxy.

MLP deep networks, also known as deep feedforward networks

are the most general type of deep networks. These networks are simply large-scale neural networks that can contain many layers of neurons and handle tensors as their input. The types and characteristics of the network elements (i.e., weight functions, transfer functions) are pretty much the same as in the standard ANN models. These models are called feedforward because the flow of information that goes through them is always forwarding and no feedback connections (i.e., connections in which outputs of a model are fed back to itself) are allowed. Generally, a sequential order of layers has to be held between the input and the output layers in the ____ type network architecture. This means that the input vector has to pass through all layers sequentially and cannot skip any of them; moreover, it cannot be directly connected to any layer except for the very first one; the output of each layer is the input to the subsequent layer.

LSTM networks

are variations of recurrent neural networks that today are known as the most effective sequence modeling technique and are the base of many practical applications. do not have such a shortcoming. The term long short-term memory network then refers to a network in which we are trying to remember what happened in the past (i.e., feedbacks; previous outputs of the layers) for a long enough time so that it can be used/leveraged in accomplishing the task when needed.

ImageNet

by far the most widely used benchmarking data set to assess the efficiency and accuracy of deep networks designed by deep learning researchers.

multilayer perceptron networks

can also be used for various prediction, classification, and clustering purposes. Especially when a large number of input variables are involved or in cases that the nature of input has to be an N -dimensional array, a deep multilayer network design needs to be employed.

Genetic Algorithms (GAs)

can be used to guide the selection of the network parameters to maximize the performance of the desired output. In fact, most commercial ANN software tools are now using GA to help users "optimize" the network parameters in a semiautomated manner

The summation function

computes the weighted sums of all input elements entering each processing element. — multiplies each input value by its weight and totals the values for a weighted sum. computes the internal stimulation, or activation level, of the neuron.

Output

contains the solution to a problem The ANN assigns numeric values to the output, which may then need to be converted into categorical output using a threshold value so that the results would be 1 for "yes" and 0 for "no."

pooling (a.k.a. subsampling) layer.

convolution layer is often followed by a ——-. are in charge of consolidating the large tensors to one with a smaller size and reducing the number of model parameters while keeping their important features. Normally, it involves an r * c consolidation window (similar to a kernel in the convolution function) that moves around the input matrix and in each move calculates some summary statistics of the elements involved in the consolidation window so that it can be put in the output image. given the size of the consolidation window (i.e., r and c), stride should be carefully selected so that there would be no overlaps in the consolidations. The pooling operation using an r * c consolidation window reduces the number of rows and columns of the input matrix by a factor of r and c, respectively. is especially useful in the image-processing applications of deep learning in which the critical task is to determine whether a feature (e.g., a particular animal) is present in an image while the exact spatial location of the same in the picture is not important. However, if the location of features is important in a particular context, applying a —— function could potentially be misleading. Sometimes it is used just to modify the size of matrices coming from the previous layer and convert them to a specified size required by the following layer in the network.

differences in the steps/tasks that need to be performed when building a typical deep learning model versus the steps/tasks performed when building models with classic machine-learning algorithms

deep learning is able to deal with more complicated tasks with a higher level of sophistication by employing many layers of connected neurons along with much larger data sets to automatically character- ized variables and solve the problems but only at the expense of a great deal of compu- tational effort.

AI is reentering the world, faster and stronger because of

deep learning, cognitive computing Both are helping to make accurate and timely decisions by harnessing the rapidly expanding Big Data resources.

Theano

define, optimize, and evaluate mathematical expressions involving multi-dimensional ar- rays (i.e., tensors) on CPU or GPU platforms. __ was one of the first deep learning frameworks but later became a source of inspiration for the developers of TensorFlow. ____ and TensorFlow both pursue a similar procedure in the sense that in both a typical network implementation involves two sections: in the first section, a computational graph is built by defining the network variables and operations to be done on them; and the second section runs that graph (in __- by compiling the graph into a function and in TensorFlow by creating a session). what happens in these libraries is that the user defines the structure of the network by providing some simple and symbolic syntax understandable even for beginners in programming, and the library automatically generates appropriate codes in either C (for processing on CPU) or CUDA (for process- ing on GPU) to implement the defined network. also includes some built-in functions to visualize computational graphs as well as to plot the network performance metrics even though its visualization features are not comparable to TensorBoard.

multi-instance learning problem

distant supervision method later was relaxed by modeling the problem as a ———. They suggest assigning labels to a bag of instances rather than a single instance that can reduce the noise of the distant supervision method and create more realistic labeled training data sets

Parameter Sharing

effectively reduces the required time and processing power to train the network by reducing the number of weight parameters Specifically, in a convolution layer, instead of having a weight for each input, there is a set of weights referred to as the convolution kernel or filter, which is shared between inputs and moves around the input matrix to produce the outputs. The kernel is typically represented as a small matrix of size Wr * c; for a given input matrix V.

Artificial Neural Networks (ANNs)

essentially simplified abstractions of the human brain and its complex biological networks of neurons.

Sensitivity analysis

has been the front-runner of the techniques proposed for shedding light into the black-box characterization of trained neural networks. method for extracting the cause-and-effect relationships among the inputs and the outputs of a trained neural network model. In the process of performing sensitivity analysis, the trained neural network's learning capability is disabled so that the network weights are not affected. The basic procedure behind sensitivity analysis is that the inputs to the network are systematically manipulated within the allowable value ranges, and the corresponding change in the output is recorded for each and every input variable

larger strides, or kernel movements

if we want the output matrix to be even smaller, we can use ——- Normally, the kernel is moved one step at a time (i.e., stride = 1) when performing the convolution. By increasing this stride to 2, the size of the output matrix is reduced by a factor of 2.

LSTM networks have been widely used in many sequence modeling applications, including

image captioning, handwritten recognization generation, machine, learning, speech recognition, parsing

A hidden layer

is a layer of neurons that takes input from the previous layer and converts those inputs into outputs for further processing. Several — layers can be placed between the input and output layers, although it is common to use only one layer. In that case, the — layer simply converts inputs into a nonlinear combination and passes the transformed inputs to the output layer. The most common interpretatio is as a feature-extraction mechanism; that is, converts the original inputs in the problem into a higher-level combination of such inputs.

ImageNet

is by far the most widely used benchmarking data set to assess the efficiency and accuracy of deep networks designed by deep learning researchers.

Recurrent Neural Network (RNN)

is specifically de- signed to process sequential inputs. An RNN basically models a dynamic system where (at least in one of its hidden neurons) the state of the system (i.e., output of a hidden neuron) at each time point t depends on both the inputs to the system at that time and its state at the previous time point t - 1. In other words, RNNs are the type of neural networks that have memory and that apply that memory to determine their future outputs. it is crucial to consider a memory element for the neural network that takes into account the effect of prior moves (in the chess example) and prior sentences and paragraphs (in the essay example) to determine the best output. This memory portrays and creates the context required for the learning and understanding. A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. In dynamic networks like RNNs, on the other hand, both inputs and outputs are sequences (patterns). Therefore, a dynamic network is a dynamic system rather than a function because its output depends not only on the input but also on the previous outputs.

There are two alternative approaches for computing the gradients in the RNNs:

namely, real-time recurrent learning (RTRL) and backpropagation through time (BTT)

Caffe

open-source deep learning framework has multiple options to be used as a high-level scripting language, including the command line, Python, and MATLAB interfaces. The deep learning libraries in ___ are written in the C++ programming language. everything is done using text files instead of code. generally we need to prepare two text files with the .prototxt extension that are communicated by the Caffe engine via JavaScript Object Notation (JSON) format. The first text file, known as the architecture file, defines the architecture of the network layer by layer, where each layer is defined by a name, a type (e.g., data, convolution, output), the names of its previous (bottom) and next (top) layers in the architecture, and some required parameters (e.g., kernel size and stride for a convolutional layer). The secuond text file, known as the solver file, specifies the properties of the training algorithm, including the learning rate, maximum number of iterations, and processing unit (CPU or GPU) to be used for training the network. While ___ supports multiple types of deep network architectures like CNN and LSTM, it is particularly known to be an efficient framework for image processing due to its incredible speed in processing image files.

TensorFlow

open-source deep learning framework the only deep learning framework that, in addition to CPUs and GPUs, supports Tensor Processing Units (TPUs), a type of processor developed by Google in 2016 for the specific purpose of neural network machine learning. Google has not yet made TPUs available to the market A detailed study performed by Google shows that TPUs deliver 30 to 80 times higher performance per watt than contemporary CPUs and GPUs Another interesting feature of ___ is its visualization module, TensorBoard. TensorBoard re- fers to a Web application involving a handful of visualization tools to visualize network graphs and plot quantitative network metrics with the aim of helping users to better un- derstand what is going on during training procedures and to debug possible issues.

torch

open-source scientific computing framework for implementing machine-learning algorithms using GPUs. The —— framework is a library based on LuaJIT, a compiled version of the popular Lua programming language. adds a number of valuable features to Lua that make deep learning analyses possible; it enables supporting n-dimensional arrays (i.e., tensors), whereas tables (i.e., two-dimensional arrays) normally are the only data-structuring method used by Lua. while Lua by default uses CPU to run the programs, —- enables use of GPUs for running programs written in the Lua language. easy and extremely fast scripting properties of LuaJIT and its flexibility

convolutional neural network (CNN)

popular variation of deep MLP architecture specifically designed for computer vision applications (e.g., image recognition, handwritten text processing). also applicable to nonimage data sets. The main characteristic is having at least one layer involving a convolution weight function instead of general matrix multiplication.

Cognitive analytics

refers to cognitive computing-branded technology platforms, such as IBM Watson, that specialize in processing and analyzing large, unstructured data sets. Typically, word processing documents, e-mails, videos, images, audio files, presentations, Web pages, social media, and many other data formats need to be manually tagged with metadata before they can be fed into a traditional analytics engine and Big Data tools for computational analyses and insight generation. The princi- pal benefit of utilizing cognitive analytics over those traditional Big Data analytics tools is that for cognitive analytics such data sets do not need to be pretagged. Cognitive analytics systems can use machine learning to adapt to different contexts with minimal human supervision. These systems can be equipped with a chatbot or search assistant that un- derstands queries, explains data insights, and interacts with humans in human languages.

GPU technology has enabled us to

successfully run neural networks with over a million neurons.

distant supervision method

the big challenge in text mining approaches is lack of sufficient large annotated data sets for supervised training of deep networks. A distant supervision method of training has been proposed to address this challenge. It suggests that large amounts of training data can be produced by aligning knowledge base (KB) facts with texts. In fact, this approach is based on the assumption that if a particular type of relation exists between an entity pair (e.g., "A" is a component of "B") in the KB, then every text document containing the mention of the entity pair would express that relation. However, since this assumption was not very realistic, Riedel, Yao, and McCallum (2010) later relaxed it by modeling the problem as a multi-instance learning problem. They suggest assigning labels to a bag of instances rather than a single instance that can reduce the noise of the distant supervision method and create more realistic labeled training data sets

feedforward

the flow of information that goes through them is always forwarding and no feedback connections (i.e., connections in which outputs of a model are fed back to itself) are allowed.

Supervised Learning

the learning process is inductive; that is, connection weights are de- rived from existing cases. The usual process of learning involves three tasks: 1. Compute temporary outputs. 2. Compare outputs with desired targets. 3. Adjust the weights and repeat the process.

parallel processing

the processing of many aspects of a problem simultaneously In ANN, when information is processed, many of the processing elements perform their computations at the same time.

Cognitive Computing

the use of artificial intelligence techniques and access to vast amounts of data to simulate human problem solving in complex situations with ambiguity, changing data, and even conflicting information. __-_ system offers a synthesis not just of information sources but also of influences, contexts, and insights. To achieve such a high-level of performance, cognitive systems often need to weigh conflicting evidence and suggest an answer that is "best" rather than "right." refers to the computing systems that use mathematical models to emulate (or partially simulate) the human cognition process to find solutions to complex problems and situations where the potential answers can be imprecise. makes a new class of problems computable. It addresses highly complex situations that are characterized by ambiguity and uncertainty; in other words, it handles the kinds of prob- lems that are thought to be solvable by human ingenuity and creativity. can find and synthesize data from various information sources and weigh context and conflicting evidence inherent in the data to provide the best possible answers to a given question or problem. To achieve this, _-__ systems include self-learning technologies that use data mining, pattern recognition, deep learning, and NLP to mimic the way the human brain works. systems have the loftier goal of creating algorithms that mimic the human brain's reasoning process to help humans solve an array of problems as the data and the problems constantly change.

forward

there are no interconnections between the output of a processing element and the input of a node in the same layer or in a preceding layer.

the memory concept (i.e., remembering "what happened in the past") is incorporated in LSTM networks by incorporating four addi-tional layers into the typical recurrent network architecture: Describe each

three gate layers, namely input gate, forget (a.k.a. feedback) gate, and output gate, and an additional layer called Constant Error Carousel (CEC), also known as the state unit. Constant Error Carousel (CEC), also known as the state unit that integrates those gates and interacts them with the other layers. Each gate is nothing but a layer with two inputs, one from the network input and the other a feedback from the final output of the whole network. The gates involve log-sigmoid transfer functions. Therefore, their outputs will be between 0 and 1 and describe how much of each component (either input, feedback, or output) should be let through the network. Also, CEC is a layer that falls between the input and the output layers in a recurrent network architecture and applies the gates outputs to make the short-term memory long. The gates are in charge of controlling the flow of information, of information through the network and dynamically change the time scale of integration based on the input sequence. As a result, LSTM networks are able to learn long-term dependencies among the sequence of inputs more easily than the regular RNNs. we typically do not want to indiscriminately remember everything that has happened in the past. Therefore, gating provides us with the capability of remembering prior outputs selectively. The input gate will allow selective inputs to the CEC; the forget gate will clear the CEC from the unwanted previous feed- backs; and the output gate will allow selective outputs from the CEC.

why does image classification networks traditionally involve two pipelines: visual feature extraction and image classification?

today there are many large and feature-rich databases for such applications available. Nevertheless, the biggest challenge is that in supervised learning applications, one needs an already annotated (i.e., labeled) data set to train the model be- fore we can use it for prediction/identification of other unknown cases. Whereas extract- ing features of data sets using CNN layers is an unsupervised task, the extracted features will not be of much use without having labeled cases to develop a classification network in a supervised learning fashion.

Convolution

typically shown by the star with circle symbol A linear operation that essentially aims at extracting simple patterns from sophisticated data patterns.

relation extraction

various types of deep networks have been applied to the word embeddings created by this algorithm to accomplish different objectives. Particularly, a large group of studies had developed convolutional networks applied to the word embeddings with the aim of relation extraction from textual data sets. Relation extraction is one of the subtasks of natural language processing (NLP) that focuses on determining whether two or more named entities rec- ognized in the text form specific relationships

r - 1 and c - 1 in Convolution function

we can pad the outside of the input matrix with zeros before convolving, that is, to add r - 1 rows and c - 1 columns of zeros to the input matrix. "for example, using a 2 * 2 kernel for convolution, the output matrix has 1 row and 1 column less than the input matrix. To prevent this change of size, we can pad the outside of the input matrix with zeros before convolving, that is, to add r - 1 rows and c - 1 columns of zeros to the input matrix.'

For any output neuron j, error (delta) = (Zj - Yj) (df/dx)

where Z and Y are the desired and actual outputs, respectively. Using the sigmoid function, f = 31 + exp( - x)4 -1, where x is proportional to the sum of the weighted inputs to the neuron, is an effective way to compute the output of a neuron in practice. With this function, the derivative of the sigmoid function df/dx = f (1 - f ) and of the error is a simple function of the desired and actual outputs The factor f (1 - f ) is the logistic function, which serves to keep the error correction well bounded. The weight of each input to the jth neuron is then changed in proportion to this calculated error.

Calculation of network gradients in the neural networks requires application of an algorithm called backpropagation,

which is the most popular neural network learning algorithm, that applies the chain rule of calculus to compute the derivatives of functions formed by composing other functions whose derivatives are known. includes one or more hidden layers. This type of network is considered feed- forward because there are no interconnections between the output of a processing element and the input of a node in the same layer or in a preceding layer. Externally provided correct patterns are compared with the neural network's output during (su- pervised) training, and feedback is used to adjust the weights until the network has categorized all training patterns as correctly as possible (the error tolerance is set in advance).

requires vast amounts of structured and unstructured data fed to machine- learning algorithms. Over time, cognitive systems are able to refine the way in which they learn and recognize patterns and the way they process data to become capable of anticipating new problems and modeling and proposing possible solutions. To achieve those capabilities, cognitive computing systems must have the following key attributes as defined by the Cognitive Computing Consortium (2018):

• Adaptive: Cognitive systems must be flexible enough to learn as information changes and goals evolve. The systems must be able to digest dynamic data in real time and make adjustments as the data and environment change. • Interactive: Human-computer interaction (HCI) is a critical component in cogni- tive systems. Users must be able to interact with cognitive machines and define their needs as those needs change. The technologies must also be able to interact with other processors, devices, and cloud platforms. • Iterative and stateful: Cognitive computing technologies can also identify prob- lems by asking questions or pulling in additional data if a stated problem is vague or incomplete. The systems do this by maintaining information about similar situa- tions that have previously occurred. • Contextual: Understanding context is critical in thought processes, so cogni- tive systems must understand, identify, and mine contextual data, such as syntax, time, location, domain, requirements, and a specific user's profile, tasks, or goals. Cognitive systems may draw on multiple sources of information, including struc- tured and unstructured data and visual, auditory, or sensor data.


Related study sets

Chapter 10 Conception and Fetal Development

View Set

CH. 12 Health insurance providers

View Set

The Kinsey Reports: sex surveyed (lecture and reading)

View Set

Management Chapter 7 Learn Smart

View Set

CCJ 4938 exam 3 chap 11, 10 9 and notes

View Set

Chapter 43: Nursing Assessment: Neurologic Function

View Set