Natural Language Processing and related topics

Ace your homework & exams now with Quizwiz!

What are some open-source NLP libraries?

- Apache OpenNLP: a machine learning toolkit that provides tokenizers, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, coreference resolution, and more. - Natural Language Toolkit (NLTK): a Python library that provides modules for processing text, classifying, tokenizing, stemming, tagging, parsing, and more. - Standford NLP: a suite of NLP tools that provide part-of-speech tagging, the named entity recognizer, coreference resolutionsystem, sentiment analysis, and more. - MALLET: a Java package that provides Latent Dirichlet Allocation, document classification, clustering, topic modeling, information extraction, and more. These libraries provide the algorithmic building blocks of NLP in real-world applications. Algorithmia provides a free API endpoint for many of these algorithms, without ever having to setup or provision servers and infrastructure.

What are some of the tools available for Sentiment Analysis?

- Meltwater: Assess the tone of the commentary as a proxy for brand reputation and uncover new insights that help you understand your target audience. - Google Alerts: A simple and very useful way to monitor your search queries. I use it to track "content marketing" and get regular email updates on the latest relevant Google results. This is a good starting point for tracking influencers, trends and competitors. - People Browser: Find all the mentions of your brand, industry and competitors and analyze sentiment. This tool allows you to compare the volume of mentions before, during and after your marketing campaigns. - Google Analytics: A powerful tool for discovering which channels influenced your subscribers and buyers. Create custom reports, annotations to keep uninterrupted records of your marketing and web design actions, as well as advanced segments to breakdown visitor data and gain valuable insights on their online experiences. - Hootsuite: A great freemium tool that allows you to manage and measure your social networks. The premium subscription provides enhanced analytics at a very reasonable 5.99USD per month. - Tweetstats: This is a fun, free tool that allows you to graph your Twitter stats. Simply enter your Twitter handle and "let the magic happen." - Facebook Insights: If you have more than 30 Likes on your Facebook Page you can start measuring its performance with Insights. See total page Likes, number of fans, daily active users, new Likes/Unlikes, Like sources, demographics, page views and unique page views, tab views, external referrers, media consumption and more! - Pagelever: This is another tool for measuring Facebook activity. Pagelever gives you the ability to precisely measure each stage of how content is consumed and shared on the Facebook platform. Social Mention: The social media equivalent to Google Alerts, this is a useful tool that allows you to track mentions for identified keywords in video, blogs, microblogs, events, bookmarks, comments, news, Q&A, hash tags and even audio media. It also indicates if mentions are positive, negative, or neutral. - Marketing Grader: Hubspot's Marketing Grader is a tool for grading your entire marketing funnel. It uses over 35 metrics to calculate your grade by looking at if you are regularly blog posting, Tweeting, updating on Facebook, converting visitors into leads, and more. It's a full funnel way to help you measure your inbound marketing initiatives.

What books are recommended for learning about NLP?

- Speech and Language Processing: "The first of its kind to thoroughly cover language technology - at all levels and with all modern technologies - this book takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations." - Foundations of Statistical Natural Language Processing: "This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications." - Handbook of Natural Language Processing: "The Second Edition presents practical tools and techniques for implementing natural language processing in computer systems. Along with removing outdated material, this edition updates every chapter and expands the content to include emerging areas, such as sentiment analysis." - Statistical Language Learning (Language, Speech, and Communication): "Eugene Charniak breaks new ground in artificial intelligenceresearch by presenting statistical language processing from an artificial intelligence point of view in a text for researchers and scientists with a traditional computer science background." - Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit "This is a book about Natural Language Processing. By "natural language" we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. At one extreme, it could be as simple as counting word frequencies to compare different writing styles. - Speech and Language Processing, 2nd Edition 2nd Edition "An explosion of Web-based language techniques, merging of distinct fields, availability of phone-based dialogue systems, and much more make this an exciting time in speech and language processing. The first of its kind to thoroughly cover language technology - at all levels and with all modern technologies - this text takes an empirical approach to the subject, based on applying statistical and other machine-learning algorithms to large corporations. The authors cover areas that traditionally are taught in different courses, to describe a unified vision of speech and language processing." - Introduction to Information Retrieval "As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval systems. However, during the last decade, relentless optimization of information retrieval effectiveness has driven web search engines to new quality levels where most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows, 2004) found that 92% of Internet users say the Internet is a good place to go for getting everyday information." To the surprise of many, the field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people's preferred means of information access."

What are some NLP examples?

- Use Summarizer to automatically summarize a block of text, exacting topic sentences, and ignoring the rest. - Generate keyword topic tags from a document using LDA (Latent Dirichlet Allocation), which determines the most relevant words from a document. This algorithm is at the heart of the Auto-Tag and Auto-Tag URL microservices. - Sentiment Analysis, based on StanfordNLP, can be used to identify the feeling, opinion, or belief of a statement, from very negative, to neutral, to very positive. Often, developers with use an algorithm to identify the sentiment of a term in a sentence, or use sentiment analysis to analyze social media.

What is Relationship Extraction?

A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.

How does NLP work?

Apart from common word processor operations that treat text like a mere sequence of symbols, NLP considers the hierarchical structure of language: several words make a phrase, several phrases make a sentence and, ultimately, sentences convey ideas. By analyzing language for its meaning, NLP systems have long filled useful roles, such as correcting grammar, converting speech to text and automatically translating between languages. (John Rehling, an NLP expert at Meltwater Group, said in How Natural Language Processing Helps Uncover Social Media Sentiment.)

Why is NLP relevant?

NLP is characterized as a hard problem in computer science. Human language is rarely precise, or plainly spoken. To understand human language is to understand not only the words, but the concepts and how they're linked together to create meaning. Despite language being one of the easiest things for humans to learn, the ambiguity of language is what makes natural language processing a difficult problem for computers to master.

What is Big Data?

Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem." Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on." Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology and environmental research.

What is Natural Language Processing or NLP?

NLP is the field of study that focuses on the interactions between human language and computers. It sits at the intersection of computer science, artificial intelligence, and computational linguistics (Wikipedia). NLP goes by many names — text analytics, data mining, computational linguistics — but the basic principle remains the same. NLP refers to computer systems that derive meaning from human language in a smart and useful way.

What are Lexical Analysis and Tokenization?

In computer science, lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth. Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

What is Customer Engagement?

Customer Engagement is a business communication connection between an external stakeholder (consumer) and an organization (company or brand) through various channels of correspondence. This connection can be a reaction, interaction, effect or overall customer experience, which takes place online and offline. The term can also be used to define customer-to-customer correspondence regarding a communication, product, service or brand. However, the latter dissemination originates from a business-to-consumer interaction resonated at a subconscious level.

What is Big Data an important concept?

Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.

What is Evangelism Marketing?

Evangelism marketing is an advanced form of word-of-mouth marketing in which companies develop customers who believe so strongly in a particular product or service that they freely try to convince others to buy and use it. The customers become voluntary advocates, actively spreading the word on behalf of the company. Evangelism marketing is sometimes confused with affiliate marketing. However, while affiliate programs provide incentives in the form of money or products, evangelist customers spread their recommendations and recruit new customers out of pure belief, not for the receipt of goods or money. Rather, the goal of the customer evangelist is simply to provide benefit to other individuals. As they act independently, evangelist customers often become key influencers. The fact that evangelists are not paid or associated with any company make their beliefs perceived by others as credible and trustworthy. Evangelism comes from the three words of 'bringing good news', and the marketing term draws from the religious sense, as consumers are driven by their beliefs in a product or service, which they preach in an attempt to convert others.

What is Part-of-speech Tagging?

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.

What is Stemming?

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation. Stemming programs are commonly referred to as stemming algorithms or stemmers.

Why is Terminology Extraction relevant?

In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet. Modeling these communities and their information needs is important for several web applications, like topic-driven web crawlers, web services, recommender systems, etc. The development of terminology extraction is essential to the language industry.

What is Machine Translation?

Machine translation or MT (not to be confused with computer-aided translation, machine-aided human translation (MAHT) or interactive translation) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another. On a basic level, MT performs simple substitution of words in one language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus statistical, and neural techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies. Current MT software can achieve improved output quality. It often allows for customization by domain or profession, improving output by limiting the scope of allowable substitutions where formal or formulaic language is used. MT has also proven useful as a tool to assist human translators and, in a very limited number of cases, can even produce output that can be used as is.

What's NLP's role in social media?

NLP can analyze language patterns to understand text. One of the most compelling ways NLP offers valuable intelligence is by tracking sentiment — the tone of a written message (tweet, Facebook update, etc.) — and tag that text as positive, negative or neutral. Much can be gleaned from sentiment analysis. Companies can target unhappy customers or, more importantly, find their competitors' unhappy customers, and generate leads. I like to call these discoveries "actionable insights" — findings that can be directly implemented into PR, marketing, adverting and sales efforts.

What are the applications of NLP?

NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. Developers can organize and structure knowledge to perform tasks such as automatic text summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, topic extraction, topic segmentation, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering. NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statical inference. In general, the more data analyzed, the more accurate the model will be.

What are the limits of NLP in social media?

NLP technology lacks human-level intelligence, at least for the foreseeable future. On a text-by-text basis, the system's conclusions may be wrong as no automated sentiment analysis that currently exists can handle certain level of nuance (sarcasm, for example). Furthermore, certain expressions ("ima") or abbreviations ("#ff") fool the program, especially when people have 140 characters or less to express their opinions, or when they use slang, profanity, misspellings and neologisms. Finally, much of social media interaction is personal, expressed between two people or among a group, commonly in 1st or 2nd person contrasting with news or brand posts, which are likely written with a more detached, omniscient tone. On top of that, each participant likely varies their language from one page to another. The English language and intonation differs hugely based on the source and the forum.

What is Named-entity Recognition?

Named-entity Recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Most research on NER systems has been structured as taking an unannotated block of text, and producing an annotated block of text that highlights the names of entities: Jim bought 300 shares of Acme Corp. in 2006. -> [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time. In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

What is Question Answering in computer science?

Question answering (QA) is a computer science discipline within the fields of information retrieval and NLP, which is concerned with building systems that automatically answer questions posed by humans in a natural language. A QA implementation, usually a computer program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, QA systems can pull answers from an unstructured collection of natural language documents. QA research attempts to deal with a wide range of question types including: fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual questions. Some examples of natural language document collections used for QA systems include: a local collection of reference texts, internal organization documents and web pages, compiled newswire reports, a set of Wikipedia pages, a subset of World Wide Web pages, etc.

How does Terminology Extraction work?

Typically, approaches to automatic term extraction make use of linguistic processors (part of speech tagging, phrase chunking) to extract terminological candidates, i.e. syntactically plausible terminological noun phrases, NPs (e.g. compounds "credit card", adjective-NPs "local tourist information office", and prepositional-NPs "board of directors", which tend to be frequent. Terminological entries are then filtered from the candidate list using statistical and machine learning methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for semantic similarity, knowledge management, human translation and machine translation, etc.

What is Sentiment Analysis?

Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor). (Source: Wikipedia)

What are the uses of Sentiment Analysis?

Sentiment analysis is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics. Social media monitoring tools like Brandwatch Analytics make that process quicker and easier than ever before, thanks to real-time monitoring capabilities. The applications of sentiment analysis are broad and powerful. The ability to extract insights from social data is a practice that is being widely adopted by organizations across the world. Shifts in sentiment on social media have been shown to correlate with shifts in the stock market. The Obama administration used sentiment analysis to gauge public opinion to policy announcements and campaign messages ahead of 2012 presidential election.

What is Terminology Extraction?

Terminology extraction (also known as topic extraction, term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

What is Text Mining?

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods. A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

What are the difficulties of Sentiment Analysis?

The human language is complex. Teaching a machine to analyze the various grammatical nuances, cultural variations, slang and misspellings that occur in online mentions is a difficult process. Teaching it to understand how context can affect tone, which is fairly intuitive for humans, is even more difficult. Consider the following sentence: "My flight's been delayed. Brilliant!" Most humans would be able to quickly interpret that the person was being sarcastic. We know that for most people having a delayed flight is not a good experience (unless there's a free bar as recompense involved). By applying this contextual understanding to the sentence, we can easily identify the sentiment as negative. Without contextual understanding, a machine looking at the sentence above might see the word "brilliant" and categorize it as positive.

Where can I get information on NLP on-line?

https://blog.algorithmia.com/introduction-natural-language-processing-nlp/


Related study sets

Chapter 2: What is the Internet?

View Set

Social Psychology - Ch. 7: Attitudes, Beliefs, and Consistency

View Set

Chapter 46 Nursing Management: Renal and Urologic Problems (Lewis)

View Set

Vía ascendente de la médula espinal I (Tacto discriminativo y sensibilidad postural)

View Set

Legal Concepts of the Insurance Contract Questions

View Set