nlp

¡Supera tus tareas y exámenes ahora con Quizwiz!

Polysemy

a word with multiple meanings depending on the context, e.g. "In what state would you find Lincoln?", - "Lincoln" can be sstate or a person.

If there's multiple users manually entering data, then this is a common problem. Maybe i like to use "n/a" but you like to use "na". An easy way to detect these various formats is to put them in a list. Then when we import the data, Pandas will recognize them right away. Here's an example of how we would do that.

# Making a list of missing value types missing_values = ["n/a", "na", "--"] df = pd.read_csv("property data.csv", na_values = missing_values)

fill in missing values with a single value.

# Replace missing values with a number df['ST_NUM'].fillna(125, inplace=True)

A very common way to replace missing values is using a median.

# Replace using median median = df['NUM_BEDROOMS'].median() df['NUM_BEDROOMS'].fillna(median, inplace=True)

get a total count of missing values.

# Total number of missing values print df.isnull().sum().sum() Out: 8

Co-variance * Co-relation

- Covariance and Correlation compare 2 variables. - Covariance of variables X and Y mesures how they vary together. Cov(X,Y) = E[(X-mean of x)(Y- mean of y)] -

What is entropy? How would you estimate the entropy

- Entropy is the measure the disorder of information. - The aim of compression in information theory is to resolve this uncertainty with the least number of bits to transmit a message accurately. - Entropy measures the amount of information need to resolve uncertainty. - Entropy is measured in number of bits per word. - Notice, entropy increase when there is more uncertainty about the occurence of a word. - Low entropy means the data is from a distribution with peaks and valleys and there is less data uncertainty. - High entropy: data is from uniform distribution (phân bố đồng đều), and there is more uncertainty about the data.

Which is a better algorithm for POS tagging - SVM or hidden markov models ? why?

- HMM (Hidden Markov Model) is used to capture patterns in the occurrence of words in text. - HMM is used to solve a variety of NLP problems, including POS tagging. - POS tagging is part of a highter level application such as Information Extraction, summarizer, QA systems. - Words and letter do not occur at random in natural lagnuage.Given any word x in a text stream, we can find n words before and after x.

What is NLP?

- NLP is automated way to understand or analyze the natural languages and extract required information from such data by applying machine learning Algorithms.

Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form? 1. Lemmatization 2. Levenshtein 3. Stemming 4. Soundex

- Normalization is a processing a word (keyword) into the most basic form. e.g. sadden, saddest, sadly --> 'sad' is normalized word. - Levenshtein is used to measure the similarity among different sentences and Soundex is used to indexing words by their pronunciation. Hence they are not appropriate tools to use for keyword normalization. - Answer: Lemmatization & Stemming.

Probabilty

- Probability is a number between (0,1). - indicate how likely it is that some event/group of events will occur. 0 mean: event never occur. 1 means: event will always occur.

What is the significance of TF-IDF?

- TFIDF stands for term frequency-inverse document frequency. - Tf-idf is one of the most popular term-weighting schemes. - TFIDF reflects how important a word is to a document in a collection or in the collection of a set. - TFIDF is used in recommender systems, search engines, stop-words filtering, text summarization and classification. Why IDF: - IDF is a measure if a word is common or rare across all documents. Inverse document frequency e.g. "the", 'brown', 'cow'. 'the' is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less-common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Differentiate regular grammar and regular expression.

- They all recognise or generate strings of a regular language, and mathematically, there is, in that sense, no difference. Of course sometimes one model is easier to use than another for a particular task, due to the details of the formalism. Furthermore the way they work in a human's head is often a little different, finite automata "feel" like computers, regular expressions "feel" like you're constructing a string out of smaller substrings and regular grammars "feel" like a more traditional grammatical derivation or classification of a sentence in a language (unsurprisingly when you look at the history). So to compare the two, let's define them: Regular Expressions So regular expressions are recursively defined as follows: ∅ is a regular expression ε is a regular expression a is a regular expression for every a∈Σ if A and B are regular expressions then A⋅B is a regular expression (concatentation) A∣B is a regular expression (alternation) A∗ is a regular expression (Kleene star) Along with some semantics (i.e. how we interpret the operators to get a string), we get a way of generating strings from a regular language. Regular Grammars Regular grammars consist of a four tuple (N,Σ,P,S∈N) where N is the set of non-terminals, Σ is the set of terminals, S is the start non-terminal and P is the set of productions which tell us how to change the start symbol, step by step, into a string in Σ∗. P can have its productions drawn from one of two types (not both though): Right Linear Grammars For non-terminals B, C, terminal a and the empty string ε, all rules are of the form: B→a B→aC B→ε Left Linear Grammars Left linear grammars are the same, but rule #2 is B→Ca. Things to Ponder So looking at these definitions and playing with them, we can see that regular expressions look like matching rules, or ways of dealing with strings a bit at a time. Grammars seem to "label" sections of the string, and group labels under new labels to validate the string (i.e. if we can get from S to the string, or vice versa, we're happy).

What is tokenization in NLP?

- Tokenization is a common task in NLP. breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. ... In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.. "Tokens" are usually individual words (at least in languages like English) and "tokenization" is taking a text or set of text and breaking it up into its individual words. These tokens are then used as the input for other types of analysis or tasks, like parsing (automatically tagging the syntactic relationship between words).

What is part of speech (POS) tagging?

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc. PoS taggers use an algorithm to label terms in text bodies. - These taggers make more complex categories than those defined as basic PoS, with tags such as "noun-plural" or even more complex labels. Part-of-speech categorization is taught to school-age children in English grammar, where children perform basic PoS tagging as part of their education.

Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric?

AUROC is robust to class imbalance, unlike raw accuracy. For example, if you want to detect a type of cancer that's prevalent in only 1% of the population, you can build a model that achieves 99% accuracy by simply classifying everyone has cancer-free.

What are the advantages and disadvantages of decision trees?

Advantages: Decision trees are easy to interpret, nonparametric (which means they are robust to outliers), and there are relatively few parameters to tune. Disadvantages: Decision trees are prone to be overfit. However, this can be addressed by ensemble methods like random forests or boosted trees.

What are the advantages and disadvantages of neural networks?

Advantages: Neural networks (specifically deep NNs) have led to performance breakthroughs for unstructured datasets such as images, audio, and video. Their incredible flexibility allows them to learn patterns that no other ML algorithm can learn. Disadvantages: However, they require a large amount of training data to converge. It's also difficult to pick the right architecture, and the internal "hidden" layers are incomprehensible.

What is n-gram?

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)-order Markov model Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability - with larger n, a model can store more context with a well-understood space-time tradeoff, enabling small experiments to scale up efficiently.

approximate matching

Approximate string matching (or fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). - The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

Explain bagging.

Bagging, or Bootstrap Aggregating, is an ensemble method in which the dataset is first divided into multiple subsets through resampling. Then, each subset is used to train a model, and the final predictions are made through voting or averaging the component models. Bagging is performed in parallel.

List some Components of NLP?

Below are the few major components of NLP. - Entity extraction: It involves segmenting a sentence to identify and extract entities, such as a person (real or fictional), organization, geographies, events, etc. - Syntactic analysis: It refers to the proper ordering of words. - Pragmatic analysis: Pragmatic Analysis is part of the process of extracting information from text.

Optimization # What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Both algorithms are methods for finding a set of parameters that minimize a loss function by evaluating parameters against data and then making adjustments. In standard gradient descent, you'll evaluate all training samples for each set of parameters. This is akin to taking big, slow steps toward the solution. In stochastic gradient descent, you'll evaluate only 1 training sample for the set of parameters before updating them. This is akin to taking small, quick steps toward the solution.

Ensemble Learning

Combining multiple models for better performance.

Data Preprocessing

Dealing with missing data, skewed distributions, outliers, etc.

What is dependency parsing?

Dependency Parsing is also known as Syntactic Parsing. It is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage. For example to answer the question "Who is the point guard for the LA Laker in the next game ?" we need to figure out its subject, objects, attributes to help us figure out that the user wants the point guard of the LA Lakers specifically for the next game. Now the task of Syntactic parsing is quite complex due to the fact that a given sentence can have multiple parse trees which we call as ambiguities. Consider a sentence "Book that flight." which can form multiple parse trees based on its ambiguous part of speech tags unless these ambiguities are resolved. Choosing a correct parse from the multiple possible parses is called as syntactic disambiguation. Parsing algorithms like the Cocke-Kasami-Younger (CKY), Earley algorithm or the Chart parsing algorithms uses a dynamic programming approach to deal with the ambiguity problems.

When would you use GD over SDG, and vice-versa?

GD theoretically minimizes the error function better than SGD. However, SGD converges much faster once the dataset becomes large. That means GD is preferable for small datasets while SGD is preferable for larger ones. In practice, however, SGD is used for most applications because it minimizes the error function well enough while being much faster and more memory efficient for large datasets.

Sampling & Splitting

How to split your datasets to tune parameters and avoid overfitting.

How can you choose a classifier based on training set size?

If training set is small, high bias / low variance models (e.g. Naive Bayes) tend to perform better because they are less likely to be overfit. If training set is large, low bias / high variance models (e.g. Logistic Regression) tend to perform better because they can reflect more complex relationships.

Explain Latent Dirichlet Allocation (LDA).

Latent Dirichlet Allocation (LDA) is a common method of topic modeling, or classifying documents by subject matter. LDA is a generative model that represents documents as a mixture of topics that each have their own probability distribution of possible words. The "Dirichlet" distribution is simply a distribution of distributions. In LDA, documents are distributions of topics that are distributions of words.

LSI (Latent Semantic Indexing)

Latent Semantic Indexing helps search engines accomplish these two tasks. It's the part of the algorithm that identifies related words in content to better classify webpages and deliver more accurate search results. By processing synonyms and understanding the relationship between words, the algorithm can interpret webpages more deeply and, therefore, deliver the right content to searchers. WHY DO WE NEED LATENT SEMANTIC INDEXING? Search engines do not process information the same way humans do. Humans use context, language processing, and association to understand language. We understand that we're talking about smartphones if the words "iPhone," "apps," and "data package" are used. We don't need to hear the word "smartphone" to understand this is what the information is about. On the other hand, search engines work differently. They use keywords to tell them what information is about. So even if a piece of information uses terms related to the main topic, search engines may have trouble recognizing this. Latent Semantic Indexing resolves this by providing the extra context search engines need to recognize topics in web content. WHO BENEFITS FROM LATENT SEMANTIC INDEXING? Additional and more accurate categorization benefits users, publishers, and marketers. Search engines can provide more useful and relevant search results because they can decipher language and classify topics based on synonyms and related terms. Marketers can improve their search rankings by adjusting their keywords to use Latent Semantic Indexing best practices. Publishers can connect with a more engaged audience because their content is more targeted and able to connect with the correct users. Searchers can more easily find the content that matches their needs and wants. ------------------------------- Latent semantic indexing, sometimes referred to as latent semantic analysis, is a mathematical method developed in the late 1980s to improve the accuracy of information retrieval. It uses a technique called singular value decomposition (SVD) to scan unstructured data within documents and identify relationships between the concepts contained therein. - it finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing). - SVD is a mathematical procedure to transform the word-document matrix such that major patterns in the colleciton revealed. Minor patterns that are not very important can be removed to identify major global relationship. - LSI analyzes relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. - The hidden relationships are call "latent semantic structure" in the collection. - Adv of LSI : it does not depend on individual words to locate documents, but rather uses a concept or topic to find relevant documents.

How to scale the machine learning algorithms such as k-means clustering to large datasets?

Mini-batch K-means. It share the same structure.

Which is better to use while extracting features character n-grams or word n-grams? Why?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization. It can also help make next word predictions.spelling error corrections. an N-gram model predicts the occurrence of a word based on the occurrence of its N - 1 previous words. So here we are answering the question - how far back in the history of a sequence of words should we go to predict the next word? For instance, a bigram model (N = 2) predicts the occurrence of a word given only its previous word (as N - 1 = 1 in this case). Similarly, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (as N - 1 = 2 in this case).

Define the NLP Terminology?

NLP Terminology is based on the following factors: Weights and Vectors: TF-IDF, length(TF-IDF, doc), Word Vectors, Google Word Vectors Text Structure: Part-Of-Speech Tagging, Head of sentence, Named entities Sentiment Analysis: Sentiment Dictionary, Sentiment Entities, Sentiment Features Text Classification: Supervised Learning, Train Set, Dev(=Validation) Set, Test Set, Text Features, LDA. Machine Reading: Entity Extraction, Entity Linking,dbpedia, FRED (lib) / Pikes

Explain Named entity recognition (NER)?

Named-entity recognition (NER) (also known as entity identification, entity chunking/extraction), - is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes/categories. e.g. person names, organizations, locations, medical codes, time expressions, quantities, monetary values, %, etc.

List some areas of NLP?

Natural Language Processing can be used for Semantic Analysis Automatic summarization Text classification Question Answering Some real-life example of NLP is IOS Siri, the Google assistant, Amazon echo.

Do you have any experience in building ontologies?

Ontology includes a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains. -ontologies define the terms used to describe and represent an area of knowledge, they are used in many applications to capture relationships and boost knowledge management. - Ontology is an explicit specification of conceptualization and a formal way to define the semantics of knowledge and data. The formal structure of ontology makes it a nature way to encode domain knowledge for the data mining use. - The purpose of an ontology is to model the business. It is independent from the computer systems, e.g. legacy or future applications and databases. Its purpose is to use formal logic and common terms to describe the business, in a way that both humans and machines can understand. - An ontology is a formal description of knowledge as a set of concepts within a domain and the relationships that hold between them. ... There are, of course, other methods that use formal specifications for knowledge representation such as vocabularies, taxonomies, thesauri, topic maps and logical models. Why we need ontology:? - Problem-solving methods, domain-independent applications, and software agents use ontologies and knowledge bases built from ontologies as data. - based on semantics or reasoning. - As new ontologies are made, their use hopefully improves problem solving within that domain. --------- - Ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. - the process starts by extracting terms and concepts or noun phrases from plain text using linguistic processors such as part-of-speech tagging and phrase chunking. Then statistical[1] or symbolic [2][3] techniques are used to extract relation signatures, often based on pattern-based[4] or definition-based[5] hypernym extraction techniques. - Ontologies are often equated with taxonomic hierarchies of classes, class definitions, and the subsumption relation, but ontologies need not be limited to these forms. Common components of ontologies include: # Individuals: Instances or objects. # Classes: types of objects, or kinds of things # Attributes: properties, features, characteristics, or parameters of objects (and classes). # Relations: between classes and objects. # Events: The changing of attributes or relations. etc.

Explain Principle Component Analysis (PCA).

PCA is a method for transforming features in a dataset by combining them into uncorrelated linear combinations. These new features, or principal components, sequentially maximize the variance represented (i.e. the first principal component has the most variance, the second principal component has the second most, and so on). As a result, PCA is useful for dimensionality reduction because you can set an arbitrary variance cutoff.

What are parametric models? Give an example.

Parametric models are those with a finite number of parameters. To predict new data, you only need to know the parameters of the model. Examples include linear regression, logistic regression, and linear SVMs. Non-parametric models are those with an unbounded number of parameters, allowing for more flexibility. To predict new data, you need to know the parameters of the model and the state of the data that has been observed. Examples include decision trees, k-nearest neighbors, and topic models using latent dirichlet analysis.

What is pragmatic analysis in NLP?

Pragmatic Analysis: It deals with outside word knowledge, which means knowledge that is external to the documents and/or queries. Pragmatics analysis that focuses on what was described is reinterpreted by what it actually meant, deriving the various aspects of language that require real-world knowledge.

Explain the Bias-Variance Tradeoff.

Predictive models have a tradeoff between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs). Simpler models are stable (low variance) but they don't get close to the truth (high bias). More complex models are more prone to being overfit (high variance) but they are expressive enough to get close to the truth (low bias). The best model for a given problem usually lies somewhere in the middle

Word and Meaning Relationships

Relationship between words: - Synonyms: words with same meaning., is the most important relationship between words. - Antonyms: words with opposite meaning. - Hyponyms + hypernyms: lesser known , user a hierachical classification to organize data.

What are 3 ways of reducing dimensionality?

Removing collinear features. Performing PCA, ICA, or other forms of algorithmic dimensionality reduction. Combining features with feature engineering.

What is Lemmatization in NLP?

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. -- Both stemming and lemmatization is to reduce forms to a common base form. am, are, is --> be car, cars, car's, cars' --> car Stemming or lemmatization? - When should I use Stemming and when should I use Lemmatization? Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming. You also had to define a parts-of-speech to obtain the correct lemma. So when to use what! The above points show that if speed is focused then stemming should be used since lemmatizers scan a corpus which consumed time and processing. It depends on the application you are working on that decides if stemmers should be used or lemmatizers. If you are building a language application in which language is important you should use lemmatization as it uses a corpus to match root forms.

What is the ROC Curve and what is AUC (a.k.a. AUROC)?

The ROC (receiver operating characteristic) the performance plot for binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x- axis). AUC is area under the ROC curve, and it's a common performance metric for evaluating binary classification models. It's equivalent to the expected probability that a uniformly drawn random positive is ranked before a uniformly drawn random negative.

What is the "Curse of Dimensionality?"

The difficulty of searching through a solution space becomes much harder as you have more features (dimensions). Consider the analogy of looking for a penny in a line vs. a field vs. a building. The more dimensions you have, the higher volume of data you'll need.

Why are ensemble methods superior to individual models?

They average out biases, reduce variance, and are less likely to overfit. There's a common line in machine learning which is: "ensemble and get 2%." This implies that you can build your models as usual and typically expect a small performance boost from ensembling.

What are 3 data preprocessing techniques to handle outliers?

Winsorize (cap at threshold). Transform to reduce skew (using Box-Cox or similar). Remove outliers if you're certain they are anomalies or measurement errors.

If you split your data into train/test splits, is it still possible to overfit your model?

Yes, it's definitely possible. One common beginner mistake is re-tuning a model or training new models with different parameters after seeing its performance on the test set. In this case, its the model selection process that causes the overfitting. The test set should not be tainted until you're ready to make your final selection.

How much data should you allocate for your training, validation, and test sets?

You have to find a balance, and there's no right answer for every problem. If your test set is too small, you'll have an unreliable estimation of model performance (performance statistic will have high variance). If your training set is too small, your actual model parameters will have high variance. A good rule of thumb is to use an 80/20 train/test split. Then, your train set can be further split into train/validation or into partitions for cross-validation.


Conjuntos de estudio relacionados

Oceanography Chapter 1: Introduction to Earth

View Set

VEN 003 Fall 2017 Final Study Guide

View Set

stress management exam questions

View Set