MIS 434 Text Analytics Final

¡Supera tus tareas y exámenes ahora con Quizwiz!

What are the clustering similarity measures?

A computation of the degree of similarity between tokens. The represent document and query similarity measures. 4 measures: overlap coefficient: measure of the overlap of 2 document sets. Looks at the smaller document (in terms of smaller document) Jacquard Coefficient: Looks to see which tokens are shared and distinct. (in terms of both measure). Cosine Coefficient: calculates the angle between the data-sets, the further an angle from 90, the more related. Euclidean Coefficient: looks to see how far away each data set is from the other with a normalized scale. (HE WILL ASK ABOUT THIS. DIFFERENCE MEASURE.)

What is association rule mining? How does it work?

- It finds items commonly put together. Usually in a shopping basket analysis - It works with confidence and support Support: Which is the probability that transaction contains X and Y. # times set appears/total purchases = P(XY) Confidence: conditional probability that one variable will appear based on the other. P(XY)/P(X)

CRISP-DM

1. Business question to answer 2. Data size, quality, variables 3. integration, normalization, selection 4.put data into the proper model 5. determine effective, efficent, and valuable model is. 6. Use results for decisions

What are the main challenges of text-mining?

1. Highly unstructured data(Talk about this) 2. Huge amounts of data 3. Difficult data collection 4. high dimensionality (talk about this) 5. Data processing

What is classification? What are it's components?

A SUPERVISED LEARNING method that classifies new data instances into groups based on training sets. This builds the TP/FP evaluation. Evaluation Methods Accuracy: (TP + TN) / All Error Rate: (FP + FN) / All Sensitivity: TP/(TP+FN) Specificity: TN/(TN+FP) Precision: TP/(TP+FP) Recall: TP/(TP+FN) F-measure: 2 x Precision x Recall / (Precision + Recall)

What is regression? What are it's components?

A SUPERVISED LEARNING technique for finding the relationship between a dependent and independent variables. simple regression models y = B1X1 + ...BnXn + E Y = dependent variable X = independent variables B = regression coefficients (weights) E = error coefficient

What is text classification?

A large collection of documents that need to have classes predefined. Also needs a similarity measure such that some documents are more related than others. The goal is to find interesting groupings of documents based on labeled records (training set). This is tested based on similarity measures like Jacquard or cosine coefficients.

What is a Term-Document Matrix?

A matrix of terms (single words or phrases) and documents determining the number of appearances of each term.

What is relationship extraction? What are its components? Give an example.

After the tokens are identified by NER, RE can treat the elements in a document as a classification problem due to their label. This can be used to determine whether there is a relationship as well as for syntactical structure.

What is k-means clustering? Give an example.

Also known as partition clustering where a given set of documents and k clusters are given and each cluster is assigned to a set of documents as possible. Efficient method but needs a specified k, only finds local optimums, unable to handle bad data, needs standard distributions. Similar to KNN

What are collocations? Give an example.

An expression of 2+ words that are a conventional way of saying something. There three rules: 1. Non-compositional phrase meaning the actual meaning is not literal and can't be conveyed by the words. 2. non-substitution meaning no words may be replaced. 3. non-modifiable can't add or take away from the saying Classic example of "kick the bucket" being synonymous with dying.

What are the different word weighting techniques? Explain each.

Binary term occurrence: Does a term appear in Doc, 1 if yes, 0 if no. TF: Count of appearances per term, per doc. More frequent = represents topic. DF: Number of documents that contain the term. Fewer appearances mean more informative (df<N). idf = log10(N/df) TF-IDF: Combination of term frequency and inverse document frequency. tf*idf. The more times the word appears in fewer documents the more weight. Works poorly with related documents as main doc ideas will appear throughout.

What is hierarchical clustering? Give an example.

Build clusters based on previous clusters, this is similar to decision trees. This does not require k but dont scale well or undo previous clusters. Main difference is the clustering can start from either end agglomerative (bottom-up) and divisive (top-down)

What is stemming?

Changing all forms of words to their normalized state for analysis.

Give a multi-dimensional view of data mining

DATA TO BE MINED: data warehouses, bases, transactional, web KNOWLEDGE TO BE MINED: characterization, classification, association TECHNIQUES UTILIZED: OLAP, machine learning, stats APPLICATIONS ADAPTED: retail, telecom, banking

What are the different classification models? What is each used for (give an example)?

Decision tree: Branching the splitting of the data set to classify unknown instances against a decision tree. Used to determine likelyhood of new customer purchase. Naive Bayes: probabilistic classifier that assumes condition independence. Scales nicely with larger training sets. Versatile for things like spam filtering. KNN: lazy learner(no algorithm beyond grabbing closest neighbors) assigns to class with top k. Slow and computationally expensive. SVM: uses kernal function (linear, polynomial, RBF) builds classifier for each category. Very strong speech recognition tool. Neural net: Give input, output, and depth of inner nodes. Let's neural net figure out what the factors are in the middle. Runs model many times adjusting the weight between factors.

What are the applications of text classification? What are the features of each representations?

Document categorization: such as newswires filter or patent classification Text filtering: Classify incoming streams of documents to filter like spam Authorship attributes: determine article author Sentiment analysis: determining sentiment polarity of documents

What are the levels of sentiment analysis? How is each used? Give examples.

Document level: overall sentiment of opinion holder. Assumes each document is about one subject from one person. Sentence level: has 2 tasks, to determine subjectivity and sentiment. Feature level: looks to see things like word groupings that convey sentiment

What is POS tagging?

Finding what the functional part of speech each word is (noun, verb, etc.).

What are the feature selection methods?

Individual Feature Ranking (IFR): Evaluate each feature as individual, assumes independence Feature Subset selection (FSS): evaluate each feature subset group 2^p-1 combinations. Filter: Use data intrinsic measures to filter out irrelevant features. Such as abundant occurrences or word stems. Wrapper: Using embedded classifier to eliminate irrelevant features evaluated by the classification algorithm.

What are the different levels of text representation? Explain and give an example of each.

Lexical: Characters, words, phrases, POS tags. Ex. tokenization Syntactic: parsing Ex. parse trees Semantic: Looking at how one phrase relates to the next. Ex. WordNet Pragmatic: Understanding the context of the setting.

What are the 2 bases of sentiment analysis? How is each used? Give examples.

Lexicon based: Using a dictionary (possibly with weights) of each sentiment class to measure a bag of words style sentiment. Machine learning: Using cross validation and a predefined training set to have a machine learning algorithm methods determine predictors for class assignment.

What is named entity recognition? What are it components? Give an example.

NER is a way of knowing what names, places, and dates are and using that knowledge to label the function of a token as a meta data element for more accurate text-analysis.

What is information extraction? What is another name for it?

Natural language processing that automatically extracts structured information from unstructured texts. By looking for things like token relationships and token recognization.

What is sentiment analysis? Why is it important?

The analysis of subjective opinions, sentiments, and emotions represented in text. Gives an idea of general opinion surrounding a topic which is what is needed for decision makers. Allows for organizations to monitor reputation and give timely feedback.

What applications are considered analytics?

Visualization techniques, Knowledge Discovery, Stat summary, Querying, reporting

4 V's of big data

Volume: Amount of data Velocity: Speed data is coming Variety: structed/unstructered data Veracity:uncertainty of data quality

What is web crawling in RapidMiner? What is regular expression for?

Web crawling is the ability of the DM software to be able to search the web given a seed location that will find web pages within a certain criteria. Regular expression is the instruction to RapidMiner for a seed location and pages to save.

What is dimensionality reduction? Why is it important for machine learning especially for text analytics?

When there is a high p data set, there is a tendency to over-fit the data especially as n shrinks. Dim reduction reduces noise and irrelevant features by eliminating features that are unrelated. These can either be feature selection methods or feature extraction methods. The former pulls dimensions from the existing data. The latter generates it's own parameters based on data relationships.

What is parsing?

Working out the grammatical structure of a sentence. Finding the groups of words, subject of another verb, etc.

How do you evaluate data-mining models?

Clustering: Confusion matrix? Classification: With a confusion matrix which contains measures like accuracy, error rate, precision, and recall Regression: Fitness of the model R^2, p-value, MSE ASSOCIATION: SUPPORT CONFIDENCE

What are stop words?

Common words that are used in a functional role but don't carry information. These are removed.

What are the evaluation metrics for classification? What are the different methods?

Effectiveness: Ability to make right decision. based on confusion matrix. Interpretability: Ability to make sense of the model. Efficiency: Time to classify on training set and future sets. Methods: Hold-out: split the data into training and testing sets. LOOCV: leaves one number out as testing and using the rest as training, repeats for each number and averages K-fold: Similar to LOOCV but splits data into k-folds with multiple data points in each fold for computational benefit and to eliminate possible bias in the data set.

What are the text mining tasks and what does each mean? Give an example of each.

Exploratory Text Analytics: Using text to form hypotheses. Ex. finding symptoms of a disease Predictive Text Analytics: Text classification. Ex. Spam filtering Information Extraction: Automatically create knowledge bases to use standard DM techniques. Ex. Event detection.

Define Data Mining/Knowledge Discovery from Data (KDD)

Extraction of non-trivial, implicit, previously unknown, and potentially useful information. patterns or knowledge from huge amounts of data

What are the different DM techniques and what is each used for?

Generalization: Information integration and data warehouse construction AKA Generalize, summarize, and contrast data characteristics Association & Correlation Analysis: Frequent patterns (or frequent itemsets), understand Association, correlation vs. causality Classification: Construct models (functions) based on some training examples and Describe/distinguish classes or concepts for future prediction Cluster Analysis: Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Outlier Analysis: analyzing data objects that do not comply with the general behavior of the data sets

What is clustering? What are its components? What are the common models?

Grouping collections of data objects based on their intra-class similarity (cohesiveness) and their inter-class similarity (distinctiveness). UNSUPERVISED LEARNING method with no predefined classes. This is done in hierarchical (bottom = agglomerative, top = divisive) or K-means format.

What are n-grams?

Grouping words together into tokens for understanding. Each gram is another word put into a token. Note each gram below it will also be collected.

What is feature extraction? How is it different from feature selection? How can it find relationships between words? How can it identify key dimensions?

It generates its own features based on data interrelationships. Ideally these generated dimensions will be fewer than p factors while representing a high proportion of the data. It finds these unique dimensions in text by looking for things like synonyms or semantically related terms. These relationships between words reflect topics or themes in the data. Common methods are PCA, LSA, LDA (latent discriminant analysis), and word embedding.

what is web scraping in RapidMiner? What is XPath for?

Specifying and extracting only the component of the webpage you want (such as the text in an article). Xpath is the scripting language used to path to a particular component of an html/xml page.In RM it's used to extract that page component.

What is tokenization?

Splitting text into individual words.

What's the difference between supervised and unsupervised learning?

Supervised: The data has a predetermined outcomes given for training. Unsupervised: Methods that involve no training set/predetermined variables.

What is point-wise mutual information (PMI)? Give an example.

The dependence of words or the co-occurrence rate of words. example: hot dog, hot not that reliant on dog, dog more reliant on hot. Issues with this are two words that only appear once (Rarer words = higher PMI), their co-occurrence will be very high even though they may be relatively unrelated.

What is exploratory text analysis? What are the types of exploratory text analysis? What is each used for (give examples)?

The goal is to explore the data. What are the frequent phrases? What are the word relationships? Are the documents related? etc. Frequency analysis: After the text is pre-processed into a term-document matrix, this identifies the most frequent phrases to see what the main topics of the documents are. Co-occurrence analysis: words in multiple docs that help identify phrases, collocations (word relationships), and term association Cluster analysis: Similarity measures between documents that look to find several clusters of documents that are relevant to one another based on the tokens contained in each.


Conjuntos de estudio relacionados

Module 1 Earth Space Science practice's

View Set

Engineering PaxPat Day 2: Page 6

View Set