text processing

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

how may stemming produce words that are not complete?

by dropping silent vowels for example

how can you parse sentences using nltk?

create a grammer rule and use ChartParser the example returns both valid trees

in pandas, how do you convert all values to lowercase, in dataframe column 'text' ? given df columns = 'title' && 'publisher'

df['title']=df['title'].str.lower() df.head()[['publisher', 'title']]

for chartparse, how do you visualize the parse tree?

for tree in parser.parse(sentence): tree.draw()

using beautiful soup, how do you extract text?

from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, "html5lib") print (soup.get_text())

in python, how can you split 'text' into sentences?

from nltk.tokenize import sent_tokenize sentences = sent_tokenize(text)

using an external library, how to tokenize text into words?

from nltk.tokenize import word_tokenize words=word_tokenize(text)

in pandas, how do you read a csv file?

import pandas as pd df = pd.red_csv("filename.csv")

in python, how to replace all chars that are not a-z or A-Z or 0-9 with " " ?

import re # regular expression text=re.sub(r"[^a-zA-Z0-9]", " ", text)

in python, how do you do a simple request from an REST API? (from url https://quotes.rest.qod.json)

import requests r=requests.get("https://quotes.rest.qod.json") res=r.json()

how does lemmatization work?

it uses a dictionary for mapping a word back to its root

what is another method to reduce words to a normalized form, other than stemming?

lemmatization

what library is useful for performing text operations?

nltk (natural language toolkit)

in python, what is good for tokenizing tweets?

nltk has a tweet tokenizer (hash tags and emoticons)

how is nltk word_tokenize different than python split?

nltk tokenize splits a little smarter ie) Dr. not Dr

in python lemmatizer, how do you specify it to lemmatize verbs?

pass in pos='v' into lemmatize function (generally you can pass the output from the noun lemmatization step into this one) use nltk wordNetLemmatizer from nltk.stem.wordnet import WordNetLemmatizer lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]

what are some common stemmers in nltk?

porter stemmer snowballstemmer

is it better to remove punctuation or replace with a space?

replace with a space Replacing with a space makes sure that words don't get concatenated together, in case the original text did not have a space before or after the punctuation.

general workflow for text processing?

sentencize normalize tokenize remove stop words stem / lemmatize (nouns) stem / lemmatize (verbs)

in BeautifulSoup , how can you find all divs with a class of "course-summary-card" ?

soup.find_all("dif", class="course-summary-card")

what it tokenization?

splitting each sentance into a sequence of words

what is parts of speech tagging?

tagging nouns,verbs, pronouns to words

in BeautifulSoup, how can we select h3 tags from text ?

text.select_one("h3 a").get_text()

python, how to remove whitespace?

text.strip()

in python, how to convert text to all lowercase?

text=text.lower()

what is stemming?

the process of reducing a word to its stem or root form

what are named entities?

typically noun objects that refer to some specific object, person, or place

how do you effectively parse html docs in python?

use BeautifulSoup

how to remove stopwords in python?

use a python list comprehension words = [w for w in words if w not in stopwords.words("english")]

in python, how can you tag parts of speech?

use nltk , pos_tag from nltk import pos_tag sentence= word_tokenize("I always lie down to tell a lie") pos_tag(sentence) notice the lie and lie are tagged differently..

in python, how do you code stemming?

use nltk porterstemmer, and pass in all cleaned words from nltk.stem.porter import PorterStemmer stemmed=[PorterStemmer().stem(w) for w in words]

in python, how do you implement lemmatization?

use nltk wordNetLemmatizer from nltk.stem.wordnet import WordNetLemmatizer lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]

what is a good way to search for companies of interest in the news?

using named_entities

in python how can you label named entities?

using nltk ne_chunk, and you have to tokenize and tag parts of speech from nltk import pos_tag, ne_chunk from nltk.tokenize import word_tokenize ne_chunk(pos_tag(word_tokenize("Antonio joined Udacity In. In California."))

what are stopwords?

words like "the, a, and,this"

in python, how to split a sentence(text) into words?

words=text.split()


Ensembles d'études connexes

Social Studies Cumulative Exam (86%)

View Set

Further Practice on 'Key' Word Transformation 121-143 (Ariella & Dasha)

View Set

Ch 45 Assessment and Management of Patients with Endocrine Disorders

View Set

Property-Casualty Insurance Test

View Set

Fundamentals 171 C. 12 Essentials

View Set

ExamFX Chapter 6 Arkansas Statutes, Rules, And Regulations for Life and Health

View Set