text processing
how may stemming produce words that are not complete?
by dropping silent vowels for example
how can you parse sentences using nltk?
create a grammer rule and use ChartParser the example returns both valid trees
in pandas, how do you convert all values to lowercase, in dataframe column 'text' ? given df columns = 'title' && 'publisher'
df['title']=df['title'].str.lower() df.head()[['publisher', 'title']]
for chartparse, how do you visualize the parse tree?
for tree in parser.parse(sentence): tree.draw()
using beautiful soup, how do you extract text?
from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, "html5lib") print (soup.get_text())
in python, how can you split 'text' into sentences?
from nltk.tokenize import sent_tokenize sentences = sent_tokenize(text)
using an external library, how to tokenize text into words?
from nltk.tokenize import word_tokenize words=word_tokenize(text)
in pandas, how do you read a csv file?
import pandas as pd df = pd.red_csv("filename.csv")
in python, how to replace all chars that are not a-z or A-Z or 0-9 with " " ?
import re # regular expression text=re.sub(r"[^a-zA-Z0-9]", " ", text)
in python, how do you do a simple request from an REST API? (from url https://quotes.rest.qod.json)
import requests r=requests.get("https://quotes.rest.qod.json") res=r.json()
how does lemmatization work?
it uses a dictionary for mapping a word back to its root
what is another method to reduce words to a normalized form, other than stemming?
lemmatization
what library is useful for performing text operations?
nltk (natural language toolkit)
in python, what is good for tokenizing tweets?
nltk has a tweet tokenizer (hash tags and emoticons)
how is nltk word_tokenize different than python split?
nltk tokenize splits a little smarter ie) Dr. not Dr
in python lemmatizer, how do you specify it to lemmatize verbs?
pass in pos='v' into lemmatize function (generally you can pass the output from the noun lemmatization step into this one) use nltk wordNetLemmatizer from nltk.stem.wordnet import WordNetLemmatizer lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
what are some common stemmers in nltk?
porter stemmer snowballstemmer
is it better to remove punctuation or replace with a space?
replace with a space Replacing with a space makes sure that words don't get concatenated together, in case the original text did not have a space before or after the punctuation.
general workflow for text processing?
sentencize normalize tokenize remove stop words stem / lemmatize (nouns) stem / lemmatize (verbs)
in BeautifulSoup , how can you find all divs with a class of "course-summary-card" ?
soup.find_all("dif", class="course-summary-card")
what it tokenization?
splitting each sentance into a sequence of words
what is parts of speech tagging?
tagging nouns,verbs, pronouns to words
in BeautifulSoup, how can we select h3 tags from text ?
text.select_one("h3 a").get_text()
python, how to remove whitespace?
text.strip()
in python, how to convert text to all lowercase?
text=text.lower()
what is stemming?
the process of reducing a word to its stem or root form
what are named entities?
typically noun objects that refer to some specific object, person, or place
how do you effectively parse html docs in python?
use BeautifulSoup
how to remove stopwords in python?
use a python list comprehension words = [w for w in words if w not in stopwords.words("english")]
in python, how can you tag parts of speech?
use nltk , pos_tag from nltk import pos_tag sentence= word_tokenize("I always lie down to tell a lie") pos_tag(sentence) notice the lie and lie are tagged differently..
in python, how do you code stemming?
use nltk porterstemmer, and pass in all cleaned words from nltk.stem.porter import PorterStemmer stemmed=[PorterStemmer().stem(w) for w in words]
in python, how do you implement lemmatization?
use nltk wordNetLemmatizer from nltk.stem.wordnet import WordNetLemmatizer lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
what is a good way to search for companies of interest in the news?
using named_entities
in python how can you label named entities?
using nltk ne_chunk, and you have to tokenize and tag parts of speech from nltk import pos_tag, ne_chunk from nltk.tokenize import word_tokenize ne_chunk(pos_tag(word_tokenize("Antonio joined Udacity In. In California."))
what are stopwords?
words like "the, a, and,this"
in python, how to split a sentence(text) into words?
words=text.split()