Python for Data Science

¡Supera tus tareas y exámenes ahora con Quizwiz!

unix time

tracks progress of time by counting seconds since an instant, appears as an integer

What is the definition of 'predictive analytics'?

using data to predict outcomes

How do you determine the size of a decision tree?

# of nodes in tree (exclude root node)

What is the symbol that denotes Magic Functions in Jupyter? % $ # //

Using markdown cells in Jupyter, how do you format text as bold? **word** *word* ##word## #word#

**word**

two categories of data visualization

-conceptual or data driven (show a concept EX: supply demand curve) -declarative or exploratory (display conclusion to observer)

key features of numpy

-multi dimensional arrays -built in operations/packages -integrate multiple languages -speed

how is numpy able to be so speedier

-numpy arrays are fixed in size -elements must be the same type

key features of jupyter notebook

-supports multiple platforms -documented data science -reproducible data science -collaboration in groups

panda benefits

-variety of data sources -data integration -data transformation -visualizations

Which Root Mean Square Error (RMSE) would represent a perfect prediction with no errors in regression? 0 NaN 1 -1

After running nltk.download("movie_reviews"), which of the following needs to be called to import movie_reviews? from nltk.corpus import movie_reviews from nltk.corpora import movie_reviews nltk.import(movie_reviews) nltk.import("movie_reviews")

As an example, let's consider a data set consisting of two variables, one representing how long students spend studying and the other representing their average test scores. Assuming students study effectively, how would you expect these two variables to be correlated? Positively correlated because I would expect the test scores to go up if they spend more time studying. Negatively correlated because as the test scores go down, students should spend more time studying. No correlation because these variables should be independent.

How do you assign each sample in a dataset to a centroid using the k-means algorithm? Assign the sample to the cluster with the closest centroid. Assign the sample to the cluster with the furthest centroid. Assign the sample to a random cluster.

How do you determine the new centroid of a cluster? Calculate the mean of the cluster Calculate the max of the cluster Calculate the mode of the cluster Calculate the min of the cluster

SELECT name FROM san_diego_elementary WHERE grade > 70 Which students at San Diego Elementary scored higher than 70%? At which elementary schools have students scored higher than 70%? Of all the elementary students in san diego that have an average grade of >70%, how many students are from San Diego Elementary?

What does the "within-cluster sum of squared error" provide? A mathematical measure of the variation within a cluster. An error measurement for a specific sample in relation to the centroid of a particular cluster. An answer to which cluster is the most 'correct.'

What does the following method call return? accuracy_score(data_true = data_test, data_pred = predictions) The fraction of correctly classified samples. The number of correctly classified samples.

What does the most_common function of Counter return? a list of words in the form ('word', frequency) a list of words in the form ('word') a list of words in the form (frequency, 'word') a list of words in the form (frequency)

What is an example of insight turned into action? (pick 1) 1 Marketing a new product based on past sales information 2 Prediction of customers' choices 3 Understanding of customer profiles 4 Gathering data sales data

What is the appropriate input for the following line of code to make a linear regression prediction? y_prediction = regressor.predict(___) x_test x_train y_train y_test

What is the command to get the number of rows in a data set titled "data"? data.shape[0] data.shape[1] data.size() data.length()

What is the definition of corpus? a collection of text in digital form a download interface to pre-processed text datasets

What is the first step in constructing a decision tree? Start with all samples at a node. Partition the samples into subsets based on the input variables. Repeatedly partition data into successively purer subsets until stopping criteria are satisfied.

What is the output of the execute method in sqlite? an iterator a cursor a table

What is the primary data structure for a relational data model? Table List Vector 3-D data frame

What type object is the intersection function called on? set twitter api authentication list

What type of object does the function Kmeans output? kmeans dataframe integer series

What will type(_) return? type of output from previous line type that represents the underscore symbol prompt to fill in the blank for type of output error because there is no object to return the type

What would 18490/1e3 (thousands format) result? 18.49 18490000 1.8490 18490.000

When is a prediction task referred to as simple linear regression? When there is only one input variable. When there is more than one input variable. When there are two input variables.

Which of the following string formatting centers a string to be in the middle of 10 spaces? {:^10} {:10} {:>10} {:<10}

Which parameter in the KMeans clustering algorithm do you have to specify for the number of clusters you want? n_clusters clusters tot cluster_centers

Why are decision boundaries of a decision tree parallel to the axes formed by the variables? Each split considers only a single variable Each subset should be as homogenous as possible The induction algorithm eventually stops expanding

You are given a dataframe labeled x where the column 'number' indicates the index of a record. Which function call would create a new dataframe y that takes more than 10 samples x if x has 100 records? y = x[(x['number']%5)==0] y = x[(x['number']%10)==0] y = x[(x['number']%15)==0]

What are the two components of data retrieval mentioned in this class? The way you specify how to get the desired data out of the relational data store. The internal processing that occurs within the data management system to compute or evaluate that specified retrieval request. The way you store specific data in a data management system. How to handle the data once it has been retrieved.

1 2

What are the ingredients to form a data science problem (select 3)? 1 define what it is you're trying to tackle 2 assess the situation with respect to the problem 3 define your goals and objectives 4 assess the population to which the problem refers

1 2 3

To use scikit-learn: DecisionTreeRegressor, train_test_split, and mean_squared_error, which of the following libraries are necessary? (Choose the best two) pandas sklearn.metrics sklearn.model_selection sklearn.tree scikitlearn

1 5

5 Basic Steps of Data Science

1. Acquire 2. Prepare 3. Analyze 4. Report 5. Act APARA

3 questions to ask when reporting (apaRa)

1. What are the main results 2. What value do these results provide 3. How can the model add to this application

steps to formulate a research question

1. define a problem 2. assess the situation 3. Define the goals

two steps of preparing data

1. explore dataset 2. pre-process dataset

2 types of pandas

1. panda series 2. panda dataframe

principles of data visualization

1. trustworthy (do not mislead audience) 2. accessible (made to be easily perceived and used by audience) 3. elegant (easy to read and interpret in a visually appealing way)

What is an example of Unix time (int64)?

1138537770

For a classification problem, if you want to predict the letter grade that a student would receive, what are 2 examples of reasonable input data to consider? Amount of time spent studying Percentage grade these students received in the previous semester Letter grade different students received in another class The students' ID numbers

What 2 statements describe classification in the context of machine learning? Predict the category of the target given input data Supervised task Unsupervised task Numerical target variable

What happens when you join two tables in pandas using natural join? Select all that applies. The common column is represented once. The common row is represented once. The common column is represented twice. The common row is represented twice.

What does the Twitter API allow access to? users location interactions personal information

123

Which of the following are frequent corner cases in tokenization? inconsistent use of punctuation shortened use of words hyphenated words multiple spaces between words

123

In accessing a client using OAuth 1.0A, what are the four required identifiers to establish access to the resources? consumer key consumer secret access token access token key access token secret

1235

Which two code outputs are equivalent? print('#'*5) print ('#*5') print('#####') print('#*'5)

panda series

1D array, similar to ndarray but can handle multiple data types

Rank 1 ndarray

1D vector or matrix

Automatic translators are a natural language processing (NLP) technique. How do they work? Learn what and how a human talks over time. Takes words, phrases, and context into account to understand what is being said. Process questions, categorize them, and match them to existing answers.

If you have a dataframe titled 'dat' with 5 rows and 2 columns and you run the following line of code, how many boolean values are returned? dat.isnull().any() 2 5 1 10

In the twitter package in python, what does the OAuth function return? list of necessary authentication authentication object twitter API there is no return value

What is the correct way to show the last 2 files using movie_reviews.fileids() where movie_reviews is a downloaded dataset? movie_reviews.fileids()[:-2] movie_reviews.fileids()[-2:] movie_reviews.fileids([:-2]) movie_reviews.fileids([-2:])

What is the difference between regression and classification for machine learning in Python? Regression transforms categorical values to numeric and then follows the same as classification. Regression is used to predict a numeric value while classification is used to predict a categorical value. Classification is used when the input data is categorical and regression is used when the input data is numeric.

What is the first step with any dataset? Research all of the background information. Perform an initial exploration. Verify the findings that others have found.

What is the function call to output the name of columns of a dataframe named x? x.columns(0) x.columns columns(x)

When is it NOT acceptable to avoid axis labels in plots using matlibplot. When you are simply exploring the data and know their values. When you are presenting non-intuitive results to another person. When the labels can be determined by the values (e.g., percentage correct, years).

Which of the following code snippets is a conditional statement? for s in statuses: if not s["text"] in all_text: [s['text'] for s in search_results['statuses'] statuses = filtered_statuses

What is true about data science? Select 3. 1 a static, one time analysis of big data 2 a model generated that leads to insights and can be improved 3 management of data 4 seeing how everything is connected

2 3 4

Which of the following are examples of stop words in English? stop doing those being myself word

2345

panda dataframe

2D data structure, multiple data types

Rank 2 ndarray

2D vector or matrix

How would you initially handle an anomaly (apparent outlier) in cluster analysis? Throw it out of the dataset Disregard in further analysis Provide further analysis on the anomaly

What is the benefit of a log graph over a graph that has not modified the scales? There is no outright advantage to converting a graph into a log scale. Log scales are significantly better in representing every type of data compared to unmodified scale graphs. Log scales allow a large range of values that are compressed into one point of the graph to be shown.

What is the name of a sophisticated word tokenizer trained on English in nltk? punkd punked punkt punnet

When plotting word frequency, what does a peak distribution most often entail about the vocabulary? There are many unique words There is a large vocabulary There is a focused topic

Which of the following is true about a model? built using test data evaluated on training data trained by the training data set

Assume the code line "text = New York-based" is run, what would be the output of text.split()? ['New',' ','York-based'] ['New','York', '-', 'based'] ['New','York','based'] ['New','York-based']

Given a list named x, what is the object type of the output returned by Counter(x)? no return type dataset list counter

Given: twitter_api = twitter.Twitter(auth=auth) WOE = 1 What is the correct code to find the trend? twitter_api.trends.place(id = WOE) twitter_api.trends.place(1) twitter_api.trends.place(WOE) twitter_api.trends.place(_id = WOE)

How does twitter identify location? city, state, and country longitude and latitude coordinates integer number

What is the correct way to use json to show the first three statuses in a list named statuses? print(json.dumps(statuses[4], indent = 1)) print(json.dumps(statuses[3], indent = 1)) print(json.dumps(statuses[1:4], indent = 1)) print(json.dumps(statuses[0:3], indent = 1))

What is the reason to prefix id with an underscore for query string parameterization with the twitter object? there is no need in the function, it is there to make it easier on the reader it is needed because it is part of the parameter name it is needed for the twitter object to know the type of output from previous line without it, twitter package appends the value to the URL

Where did we import NaiveBayesClassifier from in our movie review NLP notebook? sklearn sklearn.naive_bayes nltk.corpus nltk.classify

Given the python function from the video: def build_bag_of_words_features(words): return {word:True for word in words} What is the parameter type and return type? dictionary and boolean value set of words and boolean value dictionary and set of words dictionary and boolean value set of words and dictionary

What is the default number of rows that the function head() will return for a dataframe in pandas? 10 1 5 2

What is the definition of 'data mining'?

Activities related to finding patterns in databases and data warehouses.

As an example, you have a dataset containing numerical values of subjects' heart rates during exercise and categorical values describing how much they smoke. You want to determine whether smoking and heart rate are related. What machine learning category would this fall under?

Assocation analysis

For example, you want to predict the number of kids someone will have: either 0, 1, 2, or 3+. Is this an example of regression or classification? Regression Classification

CLASSIFICATION

What is the function Kmeans (from sklearn) used for in Python in the example soccer data analysis overview?

Clustering

Which of the following is NOT a step in the formal data science process? Collaborate Acquire Prepare Analyze

Collaborate

Regular expression: '.*\,(.*)\&.*' String in Series: "Feuer, Eis & Dosenbier" What would be the output of the extract function with this string and regular expression?

Eis

You are given a data set that contains crime reports for different neighborhoods in New York City and told to analyze it. Would you consider this a declarative or explorative example? Explorative Declarative

Explorative

Which should you do first after you put together data needed for application? Build models to analyze the data Explore the data you put together

Explore the data

XML

Extensible Markup Language, a way of using markup symbols to extract the contents of a web page; common type is JSON

A pickle file is a python utility module that uses serialization and cannot be deserialized in a different notebook. True False

In the matplotlih.pyplot function hist, the bins parameter indicates to plot only elements in a list which has a number equal to bins. True False

Information from a tweet is only the data given by the 140 character string. True False

T/F Data science only generates actionable information for the future.

T/F Raw data from sources can always be directly used to perform analysis.

TF Cluster analysis is a supervised task

TF Regression is an unsupervised task.

TF Test data is the same dataset as training data in classification models.

The accuracy of test data will be high if one data from a large data set was used in the training set. True False

The join operation works efficiently for data of all sizes. True False

The nltk corpus only supports English stopwords. True False

The raw method is unique to movie_reviews, a downloaded dataset. True False

The strings "ABC" and "abc" will be counted as the same string using Counter. True False

True or False: The function call train_test_split(a, b) where a and b are dataframes will always output the same result. True False

Using the string library, the code "string.punctuation" returns a list of punctuation characters. True False

When you concatenate two dataframes using pandas concat function, the number of resulting columns will be the columns that BOTH dataframes have. As an example, if one dataframe has columns titled 'cat','dog' and another dataframe has columns titled 'dog','bunny', then the resulting dataframe columns will be 'dog'. True False

T/F Once you have devised an action based on an insight, there is no need to continue collecting data.

F - you need to collect data for post implementation progress tracking

Code in the Jupyter code cells are restricted to being one line. True False

False

In machine learning, algorithms and programs directly aim to learn a given task.

False

T/F: Data is a static one time analysis

False

Changing an element of an array slice in numpy will NOT change the original array. True False

False (mutable)

What 2 things are most important in creating elegant visualizations? Focus on what is relevant. Remove anything which isn't adding to the figure. Use a unique style. Make sure the your visualization is trustworthy.

Focus on what is relevant. Remove anything which isn't adding to the figure.

What is NOT important in collecting data? -The user population -The intended uses of the application -How the data was presented

How the data was presented

html

Hyper Text Markup Language, A markup language used to structure text and multimedia documents, and to set up hypertext links between documents. Used extensively on the World Wide Web, it is the basis of every webpage

What is true between supervised and unsupervised approaches?

In supervised approaches, the target is provided. In unsupervised approaches, the target is unavailable.

How do we show the histogram created with graph.hist(...)? It will automatically show once you call the hist() function Call "graph.show()" Call "hist.show()" Include "%matplotlib inline"

Include "%matplotlib inline"

What are the 3 reasons that data scientists working in Python use numpy all the time? Its speed. Its functionality. Many packages rely on numpy. It enables text markup cells.

Its speed. Its functionality. Many packages rely on numpy.

In building a machine learning model, why do we want to adjust the parameters?

MIN ERROR

If you create a DataFrame using pandas by accessing a column label that doesn't exist, what values are present in that column? NULL 0 No values; error message is printed NaN

NaN

Which of the following are benefits of ndarrays over lists? Select 3. Ndarrays are more space efficient. Ndarrays are more optimized for memory. Ndarrays often have faster computation. Ndarrays have more variable types than lists.

Ndarrays are more space efficient. Ndarrays are more optimized for memory. Ndarrays often have faster computation.

Which is an example of conceptually driven data visualization? Physicists visualize the well-understood relationship between force and acceleration to teach introductory physics students. Doctors try to explore the relationship between a drug and the effect it has on their patients using data visualization. Realtors visualize a data set containing rental listings and the amount of interest they attract.

Physicists visualize the well-understood relationship between force and acceleration to teach introductory physics students.

In what 3 ways can you quickly access numpy array elements? Slicing Using an array of indices Boolean indexing Segmenting an array

Slicing Using an array of indices Boolean indexing

sql

Structured Query Language

JSON is a data format used to communicate semi-structured information. True False

Only authentication needs to be used to create a Twitter API object in python. True False

Outliers can sometimes be critical to finding convincing answers when analyzing data. True False

T/F Data science is an iterative process.

T/F Elements in numpy arrays must be all the same type.

TF A Root Mean Square Error (RMSE) higher than our mean value would be too high. (Assume all values are positive)

TF Final clusters are sensitive to initial centroids.

TF It works out better mathematically to measure the impurity of a split in a decision tree, rather than the purity.

TF The target variable is always categorical in classification.

The bag-of-words model tracks if a word appears to identify the sentiment or overall idea. True False

We can use '1' and 'True' interchangeably. True False

When you import a dataset using the read_csv function in pandas, the rows of the dataset are Series. True False

ndarrays are mutable. True False

For the merge() function in pandas, how does the parameter "how" handle row indices when how='inner'? The function takes the union of the row indices The function takes the intersection of the row indices The function takes the complement of the row indices

The function takes the intersection of the row indices

When working with cells in Jupyter, what does "_" refer to? The output of the last cell executed. A space in the line of code. A string character.

The output of the last cell executed.

What does the parameter unit refer to in the to_datetime function? The unit of the output The unit of the input

The unit of the input, the output of the date time function is always int64

What requirement is needed to add two numeric numpy arrays? They need to have the same or compatible dimensions. They need to be of the same type. They need to be converted to type float first.

They need to have the same or compatible dimensions.

What does a negative correlation score mean? All the values of those features in our dataset are negative There is no correlation between those features in our data set Those features in our dataset are inversely correlated

Those features in our dataset are inversely correlated

Take a look at the following lines of code: a = np.array([2, 3]) b1 = np.array([1]) b2 = 1 True or False: a+b1 and a+b2 result in the same ndarray. True False

True

Look at the following code: b = np.array([1,2,3]) b[1] = 'one' What error prints out after you run these two lines of code? SyntaxError NameError KeyError ValueError

ValueError (not the same type)

Which of the following statistics does the describe() function NOT return on dataframe columns? Min Variance Count Mean

Variance

When would you use the machine learning technique 'regression'?

When your model has to predict a numerical value.

What is the difference between deleting a column and popping a column? RESULTS You can store a deleted column. You can store a popped column. There is no difference.

You can store a popped column.

What is the result of the following lines of code? a=np.array(["cat","dog","fish"]) b=np.array(["dog","fish","rabbit"]) print(np.setdiff1d(a,b)) ['cat'] ['rabbit'] ['dog' 'fish'] ['cat' 'dog' 'fish' 'rabbit']

['cat']

data engineering portion of process

acquire, prepare

import pandas as pd ser = pd.Series([100, 200, 300, 400, 500], index = ['tom', 'bob', 'nancy', 'dan', 'eric']) Which of the following calls have the same output? ser.loc[['tom','bob']] ser[[0,1]] ser.iloc[[0,1]] ser[['tom','bob']]

all

What are the 3 main string operations? split contains extract delete

all but delete

computational data science portion of process

analyze, report, act

What is the next step in building a classification model after the model is constructed and parameters are adjusted?

apply model to new data

How would you change the number 5 to 7 in this matrix? arr = np.array([1,2,3,4,5]) arr[0,5] = 7 arr[4] = 7 arr[5] = 7 arr[0,4] = 7

arr[4] = 7

What is the correct way to access elements of an array "arr" that are less than 0? arr[<0] arr[arr<0] arr[arr[,]<0]

arr[arr<0]

What is the result of the following line of code? import numpy as np np.unique([1,1,3,4,2,3,3]) array([1,2,3,4]) array([1,3,4,2]) array([1,1,3,4,2,3,3]) array([4,3,2,1])

array([1,2,3,4])

Which analysis technique has the goal of finding a set of rules to capture associations between items or events? classification regression clustering association analysis graph analytics

association analysis

market basket analysis

association analysis used to predict customer purchasing patterns

Which command allows you to edit the view of the axes on a matlibplot plot? grid() plot() axis() arange()

axis()

Is age group a numeric or a categorical variable?

categorical

transforming a dataset (data munging): SCALING

changing the range of values to be between a specified range so one feature does not dominate the results. EX: height and weight

Analyze data step

classification (predict category) regression (predict numeric value) clustering (organize items into groups, create categories) association (find rules to capture association between items)

Two goals of pre-processing data

clean and transform dataset

csv

comma separated values

When thinking about data visualization, what element is critical in making you value the visualization? Context Raw values Plots and graphs

context

Which function call will allow you to group a dataset (titled dat) by 'time'? dat.groupby('time') groupby(dat['time']) dat.aggregate(['time']) dat['time'].group()

dat.groupby('time')

why is data cleaning necessary?

data is received downstream with little to no control on what is included, data cleaning is to ensure we are looking at "good" data and address data quality issues

what data step is where the most time is spent

data preparation (aPara)

What is the general syntax for calling the mode() function on a dataframe? data_frame[mode] mode(data_frame) data_frame.mode()

data_frame.mode()

What does calling the function dropna() return from a dataframe? dataframe with the rows containing the NA values dataframe omitting NA values dataframe with the columns containing the NA values dataframe with changed NA values by replacing NA values with 0

dataframe omitting NA values

What does the parameter expand represent in the split function? determines if the output is a dataframe creates every split string into a new column pipes the split strings into the output

determines if the output is a dataframe, normally the split function produces a series of strings, by using expand it will create a new df

Which of the following will display the first 7 rows in a data frame object named df? df.head(n=7) df.head(7) df.head([7]) df.head(x=7) df[:7]

df.head(n=7) df.head(7) df[:7]

function for filling missing values with interpolation

df.interpolate(), fills missing values using a linear axis assumption

histogram

df.plot.hist() shows distribution of data and skewness

function to check if a string has a certain character

df['column'].str.contains('hello')

function to return the first match of a regular expression found in a column

df['column'].str.extract()

function to replace a string with another

df['column'].str.replace()

function to separate column by a delimeter

df['column'].str.split('_')

What is the function call to find cells in a dataframe df with timestamp on 2007-02-04, given the dataframe has a parsed time column labelled 'parsed_time'? df['parsed_time'] == '2007-02-04' df['parsed_time'] = '2007-02-04' '2007-02-04' >= df['parsed_time'] >= '2007-02-04' '2007-02-05' > df['parsed_time'] > '2007-02-03'

df['parsed_time'] == '2007-02-04'

function for describing a data frame

df_name.describe()

Which function allows you to clean a data set by dropping NA values? na() dropna() drop(na) na(FALSE)

df_name.dropna()

function to remove missing values

df_name.dropna(axis=0) axis = 0 is the default function, it removes row with missing values, axis = 1 removes columns with missing values

function for backfilling missing values

df_name.fillna(method = 'backfill')

function for forwarding filling missing values

df_name.fillna(method = 'ffill')

function that uses a regular index value (integer) to index

df_name.iloc['1']

function that indexes using a specific object

df_name.loc['apple'] df_name.loc[['apple', 'orange']]

function to sort data in ascending or descending order

df_name.sort_values(by = 'column name', ascending = 'true') ascending = false for descending order

function for determining if all elements are true

dfname.all()

function for determining if any element is true

dfname.any()

function for measuring the dependencies between variables

dfname.corr()

What is the general syntax for calling the mode() function on a dataframe?

dfname.mode()

When is it acceptable to avoid axis labels in plots using matlibplot.

during data exploration

broadcasting

ensuring matrix dimensions align

Suppose you are looking at a data set consisting of how much students liked a particular class. The rows are students. There are two columns: one asks if the student has taken the class and the other asks to rate the class on a scale of 1-10. However, you notice that some students have not taken the class and therefore do not want to include their ratings. Which data filtering technique would you use to clean this dataset? Slice out columns Filter out rows Transform data

filter rows

Which library can you use to easily create geographic overlays? Folium NumPy Matlibplot IPython

folium

What graph is best to show distribution of data?

histogram and box plot

Which algorithm to build classification models relies on the notion that samples with similar characteristics likely belong to the same class?

kNN

big data

large troves of data we mine for insights

In a decision tree, which nodes do NOT have test conditions?

leaf node

what is the downside to data transformation

less detailed data

What graph is best to show how values in your data change over time?

line graph

Which of the following does a boxplot NOT show you? Median Interquartile range Min and max values Mean

mean

Which is NOT a quality of good data visualization, according to Andy Kirk? RESULTS Trustworthy Accessible Elegant Meaningful

meaningful

cleaning a dataset

merge duplicate values, remove outliers, estimate invalid values, remove missing values

There is a syntax error in the code below. np.array( [11,12,13],[21,22,23] ) How would you fix it to create the intended 2x3 array? np.array( [[11,12,13],[21,22,23]] ) np.array( [11,12,13,21,22,23] ) np.array( [11,12,13],[21,22,23] ) np.array[ [11,12,13,21,22,23] ]

np.array( [[11,12,13],[21,22,23]] )

How do you create an Rank 1 array with numpy using the numbers 1, 2, 3? np.ndarray([1,2,3]) np.array([1,2,3]) np.array(1,2,3) np.ndarray(1,2,3) np.ndarray([[1,2,3],[3,2,1]) np.array([[1,2,3],[3,2,1])

np.ndarray([1,2,3])

function to stack dataframes and create a new dataframe out of them

pandas.concat(), if dataframes have columns that do not exist in all the missing data will be filled with NaN, matching columns will be duplicated

function to merge dataframes

pd.merge(df1, df2, how = 'inner'), this will remove duplicate columns

function to convert time to a timstamp format

pd.to_datatime(df_name['column'])

For which of the following scenarios would you use the analysis technique of classification? predicting the weather predict the price of a stock simulating sales of a new product predicting the score on a test

predicting the weather

transforming a dataset (data munging): Dimensionality Reduction

reduces the # of dimensions in a dataset by choosing a sample that reflects the variability of the dataset

transforming a dataset (data munging): Aggregating data

reduces variability in a dataset by smoothing the data EX: aggregating data by fiscper rather than actual goods issue date

transforming a dataset (data munging): feature selection

removing irrelevant features, combining features, adding features (EX: ODO drop time - pulling out the hour)

examples of data cleaning

replacing values, estimating values, dropping fields

What is the correct word to describe an instance of an entity in your data?

sample

What are ways to manipulate the cleaned data into the format needed for analysis (select 5)? Scaling Transformation Feature selection Dimensionality reduction Data manipulation Denominate data

scaling, feature selection, dimensionality reduction, transformation, data manipulation

Which graphing method should you use to visualize the correlation between two arrays? Histogram Barplot Scatter plot Line plot

scatter

What graph is best to show correlation between two values?

scatter plot

What are the 2 main data structures in pandas?

series, data frame

Which is NOT mentioned in the course as a common similarity measure in cluster analysis? Euclidean distance Manhattan distance Cosine similarity Sine similarity

sine

In general, are classification and regression often supervised or unsupervised approaches?

supervised

A table with a primary key logically implies that the table cannot have a duplicate record. True False

Assert statements help locate bugs. True False

In assigning true or false values in a bag-of-words model, a missing word is equivalent to assigning a false value. True False

SQL for structured relational data can provide more operations than pandas data frames. True False

TF When you search an incorrectly spelled term online, suggested words is an example of machine learning.

Ver todos los conjuntos de estudio

Python for Data Science

Conjuntos de estudio relacionados

Marketing last midterm

Practice Quiz Exam 1

Chapter 9 Warmup

Chapter 13, 14, 15, 16, 19, 20, 21, 22, 23, 24 Practice Questions

Mechanical Ventilation - FINAL

Quiz 5

Food Science Exam Question Bank (2012-2021)

Module 9: Address Resolution

AGR 355 Final - Midterm Q's

Peds Exam IV Neuro- Chapter 49

ARM 402

Chapter 22

Human Computer Interaction

Concept Checks

BUAD 3020 toledo

Hubspot Marketing Software

Chapter: General Insurance

Quiz for Thinking/Intelligence

material science: ceramics

Radiologi HH