Pandas
first second reviews.iloc[:,1]
Both loc and iloc are row-fir_____________, column-seco_____________. This is the opposite of what we do in native Python, which is column-first, row-second. This means that it's marginally easier to retrieve rows, and marginally harder to retrieve columns. To get a column with iloc, we can do the following: Write the code to get this output! Use iloc by getting every value and from the second column Meaning get all the rows but only from the first column Do not use a slicer but use a comma [ 0, 0 ]
negative reviews.iloc[-5:]
Finally, it's worth knowing that neg_________________ numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last five elements of the dataset. Use iloc with a negative value to get the last 5 rows of all columns without print function * Use the -5 with a colon within an index
top_oceania_wines = reviews.loc[ (reviews.country.isin(['Australia', 'New Zealand'])) & (reviews.points >= 95) ]
Create a DataFrame variable top_oceania_wines containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand. Use variable wine with loc and create a tuple within a list that includes the variable wine with country and is in method and includes a list within a tuple with Aus and New Zealand then uses & and another tuple that includes wine variable with points greater than or equal to 95
indices = [1, 2, 3, 5, 8] sample_wine = wine.loc[indices]
Create a list with values 1, 2, 3, 5, and 8, assigning it to a variable called indices Then create a new variable that includes the wine with LOC and index called indices
fruits = pd.DataFrame([[30, 21]], columns=['Apples', 'Bananas'])
Get this result!
quantities = ['4 cups', '1 cup', '2 large', '1 can'] items = ['Flour', 'Milk', 'Eggs', 'Spam'] recipe = pd.Series(quantities, index=items, name='Dinner')
Get this result!
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
Get this result! Start with pd.Series ([value,value,value], index =['value', 'value','value'], name ='value') HINT: The year and sales are index strings
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B'])
Get this result! Use pd.DataFrame({'Column name':['value 1','value2'],'Column name':['value 1','value2']}, index=['row 1','row2])
list reviews.iloc[[0,1,2],1]
It's also possible to pass a lis_________: What is the code? Use a list that gets the index of Italy, Portugal, and the US. It should also be the 1st column.
index immutable
Manipulating the index Label-based selection derives its power from the labels in the ind____________. Critically, the index we use is not immuta____________. We can manipulate the index in any way we see fit. The set_index() method can be used to do the job. Here is what happens when we set_index to the title field: reviews.set_index("title") Now try setting the index to 'country' and see what happens!
desc = wine.description or desc = wine["description"] first_row = wine.iloc[0] first_description = wine.description.iloc[0]
Select the description column from wine and assign the result to the variable desc. 2 WAYS ANSWER Select the first row of data [0] from wine using ILOC, assigning it to the variable first_row ANSWER Select the first row [0] from the description column of wine using ILOC, assigning it to variable first_description. Combine variable with description with iloc using method format. ANSWER
first_descriptions = wine.description.iloc[:10]
Select the first 10 rows (with colon) from the description column in wine using iloc, assigning the result to variable first_descriptions. Use variable, column, and iloc format
Series together
The Seri_____________ and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Ser_____________ "glued toge_____________".
index
The pd.read_csv() function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in ind_________________, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an ind__________________col.
empty reviews.loc[reviews.price.notnull()]
The second is isnull (and its companion notnull). These methods let you highlight values which are (or are not) em________________ (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do: use the variable with loc method and inside the index include the variable with price column and notnull method Try using isnull as well
label based data index position reviews.loc[0,'country']
The second paradigm for attribute selection is the one followed by the loc operator: lab____________-bas______________selection. In this paradigm, it's the dat____________ ind____________ value, not its position, which matters. Get the following using an loc The first value in the index after loc method is the row number and the column name The format is: [row,col]
reserved reviews['country'][0]
These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reser_________________ characters in them Get this result! Put another index beside country!
reviews.taster_name.describe()
This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get: USE DESCRIBE CHECK PICTURE
cols = ['country', 'province', 'region_1', 'region_2'] indices = [0, 1, 10, 100] df = reviews.loc[indices, cols]
Alternative to Dataframes Create a variable df containing the country, province, region_1, and region_2 columns of the records with the index labels 0, 1, 10, and 100. First create a variable called cols and have the value names country, province, region_1 and region_2 Then create a list called indices with values 0,1,10,100 Create another variale called df that equales the wine variable with loc and the indices as a rows and the cols variables as columns
wine['country'][0] = 'CAMBODIA' wine.country
Assigning data Change the country name for index 0 to Cambodia without using loc
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2']) animals.to_csv("cows_and_goats.csv")
Create a DataFrame called `animals` then a write code to save this DataFrame to disk as a csv file with the name cows_and_goats.csv.
fruit_sales = pd.DataFrame([[35, 21], [41, 34]], columns=['Apples', 'Bananas'], index=['2017 Sales', '2018 Sales'])
Get this result!
table entries value sequence list
A DataFrame is a tab____________. It contains an array of individual entri_____________, each of which has a certain val_____________. Each entry corresponds to a row (or record) and a column. A Series, by contrast, is a sequ_____________ of data values. If a DataFrame is a table, a Series is a lis_____________
column values index parameter name
A Series is, in essence, a single colu_____________ of a DataFrame. So you can assign colu_____________ valu_____________ to the Series the same way as before, using an inde_____________ param_____________. However, a Series does not have a column na_____________, it only has one overall na_____________:
130,000 14
We can use the shape attribute to check how large the resulting DataFrame is: wine_reviews.shape (129971, 14) So our new DataFrame has _________________ records/rows split across ___________________ different columns. That's almost 2 million entries!
everything range values reviews['country'][0:3]
On its own, the : operator, which also comes from native Python, means "everyt_____________". When combined with other selectors, however, it can be used to indicate a ran_____________ of valu_____________. For example, to select the country column from just the first, second, and third row, we would do: NOT USING ILOC GETTING INDEX 0, 1, 2 Use two indexes OR Use only one index
list values reviews.loc[reviews.country.isin(['Italy', 'France'])]
Pandas comes with a few built-in conditional selectors, two of which we will highlight here. The first is isin. isin is lets you select data whose value "is in" a li_________ of valu_________. For example, here's how we can use it to select wines only from Italy or France: Use variable with loc and in a index, use variable with method country and method ISIN and the two countries as a list within a tuple of ISIN
reviews.iloc[1:3,1]
What is the code for just the second and third entries? Use iloc with 2 arguments and the first argument is sliced. The first argument is to select the second and third row and the second argument is to choose the column
simpler indices list of lists wine.loc[:,['country','description']]
iloc is conceptually simp____________ than loc because it ignores the dataset's indi____________. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indi____________ to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. Get the following output! HINT: Use loc with all rows and only the two columns shown. Remember that the columns are in a list. So a list within a list.
conditions reviews.country == 'Italy' reviews.loc[reviews.country == 'Italy'] reviews.loc[reviews.country == 'Germany']
Condit____________ selection So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on condi____________. For example, suppose that we're interested specifically in better-than-average wines produced in Italy. We can start by checking if each wine is Italian or not: ANSWER This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data. Use loc with wine, use the wine variable with country as method that equals Italy. ANSWER Now try getting reviews for only Germany! ANSWER
cols = ['country', 'variety'] df = reviews.loc[:99, cols] cols_idx = [0, 11] df = reviews.iloc[:100, cols_idx] italian_wines = reviews[reviews.country == 'Italy']
Create a variable df containing the country and variety columns of the first 100 records. Hint: you may use loc or iloc. When working on the answer this question and the several of the ones that follow, keep the following "gotcha" described in the tutorial: LOC ANSWER Create the variable cols that is a list and includes country and variety and and then create a variable df that includes the main varirable with loc and the total number of rows and the cols variable (99 or 100?) ILOC ANSWER Create the variable cols_idx and includes the index number of the columns 0 and 11 and then create a variable df that includes the main varirable with iloc and the total number of rows and the cols _idx variable NEXT Create a DataFrame italian_wines containing reviews of wines made in Italy. Hint: reviews.country equals what? Createa a new variable and use the main variable with index that includes main variable and country as a method that equals italy.
indexing reviews['country']
If we have a Python dictionary, we can access its values using the index______________ ([]) operator. We can do the same with columns in a DataFrame Get this result! Use the wine dataset and call it reviews or wine_reviews and index the country column You can use the method or index way!
reviews.points.mean()
If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function: USE IT FOR AVERAGE POINTS! 88.44713820775404
index based numerical iloc reviews.iloc[0]
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. ilo_____________ follows this paradigm. Write the code to get this output!
reviews.points.describe()
Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the describe() method: USE DESCRIBE Do one for country! CHECK PICTURE
shape = wine.loc[wine.country == 'Italy'].shape[0] reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]
This DataFrame has ~20,000 rows. The original had ~130,000. That means that around ____% of wines originate from Italy. Write a code to verify the original amount and another code to verify the Italy amount. USE variable with loc and a list that shows the variable with country method and then equals Italy within the same tuple and has the shape method outside the list. ANSWER THEN find the percent by dividing Italy with the original amount! ANSWER We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points. We can use the ampersand (&) and LOC to bring country and points together Each condition is in its own tuple but within one index but divided by the & ANSWER Suppose we'll buy any wine that's made in Italy or which is rated above average. For this we use a pipe (|) and LOC: ANSWER
reviews.taster_name.value_counts()
To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method: Use for taster_name! CHECK PICTURE
reviews.taster_name.unique()
To see a list of unique values we can use the unique() function: Use it for taster_name! array(['Kerin O'Keefe', 'Roger Voss', 'Paul Gregutt', 'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima', 'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan', 'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW', 'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen', 'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams', 'Christina Pickard'], dtype=object)
first included last excluded inclusive
iloc uses the Python stdlib indexing scheme, where the fir____________ element of the range is inclu____________ and the las____________ one exclu____________. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclu_____________. So 0:10 will select entries 0,...,10. Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet). This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].