Pandas

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What are the key features of the pandas library?

- Data Alignment - Memory Efficient - Reshaping - Merge and join - Time Series

What is a pandas DataFrame?

A pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It consists of three principal components, the data, rows, and columns.

Explain when the categorical data type is useful.

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory. The lexical order of a variable is not the same as the logical order ("one", "two", "three"). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order. As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

What is a Time Series in pandas?

A time series is an ordered sequence of data which represents how some quantity changes over time. Pandas supports: - Parsing time series information from various sources and formats - Generating sequences of fixed-frequency dates and time spans - Manipulating and converting date time with timezone information - Resampling or converting a time series to a particular frequency - Performing date and time arithmetic with absolute or relative time increments

What is categorical data In pandas?

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited and usually fixed, number of possible values. Examples are gender, social class, blood type, country affiliation, observation time or rating. All values of categorical data are either in categories or np.nan.

Explain reindexing In pandas.

Reindexing means to conform DataFrame to a new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. It changes the row labels and column labels of a DataFrame.

What is a Series in pandas?

Series is a one-dimensional labeled array capable of holding any data type. The axis labels are collectively referred to as the index. The basic method to create a Series is to call: >>> s = pd.Series(data, index=index), where the data can be a Python dict, an ndarray or a scalar value.

What are the different types of data structures In pandas?

There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.

What are the different ways a DataFrame can be created?

Using List: # initialize list of lists data = [['p', 1], ['q', 2], ['r', 3]] # Create the pandas DataFrame df = pd.DataFrame(data, columns = ['Letter', 'Number']) # print dataframe. df Using dict of narray/lists: To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. Using arrays: # initialize data of lists. data = {'Name':['Tom', 'Jack', 'Nick', 'Julie'], 'marks':[99, 98, 95, 90]} # Creates pandas DataFrame. df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])

How do you copy a Series?

With the copy() function. s2 = s1.copy()

What is Python pandas?

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This is a high-level data manipulation tool developed by Wes Mckinney and is built on the Numpy package. This package provides active and flexible data structures in order to make easy working with relational or labelled data.

List the commands useful for creating test segments

pd.DataFrame(np.random.rand(20,5)) | 5 columns and 20 rows of random floats pd.Series(my_list) | Create a series from an iterable my_list df.index=pd.date_range('1900/1/30', periods=df.shape[0]) Adds a date index

List the commands to import data from different sources and formats.

pd.read_csv(filename) | From a CSV file pd.read_table(filename) | From a delimited text file pd.read_excel(filename) | From an Excel file pd.read_sql(query, connection_object) | Read from a SQL table/database pd.read_json(json_string) | Read from a JSON formatted string, URL or file. pd.read_html(url) | Parses an html URL, string or file and extracts tables to a list of dataframes pd.read_clipboard() | Takes the contents of your clipboard and passes it to read_table() pd.DataFrame(dict) | From a dict, keys for columns names, values for data as lists

List the commands to perform common data cleaning tasks (2/2)

s.astype(float) | Convert the datatype of the series to float s.replace(1,'one') | Replace all values equal to 1 with 'one' s.replace([1,3],['one','three']) | Replace all 1 with 'one' and 3 with 'three' df.rename(columns=lambda x: x + 1) | Mass renaming of columns df.rename(columns={'old_name': 'new_ name'}) | Selective renaming df.set_index('column_one') | Change the index df.rename(index=lambda x: x + 1) | Mass renaming of index

List commands to perform common data cleaning tasks (1/2)

df.columns = ['a','b','c'] | Rename columns pd.isnull() | Checks for null Values, Returns Boolean Arrray pd.notnull() | Opposite of pd.isnull() df.dropna() | Drop all rows that contain null values df.dropna(axis=1) | Drop all columns that contain null values df.dropna(axis=1,thresh=n) | Drop all rows have have less than n non null values df.fillna(x) | Replace all null values with x s.fillna(s.mean()) | Replace all null values with the mean (mean can be replaced with almost any function from the statistics module)

List the commands to perform various statistical tests. (These can all be applied to a series as well.)

df.describe() | Summary statistics for numerical columns df.mean() | Returns the mean of all columns df.corr() | Returns the correlation between columns in a DataFrame df.count() | Returns the number of non-null values in each DataFrame column df.max() | Returns the highest value in each column df.min() | Returns the lowest value in each column median() | Returns the median of each column df.std() | Returns the standard deviation of each column

List the commands to export a DataFrame to CSV, xlsx, SQL, JSON

df.to_csv(filename) | Write to a CSV file df.to_excel(filename) | Write to an Excel file df.to_sql(table_name, connection_object) | Write to a SQL table df.to_json(filename) | Write to a file in JSON format

List the commands to combine multiple dataframes into one

df1.append(df2) | Add the rows in df1 to the end of df2 (columns should be identical) pd.concat([df1, df2],axis=1) | Add the columns in df1 to the end of df2 (rows should be identical) df1.join(df2,on=col1,how='inner') | SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values. 'how' can be one of 'left', 'right', 'outer', 'inner'

List the commands used to select a specific subset of data

df[col] | Returns column with label col as Series df[[col1, col2]] | Returns columns as a new DataFrame s.iloc[0] | Selection by position s.loc['index_one'] | Selection by index df.iloc[0,:] | First row df.iloc[0,0] | First element of first column


Ensembles d'études connexes

Chapter 22: Title, Risk of Loss, and Insurable Interest

View Set

The Physical World Midterm (chs 1-5)

View Set

quiz 7 nugent adrenal and pituitary

View Set

Finance (3.5) Prepare a Trail Balance

View Set

Principles of Economics: Macroeconomics Quiz 4

View Set