pandas
periodObj.start_time periodObj.end_time
...
functools.reduce(function, sequence[, initial])
Apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value. E.G. reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5).
pandas.tools.plotting.bootstrap_plot
Bootstrap plots are used to visually assess the uncertainty of a statistic, such as mean, median, midrange, etc. A random subset of a specified size is selected from a data set, the statistic in question is computed for this subset and the process is repeated a specified number of times. Resulting plots and histograms are what constitutes the bootstrap plot. from pandas.tools.plotting import bootstrap_plot data = pd.Series(np.random.rand(1000)) bootstrap_plot(data, size=50, samples=500, color='grey')
DataFrameObj.plot.kde
Density plot reproducing a pdf line DataFrame.plot.kde()
series.idxmin
Index of first occurrence of minimum of values. This method is the Series version of ``ndarray.argmin``. DataFrame.idxmin
DataFrameObj.shift
One may want to shift or lag the values in a time series back and forward in time. The method for this is shift, which is available on all of the pandas objects. The shift method accepts an freq argument which can accept a DateOffset class or other timedelta-like object or also a offset alias: ts.shift(5, freq='BM') ts.tshift(5, freq='D')
pandas.tools.plotting.parallel_coordinates
Parallel coordinates is a plotting technique for plotting multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together. parallel_coordinates(data, 'Name')
pandas.Period
Period('2012-05', freq='D')
pandas plotting formatting
Plot Formatting pass logy to get a log-scale Y axis. You may set the legend argument to False to hide the legend, which is shown by default. subplots=True, The layout of subplots can be specified by layout keyword. It can accept (rows, columns). The layout keyword can be used in hist and boxplot also. If input is invalid, ValueError will be raised.
index.get_level_values(level)
Return vector of label values for requested level, equal to the length of the index
GroupByObj.filter
The argument of filter must be a function that, applied to the group as a whole, returns True or False. e.g. dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')}) dff.groupby('B').filter(lambda x: len(x) > 2) dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
DataFrame matrix multiplication:
The dot method on DataFrame implements matrix multiplication: e.g. df.T.dot(df) Similarly, the dot method on Series implements dot product: s1.dot(s2)
TimestampObj.value
The epoch time in nanosecond unit
GroupByOjb.groups
The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.
DataFrameObj.reorder_levels
The reorder_levels function generalizes the swaplevel function, allowing you to permute the hierarchical index levels in one step:
Slicing methods
There are 2 explicit slicing methods, with a general case Positional-oriented (exclusive of end) e.g. df.iloc[0:3] Label-oriented (inclusive of end) e.g. df.loc['bark':'bass'] General (Either slicing style : depends on if the slice contains labels or positions) e.g. df.ix[0:3] df.ix['bark':'bass']
pandas.Timestamp
Timestamp(datetime(2012, 5, 1)) Timestamp('2012-05-01 00:00:00')
doing an operation between DataFrame and Series
When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting row-wise. In the special case of working with time series data, and the DataFrame index also contains dates, the broadcasting will be column-wise:
pd.MultiIndex.from_product
When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product function: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']] pd.MultiIndex.from_product(iterables, names=['first', 'second'])
series.abs
abs
DataFrameObj.plot.area
area plot
series.argsort()
argsort different from ranking by 1
DataFrameObj.plot.box
box plot Boxplot can be colorized by passing color keyword. You can pass a dict whose keys are boxes, whiskers, medians and caps. If some keys are missing in the dict, default colors are used for the corresponding artists. Also, boxplot has sym keyword to specify fliers style.
pd.bdate_range
business date range e.g. rng = bdate_range(start, end)
numpy.asarray
convert a pandas series to numpy ndarray
dateutil.relativedelta
d = datetime(2008, 8, 18, 9, 0) d + relativedelta(months=4, days=5)
ts.to_period(freq=None)
dates.to_period(freq='M')
delete a column in dataframe
del df['two'] In [61]: three = df.pop('three')
pd.get_dummies
df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)}) In [74]: pd.get_dummies(df['key']) Out[74]: a b c 0 0.0 1.0 0.0 1 0.0 1.0 0.0 2 1.0 0.0 0.0 3 0.0 0.0 1.0 4 1.0 0.0 0.0 5 0.0 1.0 0.0 df[['data1']].join(pd.get_dummies(df['key']))
pandas plotting examples
df.plot(x='A', y='B')
np.diff
difference operator in numpy
Index.isin
e.g df.index.isin([0,2,4])
DataFrameObj.plot.pie
e.g df.plot.pie(subplots=True, figsize=(8, 4))
Basic multi-index slicing using slices, lists, and labels.
e.g dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:] dfmi.loc['A1',(slice(None),'foo')] You can also specify the axis argument to .loc to interpret the passed slicers on a single axis. dfmi.loc(axis=0)[:,:,['C1','C3']] dfmi.loc(axis=0)['A1':'A2','B0',['C0','C2']]
pandas.crosstab()
e.g. a=array([foo, foo, foo, foo, bar, bar, bar, bar, foo, foo, foo], dtype=object) b=array([one, one, one, two, one, one, one, two, two, two, one], dtype=object) c=array([dull, dull, shiny, dull, dull, shiny, shiny, dull, shiny, shiny, shiny], dtype=object) pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c']) b one two c dull shiny dull shiny a bar 1 2 1 0 foo 2 2 1 2
GroupByObj.get_group
e.g. df3.groupby(['X']).get_group('A') Note that groupby will preserve the order in which observations are sorted within each group.
DataFrameObj plot with axes
fig, axes = plt.subplots(nrows=2, ncols=2) df['A'].plot(ax=axes[0,0]); axes[0,0].set_title('A'); df['B'].plot(ax=axes[0,1]); axes[0,1].set_title('B'); df['C'].plot(ax=axes[1,0]); axes[1,0].set_title('C'); df['D'].plot(ax=axes[1,1]); axes[1,1].set_title('D');
DataFrameOjb.iterrows()
for ind,row in df.iterrows()
iterate through a GroupByObj
for name, group in df.groupby(['A', 'B']): print(name) print(group)
pandas.tools.plotting.autocorrelation_plot
from pandas.tools.plotting import autocorrelation_plot autocorrelation_plot(data)
pandas.tseries.offsets.DateOffset
from pandas.tseries.offsets import * d + DateOffset(months=4, days=5)
df.plot.hexbin
hexbin plot x input and y input are needed
df.idxmin
idxmin(axis=0, skipna=True): Return index of first occurrence of minimum over requested axis. For a dataframe this returns a sequence corresponding to the columns
DataFrameObj.join
is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. join takes an optional on argument which may be a column or multiple column names, which specifies that the passed DataFrame is to be aligned on that column in the DataFrame. left.join(right, how=...) left.join(right, on=key_or_keys) pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
itertools.product
itertools.product(['Ada','Quinn','Violet'],['Comp','Math','Sci'])
np.linspace
linspace
pd.MultiIndex.from_tuples
make multiindex from tuples
pandas.merge
merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True, indicator=False) joining on the columns left: dataFrameObj on the left | right: DFO on the right how: joining method four choices 'left', 'right', 'outer', 'inner' on: list of columns names to join upon left_index / right_index: whether left & right indexes should be used to join left_on/right_on: the name(s) of the columns to be joined on the left/right
ndarray.ravel()
ndarray.ravel(order=['C','F']) Return a contiguous flattened 1D array.
pandas.ordered_merge
ordered_merge(left, right, on=None, left_by=None, right_by=None, left_on=None, right_on=None, fill_method=None, suffixes=('_x', '_y')) Perform merge with optional filling/interpolation designed for ordered data like time series data. Optionally perform group-wise merge (see examples) fill_method : {'ffill', None}, default None Interpolation method for data on : label or list Field names to join on. Must be found in both DataFrames. left_on / right_on : label or list, or array-like Field names to join on in left DataFrame.
pd.Period()
pd.Period(timeObj=, freq=)
Deal with epoch time
pd.to_datetime([1349720105, 1349806505, 1349892905, 1349979305, 1350065705], unit='s') datetime.datetime.fromtimestamp
numpy.title ndarray.repeat
repeat(N) repeat each component N times numpy.title(nda,N) repeat nda N times
DataFrameObj.values
represents the values of a dataframe object in numpy matrix
DataFrameObj.to_string()
returns a string representation of the dataframe object
pandas.date_range
rng = date_range('1/1/2011', periods=72, freq='H')
pandas.tools.plotting.scatter_matrix
scatter plot matrix using the scatter_matrix method e.g. scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')
slice
slice(start, stop[, step]) Create a slice object. This is used for extended slicing (e.g. a[0:10:2]).
series.asfreq
ts.asfreq('45Min', method='pad')
pd.Timestamp
ts_input : datetime-like, str, int, float Value to be converted to Timestamp offset : str, DateOffset Offset which Timestamp will have tz : string, pytz.timezone, dateutil.tz.tzfile or None Time zone for time which Timestamp will have. unit : string E.G. Timestamp(datetime(2012, 5, 1)) Timestamp('2012-05-01 00:00:00')
DataFrameObj.pivot(...) DataFrameObj.pivot(...)
.pivot(index, columns, values) Produce 'pivot' table based on 3 columns of this DataFrame. Uses unique values from index / columns and fills with values. If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting "pivoted" DataFrame will have hierarchical columns whose topmost level indicates the respective value column:
DataFrameObj.plot.bar/DataFrameObj.plot.barh
Calling a DataFrame's plot.bar() method produces a multiple bar plot: To produce a stacked bar plot, pass stacked=True: To get horizontal bar plots, use the barh method:
DataFrameObj.assign
DataFrameObj.assign(nm1=** , nm2=**) Assign new columns to a DataFrame, returning a new object (a copy) with all the original columns in addition to the new ones. kwargs : keyword, value pairs keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned. iris.assign(ratio = lambda x: (x['Width'] /x['Length']))
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
pandas.tools.plotting.lag_plot
Lag plots are used to check if a data set or time series is random. Random data should not exhibit any structure in the lag plot. Non-random structure implies that the underlying data are not random. from pandas.tools.plotting import lag_plot plt.figure() data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000))) lag_plot(data)
DataFrameObj.query & DataFrameObj.assign
Passing a callable, as opposed to an actual value to be inserted, is useful when you don't have a reference to the DataFrame at hand. This is common when using assign in chains of operations. For example, we can limit the DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot: e.g. iris.query('SepalLength > 5').assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,PetalRatio = lambda x: x.PetalWidth / x.PetalLength) .plot(kind='scatter', x='SepalRatio', y='PetalRatio')
DataFrameObj.div/sub/add/mul
Signature: df.sub(other, axis='columns', level=None, fill_value=None) Docstring: Subtraction of dataframe and other, element-wise (binary operator `sub`). Equivalent to ``dataframe - other``, but with support to substitute a fill_value for missing data in one of the inputs. Parameters ---------- other : Series, DataFrame, or constant axis : {0, 1, 'index', 'columns'} For Series input, axis to match Series index on fill_value : None or float value, default None Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level
DataFrameObj.swaplevel
The swaplevel function can switch the order of two levels: df[:5].swaplevel(0, 1, axis=0)
GroupByObj.transform
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk. For example, suppose we wished to standardize the data within each group: f = lambda x: x.fillna(x.mean()) transformed = grouped.transform(f)
pd.IndexSlice
You can use a pd.IndexSlice to have a more natural syntax using : rather than using slice(None) e.g dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']] dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']] mask = dfmi[('a','foo')]>200 dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
DataFrameObj.applymap(func)
a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame func : Python function, returns a single value from a single value Examples df = pd.DataFrame(np.random.randn(3, 3)) df = df.applymap(lambda x: '%.2f' % x)
GroupByObj.agg
aggregate is equivalent to agg
pandas.concat
akes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of "what to do with the other axes": pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True) Docstring: Concatenate pandas objects along a particular axis with optional set logic along the other axes. join: {'inner', 'outer'}, default 'outer'. How to handle indexes on other axis(es). Outer for union and inner for intersection. e.g. pd.concat( [df1, df2, df3]) pd.concat([s3, s4, s5], axis=1, keys=['red','blue','yellow'])
ts.resample
ts.resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0) Convenience method for frequency conversion and resampling of regular time-series data. rule : the offset string or object representing target conversion axis : int, optional, default 0 closed : {'right', 'left'} Which side of bin interval is closed label : {'right', 'left'} Which bin edge label to label bucket with convention : {'start', 'end', 's', 'e'} loffset : timedelta Adjust the resampled time labels base : int, default 0 For frequencies that evenly subdivide 1 day, the "origin" of the aggregated intervals. For example, for '5min' frequency, base could range from 0 through 4. Defaults to 0
ts.to_timestamp
ts.to_timestamp(freq= , how= ) freq : string or DateOffset, default 'D' for week or longer, 'S' otherwise Target frequency how : {'s', 'e', 'start', 'end'}
datetime_index partial string indexing
ts['1/31/2011'] ts[datetime(2011, 12, 25):] ts['10/31/2011':'12/31/2011'] dft['2013-1':'2013-2'] This specifies an exact stop time (and is not the same as the above) dft['2013-1':'2013-2-28 00:00:00'] To select a single row, use .loc dft.loc['2013-1-15 12:30:00']
DataFrameObj.plot.scatter
x input and y input are required To plot multiple column groups in a single axes, repeat plot method specifying target ax. It is recommended to specify color and label keywords to distinguish each groups. ax = df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1'); df.plot.scatter(x='c', y='d', color='DarkGreen', label='Group 2', ax=ax);
DataFrameObj.xs
xs(key, axis=0, level=None, copy=None, drop_level=True) Returns a cross-section (row(s) or column(s)) from the Series/DataFrame. Defaults to cross-section on the rows (axis=0). xs() also allows selection with multiple keys df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
zip
zip(iter1 [,iter2 [...]]) --> zip object Return a zip object whose .__next__() method returns a tuple where the i-th element comes from the i-th iterable argument. The .__next__() method continues until the shortest iterable in the argument sequence is exhausted and then it raises StopIteration.
pd.to_timedate datetime.datetime.strftime datetime.datetime.strptime
common notation %a Weekday abbr. %A Weekday fullname %W weekday as a decimal # %d day of the month %b month abbr. name %B month full name %m month as a zero-padded decimal number %y Year without centurary %Y year with centatry %H hours(24) %I hour(12) %p locale(am or pm) %M minute %s seconde %j day of the year
DataFrameObj.append
concatenate along axis=0, namely the index: In the case of DataFrame, the indexes must be disjoint but the columns do not need to be: df = df1.append(df2,ignore_index=True) append here does not modify df1 and returns its copy with df2 appended. For DataFrames which don't have a meaningful index, you may wish to append them and ignore the fact that they may have overlapping indexes: To do this, use the ignore_index argument: append may take multiple objects to concatenate: result = df1.append([df2, df3])
pd.cut
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False) Return indices of half-open bins to which each value of `x` belongs. x : Input array to be binned. Has to be 1d. bins : int or sequence of scalars If `bins` int, it defines the # of equal-width bins in the range of `x`. If `bins` is a sequence it defines the bin edges allowing for non-uniform bin width. retbins : bool, optional Whether to return the bins or not. Returns out : Categorical or Series or array of integers if labels is False bins : ndarray of floats ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60]) pd.cut(ages, bins=3)
plt.axhline
plt.axhline(y=0, xmin=0, xmax=1, hold=None, **kwargs) Add a horizontal line across the axis. y : scalar, optional, default: 0 y position in data coordinates of the horizontal line. xmin : scalar, optional, default: 0 Should be between 0 and 1, 0 being the far left of the plot, 1 the far right of the plot. xmax : scalar, optional, default: 1 Should be between 0 and 1, 0 being the far left of the plot, 1 the ar right of the plot.
stack/unstack
stack: "pivot" a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels. unstack: inverse operation from stack: "pivot" a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.
ts.asfreq
ts.asfreq(freq, method=None, how=None, normalize=False) Convert all TimeSeries inside to specified frequency using DateOffset objects. Optionally provide fill method to pad/backfill missing values. freq : DateOffset object, or string method : {'backfill', 'bfill', 'pad', 'ffill', None} Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to backfill / bfill: use NEXT valid observation to fill how : {'start', 'end'}, default end For PeriodIndex only, see PeriodIndex.asfreq normalize : bool, default False Whether to reset output index to midnight
series.resample
ts.resample('D').mean()
insert a column to a DataFrame
By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns: df.insert(1, 'bar', df['one'])
Indexing/Slicing
Operation Syntax Result Select column df[col] Series Select row by label df.loc[label] Series Select row by integer location df.iloc[loc] Series Slice rows df[5:10] DataFrame Select rows by boolean vector df[bool_vec] DataFrame
dataframe transposing
Transposing To transpose, access the T attribute (also the transpose function), similar to an ndarray:
random.shuffle
import random; random.shuffle
DataFrameObj.plot kind
Plotting methods allow for a handful of plot styles other than the default Line plot. These methods can be provided as the kind keyword argument to plot(). These include: 'bar' or 'barh' for bar plots 'hist' for histogram 'box' for boxplot 'kde' or 'density' for density plots 'area' for area plots 'scatter' for scatter plots 'hexbin' for hexagonal bin plots 'pie' for pie plots
DataFrameObj.to_string
Render a DataFrame to a console-friendly tabular output.
GroupByObj.apply
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example: grouped['C'].apply(lambda x: x.describe()) apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to apply. So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in the output as well as set the indices.
pandas.to_datetime
To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function. When passed a Series, this returns a Series (with the same index), while a list-like is converted to a DatetimeIndex: to_datetime(Series(['Jul 31, 2009', '2010-01-10', None])) to_datetime(['2005/11/23', '2010.12.31'])
pd.to_timedate
To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function. When passed a Series, this returns a Series (with the same index), while a list-like is converted to a DatetimeIndex: e.g to_datetime(['2005/11/23', '2010.12.31']) to_datetime(['04-01-2012 10:00'], dayfirst=True) Specifying a format argument will potentially speed up the conversion considerably
DataFrameObj.groupby
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False) The mapping can be specified many different ways: A Python function, to be called on each of the axis labels A list or NumPy array of the same length as the selected axis A dict or Series, providing a label -> group name mapping For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but it makes life simpler A list of any of the above things By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups: grouped = df.groupby(['A', 'B'], as_index=False)
DataFrameOjb.itertupels()
df.itertuples(index=True, name='Pandas') Docstring: Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple. df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b']) >>> df col1 col2 a 1 0.1 b 2 0.2 >>> for row in df.itertuples(): ... print(row) Pandas(Index='a', col1=1, col2=0.10000000000000001) Pandas(Index='b', col1=2, col2=0.20000000000000001)
DataFrameObj.plot.hist
df.plot.hist(stacked=True, bins=20) You can pass other keywords supported by matplotlib hist. For example, horizontal and cumulative histgram can be drawn by orientation='horizontal' and cumulative='True'.
DataFrameObj.reset_index
df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
DataFrameObj.combine_first DataFrameObj.update
df1.combine_first(df2)
DataFrameObj.reindex
df2.reindex(index=None, columns=None, **kwargs) Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False method : {None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}, only applicable to DataFrames/Series with monotonically increasing/decreasing index. level : int or name Broadcast across a level, matching Index values on the passed MultiIndex level
~ : to take the complement of a mask
df[~((df.AAA <= 6) & (df.index.isin([0,2,4])))]