7646 - MT - python
# prices.tail syms=['IBM', 'AAPL', 'HNZ', 'XOM', 'GLD'] prices = pd.DataFrame(np.random.rand(10, len(syms)),columns=syms) print prices.tail(1).values
.tail = display the last 5 rows in array including column headers .tail(1) = display only the last row in the array including column headers .tail(1).values = display only the values of the last row in the array without column headers
s = pd.DataFrame([0, 1, 2, np.nan])
0 0 0.0 1 1.0 2 2.0 3 NaN inside the first [], every element is a row
s = pd.DataFrame([[0, 1, 2, np.nan],[1,2,3,4]])
0 1 2 3 0 0 1 2 NaN 1 1 2 3 4.0 inside the first [], every element is a row
cols = ["AGE","HEIGHT", "WEIGHT"] data = [[1,1,None],[2,None,2]] df1 = pd.DataFrame(data, columns=cols) df1.fillna(20) df1['HEIGHT'].mean() / df1['WEIGHT'].mean()
0.5 since fillna wasn't recorded (no in place or var store)
(1) df = pd.DataFrame(d) (2) normed = df/df.ix[0] (3) normed['AAPL'] = np.nan (4) normed.fillna(value='0') (5) print normed[0:2]
4 was not stored, so we will get: 0 NaN 1.000000 1.000000 1.000000 1.000000 1 NaN 0.979705 0.996171 0.998848 0.999866
numpy.random.randint(low=0, high, size=None, dtype='I') x = np.random.randint(5, size=(4, 5))
5 in this case is a upper bound
DataFrame.groupby('stock_id')
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
numpy.random.randn(d0, d1, ..., dn)
Return a sample (or samples) from the "standard normal" distribution. dn~ dimension n size numpy.random.randn(2,4) -> 2x4 normally distributed
array = np.ones((2,3,4)) array = array * 2 print array.sum(axis=None)
Since each cell is 1, the sum would be 2x3x4x2 = 48
df = pd.DataFrame([52, 46, 50, 51], columns = ['AAPL'], index = ['01-01', '01-02', '01-03', '01-04']) df = df.join(pd.DataFrame([83, 88, 86, 90], columns = ['SPY'], index = ['01-01', '01-02', '01-04', '01-05']))
The date range would be based on the first df, missing data will be NaN 01-01 52 83 01-02 46 88 01-03 50 NaN 01-04 51 86
import string idx = [i for i in string.uppercase] # A, B, C .. Z t = pd.Series(range(26), index=idx) # 0, 1, 2 .. 25 # examples ... t[0] # --> 0 t['A'] # --> 0 t[['A','M']] # --> [0, 12] t['A':'D'] # --> [0, 1, 2, 3] t.iloc[25] # --> 25 t.loc['Z'] # --> 25 t.loc[['A','Z']] # --> [0, 25] t.ix['A':'C'] # --> [0, 1, 2] t.ix[0:2] # --> [0, 1]
True
loc works on labels in the index. iloc works on the positions in the index (so it only takes integers). ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
True
pandas.rolling_mean(df, window=2)
[ nan 2.5 3.5 4.5] rolling_mean will take the current index as right edge by default
j = [8, 7, 6, 5, 4, 3, 2, 1] print [x/j[-2] for x in j[1:-1]]
[3, 3, 2, 2, 1, 1] since both are int, python will use int for the result
a=np.array([(10.0,20.0),(1.0,2.0)]) b=np.array([(100,200),(1,2)]) a/b
[[ 0.1 0.1] [ 1. 1. ]] element wise
x= np.array([[1,2,3],[4,5,6], [7,8,9]]) y = np.amax(x, axis=1) z = x/y
[[0, 0, 0],[1, 0, 0],[2, 1, 1]]
numpy.argmax(a, axis=None, out=None) x = np.array([[1,9,1],[1,1,1],[1,1,1]]) x.argmax(axis=1)
a : Input array Returns the indices of the maximum values along an axis. -> [1,0,0]
import pandas as pd df = pd.DataFrame({'a':[3,3,3], 'b':[6,6,6], 'c':[3,3,3]}) df.ix[0,1]/df[1:]
a b c 1 2 1 2 2 2 1 2 df.ix[0,1] -> 6 df[1:] -> the last 2 rows division will resize df.ix[0,1] to a 2x3 matrix
a = [[]]*4 b = [[] for _ in range(4)] a[0].append(15) b[0].append(15)
a: [[15], [15], [15], [15]] b: [[15], [], [], []]
a = [[]]*4 a[0].append(15) vs a[0] = 15
append(15) will give [15]*4 a[0] only write 1 cell
np.ones([1,5])
array([[ 1., 1., 1., 1., 1.]])
ary = np.array([[[1, 2], [3, 4]], [[5, 6], [7,8]]]) print ary[:, :, 0]
ary[:, :, 0] the first ':': open the first set of [] then [:,0] is just all row, first col
axis = 0 or 1 index [0] or [1]
axis 0: sum col by col or stack on-top (vertical) index [0]: row
numpy.sum(a, axis=)
axis = 0: all row, basically col axis = 1: all col, basically row axis = None: default, all axes = .sum().sum()
A.mean(axis = 1, keepdims = True)
axis = 1 -> mean of each row -> the result will be a col keepdims will keep the result as a colum. If not, it would be converted to row
a = pd.Series([1, 2, 4, 7, 11, 16]) b.values[-1]*1.0 -> 5.0
b = a-a.shift(1)
import scipy.optimize as spo def f(x): '''Arbitrary function''' n = sum(x**2) return n guess = [0.5, 0.1, 1.2, 0.2] spo.minimize(f, guess, method = 'SLSQP', constraints=const) what would be const?
const = ({ 'type': 'eq', 'fun': lambda x: 2 - sum(x)}) type : str Constraint type: 'eq' for equality, 'ineq' for inequality. fun : callable The function defining the constraint. In this case, what we want to minimize jac : callable, optional The Jacobian of fun (only for SLSQP). args : sequence, optional Extra arguments to be passed to the function and Jacobian.
s = pd.DataFrame([[0, 1, 2, np.nan],[1,2,3,4]]) s.count s.size
count: axis : {0 or 'index', 1 or 'columns'}, default 0 (row) -> 0 3: row 0, count 3 non-N/A elements size: -> 4: size will count N/A
np.random.random([4,4]) np.random.random(4) np.random.rand(4)
create a 4x4 random array value [0,1], inclusive
m = [15 16 17 18 19] m.cumsum(axis=1)
cumulative sum [15 31 48 66 85]
data = pd.DataFrame(np.arange(16).reshape(4,4), index=list('abcd'), columns=['one', 'two', 'three', 'four']) output column 'two'
data['two'] data.ix[:, 1] data.ix[:, 'two']
w = [0, 1, 2] df = pd.DataFrame([[1, 1, 1], [2, 2, 2]], columns=["A", "B", "C"], index=["X", "Y"]) df = df.multiply(w) df.ix["Y", "C"]
df = A B C X 0 1 2 Y 0 2 4 df.ix["Y", "C"] ~ df[1][2], but df cannot use single indexing
combined_scores.ix['Ariel'] vs combined_scores[0]
df is unhashable, so cannot use single indexing combined_scores[:1:] would work
import pandas as pd df1 = pd.DataFrame({'Type': ["Stock", None]}) df2 = df1 df1.fillna("Mutual Fund", inplace=True) df2.fillna("ETF")
df2.fillna("ETF") doesn't do anything, since df2 = df1 (by reference), so all NA were filled
df1 = pd.DataFrame( ... ) df2 = pd.DataFrame( ... ) df3 = df1.join(df2)
df3 will follow df1's order
a.dot(b)
dot product a.b (a.k.a matrix multiplication) (cross product needs the determinants, use for vector)
df.drop("B", axis=1)
drop col B
a * a
element wise multiplication =/= dot product
df.fillna(method='ffill', inplace=True) df.fillna(method='bfill', inplace=True)
ffill, then bfill
arr = np.array([[1, 2, 3],[3, 4, 5],[5, 6, 7]]) target = np.array([0, 2]) arr[target][0] arr[target][1]
in this case target is not just [0] and [2] indexes, it will act like [0: 2] in indexing arr[target][0] = arr[0: 2][0] arr[target][1] = arr[0: 2][1]
key left_value 0 0 a 1 1 b 2 2 c 3 3 d 4 4 e key right_value 0 2 f 1 3 g 2 4 h 3 5 i 4 6 j Merge result: Is it an outer or inner merge? 0 2 c f 1 3 d g 2 4 e h
inner, since there is no NA
ax = df.plot(title='P Values', fontsize=12) ax.set_xlabel('Date') ax.set_ylabel('P Value')
label the x and y axes set_xlabel is a property of the plot can also do df.columns.name df.index.name before plotting
numpy's std() calculates the uncorrected sample standard deviation of a given numpy array using what fomula?
math.sqrt(sum((x - x.mean())**2) / (len(x))) (RMS)
merge() vs concat()
merge() is used to combine two (or more) dataframes on the basis of values of common columns(indexes can also be used, use left_index=True and/or right_index=True) concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1)
pandas.rolling_max(df,window=2,min_periods=2)
min_periods: Minimum number of observations in window, just a way to apply the function to a smaller sample than the rolling window
df.ix[bd:ed] df[bd:ed]
not df.ix([bd:ed]) loc: label based: df.loc[row_indexer,column_indexer] iloc: integer position based
DataFrame.join
on : column name, tuple/list of column names, or array-like how : {'left', 'right', 'outer', 'inner'}, default: 'left' left: a left join: only the items relative to the left data table, right: opposite direction. Outer join: union join... inner join == natural join (only elements that exist in both tables sort : boolean, default False Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword)
df.ix[1:2,3:4]
output both rows 1 &2 cross both col 3 & 4
red = np.array([[3,4],[2,4]]) x = ([2,5],[2,3]) x = np.asarray(x) blue = red * x[-2,:]
row wise then element wise multiplication [[6 8], [4 8]]
df/df.shift(1)
shift down basically dividing by the previous element
a.size a.shape[0] a.shape[1]
size = rNum*cNum shape = (rNum,cNum) shape[0] = rNum shape[0] = cNum
x**2
square element wise
np.mean(a.min(axis = 0))
take the min of each col, then take the mean
df.head(rNum) df.tail(rNum)
the first/last rNum rows (default = 5)
np.ones((2,3,4))
will create 2 arrays, each 3x4 and wrap it in another dimension [[[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]] [[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]]] Basically a 2x3x4 array
b = a.copy()
will create another copy of a
df = pd.DataFrame(data) df = df.shift(2)
will shift the df down, and replace the first 2 rows with NAs
df = pd.DataFrame(index=dates) df = df.reset_index(['Date']) df.set_index('count', inplace=True)
# create df with dates index # Returning an index to data # Setting a new index from an existing column inplace=True is the same as df = df.set ...
a = np.array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) i = np.array([[0,1], [1,2]]) j = np.array([[2,3], [1,0]]) print a[i,j]
-> get row indexes from i, get col indexes from j -> write result in the same form as i or j