CMSC320
tabular operations
1. select/slicing (select only some rows or some columns or a combination of both 2.aggregrate/reduce combine values caross a column into a single value 3. map apply a function to every row, possibly creating more or fewer columns. variations that allow one row to generate multiple rows in the output (sometimes called "flatmap") 4. Group By Group tuples together by column/dimension 5. Group By Aggregate Compute one aggregate per group Final result usually seen as a table 6. Union/Intersection/Difference Set operations - only if the two tables have identical attributes/columns Similarly intersection and set difference manipulate tables as sets IDs may be treated in different ways, resulting in somewhat different behaviors
JSON from twitter
GET https://api.twitter.com/1.1/friends/list.json?cursor =-1&screen_name=twitter_api&skip-status=true&include_user_entities=false
RESTful APIs (Representational State Transfer)
GET: perform query, return data POST: create a new entry or object PUT: update an existing entry or object PATCH: partially update an existing entry or object DELETE: delete an existing entry or object. Can be more intricate. but verb ("put") aligns with actions
JSON Files and Strings
JSON is a method for serializing objects: Convert an object into a string deserialization converts a string back to an object easy for humans to read Defined by: object - python dict, hash table, Java map array - Python list, Java array, vector value - Python string, float, int, boolean, JSON object, JSON array
Relationships
Primary keys and foreign keys define interactions between different tables aka entities. Four types: One-to-one one-to-one-or-none one-to-many and many-to-one many-to-many connects (one, many) of the rows in one table to (one, many) of the rows in another table
creating lists in a "pythonic way"
P = [2**x for x in range(17)] E = [x for x in range(1000) if x%2 != 0] map/filter is "lazier" than this
scipy cont
SciPy gives you access to a ton of specialized mathematical functionality. Just know it exists. We won't use it much in this class some functionality: special mathematical functions (scipy.special) -- elliptic, bessel, etc) Integration (scipy.integrate) optimization (scipy.optimize) interpolation (scipy.interpolate) fourier transformations (scipy.ffpack) signal processing (scipy.signal) Linear algebra (scipy.linalg) Compressed Sparse Graph Routines (scipy.sparse.csgraph) Spatial data structures and algorithms (scipy.spatial) Statistics (scipy.stats) Multidimensional image processing (scipy.ndimage)
Querying a RESTful API
Stateless: with every request, you send along a token/authentication of who you are token = "super_secret_token" r = requests.get("https//:github.com/user", params = {"access_token": token}) print(r.content) {"login": "Mohammad Nayeem teli", "id" : 10536112, "avatar url" : "http..."} PUT/POST/DELETE can edit your repositories
XML, XHTML, HTML
Still hugely popular online, but JSON has replaced XML for: asychnchronous browser <--> server calls many newer web APIs
summary of operations
Tables: A simple, common abstraction Subsumes a set of "strings" - a common input Operations Select, Map, Aggregate, Reduce, Join/Merge, Union/Concat, Group by In a given system/language, the operations may be named differently: SQL uses "join", whereas Pandas uses "merge"
Relation
Simplest relation: a table aka tabular data full of unique tuples
XML
XML is a hierarchical markup language: <tag attribute="value1"> <subtag> 'some cool words or vales go here.' </subtag> <openclosetag attribute="value2/> </tag>
Pandas: History
Written by: Wes McKinney Started in 2008 to get a high-performance, flexible tool to perform quantitave analysis on financial data Highly optimized for performance, with critical code paths written in Cython or C Key constructs: series (like a numPy array) DataFrame (like a table relation or R data.frame) Foundation for Data Wrangling and Analysis in Python
numpy array
a few mechanisms for creating arrays in NumPy: conversion from other python structures (eg lists, tuples) Any sequence-like data can be mapped to a ndarray Built-in NumPy array creation (eg, aranges, ones, zeroes, etc.) create arrays with all ones, zeroes, increasing numbers from 0 to 1, etc Reading arrays from disk, either from standard or custom formats (CSV file)
primary key
a unique identifier for every tuple in a relation. Each tuple has one primary key
resizing array
an array shape can be manipulated by a number of methods. resize(size) will modify an array in place. reshape(size) will return a copy of the new array with a new shape a = np.floor(10*np.random((3,4))) print(a) [[9. 8. 7. 9.] [7. 5. 9. 7.] [8. 2. 7. 5.]] a.shape (3,4) a.ravel() array([9., 8., 7., 9., 7., 5., 9., 7., 8., 2., 7., 5.]) a.shape = (6,2) print(a) [[9. 8.] [7. 9.] [7. 5.] [9. 7.] [8. 2.] [7. 5.]] a.transpose() array([[9., 7., 7., 9., 8., 7.], [8., 9., 5., 7., 2., 5.]])
Python
an interpreted, dynamically-typed, high-level, garbage-collected, object-oriented- functional-imperative, and widely used. interpreted: instructions are executed without being compiled into virtual mahine instructions Dynamically-typed: verifies type safety at runtime high-level: abstracted away from the raw metal and kernel garbage-collected: memory management is automated OOF: you can do bits of OO, F programming
map
apply a function to a sequence or iterable arr = [1,2,3,4,5] map(lambda x:x**2, arr) = [1,4,9,16,25]
foreign keys
attributes (columns) that point to a different table's primary key A table can have multiple foreign keys
Exceptions
tweepy(Python Twitter API) returns "Rate limit exceeded" sqlite (a file-based database) returns an integrity error print('Python', python_version()) try: cause_a_Name_Error except Name_Error as err: print(err, '-> some extra text')
one-to-one
two tables have a one-to-one relationship if every tuple in the first table corresponds to exactly one entry in the other (person --> SSN) In general, you won't be using these (why not just merge the rows into one table)? split a big row between SSD and HDD or distributed Restrict access to part of a row (some DBMSs allow column-level access control, but not all) Caching, partioning, and serious stuff: another class
linear algebra in numpy continued (code)
u = eye(2) array([[1.,0.],[0.,1.]]) j = array([[0.0,-1.0],[1.0,0.0]]) dot(j,j) array([[-1.,0.],[0.,-1.]]) trace(u) #trace sum of diagonal 2.0 y = array([[5.],[7.]]) solve(a,y) #solve linear matrix equation array([[-3.],[4.]]) eig(j) #get eigenvalues/eigenvectors of matrix array([0.+1.j, 0.-1.j]), array([[0.707107 + 0.j, 0.707107+0.j],[0.000.7071j, 0.000+.7071j]]))
example of tidydata
variable: measure or attribute: age, sex, weight, height value: measurement of attribute: 12.2, 42.3kg, 145.1cm, M/F Observation: all measurements for an object: a specific person is [12.2, 42.3, 145.1, F]
one-to-one-or-none
we want to keep track of people's cats. Each person has at most one entry in the table
NumPy array code
x = np.array([2,3,1,0]) x = np.array([2,3,1,0]) x = np.array([[1,2,0], [0,0],(1+1j,3.)]) x = np.array([[ 1.+0.j, 2.+0.j], [0.+0.j, 0.+0.j], [1.+1.j, 3.+0.j]])
NumPy arrays cont
zeros(shape) - creates an array filled with 0 values witht he specified shape. The default dtype is float64. np.zeroes((2,3)) array([[0.,0.,0.], [0.,0.,0.]]) ones(shape) - creates an array filled with 1 values arange() - like Python's range() np.arange(10) array([0,1,2,3,4,5,6,7,8,9]) np.arange(2,10, dtype=np.float) array([2.,3.,4.,5.,6.,7.,8.,9.]) np.arange(2,3,0.2) array([2. , 2.2, 2.4, 2.6, 2.8])
array operations
basic operations apply element-wise. the result is a new array with the resultant elements. a= np.arange(5) b = np.arange(5) a + b array([0,2,4,6,8]) a-b array([0,0,0,0,0]) a**2 array([0, 1, 4 , 9, 16]) a>3 array([False, False, False, False, True], dtype = bool) 10*np.sin(a) array([0., 8.4147, 9.09297, 1.411200, -7.56802]) a*b array([0, 1, 4, 9, 16])
five most common problem with messy data
column headers are values, not variable names multiple variables are stored in one column variables are stored in both rows and columns multiple types of observational units in the same table a single observational unit stored in multiple tables
HTTP requests
conda install -c anaconda requests=2.21.0 r = requests.get('cmsc320 website url') r.staus_code = 200 r.headers('content_type') 'text/html' r.content 'b'<!DOCTYPE html>\nhtml lang="en" > \n\n <head>\n\n <meta charset = "utf-8" > \n <meta name = "viewport"
a web-based application programming interface (API)
contact between a server and a user stating: "If you send me a specific request, I will return some information in a structured and documented format." More generally, APIs can also perform actions, may not be web-based, be a set of protocols for communicating between processes, between an application and an OS
NumPy
contains: a powerful n-dimensional array object sophisticated (broadcasting/universal) functions tools for integrating C/C++ and Fortran code useful linear algebra, Fourier tranform, and random number capabilities, etc can also be used as an efficient multi-dimensional container of generic data
functions in python
def my_func(x,y): if (x>y): return x else: return y def my_func(x,y): return (x-1,y+2) (a,b) = my_func(1,2) a= 0, b= 4
HTTP requests cont
https://www.google.com/?c = cmsc320&tbs = qdr:m HTTP GET request: GET ?q = cmsc320&tbs = qdr:m HTTP/1.1 HOST: google.com User-Agent: Mozilla 15.0 (xll; Linux x86_64; rv:10.0.01) Gecko/20100101 Firefox/10.0.01 params = {"q": "cmsc320", "tbs", "qdr:m"} r = requests.get("https//www.google.com", params = params)
printing items in list in PYTHON
idx = 0 while idx < len(arr): print(arr[idx]) idx+= 1 for element in arr: print(element)
compiling regex
if things are going slowly or you are going to reuse the regular expression, then compile it. #compile the reg expression "cmsc320" regex = re.compile(r"cmsc320") #use it repeatedly to search for matches in text regex.match(text) # does strat of text match? regex.search(text) #find first match or none regex.findall(text) #finds all matches
searching for elements
if we want to do it in a table without pandas, would be O(n). Have to search the whole table
indexes
like a hidden sorted map of references to a specific attribute (column) in a table; allows O(log n) lookup instead of O(n) Actually implemented with data structures like B-trees But: indexes are not free takes memory to store takes time to build takes time to update (add/delete a row, update the column) But, but: one index is (mostly) free Index will be built automatically on the primary key think before you build/maintain an index on other attributes
numpy arrays cont pt 2
linspace() - creates arrays with a specified number of elements, and spaced equally between the specified beginning and end values. np.arange(1., 4., 6) array([1., 1.6, 2.2, 2.8, 3.4, 4.]) random.random(shape) - creates arrays with random floats over the interval [0,1]. np.random.random((2,3)) array([[0.7586, .4176, 0.3500], [0.7716, 0.0587, 0.9879]])
list
list(range(10)) [0,1,2,3,4,5,6,7,8,9]
matching sequences and repeating characters
match 'a' 0 or 1 time: a? match character 'a' 0 or more times: a* match char 'a' 1 or more times a+ match character a exactly n times: a{n} match char 'a' at least n times: a{n,}
Can match sets of characters or multiple and more elaborate sets and sequences of chars:
match 'a': a match 'a', 'b', or 'c': [abc] match any character but 'a', 'b', or 'c': [^abc] match any digit = \d (=[0123456789]) match any alphanumerc = \w (=[a-z A-Z0-9]) match any whitespace = \s (=[\t\n\r\f\v]) match any character: . Special charcters must be escapes: $.^*+?{}[]()
tools to fix common prblems in messy data
melting string splitting casting
ndarray
ndarray object: an n-dimensional array of homogenous data types, with mnay operations being performed in compiled code for performance
NumPy datatypes
numpy.dtype class includes: intc(same as C integer) and intp (used for indexing) int 8, int16, int32, int64 uint8, uint16, uint32, uint64 float16, float32, float64 complex64, complex128 bool_, int_, float_, complex_ are shorthand for defaults These can be used as functions to cast literals or sequence types, as well as arguments to NumPy functions that accept the dtype keyword argument.
pooling analyses
pooled slope estimate is the average of the N imputed estimates beta1p = (beta1 1 + b1 2)/ 2 the pooled slope variance is: s = (sum of zi)/n + (1 + 1/n) * 1/(n-1) * sum(beta1 i - beta 1 p)^2) where zi is the standard error of the imputed slopes standard error: take the square root
basic idea of python
present code in the order that logic and flow of human thoughts demand, not the machine-needed ordering source code; text explanation; and end results of running code
How a relational DB fits into your workflow
raw input --> python<--> structured output (trained classifiers, JSON for D3, visualizations) python<-->SQLite File (SQL) <--> SQLite CLI & GUI Frontend (SQL)
filter
returns a list of elements for which a predicate is true arr[1,2,3,4,5,6,7] filter(lambda x: x %2 == 0,arr) [2,4,6]
len
returns the number of items of an enumerate object x = len(['c', 'm', 's', 'c', 3, 2, 0]) x = 7
scipy
scipy is a collection of mathematical algoritms and convenience functions built on the numpy extensions of Python. It adds significant power to the interactove python session by providing the user with high-level commands and classes for manipulating and visualizing data. Basically SciPy scontains various tools and functions for solving common problems in scientific computing.
array operations cont
since multiplication is done element-wise, you need to specifically perform a dot product to perform matrix multiplication. a = np.zeroes(4).reshape(2,2) a array([[0., 0.], [0., 0.]]) a[0,0] = 1 a[1,1] = 1 b = np.arange(4).reshape(2,2) b array([[0,1],[2,3]]) a *b array([[0., 0.], [0., 3.]]) np.dot(a,b) array([[0., 1.], [2., 3.]])
indexing
single-dimension indexing is accomplished as usual. x = np.arange(10) x[2] 2 x[-2] 8 x.shape = (2,5) x[1,3] 8 x[1,-1] 9
indexing cont
slicing is possible just as it is in Python sequences. x = np.arange(10) x[2:5] array([2,3,4]) x[:-7] array([0,1,2]) x[1:7:2] array([1,3,5]) y = np.arange(35).reshape(5,7) y[1:5:2, ::3] array([7,10,13], [21, 24, 27])
aside:pandas
so this kinda feels like pandas.. and pandas kinda feels like a relational data system... Pandas is not strictly a relational data system: No notion of primary/foreign keys It does have indexes (and multi-column indexes): pandas.Index: ordered, sliceable set stroing axis labels pandas.MultiIndex: hierarchical index Rule of thumb: do heavy, rough lifting at the relational DB level, then fine-grained slicing and dicing and viz with pandas
hierachical indexes
sometimes more intuitive organization of the data Makes it easier to understand and analyze higher-dimensional data instead of 3-D array, may only need a 2D array
Pandas: series
subclass of numpy.ndarray data: any type index labels need not to be ordered duplicates possible but result in reduced functionality
HTML
the specification is fairly pure. We'll use BeautifulSoup: conda install -c asmeurer beautiful-soup-4.3.2 import requests from bs4 import BeautifulSoup r = r.requests.get("https://cs.umd.edu/class/summer/cmsc320/") root = BeautifulSoup(r.content) root.findAll("a") #links for cs320
Array Operations cont
there are also some built-in methods of ndarray objects. universal functions which may also be applied include exp, sqrt, add, sin, cos, etc a = np.random.random((2,3)) a array[.682, 0.989, 0.694], [0.788, 0.622, 0.405]]) a.sum() 4.1807 a.min() 0.405 a.max(axis = 0) array([0.788, 0.989, 0.694]) a.min(axis=1) array([0.682, 0.405]) axis = 0 - talking ab columns axis = 1, talking ab rows
5 ways to get Data
direct download and load from local storage generate locally via dowloaded code (eg simulation) query data from a database query an API from the intra/internet scrape data from a webpage
delete row(s) from the table
#Delete row(s) from the table cursor.execute("DELETE FROM cats WHERE id == 2"); conn.commit()
Regular Expressions Cont
#Does start of text match cmsc320? match = re.match(r"cmsc320", text) #Iterate over all matches for "cmsc320" in text for match in re.finditer(r"cmsc320", text): print(match.start()) #find all matches matches = re.findall(r"cmsc320", text)
crash course in SQL (in Python)
#Make a table cursor.execute(""" CREATE TABLE cats( id INTEGER PRIMARY KEY, name TEXT )""") Capitalization doesn't matter for SQL reserved words SELECT = select = SeLeCt Rule of thumb: capitalize keywords for readability
downloading a bunch of files cont
#cycle through the href for each anchor, checking to see if it's a PDF/PPTX link or not for lnk in lnks: href = lnk['href'] #if it's a PDF/PPTX link, queue a download If href.lower().endswith(('.pdf', '.pptx')): urld = urlparse.urljoin(url,href) rd = requests.get(urld, stream=True) #write the downloaded pdf to a file outfile = path.join(outbase,href) with open(outfile,'wb') as f: f.write(rd.content
more complicated example cont
#formatting df["week"] = df['week'].str.extract('(\d+)+, expand=False).astype(int) df["rank"] = df["rank"].astype(int) #Cleaning out unnecessary rows df = df.dropna() #Create "date" columns df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') -- pd.DateOffset(weeks=1)
Inserting into table
#insert into the table cursor.execute("INSERT INTO cats VALUES (1, 'Megabyte')") cursor.execute("INSERT INTO cats VALUES (2, 'Meowly Cyrus')") cursor.execute("INSERT INTO CATS VALUES (3,'Fuzz Aldrin')") conn.commit()
more complicated example
#keep identifer variables id_vars = ["year","artist.inverted","track","time","genre","date.entered","date.entered", "date.peaked"] #melt the rest into week and rank columns df = pd.melt(frame=df, id_vars=id_vars, var_name = "week", value_name="rank")
reading rows
#read all rows from a table for rows in cursor.execute("SELECT * FROM cats"); print(row) #Read all rows into pandas dataFrame pd.read_sql_query("SELECT * FROM cats", conn, index_col="id")
melting data
f_df = pd.melt(df, ["religion'], var_name = "income", value_name = "freq") f_df = f_df.sort_values(by=["religion"]) f_df.head(10)
linear algebra in numpy
from numpy import * from numpy.linalg import * a = array([1.0,2.0],[3.0,4.0]]) a.transpose() array([[1.,3.],[2.,4.]]) inv(a) array([-2,1.], [1.5,-.5]])
downloading a bunch of files
import re import requests from bs4 import BeautifulSoup try: from urllib.parse import urlparse except ImportError: from urlparse import urlparse #HTTP GET request sent to the URL url r = requests.get(url) #use BeautifulSoup to parse the GET response root = BeautifulSoup(r.content) links = root.find("div", id="schedule")\.find("table")\.find("tbody").findAll("a")
Crash Course in SQL (in Python)
import sqlite3 #create a database and connect to it conn = sqlite3.connect("cmsc320.db") cursor = conn.cursor() conn.close() Cursor: temporary work area in system memory for manipulating SQL statements and return values If you do not close the connection(conn.close()), any outstanding transaction is rolled back
SQL join visual
inner join - only the keys present in both. full join - both left and right table values are in new table left join - all left table values and values present in both right join - all right table values and values present in both
printing items in list in JAVA
int[] arr = new int[10]; for (int idx = 0; idex < arr.length; ++idx){ System.out.println(arr[idx]); }
one scipy example
integral of sinxdx from a to b we have a function object - np.sin defines the sin function for us. We can compute the definite integral from a to b using the quad function. res = scipy.integrate.quad(np.sin, 0, np.pi) print(res) (2.0,2.22044) # 2 with a very small error margin res = scipy.integrate.quad(np.sin, -np.inf, +np.inf) print(res) (0.,0.) #integral does not converge
Many-to-many
keep track of cats' colors: one column per column, too many columns, too many nulls, cteate a color_id
Regular expressions
"filename.pdf".endswith((".pdf",".pptx")) "fiLNmae.pDf".lower().endwith(".pdf", ".pptx") Used to search for specific elements, or groups of elements that match a pattern #find index of 1st occurence of "cmsc320" import re match = re.search(r"cmsc320", text) print(match.start())
merge operations
1. merge or join combine rows/tuples across two tables if they have the same key Outer joins can be used to "pad" IDs that don't appear in both tables Three variants: LEFT, RIGHT, FULL SQL Terminology -- Pandas has these operations as well values padded with 'NaN'
RESTful API status code
200: request was successful 201: a new resource was created 202: request was received but no modification made 204: request was successful, but response has no content 400: request was malformed 401: client is unauthorized 404: request service not found 415: requested data format is not supported 422: requetsed data format had missing data 500: server throws an error while processing
CSV files
Any CSV reader worth anything can parse files with any delimiter, not just ',' (eg "TSV" - tab-sperated)
associative tables
Ctas in one table and colors in another and then combine tables so you have cat_id and color_id in one table.
Data manipulation and computation
Data representaion (natural way to think about data) one-dimensional like an array or vector. Also an n-dimensional array or a matrix. Indexing, slicing, filter map --> apply a function to every element reduce/aggregate --> combine values to get a single scalar (sum, median) given 2 vectors: dot and cross products
pandas: dataframes
Each column can have a different type Row and column index Mutable size: insert and delete columns Note the use of word "index" for what we called "key" Relational database use "index" to mean something else Non-unique index values allowed may raise an exception for some operations
S we've queried a server using a well-formed GET request via the requests Python made. What comes back?
General structured data: Comma-Seperated-files (CSV) files and strings Javascript object notation(JSON) files and strings HTML, XHTML,XML files and strings Domain-specific structured data: shapefiles: geospatial vector data (OpenStreetMap) RVT files: architectural planning (Autodesk Revlt)
scipy.integrate
Lets say we do not have a function object, we only have some (x,y) samples that don't "define" our function. We can estimate the Integral using the trapezoidal rule. sample_x = np.linspace(0, np.pi,1000) sample_y = np.sin(sample_x) # creating 1000 samples result = scipy.integrate.trapz(sample_y, sample_x) print(result) 1.99999 sample_x = np.linspace(0,np.pi,1000000) sample_y = np.sin(sample_x) #creating a million samples result = scipy.integrate.trapz(sample_y, sample_x) print(result) 2.0
Tidy Data
Names of files/DataFrames = description of one dataset Enforce one data type per dataset (ish)
difference between NumPy arrays and Python sequences
NumPy arrays have a fixed size. Modifying the size means creating a new array NumPy arrays must be of the same datatype, but this can include Python objects -- may not get performance benefits more efficient mathematical operations than built-in sequence types
Authentication and OAUTH
Old and busted: r = requests.get("https//api.github.com/user, auth=("nayeemz", "database name")) new approach: OAUTH grants access tokens that give possibly incomplete access to a user or app without exposing a password
Python 3
Pyhton3 is intentionally backwards incompatible (but not that incompatible) biggest changes from Python2: print "statement" --> print("function") 1/2 = 0 --> 1/2 = .5 1//2 = 0 ASCII str default --> default Unicode Namespace ambiguity fixed: i = 1 [i for i in range(5)] print(i)
Python vs R
Python is a "full" programming language - easier to integrate with systems in the field R has a more mature set of pure stats libraries Python is catching up and is ahead for ML (machine Learning) Python is used more in the tech industry
JSON in python
Some built-in types: "Strings", 1.0, True,False, None Lists: ["Goodbye", "cruel", "world"] Dictionaries: {"hello": "bonjour", "goodbye", "au revoir"
enumerate
enumerate(["311","320","330")] [(0,"311"),(1,"320"),(2,"330")]
CSV code
import csv with open("schedule.csv", "rb") as f: reader = csv.reader(f, delimiter = ",", quotechar = '"') for row in reader: print(row)
parsing JSON in python
import json r = requests.get('https://api.github.com/search/repositories; params = {'q' : 'users'}) data = json.loads(r.content) json.load(some_file) #loads JSON from a file json.dump(some_obj, some_file) #writes JSON to a file json.dumps(json_obj) #returns JSON String
printing arrays numpy
import numpy as np a = np.arange(3) print(a) [0,1,2] a array([0,1,2]) b = np.arange(9).reshape(3,3) print(b) [[0 1 2] [ 3 4 5] [6 7 8]] c = np.arange(8).reshape(2,2,2) print(c) [[[0 1] [2 3]] [[4 5] [6 7]]]
Numpy dataype code
import numpy as np x = np.float32(1.0) x y = np.int_([1,2,4]) y z = np.arange(3, dtype=np.uint8) z array([0,1,2], dtype =uint8) z.dtype dtype('uint8')
SQLite
on-disk relational database managment system (RDMS) Applications connect directly to a file Most RDMSs have applications connect to a server: Advantages include greater concurrency, less restrictive locking Disadvantages include, for this class setup time All interactions use Structured Query Language (SQL)
data processing operations
one or more datasets as input and produce one or more datasets as output
one-to-many and many-to-one
one person can have one nationality in this example, but one nationality can include many people
indexing cont
using fewer dimensions to index will result in a subarray: x = np.arange(10) x.shape = (2,5) x[0] array([0, 1 , 2, 3, 4]) This means that x[i, j] = x[i][j] but the second method is less efficient