Numpy介绍

While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.

In particular, Python lists are very flexible containers that can be nested arbitrarily deep and which can hold any Python object in them, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices. In contrast, much of our modern heritage of scientific computing has been built on top of libraries written in the Fortran language, which has native support for vectors and matrices as well as a library of mathematical functions that can efficiently operate on entire arrays at once.

1. Basics of Numpy arrays

We now turn our attention to the Numpy library, which forms the base layer for the entire 'scipy ecosystem'. Once you have installed numpy, you can import it as

In [1]:
import numpy
numpy.__version__
Out[1]:
'1.11.1'

For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later. By convention, you'll find that most people in the SciPy/PyData world will import NumPy using np as an alias:

In [2]:
import numpy as np
In [3]:
np?
In [4]:
lst = [10, 20, 30, 40]
In [5]:
type(lst)
Out[5]:
list
In [6]:
arr = np.array([10, 20, 30, 40])
In [7]:
arr
Out[7]:
array([10, 20, 30, 40])

Data Types

A Python List Is More Than Just a List

Let's consider now what happens when we use a Python data structure that holds many Python objects. The standard mutable multi-element container in Python is the list. We can create a list of integers as follows:

In [8]:
L = list(range(10))
L
Out[8]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [9]:
type(L[0])
Out[9]:
int

Or, similarly, a list of strings:

In [10]:
L2 = [str(c) for c in L]
L2
Out[10]:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
In [11]:
type(L2[0])
Out[11]:
str

Because of Python's dynamic typing, we can even create heterogeneous lists:

In [12]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]
Out[12]:
[bool, str, float, int]

Fixed-Type Arrays in Python

Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in array module (available since Python 3.3) can be used to create dense arrays of a uniform type:

In [13]:
import array
L = list(range(10))
A = array.array('i', L)
A
Out[13]:
array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [14]:
# integer array:
np.array([1, 4, 2, 5, 3])
Out[14]:
array([1, 4, 2, 5, 3])
In [15]:
np.array([1, 2, 3, 4], dtype='float32')
Out[15]:
array([ 1.,  2.,  3.,  4.], dtype=float32)
In [16]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
Out[16]:
array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])
In [17]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
Out[17]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [18]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)
Out[18]:
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])
In [19]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)
Out[19]:
array([[ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14],
       [ 3.14,  3.14,  3.14,  3.14,  3.14]])
In [20]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)
Out[20]:
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
In [21]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)
Out[21]:
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
In [22]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))
Out[22]:
array([[ 0.21289378,  0.52487205,  0.25849942],
       [ 0.90209672,  0.04651644,  0.19503496],
       [ 0.56888404,  0.6758118 ,  0.64613977]])
In [23]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))
Out[23]:
array([[ 0.65783253,  0.99990118, -0.09419671],
       [-0.91003253,  0.91356762, -0.56415596],
       [ 0.31124371, -1.5067149 ,  0.26336399]])
In [24]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))
Out[24]:
array([[5, 4, 4],
       [9, 1, 3],
       [6, 3, 2]])
In [25]:
# Create a 3x3 identity matrix
np.eye(3)
Out[25]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])
In [26]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)
Out[26]:
array([ 1.,  1.,  1.])

NumPy Standard Data Types

NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string:

np.zeros(10, dtype='int16')

Or using the associated NumPy object:

np.zeros(10, dtype=np.int16)
Data type Description
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)
intc Identical to C int (normally int32 or int64)
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float_ Shorthand for float64.
float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ Shorthand for complex128.
complex64 Complex number, represented by two 32-bit floats
complex128 Complex number, represented by two 64-bit floats

More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the NumPy documentation.

Element indexing

Elements of a one-dimensional array are accessed with the same syntax as a list:

In [27]:
lst[0]
Out[27]:
10
In [28]:
arr[0]
Out[28]:
10
In [29]:
arr[-1]
Out[29]:
40
In [30]:
arr[2:]
Out[30]:
array([30, 40])

Differences between arrays and lists

The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:

In [31]:
lst[-1] = 'a string inside a list'
lst
Out[31]:
[10, 20, 30, 'a string inside a list']

but the same can not be done with an array, as we get an error message:

In [32]:
arr[-1] = 'a string inside an array'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-29c0bfa5fa8a> in <module>()
----> 1 arr[-1] = 'a string inside an array'

ValueError: invalid literal for int() with base 10: 'a string inside an array'

Array Attributes

The information about the type of an array is contained in its dtype attribute:

In [33]:
arr
Out[33]:
array([10, 20, 30, 40])
In [34]:
np.mean(arr)
Out[34]:
25.0
In [35]:
arr.mean()
Out[35]:
25.0
In [36]:
?arr.dtype

Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:

In [37]:
arr[-1] = 1.234
arr
Out[37]:
array([10, 20, 30,  1])

Creating Arrays

In [38]:
?np.zeros
In [39]:
np.zeros(5, dtype=float)
Out[39]:
array([ 0.,  0.,  0.,  0.,  0.])
In [40]:
np.zeros(3, dtype=int)
Out[40]:
array([0, 0, 0])
In [41]:
np.zeros(3, dtype=complex)
Out[41]:
array([ 0.+0.j,  0.+0.j,  0.+0.j])

and similarly for ones:

In [42]:
print('5 ones:', np.ones(5))
5 ones: [ 1.  1.  1.  1.  1.]

If we want an array initialized with an arbitrary value, we can create an empty array and then use the fill method to put the value we want into the array:

In [43]:
a = np.empty(4)
a.fill(5.5)
a
Out[43]:
array([ 5.5,  5.5,  5.5,  5.5])

Defining various sequences

Numpy also offers the arange function, which works like the builtin range but returns an array instead of a list:

In [44]:
np.arange(5)
Out[44]:
array([0, 1, 2, 3, 4])
In [45]:
np.arange(1,100,3)
Out[45]:
array([ 1,  4,  7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49,
       52, 55, 58, 61, 64, 67, 70, 73, 76, 79, 82, 85, 88, 91, 94, 97])

and the linspace and logspace functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:

In [46]:
print("A linear grid between 0 and 1:")
print(np.linspace(0, 1, 5))
A linear grid between 0 and 1:
[ 0.    0.25  0.5   0.75  1.  ]
In [47]:
print("A logarithmic grid between 10**1 and 10**3:")
print(np.logspace(1, 3, 4))
A logarithmic grid between 10**1 and 10**3:
[   10.            46.41588834   215.443469    1000.        ]

Creating random arrays

  • np.random
In [48]:
x = np.random.randn(100)
In [49]:
np.random.randn(5)
Out[49]:
array([ 1.69608682,  0.34324429,  0.01655629, -1.42253019, -0.74848612])

whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:

In [50]:
norm10 = np.random.normal(10, 3, 5)
norm10
Out[50]:
array([ 10.06548702,   3.59663134,   8.16978462,  11.03164637,   6.56228173])
In [51]:
bin=np.random.binomial(10,0.4,50)
bin
Out[51]:
array([4, 2, 4, 6, 3, 2, 4, 4, 5, 4, 5, 5, 4, 4, 2, 5, 6, 6, 4, 4, 3, 7, 3,
       4, 3, 5, 3, 4, 3, 4, 3, 7, 8, 4, 5, 3, 3, 6, 5, 6, 5, 3, 4, 2, 3, 3,
       6, 4, 3, 5])

Indexing with other arrays

Above we saw how to index arrays with single numbers and slices, just like Python lists. But arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.

Consider for example that in the array norm10 we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is true or false:

In [52]:
mask = norm10 > 9 
mask
Out[52]:
array([ True, False, False,  True, False], dtype=bool)

Now that we have this mask, we can use it to either read those values or to reset them to 0:

In [53]:
print('Values above 9:', norm10[mask])
Values above 9: [ 10.06548702  11.03164637]
In [54]:
print('Resetting all values above 9 to 0...')
norm10[mask] = 0
print(norm10)
Resetting all values above 9 to 0...
[ 0.          3.59663134  8.16978462  0.          6.56228173]

2. Arrays with more than one dimension

  • 多维数组
In [55]:
lst2 = [[1, 2], [3, 4]]
lst2
arr2 = np.array([[1, 2], [3, 4]])
arr2
Out[55]:
array([[1, 2],
       [3, 4]])
In [56]:
print(lst2[0][1])
print(arr2[0,1])
2
2
In [57]:
import numpy as np
In [58]:
np.zeros((2,3))
Out[58]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [59]:
arr = np.random.normal(10, 3, (2, 4))
arr
Out[59]:
array([[  6.65704412,  11.73090744,   9.8732487 ,  10.62893331],
       [  9.3844521 ,  16.07447658,  10.41627934,  13.27352316]])
In [60]:
arr.reshape(4,2)
Out[60]:
array([[  6.65704412,  11.73090744],
       [  9.8732487 ,  10.62893331],
       [  9.3844521 ,  16.07447658],
       [ 10.41627934,  13.27352316]])

In fact, the shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:

In [61]:
arr = np.arange(8).reshape(2, 4)
arr
Out[61]:
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

Views, not Copies

Note that reshaping (like most numpy operations) provides a view of the same memory:

In [62]:
arr = np.arange(8)
arr2 = arr.reshape(2, 4)

arr[3] = 1000
print(arr)
print(arr2)
[   0    1    2 1000    4    5    6    7]
[[   0    1    2 1000]
 [   4    5    6    7]]

This lack of copying allows for very efficient vectorized operations.

Copies

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the copy() method:

In [63]:
arr3_copy = arr2[:2,:2].copy()
print (arr3_copy)
[[0 1]
 [4 5]]
In [64]:
arr3_copy[0,0] = 33
print (arr3_copy)
[[33  1]
 [ 4  5]]
In [65]:
print (arr2)
[[   0    1    2 1000]
 [   4    5    6    7]]

Slices

With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions:

In [66]:
print('Slicing in the second row:', arr2[1, 2:4])
print('All rows, third column   :', arr2[:, 2])
Slicing in the second row: [6 7]
All rows, third column   : [2 6]

If you only provide one index, then you will get an array with one less dimension containing that row:

In [67]:
print('First row:  ', arr2[0])
print('Second row: ', arr2[1])
First row:   [   0    1    2 1000]
Second row:  [4 5 6 7]

Array Properties and Methods

In [68]:
arr = arr2
In [69]:
print('Data type                :', arr.dtype)
print('Total number of elements :', arr.size)
print('Number of dimensions     :', arr.ndim)
print('Shape (dimensionality)   :', arr.shape)
print('Memory used (in bytes)   :', arr.nbytes)
Data type                : int64
Total number of elements : 8
Number of dimensions     : 2
Shape (dimensionality)   : (2, 4)
Memory used (in bytes)   : 64

Arrays also have many useful methods, some especially useful ones are:

In [70]:
print('Minimum and maximum             :', arr.min(), arr.max())
print('Sum and product of all elements :', arr.sum(), arr.prod())
print('Mean and standard deviation     :', arr.mean(), arr.std())
Minimum and maximum             : 0 1000
Sum and product of all elements : 1025 0
Mean and standard deviation     : 128.125 329.545686324

For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the axis parameter; for example:

In [71]:
?arr.sum
In [72]:
?np.sum
In [73]:
?arr.mean
In [74]:
arr
Out[74]:
array([[   0,    1,    2, 1000],
       [   4,    5,    6,    7]])
In [75]:
print('For the following array:\n', arr)
print('The sum of elements along the rows is    :', arr.sum(axis=1))
print('The sum of elements along the columns is :', arr.sum(axis=0))
For the following array:
 [[   0    1    2 1000]
 [   4    5    6    7]]
The sum of elements along the rows is    : [1003   22]
The sum of elements along the columns is : [   4    6    8 1007]
In [76]:
np.zeros((3,4,5,6)).sum(2).shape
Out[76]:
(3, 4, 6)

Another widely used property of arrays is the .T attribute, which allows you to access the transpose of the array:

In [77]:
print('Array:\n', arr)
print('Transpose:\n', arr.T)
Array:
 [[   0    1    2 1000]
 [   4    5    6    7]]
Transpose:
 [[   0    4]
 [   1    5]
 [   2    6]
 [1000    7]]

Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.

Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument.

In [78]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
Out[78]:
array([1, 2, 3, 3, 2, 1])
In [79]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))
[ 1  2  3  3  2  1 99 99 99]
In [80]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
# concatenate along the first axis
np.concatenate([grid, grid])
Out[80]:
array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])
In [81]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
Out[81]:
array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])
In [82]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])
Out[82]:
array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])
In [83]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])
Out[83]:
array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

Similary, np.dstack will stack arrays along the third axis.

Splitting of arrays

The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and np.vsplit.

In [84]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]
In [85]:
grid = np.arange(16).reshape((4, 4))
grid
Out[85]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
In [86]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]
In [87]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)
[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]

Similarly, np.dsplit will split arrays along the third axis.

More array properties

We don't have time here to look at all the methods and properties of arrays, here's a complete list. Simply try exploring some of these IPython to learn more, or read their description in the full Numpy documentation:

arr.T             arr.copy          arr.getfield      arr.put           arr.squeeze
arr.all           arr.ctypes        arr.imag          arr.ravel         arr.std
arr.any           arr.cumprod       arr.item          arr.real          arr.strides
arr.argmax        arr.cumsum        arr.itemset       arr.repeat        arr.sum
arr.argmin        arr.data          arr.itemsize      arr.reshape       arr.swapaxes
arr.argsort       arr.diagonal      arr.max           arr.resize        arr.take
arr.astype        arr.dot           arr.mean          arr.round         arr.tofile
arr.base          arr.dtype         arr.min           arr.searchsorted  arr.tolist
arr.byteswap      arr.dump          arr.nbytes        arr.setasflat     arr.tostring
arr.choose        arr.dumps         arr.ndim          arr.setfield      arr.trace
arr.clip          arr.fill          arr.newbyteorder  arr.setflags      arr.transpose
arr.compress      arr.flags         arr.nonzero       arr.shape         arr.var
arr.conj          arr.flat          arr.prod          arr.size          arr.view
arr.conjugate     arr.flatten       arr.ptp           arr.sort          

3.Operating with arrays

Arrays support all regular arithmetic operators, and the numpy library also contains a complete collection of basic mathematical functions that operate on arrays.

In [88]:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print(arr1, '+', arr2, '=', arr1+arr2)
[0 1 2 3] + [10 11 12 13] = [10 12 14 16]
In [89]:
print(arr1, '*', arr2, '=', arr1*arr2)
[0 1 2 3] * [10 11 12 13] = [ 0 11 24 39]
In [90]:
1.5 * arr1
Out[90]:
array([ 0. ,  1.5,  3. ,  4.5])

Computation on NumPy Arrays: Universal Functions

In [91]:
import numpy as np
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)
Out[91]:
array([ 0.16666667,  1.        ,  0.25      ,  0.25      ,  0.125     ])
In [92]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)
1 loop, best of 3: 2.58 s per loop
In [93]:
print(compute_reciprocals(values))
print(1.0 / values)
[ 0.16666667  1.          0.25        0.25        0.125     ]
[ 0.16666667  1.          0.25        0.25        0.125     ]

The following table lists the arithmetic operators implemented in NumPy:

Operator Equivalent ufunc Description
+ np.add Addition (e.g., 1 + 1 = 2)
- np.subtract Subtraction (e.g., 3 - 2 = 1)
- np.negative Unary negation (e.g., -2)
* np.multiply Multiplication (e.g., 2 * 3 = 6)
/ np.divide Division (e.g., 3 / 2 = 1.5)
// np.floor_divide Floor division (e.g., 3 // 2 = 1)
** np.power Exponentiation (e.g., 2 ** 3 = 8)
% np.mod Modulus/remainder (e.g., 9 % 4 = 1)
In [94]:
x = np.array([-2, -1, 0, 1, 2])
abs(x)
Out[94]:
array([2, 1, 0, 1, 2])
In [95]:
np.absolute(x)
Out[95]:
array([2, 1, 0, 1, 2])
In [96]:
np.abs(x)
Out[96]:
array([2, 1, 0, 1, 2])
In [97]:
x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0 + 1j])
np.abs(x)
Out[97]:
array([ 5.,  5.,  2.,  1.])

Trigonometric functions

NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist are the trigonometric functions.

In [98]:
theta = np.linspace(0, np.pi, 3)
In [99]:
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
theta      =  [ 0.          1.57079633  3.14159265]
sin(theta) =  [  0.00000000e+00   1.00000000e+00   1.22464680e-16]
cos(theta) =  [  1.00000000e+00   6.12323400e-17  -1.00000000e+00]
tan(theta) =  [  0.00000000e+00   1.63312394e+16  -1.22464680e-16]
In [100]:
x = [-1, 0, 1]
print("x         = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
x         =  [-1, 0, 1]
arcsin(x) =  [-1.57079633  0.          1.57079633]
arccos(x) =  [ 3.14159265  1.57079633  0.        ]
arctan(x) =  [-0.78539816  0.          0.78539816]

Exponents and logarithms

Another common type of operation available in a NumPy ufunc are the exponentials:

In [101]:
x = [1, 2, 3]
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))
print("3^x   =", np.power(3, x))
x     = [1, 2, 3]
e^x   = [  2.71828183   7.3890561   20.08553692]
2^x   = [ 2.  4.  8.]
3^x   = [ 3  9 27]
In [102]:
x = [1, 2, 4, 10]
print("x        =", x)
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))
x        = [1, 2, 4, 10]
ln(x)    = [ 0.          0.69314718  1.38629436  2.30258509]
log2(x)  = [ 0.          1.          2.          3.32192809]
log10(x) = [ 0.          0.30103     0.60205999  1.        ]
In [103]:
x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
exp(x) - 1 = [ 0.          0.0010005   0.01005017  0.10517092]
log(1 + x) = [ 0.          0.0009995   0.00995033  0.09531018]

Specialized ufuncs

NumPy has many more ufuncs available, including hyperbolic trig functions, bitwise arithmetic, comparison operators, conversions from radians to degrees, rounding and remainders, and much more.

Another excellent source for more specialized and obscure ufuncs is the submodule scipy.special. If you want to compute some obscure mathematical function on your data, chances are it is implemented in scipy.special.

In [104]:
from scipy import special
In [105]:
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x)     =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2)   =", special.beta(x, 2))
gamma(x)     = [  1.00000000e+00   2.40000000e+01   3.62880000e+05]
ln|gamma(x)| = [  0.           3.17805383  12.80182748]
beta(x, 2)   = [ 0.5         0.03333333  0.00909091]
In [106]:
# Error function (integral of Gaussian)
# its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x)  =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
erf(x)  = [ 0.          0.32862676  0.67780119  0.84270079]
erfc(x) = [ 1.          0.67137324  0.32219881  0.15729921]
erfinv(x) = [ 0.          0.27246271  0.73286908         inf]

Advanced Ufunc Features

Many NumPy users make use of ufuncs without ever learning their full set of features.

Specifying output

For large calculations, it is sometimes useful to be able to specify the array where the result of the calculation will be stored. Rather than creating a temporary array, this can be used to write computation results directly to the memory location where you'd like them to be. For all ufuncs, this can be done using the out argument of the function:

In [107]:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
[  0.  10.  20.  30.  40.]
In [108]:
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
[  1.   0.   2.   0.   4.   0.   8.   0.  16.   0.]
In [109]:
x = np.arange(1, 6)
np.add.reduce(x)
Out[109]:
15
In [110]:
np.multiply.reduce(x)
Out[110]:
120
In [111]:
np.add.accumulate(x)
Out[111]:
array([ 1,  3,  6, 10, 15])
In [112]:
np.multiply.accumulate(x)
Out[112]:
array([  1,   2,   6,  24, 120])

Outer products

In [113]:
x = np.arange(1, 6)
np.multiply.outer(x, x)
Out[113]:
array([[ 1,  2,  3,  4,  5],
       [ 2,  4,  6,  8, 10],
       [ 3,  6,  9, 12, 15],
       [ 4,  8, 12, 16, 20],
       [ 5, 10, 15, 20, 25]])
In [ ]:
 

This is a first example of broadcasting

Broadcasting

While this means that in principle arrays must always match in their dimensionality in order for an operation to be valid, numpy will broadcast dimensions when possible. Here is an example of broadcasting a scalar to a 1D array:

In [114]:
print(np.arange(3))
print(np.arange(3) + 5)
[0 1 2]
[5 6 7]

We can also broadcast a 1D array to a 2D array, in this case adding a vector to all rows of a matrix:

In [115]:
np.ones((3, 3))
Out[115]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [116]:
np.arange(3)
Out[116]:
array([0, 1, 2])
In [117]:
np.ones((3, 3)) + np.arange(3)
Out[117]:
array([[ 1.,  2.,  3.],
       [ 1.,  2.,  3.],
       [ 1.,  2.,  3.]])

We can also broadcast in two directions at a time:

In [118]:
np.arange(3).reshape((3, 1))
Out[118]:
array([[0],
       [1],
       [2]])
In [119]:
np.arange(3)
Out[119]:
array([0, 1, 2])
In [120]:
np.arange(3).reshape((3, 1)) + np.arange(3)
Out[120]:
array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4]])

Visualizing Broadcasting

(image source)

Comparisons, Masks, and Boolean Logic

In [121]:
import numpy as np
import pandas as pd

# use pandas to extract rainfall inches as a NumPy array
rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
inches = rainfall / 254.0  # 1/10mm -> inches
inches.shape
Out[121]:
(365,)
In [122]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # set plot styles
In [123]:
plt.hist(inches, 40);
In [124]:
x = np.array([1, 2, 3, 4, 5])
In [125]:
x < 3  # less than
Out[125]:
array([ True,  True, False, False, False], dtype=bool)
In [126]:
x > 3  # greater than
Out[126]:
array([False, False, False,  True,  True], dtype=bool)
In [127]:
x != 3  # not equal
Out[127]:
array([ True,  True, False,  True,  True], dtype=bool)
In [128]:
(2 * x) == (x ** 2)
Out[128]:
array([False,  True, False, False, False], dtype=bool)
In [129]:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x
Out[129]:
array([[5, 0, 3, 3],
       [7, 9, 3, 5],
       [2, 4, 7, 6]])
In [130]:
# how many values less than 6?
np.count_nonzero(x < 6)
Out[130]:
8
In [131]:
np.sum(x < 6)
Out[131]:
8
In [132]:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)
Out[132]:
array([4, 2, 2])
In [133]:
# are all values in each row less than 8?
np.all(x < 8, axis=1)
Out[133]:
array([ True, False,  True], dtype=bool)
In [134]:
np.sum((inches > 0.5) & (inches < 1))
Out[134]:
29
In [135]:
np.sum(~( (inches <= 0.5) | (inches >= 1) ))
Out[135]:
29
In [136]:
print("Number days without rain:      ", np.sum(inches == 0))
print("Number days with rain:         ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches  :", np.sum((inches > 0) &
                                                (inches < 0.2)))
Number days without rain:       215
Number days with rain:          150
Days with more than 0.5 inches: 37
Rainy days with < 0.2 inches  : 75
In [137]:
# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
      np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches):  ",
      np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
      np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))
Median precip on rainy days in 2014 (inches):    0.194881889764
Median precip on summer days in 2014 (inches):   0.0
Maximum precip on summer days in 2014 (inches):  0.850393700787
Median precip on non-summer rainy days (inches): 0.200787401575
In [138]:
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B
Out[138]:
array([ True,  True,  True, False,  True,  True], dtype=bool)

Exploring Fancy Indexing

In [139]:
import numpy as np
rand = np.random.RandomState(42)

x = rand.randint(100, size=10)
print(x)
[51 92 14 71 60 20 82 86 74 74]
In [140]:
[x[3], x[7], x[2]]
Out[140]:
[71, 86, 14]
In [141]:
ind = [3, 7, 4]
x[ind]
Out[141]:
array([71, 86, 60])
In [142]:
ind = np.array([[3, 7],
                [4, 5]])
x[ind]
Out[142]:
array([[71, 86],
       [60, 20]])
In [143]:
X = np.arange(12).reshape((3, 4))
X
Out[143]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [144]:
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]
Out[144]:
array([ 2,  5, 11])
In [145]:
X[row[:, np.newaxis], col]
Out[145]:
array([[ 2,  1,  3],
       [ 6,  5,  7],
       [10,  9, 11]])
In [146]:
row[:, np.newaxis] * col
Out[146]:
array([[0, 0, 0],
       [2, 1, 3],
       [4, 2, 6]])

Combined Indexing

In [147]:
X[2, [2, 0, 1]]
Out[147]:
array([10,  8,  9])
In [148]:
X[1:, [2, 0, 1]]
Out[148]:
array([[ 6,  4,  5],
       [10,  8,  9]])
In [149]:
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
Out[149]:
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

Example: Selecting Random Points

One common use of fancy indexing is the selection of subsets of rows from a matrix. For example, we might have an $N$ by $D$ matrix representing $N$ points in $D$ dimensions, such as the following points drawn from a two-dimensional normal distribution:

In [150]:
mean = [0, 0]
cov = [[1, 2],
       [2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape
Out[150]:
(100, 2)
In [151]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # for plot styling

plt.scatter(X[:, 0], X[:, 1]);
In [152]:
indices = np.random.choice(X.shape[0], 20, replace=False)
indices
Out[152]:
array([30, 95, 33, 18, 20, 39,  9, 94, 55, 69, 65,  8, 86, 21, 36,  2,  3,
       79, 67, 78])
In [153]:
selection = X[indices]  # fancy indexing here
selection.shape
Out[153]:
(20, 2)
In [154]:
plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1],
            facecolor='none', s=200);

Modifying Values with Fancy Indexing

In [155]:
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)
[ 0 99 99  3 99  5  6  7 99  9]
In [156]:
x = np.zeros(10)
x[[0, 0]] = [4, 6]
print(x)
[ 6.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
In [157]:
i = [2, 3, 3, 4, 4, 4]
x[i] += 1
x
Out[157]:
array([ 6.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.])
In [158]:
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
[ 0.  0.  1.  2.  3.  0.  0.  0.  0.  0.]

Example: Binning Data

You can use these ideas to efficiently bin data to create a histogram by hand. For example, imagine we have 1,000 values and would like to quickly find where they fall within an array of bins. We could compute it using ufunc.at like this:

In [159]:
np.random.seed(42)
x = np.random.randn(100)

# compute a histogram by hand
bins = np.linspace(-5, 5, 20)
counts = np.zeros_like(bins)

# find the appropriate bin for each x
i = np.searchsorted(bins, x)

# add 1 to each of these bins
np.add.at(counts, i, 1)
In [160]:
# plot the results
plt.plot(bins, counts, linestyle='steps');

Sorting Arrays

In [161]:
import numpy as np

def selection_sort(x):
    for i in range(len(x)):
        swap = i + np.argmin(x[i:])
        (x[i], x[swap]) = (x[swap], x[i])
    return x
In [162]:
x = np.array([2, 1, 4, 3, 5])
selection_sort(x)
Out[162]:
array([1, 2, 3, 4, 5])

Even selection sort, though, is much better than my all-time favorite sorting algorithms, the bogosort:

In [163]:
def bogosort(x):
    while np.any(x[:-1] > x[1:]):
        np.random.shuffle(x)
    return x
In [164]:
x = np.array([2, 1, 4, 3, 5])
bogosort(x)
Out[164]:
array([1, 2, 3, 4, 5])
In [165]:
x = np.array([2, 1, 4, 3, 5])
np.sort(x)
Out[165]:
array([1, 2, 3, 4, 5])
In [166]:
x.sort()
print(x)
[1 2 3 4 5]

A related function is argsort, which instead returns the indices of the sorted elements:

In [167]:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)
[1 0 3 2 4]

Sorting along rows or columns

In [168]:
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
[[6 3 7 4 6 9]
 [2 6 7 4 3 7]
 [7 2 5 4 1 7]
 [5 1 4 0 9 5]]
In [169]:
# sort each column of X
np.sort(X, axis=0)
Out[169]:
array([[2, 1, 4, 0, 1, 5],
       [5, 2, 5, 4, 3, 7],
       [6, 3, 7, 4, 6, 7],
       [7, 6, 7, 4, 9, 9]])
In [170]:
# sort each row of X
np.sort(X, axis=1)
Out[170]:
array([[3, 4, 6, 6, 7, 9],
       [2, 3, 4, 6, 7, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 5, 9]])

Partial Sorts: Partitioning

In [171]:
x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)
Out[171]:
array([2, 1, 3, 4, 6, 5, 7])
In [172]:
np.partition(X, 2, axis=1)
Out[172]:
array([[3, 4, 6, 7, 6, 9],
       [2, 3, 4, 7, 6, 7],
       [1, 2, 4, 5, 7, 7],
       [0, 1, 4, 5, 9, 5]])

Example: k-Nearest Neighbors

Let's quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set. We'll start by creating a random set of 10 points on a two-dimensional plane. Using the standard convention, we'll arrange these in a $10\times 2$ array:

In [173]:
X = rand.rand(10, 2)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling
plt.scatter(X[:, 0], X[:, 1], s=100);

Now we'll compute the distance between each pair of points. Recall that the squared-distance between two points is the sum of the squared differences in each dimension;

In [174]:
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=-1)

This operation has a lot packed into it, and it might be a bit confusing if you're unfamiliar with NumPy's broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:

In [175]:
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape
Out[175]:
(10, 10, 2)
In [176]:
# square the coordinate differences
sq_differences = differences ** 2
sq_differences.shape
Out[176]:
(10, 10, 2)
In [177]:
# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
dist_sq.shape
Out[177]:
(10, 10)
In [178]:
dist_sq.diagonal()
Out[178]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
In [179]:
nearest = np.argsort(dist_sq, axis=1)
print(nearest)
[[0 3 9 7 1 4 2 5 6 8]
 [1 4 7 9 3 6 8 5 0 2]
 [2 1 4 6 3 0 8 9 7 5]
 [3 9 7 0 1 4 5 8 6 2]
 [4 1 8 5 6 7 9 3 0 2]
 [5 8 6 4 1 7 9 3 2 0]
 [6 8 5 4 1 7 9 3 2 0]
 [7 9 3 1 4 0 5 8 6 2]
 [8 5 6 4 1 7 9 3 2 0]
 [9 7 3 0 1 4 5 8 6 2]]
In [180]:
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)

In order to visualize this network of neighbors, let's quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:

In [181]:
plt.scatter(X[:, 0], X[:, 1], s=100)

# draw lines from each point to its two nearest neighbors
K = 2

for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
        # plot a line from X[i] to X[j]
        # use some zip magic to make it happen:
        plt.plot(*zip(X[j], X[i]), color='black')

Structured Data: NumPy's Structured Arrays

In [182]:
import numpy as np
In [183]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
In [184]:
x = np.zeros(4, dtype=int)
In [185]:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
print(data.dtype)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
In [186]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)
 ('Doug', 19, 61.5)]
In [187]:
# Get all names
data['name']
Out[187]:
array(['Alice', 'Bob', 'Cathy', 'Doug'], 
      dtype='<U10')
In [188]:
# Get first row of data
data[0]
Out[188]:
('Alice', 25, 55.0)
In [189]:
# Get the name from the last row
data[-1]['name']
Out[189]:
'Doug'

Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:

In [190]:
# Get names where age is under 30
data[data['age'] < 30]['name']
Out[190]:
array(['Alice', 'Doug'], 
      dtype='<U10')

Creating Structured Arrays

Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method:

In [191]:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':('U10', 'i4', 'f8')})
Out[191]:
dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
In [192]:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':((np.str_, 10), int, np.float32)})
Out[192]:
dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])
In [193]:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
Out[193]:
dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])
In [194]:
np.dtype('S10,i4,f8')
Out[194]:
dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])

The shortened string format codes may seem confusing, but they are built on simple principles. The first (optional) character is < or >, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits. The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below). The last character or characters represents the size of the object in bytes.

Character Description Example
'b' Byte np.dtype('b')
'i' Signed integer np.dtype('i4') == np.int32
'u' Unsigned integer np.dtype('u1') == np.uint8
'f' Floating point np.dtype('f8') == np.int64
'c' Complex floating point np.dtype('c16') == np.complex128
'S', 'a' String np.dtype('S5')
'U' Unicode string np.dtype('U') == np.str_
'V' Raw data (void) np.dtype('V') == np.void
In [195]:
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])
(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
In [196]:
data['age']
Out[196]:
array([25, 45, 37, 19], dtype=int32)
In [197]:
data_rec = data.view(np.recarray)
data_rec.age
Out[197]:
array([25, 45, 37, 19], dtype=int32)
In [198]:
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age
The slowest run took 19.07 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 272 ns per loop
100000 loops, best of 3: 8.48 µs per loop
100000 loops, best of 3: 9.92 µs per loop
In [ ]:
 

Questions:

Will the following broadcasting operations work?

In [199]:
arr1 = np.ones((2, 3))
arr2 = np.ones((2, 1))
print (arr1)
print (arr2)
arr1 + arr2
[[ 1.  1.  1.]
 [ 1.  1.  1.]]
[[ 1.]
 [ 1.]]
Out[199]:
array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])
In [200]:
arr1 = np.ones((2, 3))
arr2 = np.ones(2).reshape(2,1)
print (arr1)
print (arr2)
arr1 + arr2
[[ 1.  1.  1.]
 [ 1.  1.  1.]]
[[ 1.]
 [ 1.]]
Out[200]:
array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])
In [201]:
arr1 = np.ones((2, 3))
arr2 = np.ones((2, 1))
print (arr1)
print (arr2)
arr1 + arr2
[[ 1.  1.  1.]
 [ 1.  1.  1.]]
[[ 1.]
 [ 1.]]
Out[201]:
array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

Quick Exercise:

Use np.arange and reshape

A = [[1 2 3 4]
     [5 6 7 8]]

Use np.arange to create the array

B = [1 2]

Use broadcasting to add B to each column of A to create the final array

A + B = [[2  3  4  5]
         [7  8  9 10]

Hint: what shape does B have to be changed to?

In [ ]:
 

Answers:

In [ ]:
A = np.arange(1, 9).reshape((2, 4))
B = np.arange(1, 3)
A + B.reshape((2, 1))

Another way to change the shape of B is to use the newaxis keyword:

Element-wise Functions

As we mentioned before, Numpy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, scipy ships a rich special function library in the scipy.special module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between $0$ and $2\pi$ is as simple as:

In [202]:
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

4.Linear algebra in numpy

Numpy ships with a basic linear algebra library, and all arrays have a dot method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:

In [203]:
v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])

print(v1, '.', v2, '=', np.dot(v1, v2))
[2 3 4] . [1 0 1] = 6

Here is a regular matrix-vector multiplication, note that the array v1 should be viewed as a column vector in traditional linear algebra notation; numpy makes no distinction between row and column vectors and simply verifies that the dimensions match the required rules of matrix multiplication, in this case we have a $2 \times 3$ matrix multiplied by a 3-vector, which produces a 2-vector:

In [204]:
A = np.arange(6).reshape(2, 3)
print(A, 'x', v1, '=', np.dot(A, v1))
[[0 1 2]
 [3 4 5]] x [2 3 4] = [11 38]

For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A \times A^T$:

In [205]:
print(np.dot(A, A.T))
[[ 5 14]
 [14 50]]

and $A^T \times A$:

In [206]:
print(np.dot(A.T, A))
[[ 9 12 15]
 [12 17 22]
 [15 22 29]]

Furthermore, the numpy.linalg module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc. For even more linear algebra tools, scipy.linalg contains the majority of the tools in the classic LAPACK libraries as well as functions to operate on sparse matrices. We refer the reader to the Numpy and Scipy documentations for additional details on these.

Reading and writing arrays to disk

Numpy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a text and a binary file containing numerical data. In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, with in this case 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable pi had in the computer's memory.

The tradeoffs between the two modes are thus:

  • Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can only be used for one- and two-dimensional arrays.

  • Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.

Text data

First, let's see how to read and write arrays in text mode. The np.savetxt function saves an array to a text file, with options to control the precision, separators and even adding a header:

In [207]:
arr = np.arange(10).reshape(2, 5)
np.savetxt('test.out', arr, fmt='%.2e', header="My dataset")
!cat test.out
# My dataset
0.00e+00 1.00e+00 2.00e+00 3.00e+00 4.00e+00
5.00e+00 6.00e+00 7.00e+00 8.00e+00 9.00e+00

And this same type of file can then be read with the matching np.loadtxt function:

In [208]:
arr2 = np.loadtxt('test.out')
print(arr2)
[[ 0.  1.  2.  3.  4.]
 [ 5.  6.  7.  8.  9.]]

Binary Data

For binary data, Numpy provides the np.save and np.savez routines. The first saves a single array to a file with .npy extension, while the latter can be used to save a group of arrays into a single file with .npz extension. The files created with these routines can then be read with the np.load function.

Let us first see how to use the simpler np.save function to save a single array:

In [209]:
np.save('test.npy', arr2)
# Now we read this back
arr2n = np.load('test.npy')
# Let's see if any element is non-zero in the difference.
# A value of True would be a problem.
print('Any differences?', np.any(arr2-arr2n))
Any differences? False

Now let us see how the np.savez function works. You give it a filename and either a sequence of arrays or a set of keywords. In the first mode, the function will auotmatically name the saved arrays in the archive as arr_0, arr_1, etc:

In [210]:
np.savez('test.npz', arr, arr2)
arrays = np.load('test.npz')
arrays.files
Out[210]:
['arr_0', 'arr_1']

.npz: multiple binary outputs in one file

Alternatively, we can explicitly choose how to name the arrays we save:

In [211]:
np.savez('test.npz', array1=arr, array2=arr2)
arrays = np.load('test.npz')
arrays.files
Out[211]:
['array1', 'array2']

The object returned by np.load from an .npz file works like a dictionary, though you can also access its constituent files by attribute using its special .f field; this is best illustrated with an example with the arrays object from above:

In [212]:
print('First row of first array:', arrays['array1'][0])

# This is an equivalent way to get the same field
print('First row of first array:', arrays.f.array1[0])
First row of first array: [0 1 2 3 4]
First row of first array: [0 1 2 3 4]

This .npz format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5.

Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this discussion, but of particular interest for scientific users we at least mention the following:

Matlab files

The scipy.io module contains routines to read and write Matlab files in .mat format and files in the NetCDF format that is widely used in certain scientific disciplines.

HDF5 files

For manipulating files in the HDF5 format, there are two excellent options in Python: The PyTables project offers a high-level, object oriented approach to manipulating HDF5 datasets, while the h5py project offers a more direct mapping to the standard HDF5 library interface. Both are excellent tools; if you need to work with HDF5 datasets you should read some of their documentation and examples and decide which approach is a better match for your needs.

Breakout: trapezoidal integration

Illustrates: basic array slicing, functions as first class objects.

In this exercise, you are tasked with implementing the simple trapezoid rule formula for numerical integration. If we want to compute the definite integral

$$ \int_{a}^{b}f(x)dx $$

we can partition the integration interval $[a,b]$ into smaller subintervals, and approximate the area under the curve for each subinterval by the area of the trapezoid created by linearly interpolating between the two function values at each end of the subinterval:

<img src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Trapezoidal_rule_illustration.png/316px-Trapezoidal_rule_illustration.png" /img>

The blue line represents the function $f(x)$ and the red line is the linear interpolation. By subdividing the interval $[a,b]$, the area under $f(x)$ can thus be approximated as the sum of the areas of all the resulting trapezoids.

If we denote by $x_{i}$ ($i=0,\ldots,n,$ with $x_{0}=a$ and $x_{n}=b$) the abscissas where the function is sampled, then

$$ \int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(f(x_{i})+f(x_{i-1})\right). $$

The common case of using equally spaced abscissas with spacing $h=(b-a)/n$ reads simply

$$ \int_{a}^{b}f(x)dx\approx\frac{h}{2}\sum_{i=1}^{n}\left(f(x_{i})+f(x_{i-1})\right). $$

One frequently receives the function values already precomputed, $y_{i}=f(x_{i}),$ so the equation above becomes

$$ \int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(y_{i}+y_{i-1}\right). $$

Exercises

Part 1

Write a function trapz(x, y), that applies the trapezoid formula to pre-computed values, where x and y are 1-d arrays. (Use numpy operations, not loops!)

In [ ]:
 

Part 2

Write a function trapzf(f, a, b, npts=100) that accepts a function f, the endpoints a and b and the number of samples to take npts. Sample the function uniformly at these points and return the value of the integral.

In [ ]:
 

Part 3

Verify that both functions above are correct by showing that they produce correct values for a simple integral such as $\int_0^3 x^2$.

In [ ]:
 

Part 4

Preview of what's to come: use the documentation features of IPython to explore the submodule scipy.integrage. Can you find a suitable function to perform this integration above?

In [ ]:
 


matploblib介绍

The matplotlib library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the book. We encourage the reader to read the tutorials included with the matplotlib documentation as well as to browse its extensive gallery of examples that include source code.

Just as we typically use the shorthand np for Numpy, we will use plt for the matplotlib.pyplot module where the easy-to-use plotting functions reside (the library contains a rich object-oriented architecture that we don't have the space to discuss here):

In [1]:
%matplotlib inline
In [2]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
In [3]:
dir(mpl)
Out[3]:
['RcParams',
 'URL_REGEX',
 'Verbose',
 '_DATA_DOC_APPENDIX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '__version__numpy__',
 '_all_deprecated',
 '_cm',
 '_cm_listed',
 '_cntr',
 '_contour',
 '_create_tmp_config_dir',
 '_decode_filesystem_path',
 '_deprecated_ignore_map',
 '_deprecated_map',
 '_error_details_fmt',
 '_get_cachedir',
 '_get_config_or_cache_dir',
 '_get_configdir',
 '_get_data_path',
 '_get_data_path_cached',
 '_get_home',
 '_get_xdg_cache_dir',
 '_get_xdg_config_dir',
 '_image',
 '_is_writable_dir',
 '_mathtext_data',
 '_obsolete_set',
 '_open_file_or_url',
 '_path',
 '_png',
 '_pylab_helpers',
 '_python26',
 '_qhull',
 '_rc_params_in_file',
 '_replacer',
 '_tri',
 '_url_lines',
 '_use_error_msg',
 '_version',
 'absolute_import',
 'afm',
 'artist',
 'axes',
 'axis',
 'backend_bases',
 'backend_tools',
 'backends',
 'bad_pyparsing',
 'bezier',
 'blocking_input',
 'cbook',
 'chain',
 'checkdep_dvipng',
 'checkdep_ghostscript',
 'checkdep_inkscape',
 'checkdep_pdftops',
 'checkdep_ps_distiller',
 'checkdep_tex',
 'checkdep_usetex',
 'checkdep_xmllint',
 'cm',
 'collections',
 'colorbar',
 'colors',
 'compare_versions',
 'compat',
 'container',
 'contextlib',
 'contour',
 'cycler',
 'dates',
 'dateutil',
 'dedent',
 'defaultParams',
 'default_test_modules',
 'distutils',
 'division',
 'docstring',
 'dviread',
 'externals',
 'f',
 'figure',
 'finance',
 'font_manager',
 'fontconfig_pattern',
 'ft2font',
 'functools',
 'get_backend',
 'get_cachedir',
 'get_configdir',
 'get_data_path',
 'get_example_data',
 'get_home',
 'get_label',
 'get_py2exe_datafiles',
 'gridspec',
 'image',
 'inspect',
 'interactive',
 'io',
 'is_interactive',
 'is_string_like',
 'is_url',
 'legend',
 'legend_handler',
 'lines',
 'locale',
 'major',
 'markers',
 'mathtext',
 'matplotlib_fname',
 'minor1',
 'minor2',
 'mlab',
 'mplDeprecation',
 'numpy',
 'offsetbox',
 'os',
 'patches',
 'path',
 'print_function',
 'projections',
 'pylab',
 'pyparsing',
 'pyplot',
 'quiver',
 'rc',
 'rcParams',
 'rcParamsDefault',
 'rcParamsOrig',
 'rc_context',
 'rc_file',
 'rc_file_defaults',
 'rc_params',
 'rc_params_from_file',
 'rcdefaults',
 'rcsetup',
 're',
 'reload',
 's',
 'scale',
 'six',
 'spines',
 'stackplot',
 'streamplot',
 'style',
 'subprocess',
 'sys',
 'table',
 'tempfile',
 'test',
 'texmanager',
 'text',
 'textpath',
 'ticker',
 'tight_bbox',
 'tk_window_focus',
 'tmp',
 'transforms',
 'tri',
 'unicode_literals',
 'units',
 'unpack_labeled_data',
 'urlopen',
 'use',
 'validate_backend',
 'verbose',
 'verify_test_dependencies',
 'warnings',
 'widgets']

Setting Styles

We will use the plt.style directive to choose appropriate aesthetic styles for our figures.

In [ ]:
plt.style.use('classic')
In [ ]:
plt.style.use('ggplot')

1.Gettings started

In [4]:
x = np.linspace(-5, 2, 100)
In [5]:
y1 = x**3 + 5*x**2 + 10
In [6]:
y2 = 3*x**2 + 10*x
In [7]:
y3 = 6*x + 10
In [8]:
fig,ax=plt.subplots()
ax.plot(x,y1,label='f(x)')
ax.plot(x,y2,label="f'(x)")
ax.plot(x,y3,label="f''(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
Out[8]:
<matplotlib.legend.Legend at 0x10d021908>
In [9]:
plt.plot(x,y1)
Out[9]:
[<matplotlib.lines.Line2D at 0x10d716400>]
In [10]:
fig, ax = plt.subplots()
ax.plot(x, y1);
In [11]:
fig, ax = plt.subplots()
ax.plot(x, y1, color="blue", label="y(x)")
ax.plot(x, y2, color="red", label="y'(x)")
ax.plot(x, y3, color="green", label="y''(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend();
In [12]:
fig.savefig("figure-1.jpg")
In [13]:
!ls -lh figure-1.jpg
-rw-r--r--@ 1 lee  staff    36K  8 11 23:09 figure-1.jpg
In [14]:
from IPython.display import Image
Image(filename="figure-1.jpg")
Out[14]:

In savefig(), the file format is inferred from the extension of the given filename. Depending on what backends you have installed, many different file formats are available.

In [15]:
fig.canvas.get_supported_filetypes()
Out[15]:
{'eps': 'Encapsulated Postscript',
 'jpeg': 'Joint Photographic Experts Group',
 'jpg': 'Joint Photographic Experts Group',
 'pdf': 'Portable Document Format',
 'pgf': 'PGF code for LaTeX',
 'png': 'Portable Network Graphics',
 'ps': 'Postscript',
 'raw': 'Raw RGBA bitmap',
 'rgba': 'Raw RGBA bitmap',
 'svg': 'Scalable Vector Graphics',
 'svgz': 'Scalable Vector Graphics',
 'tif': 'Tagged Image File Format',
 'tiff': 'Tagged Image File Format'}

Two Interfaces for the Price of One

A potentially confusing feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based interface, and a more powerful object-oriented interface. We'll quickly highlight the differences between the two here.

MATLAB-style Interface

Matplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact. The MATLAB-style tools are contained in the pyplot (plt) interface.

In [16]:
plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
In [17]:
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));
In [18]:
fig, ax = plt.subplots(figsize=(8,5))

ax.plot(x, y1, lw=1.5, color="blue", label="$y(x)$")
ax.plot(x, y2, lw=1.5, color="red", label="$y'(x)$")
ax.plot(x, y3, lw=1.5, color="green", label="$y''(x)$")


ax.plot(x, np.zeros_like(x), lw=0.5, color="black")

ax.plot([-3.33, -3.33], [0, (-3.33)**3 + 5*(-3.33)**2 + 10], ls='--', lw=0.5, color="black")
ax.plot([-3.33], [(-3.33)**3 + 5*(-3.33)**2 + 10], lw=0.5, marker='o', color="blue")


ax.plot([0, 0], [0, 10], lw=0.5, ls='--', color="black")
ax.plot([0], [10], lw=0.5, marker='o', color="blue")

ax.set_ylim(-15, 40)
ax.set_yticks([-10, 0, 10, 20, 30])
ax.set_xticks([-4, -2, 0, 2])

ax.set_xlabel("$x$", fontsize=18)
ax.set_ylabel("$y$", fontsize=18)
ax.legend(loc=0, ncol=3, fontsize=14, frameon=False);
In [19]:
fig.savefig("figure-2.jpg")

Figure

In [20]:
fig = plt.figure(figsize=(8, 2.5), facecolor="#f1f1f1")

# axes coordinates as fractions of the canvas width and height
left, bottom, width, height = 0.1, 0.1, 0.8, 0.8
ax = fig.add_axes((left, bottom, width, height), axisbg="#e1e1e1")

x = np.linspace(-2, 2, 1000)
y1 = np.cos(40 * x)
y2 = np.exp(-x**2)

ax.plot(x, y1 * y2)
ax.plot(x, y2, 'g')
ax.plot(x, -y2, 'g')
ax.set_xlabel("x")
ax.set_ylabel("y")

fig.savefig("graph.png", dpi=100, facecolor="#f1f1f1")
fig.savefig("graph.pdf", dpi=300, facecolor="#f1f1f1")

2.Plot types

Line Colors and Styles

In [21]:
plt.plot(x, np.sin(x - 0), color='blue')        # specify color by name
plt.plot(x, np.sin(x - 1), color='g')           # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75')        # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44')     # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
In [22]:
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');

# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-')  # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':');  # dotted
In [23]:
plt.plot(x, x + 0, '-g')  # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r');  # dotted red
In [24]:
fignum = 0

def hide_labels(fig, ax):
    global fignum
    ax.set_xticks([])
    ax.set_yticks([])
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    ax.axis('tight')
    
    fignum += 1
    fig.savefig("plot-types-%d.pdf" % fignum)
In [25]:
x = np.linspace(-3, 3, 25)
y1 = x**3+ 3 * x**2 + 10
y2 = -1.5 * x**3 + 10*x**2 - 15
In [26]:
fig, ax = plt.subplots(figsize=(4, 3))

ax.plot(x, y1)
ax.plot(x, y2)


hide_labels(fig, ax)
In [27]:
fig, ax = plt.subplots(figsize=(4, 3))

ax.step(x, y1)
ax.step(x, y2)

hide_labels(fig, ax)
In [28]:
fig, ax = plt.subplots(figsize=(4, 3))
width = 1/10.0
ax.bar(x - width/2, y1, width=width, color="blue")
ax.bar(x + width/2, y2, width=width, color="green")

hide_labels(fig, ax)
In [29]:
fig, ax = plt.subplots(figsize=(4, 3))
ax.fill_between(x, y1, y2)

hide_labels(fig, ax)
In [30]:
fig, ax = plt.subplots(figsize=(4, 3))
ax.hist(y2, bins=30)
ax.hist(y1, bins=30)

hide_labels(fig, ax)
In [31]:
fig, ax = plt.subplots(figsize=(4, 3))

ax.errorbar(x, y2, yerr=y1, fmt='o-')

hide_labels(fig, ax)
In [32]:
fig, ax = plt.subplots(figsize=(4, 3))

ax.stem(x, y2, 'b', markerfmt='bs')
ax.stem(x, y1, 'r', markerfmt='ro')

hide_labels(fig, ax)
In [33]:
fig, ax = plt.subplots(figsize=(4, 3))

x = np.linspace(0, 5, 50)

ax.scatter(x, -1 + x + 0.25 * x**2 + 2 * np.random.rand(len(x)))
ax.scatter(x, np.sqrt(x) + 2 * np.random.rand(len(x)), color="green")

hide_labels(fig, ax)
In [34]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

Simple Scatter Plots

In [35]:
x = np.linspace(0, 10, 30)
y = np.sin(x)

plt.plot(x, y, 'o', color='black');
In [36]:
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
    plt.plot(rng.rand(5), rng.rand(5), marker,
             label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
In [37]:
plt.plot(x, y, '-ok');
In [38]:
plt.plot(x, y, '-p', color='gray',
         markersize=15, linewidth=4,
         markerfacecolor='white',
         markeredgecolor='gray',
         markeredgewidth=2)
plt.ylim(-1.2, 1.2);
In [39]:
plt.scatter(x, y, marker='o');
In [40]:
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)

plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
            cmap='viridis')
plt.colorbar();  # show color scale
In [41]:
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T

plt.scatter(features[0], features[1], alpha=0.2,
            s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);

Errorbars

In [42]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
In [43]:
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)

plt.errorbar(x, y, yerr=dy, fmt='.k');
In [44]:
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
             ecolor='lightgray', elinewidth=3, capsize=0);
In [45]:
from sklearn.gaussian_process import GaussianProcess

# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)

# Compute the Gaussian process fit
gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
                     random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)

xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE)  # 2*sigma ~ 95% confidence region

# Visualize the result
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')

plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,
                 color='gray', alpha=0.2)
plt.xlim(0, 10);
/Users/lee/anaconda/lib/python3.5/site-packages/sklearn/utils/deprecation.py:52: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
  warnings.warn(msg, category=DeprecationWarning)
/Users/lee/anaconda/lib/python3.5/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)

Density and Contour Plots

In [46]:
def f(x, y):
    return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)

X, Y = np.meshgrid(x, y)
Z = f(X, Y)

plt.contour(X, Y, Z, colors='black');
In [47]:
plt.contour(X, Y, Z, 20, cmap='RdGy');
In [48]:
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
In [49]:
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
           cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
In [50]:
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)

plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
           cmap='RdGy', alpha=0.5)
plt.colorbar();

Histograms, Binnings, and Density

In [51]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

data = np.random.randn(1000)
In [52]:
plt.hist(data);
In [53]:
plt.hist(data, bins=30, normed=True, alpha=0.5,
         histtype='stepfilled', color='steelblue',
         edgecolor='none');
In [54]:
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)

kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)

plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
In [55]:
# simply compute the histogram
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
[ 12 168 492 303  25]
In [56]:
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
In [57]:
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
In [58]:
# Hexagonal binnings
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')
In [59]:
# Kernel density estimation
from scipy.stats import gaussian_kde

# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)

# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))

# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
           origin='lower', aspect='auto',
           extent=[-3.5, 3.5, -6, 6],
           cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")
In [ ]:
 

3.Advanced Features

In [60]:
fig, ax = plt.subplots(figsize=(8, 4))

x = np.linspace(-20, 20, 100)
y = np.sin(x) / x

ax.plot(x, y)

ax.set_ylabel("y label")
ax.set_xlabel("x label")


for label in ax.get_xticklabels() + ax.get_yticklabels():
    label.set_rotation(45)

Axes

Axes Limits

In [61]:
x = np.linspace(0, 10, 1000)
plt.plot(x, np.sin(x))

plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
In [62]:
plt.plot(x, np.sin(x))

plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
In [63]:
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);
In [64]:
plt.plot(x, np.sin(x))
plt.axis('tight');
In [65]:
plt.plot(x, np.sin(x))
plt.axis('equal');
In [66]:
fig, axes = plt.subplots(ncols=2, nrows=3)
In [67]:
fig, axes = plt.subplots(1, 2, figsize=(8, 3.5), sharey=True)

data1 = np.random.randn(200, 2) * np.array([3, 1])
#area1 = (np.random.randn(200) + 0.5) * 100

data2 = np.random.randn(200, 2) * np.array([1, 3])
#area2 = (np.random.randn(200) + 0.5) * 100

axes[0].scatter(data1[:,0], data1[:,1], color="green", marker="s", s=30, alpha=0.5)
axes[0].scatter(data2[:,0], data2[:,1], color="blue", marker="o", s=30, alpha=0.5)

axes[1].hist([data1[:,1], data2[:,1]], bins=15, color=["green", "blue"], alpha=0.5, orientation='horizontal');

Labeling Plots

In [68]:
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
In [69]:
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')

plt.legend();
In [ ]:
 

Legends

In [70]:
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

x = np.linspace(0, 1, 100)

for n in range(4):
    axes[n].plot(x, x, label="y(x) = x")
    axes[n].plot(x, x + x**2, label="y(x) = x + x**2")
    axes[n].legend(loc=n+1)
    axes[n].set_title("legend(loc=%d)" % (n+1))

fig.tight_layout()
fig.savefig("legend-loc.pdf")

Colorbars

In [71]:
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
In [72]:
x = np.linspace(0, 10, 1000)
I = np.sin(x) * np.cos(x[:, np.newaxis])

plt.imshow(I)
plt.colorbar();
In [73]:
plt.imshow(I, cmap='gray');
In [74]:
from matplotlib.colors import LinearSegmentedColormap

def grayscale_cmap(cmap):
    """Return a grayscale version of the given colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    
    # convert RGBA to perceived grayscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
        
    return LinearSegmentedColormap.from_list(cmap.name + "_gray", colors, cmap.N)
    

def view_colormap(cmap):
    """Plot a colormap with its grayscale equivalent"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    
    cmap = grayscale_cmap(cmap)
    grayscale = cmap(np.arange(cmap.N))
    
    fig, ax = plt.subplots(2, figsize=(6, 2),
                           subplot_kw=dict(xticks=[], yticks=[]))
    ax[0].imshow([colors], extent=[0, 10, 0, 1])
    ax[1].imshow([grayscale], extent=[0, 10, 0, 1])
    
    
view_colormap('jet')
In [75]:
view_colormap('viridis')
In [76]:
view_colormap('cubehelix')
In [77]:
view_colormap('RdBu')
In [78]:
# make noise in 1% of the image pixels
speckles = (np.random.random(I.shape) < 0.01)
I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles))

plt.figure(figsize=(10, 3.5))

plt.subplot(1, 2, 1)
plt.imshow(I, cmap='RdBu')
plt.colorbar()

plt.subplot(1, 2, 2)
plt.imshow(I, cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
In [79]:
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
In [80]:
# load images of the digits 0 through 5 and visualize several of them
from sklearn.datasets import load_digits
digits = load_digits(n_class=6)

fig, ax = plt.subplots(8, 8, figsize=(6, 6))
for i, axi in enumerate(ax.flat):
    axi.imshow(digits.images[i], cmap='binary')
    axi.set(xticks=[], yticks=[])
In [81]:
# project the digits into 2 dimensions using IsoMap
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
projection = iso.fit_transform(digits.data)

# plot the results
plt.scatter(projection[:, 0], projection[:, 1], lw=0.1,
            c=digits.target, cmap=plt.cm.get_cmap('cubehelix', 6))
plt.colorbar(ticks=range(6), label='digit value')
plt.clim(-0.5, 5.5)
In [ ]:
 

Twin axes

In [82]:
fig, ax1 = plt.subplots(figsize=(8, 4))

r = np.linspace(0, 5, 100)
a = 4 * np.pi * r ** 2  # area
v = (4 * np.pi / 3) * r ** 3  # volume


ax1.set_title("surface area and volume of a sphere", fontsize=16)
ax1.set_xlabel("radius [m]", fontsize=16)

ax1.plot(r, a, lw=2, color="blue")
ax1.set_ylabel("surface area ($m^2$)", fontsize=16, color="blue")
for label in ax1.get_yticklabels():
    label.set_color("blue")

    
ax2 = ax1.twinx()
ax2.plot(r, v, lw=2, color="red")
ax2.set_ylabel(r"volume ($m^3$)", fontsize=16, color="red")
for label in ax2.get_yticklabels():
    label.set_color("red")
    
fig.tight_layout()
fig.savefig("ch4-axis-twin-ax.pdf")

4.Advanced grid layout

Inset

In [83]:
fig = plt.figure(figsize=(8, 4))

def f(x):
    return 1/(1 + x**2) + 0.1/(1 + ((3 - x)/0.1)**2)

def plot_and_format_axes(ax, x, f, fontsize):
    ax.plot(x, f(x), linewidth=2)
    ax.xaxis.set_major_locator(mpl.ticker.MaxNLocator(5))
    ax.yaxis.set_major_locator(mpl.ticker.MaxNLocator(4))
    ax.set_xlabel(r"$x$", fontsize=fontsize)
    ax.set_ylabel(r"$f(x)$", fontsize=fontsize)
    
# main graph
ax = fig.add_axes([0.1, 0.15, 0.8, 0.8], axisbg="#f5f5f5")
x = np.linspace(-4, 14, 1000)
plot_and_format_axes(ax, x, f, 18)

# inset
x0, x1 = 2.5, 3.5
ax = fig.add_axes([0.5, 0.5, 0.38, 0.42], axisbg='none')
x = np.linspace(x0, x1, 1000)
plot_and_format_axes(ax, x, f, 14)

fig.savefig("ch4-advanced-axes-inset.pdf")

Subplots

In [84]:
ncols, nrows = 3, 3

fig, axes = plt.subplots(nrows, ncols)

for m in range(nrows):
    for n in range(ncols):
        axes[m, n].set_xticks([])
        axes[m, n].set_yticks([])
        axes[m, n].text(0.5, 0.5, "axes[%d, %d]" % (m, n),
                        horizontalalignment='center')
In [85]:
fig, axes = plt.subplots(2, 2, figsize=(6, 6), sharex=True, sharey=True, squeeze=False)

x1 = np.random.randn(100)
x2 = np.random.randn(100)

axes[0, 0].set_title("Uncorrelated")
axes[0, 0].scatter(x1, x2)

axes[0, 1].set_title("Weakly positively correlated")
axes[0, 1].scatter(x1, x1 + x2)

axes[1, 0].set_title("Weakly negatively correlated")
axes[1, 0].scatter(x1, -x1 + x2)

axes[1, 1].set_title("Strongly correlated")
axes[1, 1].scatter(x1, x1 + 0.15 * x2)

axes[1, 1].set_xlabel("x")
axes[1, 0].set_xlabel("x")
axes[0, 0].set_ylabel("y")
axes[1, 0].set_ylabel("y")

plt.subplots_adjust(left=0.1, right=0.95, bottom=0.1, top=0.95, wspace=0.1, hspace=0.2)

fig.savefig("ch4-advanced-axes-subplots.pdf")

Aside: Matplotlib Gotchas

While most plt functions translate directly to ax methods (such as plt.plot()ax.plot(), plt.legend()ax.legend(), etc.), this is not the case for all commands.

  • plt.xlabel()ax.set_xlabel()
  • plt.ylabel()ax.set_ylabel()
  • plt.xlim()ax.set_xlim()
  • plt.ylim()ax.set_ylim()
  • plt.title()ax.set_title()

In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ax.set() method to set all these properties at once:

In [90]:
ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
       xlabel='x', ylabel='sin(x)',
       title='A Simple Plot');

Text and Annotation

In [94]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
In [95]:
births = pd.read_csv('data/births.csv')

quartiles = np.percentile(births['births'], [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')

births['day'] = births['day'].astype(int)

births.index = pd.to_datetime(10000 * births.year +
                              100 * births.month +
                              births.day, format='%Y%m%d')
births_by_date = births.pivot_table('births',
                                    [births.index.month, births.index.day])
births_by_date.index = [pd.datetime(2012, month, day)
                        for (month, day) in births_by_date.index]

fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);
In [76]:
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)

# Add labels to the plot
style = dict(size=10, color='gray')

ax.text('2012-1-1', 3950, "New Year's Day", **style)
ax.text('2012-7-4', 4250, "Independence Day", ha='center', **style)
ax.text('2012-9-4', 4850, "Labor Day", ha='center', **style)
ax.text('2012-10-31', 4600, "Halloween", ha='right', **style)
ax.text('2012-11-25', 4450, "Thanksgiving", ha='center', **style)
ax.text('2012-12-25', 3850, "Christmas ", ha='right', **style)

# Label the axes
ax.set(title='USA births by day of year (1969-1988)',
       ylabel='average daily births')

# Format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));
In [96]:
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)

# Add labels to the plot
ax.annotate("New Year's Day", xy=('2012-1-1', 4100),  xycoords='data',
            xytext=(50, -30), textcoords='offset points',
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3,rad=-0.2"))

ax.annotate("Independence Day", xy=('2012-7-4', 4250),  xycoords='data',
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            xytext=(10, -40), textcoords='offset points', ha='center',
            arrowprops=dict(arrowstyle="->"))

ax.annotate('Labor Day', xy=('2012-9-4', 4850), xycoords='data', ha='center',
            xytext=(0, -20), textcoords='offset points')
ax.annotate('', xy=('2012-9-1', 4850), xytext=('2012-9-7', 4850),
            xycoords='data', textcoords='data',
            arrowprops={'arrowstyle': '|-|,widthA=0.2,widthB=0.2', })

ax.annotate('Halloween', xy=('2012-10-31', 4600),  xycoords='data',
            xytext=(-80, -40), textcoords='offset points',
            arrowprops=dict(arrowstyle="fancy",
                            fc="0.6", ec="none",
                            connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('Thanksgiving', xy=('2012-11-25', 4500),  xycoords='data',
            xytext=(-120, -60), textcoords='offset points',
            bbox=dict(boxstyle="round4,pad=.5", fc="0.9"),
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="angle,angleA=0,angleB=80,rad=20"))


ax.annotate('Christmas', xy=('2012-12-25', 3850),  xycoords='data',
             xytext=(-30, 0), textcoords='offset points',
             size=13, ha='right', va="center",
             bbox=dict(boxstyle="round", alpha=0.1),
             arrowprops=dict(arrowstyle="wedge,tail_width=0.5", alpha=0.1));

# Label the axes
ax.set(title='USA births by day of year (1969-1988)',
       ylabel='average daily births')

# Format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));

ax.set_ylim(3600, 5400);

Three-Dimensional Plotting

In [97]:
from mpl_toolkits import mplot3d
In [98]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [99]:
fig = plt.figure()
ax = plt.axes(projection='3d')
In [100]:
ax = plt.axes(projection='3d')

# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')

# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');
In [101]:
def f(x, y):
    return np.sin(np.sqrt(x ** 2 + y ** 2))

x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)

X, Y = np.meshgrid(x, y)
Z = f(X, Y)

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
In [102]:
ax.view_init(60, 35)
fig
Out[102]:
In [103]:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
In [104]:
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
                cmap='viridis', edgecolor='none')
ax.set_title('surface');
In [105]:
r = np.linspace(0, 6, 20)
theta = np.linspace(-0.9 * np.pi, 0.8 * np.pi, 40)
r, theta = np.meshgrid(r, theta)

X = r * np.sin(theta)
Y = r * np.cos(theta)
Z = f(X, Y)

ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
                cmap='viridis', edgecolor='none');
In [106]:
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)

ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5);
In [107]:
ax = plt.axes(projection='3d')
ax.plot_trisurf(x, y, z,
                cmap='viridis', edgecolor='none');

Geographic Data with Basemap

In [91]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

Once you have the Basemap toolkit installed and imported, geographic plots are just a few lines away (the graphics in the following also requires the PIL package in Python 2, or the pillow package in Python 3):

In [92]:
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
In [93]:
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
            width=8E6, height=8E6, 
            lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);
In [94]:
from itertools import chain

def draw_map(m, scale=0.2):
    # draw a shaded-relief image
    m.shadedrelief(scale=scale)
    
    # lats and longs are returned as a dictionary
    lats = m.drawparallels(np.linspace(-90, 90, 13))
    lons = m.drawmeridians(np.linspace(-180, 180, 13))

    # keys contain the plt.Line2D instances
    lat_lines = chain(*(tup[1][0] for tup in lats.items()))
    lon_lines = chain(*(tup[1][0] for tup in lons.items()))
    all_lines = chain(lat_lines, lon_lines)
    
    # cycle through these lines and set the desired style
    for line in all_lines:
        line.set(linestyle='-', alpha=0.3, color='w')
        
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
            llcrnrlat=-90, urcrnrlat=90,
            llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
In [95]:
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
            lat_0=0, lon_0=0)
draw_map(m)
In [96]:
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
            lat_0=50, lon_0=0)
draw_map(m);
In [97]:
fig, ax = plt.subplots(1, 2, figsize=(12, 8))

for i, res in enumerate(['l', 'h']):
    m = Basemap(projection='gnom', lat_0=57.3, lon_0=-6.2,
                width=90000, height=120000, resolution=res, ax=ax[i])
    m.fillcontinents(color="#FFDDCC", lake_color='#DDEEFF')
    m.drawmapboundary(fill_color="#DDEEFF")
    m.drawcoastlines()
    ax[i].set_title("resolution='{0}'".format(res));
In [98]:
import pandas as pd
cities = pd.read_csv('data/california_cities.csv')

# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values

# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', 
            lat_0=37.5, lon_0=-119,
            width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')

# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,
          c=np.log10(population), s=area,
          cmap='Reds', alpha=0.5)

# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)

# make legend with dummy points
for a in [100, 300, 500]:
    plt.scatter([], [], c='k', alpha=0.5, s=a,
                label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,
           labelspacing=1, loc='lower left');

Data processing and analysis with pandas

In [108]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
In [109]:
pd.__version__
Out[109]:
'0.18.1'
In [3]:
pd?

1.Series object

In [110]:
s = pd.Series([909976, 8615246, 2872086, 2273305])
In [111]:
s
Out[111]:
0     909976
1    8615246
2    2872086
3    2273305
dtype: int64
In [112]:
type(s)
Out[112]:
pandas.core.series.Series
In [113]:
s.dtype
Out[113]:
dtype('int64')
In [114]:
s.index
Out[114]:
RangeIndex(start=0, stop=4, step=1)
In [115]:
s.values
Out[115]:
array([ 909976, 8615246, 2872086, 2273305])
In [116]:
s.index = ["Stockholm", "London", "Rome", "Paris"]
In [117]:
s.name = "Population"
In [118]:
s
Out[118]:
Stockholm     909976
London       8615246
Rome         2872086
Paris        2273305
Name: Population, dtype: int64
In [119]:
s = pd.Series([909976, 8615246, 2872086, 2273305], 
              index=["Stockholm", "London", "Rome", "Paris"], name="Population")
In [120]:
s["London"]
Out[120]:
8615246
In [121]:
s.Stockholm
Out[121]:
909976
In [122]:
s[["Paris", "Rome"]]
Out[122]:
Paris    2273305
Rome     2872086
Name: Population, dtype: int64
In [123]:
s.median(), s.mean(), s.std(),s.min(), s.max()
Out[123]:
(2572695.5, 3667653.25, 3399048.5005155364, 909976, 8615246)
In [124]:
s.quantile(q=0.25), s.quantile(q=0.5), s.quantile(q=0.75)
Out[124]:
(1932472.75, 2572695.5, 4307876.0)
In [125]:
s.describe()
Out[125]:
count    4.000000e+00
mean     3.667653e+06
std      3.399049e+06
min      9.099760e+05
25%      1.932473e+06
50%      2.572696e+06
75%      4.307876e+06
max      8.615246e+06
Name: Population, dtype: float64
In [126]:
mpl.style.use('ggplot')
In [127]:
fig, axes = plt.subplots(1, 4, figsize=(12, 3))

s.plot(ax=axes[0], kind='line', title="line")
s.plot(ax=axes[1], kind='bar', title="bar")
s.plot(ax=axes[2], kind='box', title="box")
s.plot(ax=axes[3], kind='pie', title="pie")

fig.tight_layout()
fig.savefig("ch12-series-plot.pdf")
fig.savefig("ch12-series-plot.png")

2.DataFrame object

In [128]:
df = pd.DataFrame([[909976, 8615246, 2872086, 2273305],
                   ["Sweden", "United kingdom", "Italy", "France"]])
In [129]:
df
Out[129]:
0 1 2 3
0 909976 8615246 2872086 2273305
1 Sweden United kingdom Italy France
In [130]:
df = pd.DataFrame([[909976, "Sweden"],
                   [8615246, "United kingdom"], 
                   [2872086, "Italy"],
                   [2273305, "France"]])
In [131]:
df
Out[131]:
0 1
0 909976 Sweden
1 8615246 United kingdom
2 2872086 Italy
3 2273305 France
In [132]:
df.index = ["Stockholm", "London", "Rome", "Paris"]
In [133]:
df.columns = ["Population", "State"]
In [134]:
df
Out[134]:
Population State
Stockholm 909976 Sweden
London 8615246 United kingdom
Rome 2872086 Italy
Paris 2273305 France
In [135]:
df = pd.DataFrame([[909976, "Sweden"],
                   [8615246, "United kingdom"], 
                   [2872086, "Italy"],
                   [2273305, "France"]],
                  index=["Stockholm", "London", "Rome", "Paris"],
                  columns=["Population", "State"])
df
Out[135]:
Population State
Stockholm 909976 Sweden
London 8615246 United kingdom
Rome 2872086 Italy
Paris 2273305 France
In [136]:
df = pd.DataFrame({"Population": [909976, 8615246, 2872086, 2273305],
                   "State": ["Sweden", "United kingdom", "Italy", "France"]},
                  index=["Stockholm", "London", "Rome", "Paris"])
df
Out[136]:
Population State
Stockholm 909976 Sweden
London 8615246 United kingdom
Rome 2872086 Italy
Paris 2273305 France
In [137]:
df.ix[:,'State']
Out[137]:
Stockholm            Sweden
London       United kingdom
Rome                  Italy
Paris                France
Name: State, dtype: object
In [138]:
df.index
Out[138]:
Index(['Stockholm', 'London', 'Rome', 'Paris'], dtype='object')
In [139]:
df.columns
Out[139]:
Index(['Population', 'State'], dtype='object')
In [140]:
df.values
Out[140]:
array([[909976, 'Sweden'],
       [8615246, 'United kingdom'],
       [2872086, 'Italy'],
       [2273305, 'France']], dtype=object)
In [141]:
df.Population
Out[141]:
Stockholm     909976
London       8615246
Rome         2872086
Paris        2273305
Name: Population, dtype: int64
In [142]:
type(df.Population)
Out[142]:
pandas.core.series.Series
In [143]:
df.Population.Stockholm
Out[143]:
909976
In [144]:
type(df.ix)
Out[144]:
pandas.core.indexing._IXIndexer
In [145]:
print (df.ix["Stockholm"])
print (type(df.ix["Stockholm"]))
Population    909976
State         Sweden
Name: Stockholm, dtype: object
<class 'pandas.core.series.Series'>
In [146]:
df.ix[["Paris", "Rome"]]
Out[146]:
Population State
Paris 2273305 France
Rome 2872086 Italy
In [147]:
df.ix[["Paris", "Rome"], "Population"]
Out[147]:
Paris    2273305
Rome     2872086
Name: Population, dtype: int64
In [148]:
df.mean()
Out[148]:
Population    3667653.25
dtype: float64
In [149]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Stockholm to Paris
Data columns (total 2 columns):
Population    4 non-null int64
State         4 non-null object
dtypes: int64(1), object(1)
memory usage: 96.0+ bytes
In [150]:
df.describe()
Out[150]:
Population
count 4.000000e+00
mean 3.667653e+06
std 3.399049e+06
min 9.099760e+05
25% 1.932473e+06
50% 2.572696e+06
75% 4.307876e+06
max 8.615246e+06
In [151]:
df.dtypes
Out[151]:
Population     int64
State         object
dtype: object
In [152]:
df.head(3)
Out[152]:
Population State
Stockholm 909976 Sweden
London 8615246 United kingdom
Rome 2872086 Italy
In [155]:
!head -n5 european_cities.csv
Rank,City,State,Population,Date of census/estimate
1,London[2], United Kingdom,"8,615,246",1 June 2014
2,Berlin, Germany,"3,437,916",31 May 2014
3,Madrid, Spain,"3,165,235",1 January 2014
4,Rome, Italy,"2,872,086",30 September 2014

Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

In [156]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
Out[156]:
1    a
3    b
5    c
dtype: object
In [157]:
# explicit index when indexing
data[1]
Out[157]:
'a'
In [158]:
# implicit index when slicing
data[1:3]
Out[158]:
3    b
5    c
dtype: object
In [159]:
data.loc[1]
Out[159]:
'a'
In [160]:
data.loc[1:3]
Out[160]:
1    a
3    b
dtype: object
In [161]:
data.iloc[1]
Out[161]:
'b'
In [162]:
data.iloc[1:3]
Out[162]:
3    b
5    c
dtype: object

Missing Data

In [163]:
vals1 = np.array([1, None, 3, 4])
vals1
Out[163]:
array([1, None, 3, 4], dtype=object)
In [164]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()
dtype = object
10 loops, best of 3: 104 ms per loop

dtype = int
100 loops, best of 3: 4.02 ms per loop

In [165]:
vals1.sum()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-165-749fd8ae6030> in <module>()
----> 1 vals1.sum()

/Users/lee/anaconda/lib/python3.5/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims)
     30 
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32     return umr_sum(a, axis, dtype, out, keepdims)
     33 
     34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
In [166]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype
Out[166]:
dtype('float64')
In [167]:
1 + np.nan
Out[167]:
nan
In [168]:
vals2.sum(), vals2.min(), vals2.max()
Out[168]:
(nan, nan, nan)
In [169]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Out[169]:
(8.0, 1.0, 4.0)
In [170]:
pd.Series([1, np.nan, 2, None])
Out[170]:
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64
In [171]:
x = pd.Series(range(2), dtype=int)
x
Out[171]:
0    0
1    1
dtype: int64
In [172]:
x[0] = None
x
Out[172]:
0    NaN
1    1.0
dtype: float64

Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value. (Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included).

While this type of magic may feel a bit hackish compared to the more unified approach to NA values in domain-specific languages like R, the Pandas sentinel/casting approach works quite well in practice and in my experience only rarely causes issues.

The following table lists the upcasting conventions in Pandas when NA values are introduced:

Typeclass Conversion When Storing NAs NA Sentinel Value
floating No change np.nan
object No change None or np.nan
integer Cast to float64 np.nan
boolean Cast to object None or np.nan

Keep in mind that in Pandas, string data is always stored with an object dtype.

Detecting null values

Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:

In [173]:
data = pd.Series([1, np.nan, 'hello', None])
In [174]:
data.isnull()
Out[174]:
0    False
1     True
2    False
3     True
dtype: bool
In [175]:
data[data.notnull()]
Out[175]:
0        1
2    hello
dtype: object
In [176]:
data.dropna()
Out[176]:
0        1
2    hello
dtype: object
In [177]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
Out[177]:
a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64
In [178]:
data.fillna(0)
Out[178]:
a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64
In [179]:
# forward-fill
data.fillna(method='ffill')
Out[179]:
a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64
In [180]:
# back-fill
data.fillna(method='bfill')
Out[180]:
a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

Hierarchical Indexing

In [181]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
Out[181]:
(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64
In [182]:
pop[('California', 2010):('Texas', 2000)]
Out[182]:
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64
In [183]:
pop[[i for i in pop.index if i[1] == 2010]]
Out[183]:
(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64
In [184]:
index = pd.MultiIndex.from_tuples(index)
index
Out[184]:
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
In [185]:
pop = pop.reindex(index)
pop
Out[185]:
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [186]:
pop[:, 2010]
Out[186]:
California    37253956
New York      19378102
Texas         25145561
dtype: int64

In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

In [187]:
pop_df = pop.unstack()
pop_df
Out[187]:
2000 2010
California 33871648 37253956
New York 18976457 19378102
Texas 20851820 25145561
In [188]:
pop_df.stack()
Out[188]:
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [189]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df
Out[189]:
total under18
California 2000 33871648 9267089
2010 37253956 9284094
New York 2000 18976457 4687374
2010 19378102 4318033
Texas 2000 20851820 5906301
2010 25145561 6879014
In [190]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
Out[190]:
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568
In [191]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df
Out[191]:
data1 data2
a 1 0.284840 0.201477
2 0.856525 0.874145
b 1 0.002510 0.315458
2 0.490551 0.046693
In [192]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)
Out[192]:
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
In [193]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Out[193]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 49.0 39.3 36.0 37.7 27.0 36.0
2 24.0 36.8 42.0 36.5 32.0 37.6
2014 1 38.0 38.4 12.0 37.5 43.0 36.8
2 35.0 35.0 56.0 36.0 40.0 35.2
In [194]:
health_data['Guido', 'HR']
Out[194]:
year  visit
2013  1        36.0
      2        42.0
2014  1        12.0
      2        56.0
Name: (Guido, HR), dtype: float64
In [195]:
health_data.loc[:, ('Bob', 'HR')]
Out[195]:
year  visit
2013  1        49.0
      2        24.0
2014  1        38.0
      2        35.0
Name: (Bob, HR), dtype: float64

Combining Datasets: Concat and Append

In [196]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))
Out[196]:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
In [197]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
In [198]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
Out[198]:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [199]:
x = [[1, 2],
     [3, 4]]
np.concatenate([x, x], axis=1)
Out[199]:
array([[1, 2, 1, 2],
       [3, 4, 3, 4]])
In [200]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
Out[200]:
1    A
2    B
3    C
4    D
5    E
6    F
dtype: object
In [201]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
Out[201]:

df1

A B
1 A1 B1
2 A2 B2

df2

A B
3 A3 B3
4 A4 B4

pd.concat([df1, df2])

A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
In [202]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
Out[202]:

df3

A B
0 A0 B0
1 A1 B1

df4

C D
0 C0 D0
1 C1 D1

pd.concat([df3, df4], axis='col')

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
In [203]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
Out[203]:

x

A B
0 A0 B0
1 A1 B1

y

A B
0 A2 B2
1 A3 B3

pd.concat([x, y])

A B
0 A0 B0
1 A1 B1
0 A2 B2
1 A3 B3
In [204]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)
ValueError: Indexes have overlapping values: [0, 1]
In [205]:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
Out[205]:

x

A B
0 A0 B0
1 A1 B1

y

A B
0 A2 B2
1 A3 B3

pd.concat([x, y], ignore_index=True)

A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
In [206]:
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
Out[206]:

x

A B
0 A0 B0
1 A1 B1

y

A B
0 A2 B2
1 A3 B3

pd.concat([x, y], keys=['x', 'y'])

A B
x 0 A0 B0
1 A1 B1
y 0 A2 B2
1 A3 B3
In [207]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
Out[207]:

df5

A B C
1 A1 B1 C1
2 A2 B2 C2

df6

B C D
3 B3 C3 D3
4 B4 C4 D4

pd.concat([df5, df6])

A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
In [208]:
display('df5', 'df6',
        "pd.concat([df5, df6], join='inner')")
Out[208]:

df5

A B C
1 A1 B1 C1
2 A2 B2 C2

df6

B C D
3 B3 C3 D3
4 B4 C4 D4

pd.concat([df5, df6], join='inner')

B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4
In [209]:
display('df5', 'df6',
        "pd.concat([df5, df6], join_axes=[df5.columns])")
Out[209]:

df5

A B C
1 A1 B1 C1
2 A2 B2 C2

df6

B C D
3 B3 C3 D3
4 B4 C4 D4

pd.concat([df5, df6], join_axes=[df5.columns])

A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
In [210]:
display('df1', 'df2', 'df1.append(df2)')
Out[210]:

df1

A B
1 A1 B1
2 A2 B2

df2

A B
3 A3 B3
4 A4 B4

df1.append(df2)

A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4

Merge and Join

In [211]:
import pandas as pd
import numpy as np

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
In [212]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
Out[212]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df2

employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
In [213]:
df3 = pd.merge(df1, df2)
df3
Out[213]:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
In [214]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
Out[214]:

df3

employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

df4

group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve

pd.merge(df3, df4)

employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
In [215]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
Out[215]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df5

group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization

pd.merge(df1, df5)

employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization

The on keyword

In [216]:
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
Out[216]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df2

employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014

pd.merge(df1, df2, on='employee')

employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

The left_on and right_on keywords

In [217]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
Out[217]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df3

name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000

pd.merge(df1, df3, left_on="employee", right_on="name")

employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000
In [218]:
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
Out[218]:
employee group salary
0 Bob Accounting 70000
1 Jake Engineering 80000
2 Lisa Engineering 120000
3 Sue HR 90000

The left_index and right_index keywords

In [219]:
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
Out[219]:

df1a

group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
In [220]:
display('df1a', 'df2a',
        "pd.merge(df1a, df2a, left_index=True, right_index=True)")
Out[220]:

df1a

group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014

pd.merge(df1a, df2a, left_index=True, right_index=True)

group hire_date
employee
Lisa Engineering 2004
Bob Accounting 2008
Jake Engineering 2012
Sue HR 2014
In [221]:
display('df1a', 'df2a', 'df1a.join(df2a)')
Out[221]:

df1a

group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014

df1a.join(df2a)

group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
In [222]:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
Out[222]:

df1a

group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df3

name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000

pd.merge(df1a, df3, left_index=True, right_on='name')

group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000

Specifying Set Arithmetic for Joins

In [223]:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
Out[223]:

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7)

name food drink
0 Mary bread wine
In [224]:
pd.merge(df6, df7, how='inner')
Out[224]:
name food drink
0 Mary bread wine
In [225]:
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
Out[225]:

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7, how='outer')

name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
In [226]:
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
Out[226]:

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7, how='left')

name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine

Aggregation and Grouping

In [227]:
import numpy as np
import pandas as pd

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
In [228]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
Out[228]:
(1035, 6)
In [229]:
planets.head()
Out[229]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
In [230]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
Out[230]:
0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64
In [231]:
planets.dropna().describe()
Out[231]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000

The following table summarizes some other built-in Pandas aggregations:

Aggregation Description
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

These are all methods of DataFrame and Series objects.

Split, apply, combine

A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated in this figure:

In [232]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df
Out[232]:
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
In [233]:
df.groupby('key')
Out[233]:
<pandas.core.groupby.DataFrameGroupBy object at 0x11761ec88>
In [234]:
df.groupby('key').sum()
Out[234]:
data
key
A 3
B 5
C 7
In [235]:
planets.groupby('method')['orbital_period'].median()
Out[235]:
method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64
In [236]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))
Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)
In [237]:
planets.groupby('method')['year'].describe().unstack()
Out[237]:
count mean std min 25% 50% 75% max
method
Astrometry 2.0 2011.500000 2.121320 2010.0 2010.75 2011.5 2012.25 2013.0
Eclipse Timing Variations 9.0 2010.000000 1.414214 2008.0 2009.00 2010.0 2011.00 2012.0
Imaging 38.0 2009.131579 2.781901 2004.0 2008.00 2009.0 2011.00 2013.0
Microlensing 23.0 2009.782609 2.859697 2004.0 2008.00 2010.0 2012.00 2013.0
Orbital Brightness Modulation 3.0 2011.666667 1.154701 2011.0 2011.00 2011.0 2012.00 2013.0
Pulsar Timing 5.0 1998.400000 8.384510 1992.0 1992.00 1994.0 2003.00 2011.0
Pulsation Timing Variations 1.0 2007.000000 NaN 2007.0 2007.00 2007.0 2007.00 2007.0
Radial Velocity 553.0 2007.518987 4.249052 1989.0 2005.00 2009.0 2011.00 2014.0
Transit 397.0 2011.236776 2.077867 2002.0 2010.00 2012.0 2013.00 2014.0
Transit Timing Variations 4.0 2012.500000 1.290994 2011.0 2011.75 2012.5 2013.25 2014.0

Pivot Tables

In [238]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
In [239]:
titanic.head()
Out[239]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

This contains a wealth of information on each passenger of that ill-fated voyage, including gender, age, class, fare paid, and much more.

In [240]:
titanic.groupby('sex')[['survived']].mean()
Out[240]:
survived
sex
female 0.742038
male 0.188908
In [241]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
Out[241]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
In [242]:
titanic.pivot_table('survived', index='sex', columns='class')
Out[242]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
In [243]:
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
Out[243]:
class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.511628
(18, 80] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 80] 0.375000 0.071429 0.133663
In [244]:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
Out[244]:
fare [0, 14.454] (14.454, 512.329]
class First Second Third First Second Third
sex age
female (0, 18] NaN 1.000000 0.714286 0.909091 1.000000 0.318182
(18, 80] NaN 0.880000 0.444444 0.972973 0.914286 0.391304
male (0, 18] NaN 0.000000 0.260870 0.800000 0.818182 0.178571
(18, 80] 0.0 0.098039 0.125000 0.391304 0.030303 0.192308

Additional pivot table options

The full call signature of the pivot_table method of DataFrames is as follows:

# call signature as of Pandas 0.18
DataFrame.pivot_table(data, values=None, index=None, columns=None,
                      aggfunc='mean', fill_value=None, margins=False,
                      dropna=True, margins_name='All')

The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices (e.g., 'sum', 'mean', 'count', 'min', 'max', etc.) or a function that implements an aggregation (e.g., np.sum(), min(), sum(), etc.).

In [245]:
titanic.pivot_table(index='sex', columns='class',
                    aggfunc={'survived':sum, 'fare':'mean'})
Out[245]:
survived fare
class First Second Third First Second Third
sex
female 91.0 70.0 72.0 106.125798 21.970121 16.118810
male 45.0 17.0 47.0 67.226127 19.741782 12.661633
In [246]:
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[246]:
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838

3.Larger dataset example

In [247]:
df_pop = pd.read_csv("data/european_cities.csv")
In [248]:
df_pop.head()
Out[248]:
Rank City State Population Date of census/estimate
0 1 London[2] United Kingdom 8,615,246 1 June 2014
1 2 Berlin Germany 3,437,916 31 May 2014
2 3 Madrid Spain 3,165,235 1 January 2014
3 4 Rome Italy 2,872,086 30 September 2014
4 5 Paris France 2,273,305 1 January 2013
In [249]:
type(df_pop.Population[0])
Out[249]:
str
In [250]:
df_pop = pd.read_csv("data/european_cities.csv", delimiter=",", encoding=None, header=0)
In [251]:
df_pop.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
Rank                       105 non-null int64
City                       105 non-null object
State                      105 non-null object
Population                 105 non-null object
Date of census/estimate    105 non-null object
dtypes: int64(1), object(4)
memory usage: 4.2+ KB
In [252]:
df_pop.head()
Out[252]:
Rank City State Population Date of census/estimate
0 1 London[2] United Kingdom 8,615,246 1 June 2014
1 2 Berlin Germany 3,437,916 31 May 2014
2 3 Madrid Spain 3,165,235 1 January 2014
3 4 Rome Italy 2,872,086 30 September 2014
4 5 Paris France 2,273,305 1 January 2013
In [253]:
type(df_pop.Population[0])
Out[253]:
str
In [254]:
df_pop["NumericPopulation"] = df_pop.Population.apply(lambda x: int(x.replace(",", "")))
In [255]:
df_pop["State"][0:3]
Out[255]:
0     United Kingdom
1            Germany
2              Spain
Name: State, dtype: object
In [256]:
df_pop["State"] = df_pop["State"].apply(lambda x: x.strip())##默认去除空格,一般是去除首尾字符
In [257]:
df_pop.head()
Out[257]:
Rank City State Population Date of census/estimate NumericPopulation
0 1 London[2] United Kingdom 8,615,246 1 June 2014 8615246
1 2 Berlin Germany 3,437,916 31 May 2014 3437916
2 3 Madrid Spain 3,165,235 1 January 2014 3165235
3 4 Rome Italy 2,872,086 30 September 2014 2872086
4 5 Paris France 2,273,305 1 January 2013 2273305
In [258]:
df_pop.dtypes
Out[258]:
Rank                        int64
City                       object
State                      object
Population                 object
Date of census/estimate    object
NumericPopulation           int64
dtype: object
In [259]:
df_pop2 = df_pop.set_index("City")
In [260]:
df_pop2 = df_pop2.sort_index()
In [261]:
df_pop2.head()
Out[261]:
Rank State Population Date of census/estimate NumericPopulation
City
Aarhus 92 Denmark 326,676 1 October 2014 326676
Alicante 86 Spain 334,678 1 January 2012 334678
Amsterdam 23 Netherlands 813,562 31 May 2014 813562
Antwerp 59 Belgium 510,610 1 January 2014 510610
Athens 34 Greece 664,046 24 May 2011 664046
In [262]:
df_pop3 = df_pop.set_index(["State", "City"]).sortlevel(0)
In [263]:
df_pop3.head(7)
Out[263]:
Rank Population Date of census/estimate NumericPopulation
State City
Austria Vienna 7 1,794,770 1 January 2015 1794770
Belgium Antwerp 59 510,610 1 January 2014 510610
Brussels[17] 16 1,175,831 1 January 2014 1175831
Bulgaria Plovdiv 84 341,041 31 December 2013 341041
Sofia 14 1,291,895 14 December 2014 1291895
Varna 85 335,819 31 December 2013 335819
Croatia Zagreb 24 790,017 31 March 2011 790017
In [264]:
df_pop3.ix["Sweden"]
Out[264]:
Rank Population Date of census/estimate NumericPopulation
City
Gothenburg 53 528,014 31 March 2013 528014
Malmö 102 309,105 31 March 2013 309105
Stockholm 20 909,976 31 January 2014 909976
In [265]:
df_pop.ix[df_pop.State == 'Sweden']
Out[265]:
Rank City State Population Date of census/estimate NumericPopulation
19 20 Stockholm Sweden 909,976 31 January 2014 909976
52 53 Gothenburg Sweden 528,014 31 March 2013 528014
101 102 Malmö Sweden 309,105 31 March 2013 309105
In [266]:
df_pop3.ix[("Sweden", "Gothenburg")]
Out[266]:
Rank                                  53
Population                       528,014
Date of census/estimate    31 March 2013
NumericPopulation                 528014
Name: (Sweden, Gothenburg), dtype: object
In [267]:
df_pop.set_index("City").sort_values(["State", "NumericPopulation"], ascending=[False, True]).head()
Out[267]:
Rank State Population Date of census/estimate NumericPopulation
City
Nottingham 103 United Kingdom 308,735 30 June 2012 308735
Wirral 97 United Kingdom 320,229 30 June 2012 320229
Coventry 94 United Kingdom 323,132 30 June 2012 323132
Wakefield 91 United Kingdom 327,627 30 June 2012 327627
Leicester 87 United Kingdom 331,606 30 June 2012 331606
In [268]:
city_counts = df_pop.State.value_counts()
In [269]:
city_counts.name = "# cities in top 105"
In [270]:
df_pop3 = df_pop[["State", "City", "NumericPopulation"]].set_index(["State", "City"])
In [271]:
df_pop3.head()
Out[271]:
NumericPopulation
State City
United Kingdom London[2] 8615246
Germany Berlin 3437916
Spain Madrid 3165235
Italy Rome 2872086
France Paris 2273305
In [272]:
df_pop4=df_pop3.sum(level="State").sort_values("NumericPopulation", ascending=False)
df_pop4.head()
Out[272]:
NumericPopulation
State
United Kingdom 16011877
Germany 15119548
Spain 10041639
Italy 8764067
Poland 6267409
In [275]:
df_pop5 = df_pop.drop("Rank", axis=1).groupby("State").sum().sort_values("NumericPopulation", ascending=False)
In [276]:
df_pop5.head()
Out[276]:
NumericPopulation
State
United Kingdom 16011877
Germany 15119548
Spain 10041639
Italy 8764067
Poland 6267409
In [277]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

city_counts.plot(kind='barh', ax=ax1)
ax1.set_xlabel("# cities in top 105")
df_pop5.NumericPopulation.plot(kind='barh', ax=ax2)
ax2.set_xlabel("Total pop. in top 105 cities")

fig.tight_layout()
fig.savefig("ch12-state-city-counts-sum.pdf")

4.Time series

Basics

In [278]:
import datetime
In [279]:
pd.date_range("2015-1-1", periods=31)
Out[279]:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
               '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
               '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',
               '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',
               '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
               '2015-01-29', '2015-01-30', '2015-01-31'],
              dtype='datetime64[ns]', freq='D')
In [280]:
pd.date_range(datetime.datetime(2015, 1, 1), periods=31)
Out[280]:
DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
               '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
               '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',
               '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',
               '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
               '2015-01-29', '2015-01-30', '2015-01-31'],
              dtype='datetime64[ns]', freq='D')
In [281]:
pd.date_range("2015-1-1 00:00", "2015-1-1 12:00", freq="H")
Out[281]:
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 01:00:00',
               '2015-01-01 02:00:00', '2015-01-01 03:00:00',
               '2015-01-01 04:00:00', '2015-01-01 05:00:00',
               '2015-01-01 06:00:00', '2015-01-01 07:00:00',
               '2015-01-01 08:00:00', '2015-01-01 09:00:00',
               '2015-01-01 10:00:00', '2015-01-01 11:00:00',
               '2015-01-01 12:00:00'],
              dtype='datetime64[ns]', freq='H')
In [282]:
ts1 = pd.Series(np.arange(31), index=pd.date_range("2015-1-1", periods=31))
In [283]:
ts1["2015-1-3"]
Out[283]:
2
In [284]:
ts1.index[2]
Out[284]:
Timestamp('2015-01-03 00:00:00', offset='D')
In [285]:
ts1.index[2].year, ts1.index[2].month, ts1.index[2].day
Out[285]:
(2015, 1, 3)
In [286]:
ts1.index[2].to_pydatetime()
Out[286]:
datetime.datetime(2015, 1, 3, 0, 0)
In [287]:
ts2 = pd.Series(np.random.rand(2), 
                index=[datetime.datetime(2015, 1, 1), datetime.datetime(2015, 2, 1)])
In [288]:
ts2
Out[288]:
2015-01-01    0.160353
2015-02-01    0.484760
dtype: float64
In [289]:
periods = pd.PeriodIndex([pd.Period('2015-01'), pd.Period('2015-02'), pd.Period('2015-03')])
In [290]:
ts3 = pd.Series(np.random.rand(3), periods)
In [291]:
ts3
Out[291]:
2015-01    0.227778
2015-02    0.243639
2015-03    0.964497
Freq: M, dtype: float64
In [292]:
ts3.index
Out[292]:
PeriodIndex(['2015-01', '2015-02', '2015-03'], dtype='int64', freq='M')
In [293]:
ts2.to_period('M')
Out[293]:
2015-01    0.160353
2015-02    0.484760
Freq: M, dtype: float64

Temperature time series example

In [295]:
!head -n 5 temperature_outdoor_2014.tsv
1388530986	4.380000
1388531586	4.250000
1388532187	4.190000
1388532787	4.060000
1388533388	4.060000
In [296]:
!wc -l temperature_outdoor_2014.tsv
   49548 temperature_outdoor_2014.tsv
In [297]:
df1 = pd.read_csv('data/temperature_outdoor_2014.tsv', delimiter="\t", names=["time", "outdoor"])
In [298]:
df2 = pd.read_csv('data/temperature_indoor_2014.tsv', delimiter="\t", names=["time", "indoor"])
In [299]:
df1.head()
Out[299]:
time outdoor
0 1388530986 4.38
1 1388531586 4.25
2 1388532187 4.19
3 1388532787 4.06
4 1388533388 4.06
In [300]:
df2.head()
Out[300]:
time indoor
0 1388530986 21.94
1 1388531586 22.00
2 1388532187 22.00
3 1388532787 22.00
4 1388533388 22.00
In [301]:
df1.time = (pd.to_datetime(df1.time.values, unit="s")
              .tz_localize('UTC').tz_convert('Europe/Stockholm'))
In [302]:
df1 = df1.set_index("time")
In [303]:
df2.time = (pd.to_datetime(df2.time.values, unit="s")
              .tz_localize('UTC').tz_convert('Europe/Stockholm'))
In [304]:
df2 = df2.set_index("time")
In [305]:
df1.head()
Out[305]:
outdoor
time
2014-01-01 00:03:06+01:00 4.38
2014-01-01 00:13:06+01:00 4.25
2014-01-01 00:23:07+01:00 4.19
2014-01-01 00:33:07+01:00 4.06
2014-01-01 00:43:08+01:00 4.06
In [306]:
df1.index[0]
Out[306]:
Timestamp('2014-01-01 00:03:06+0100', tz='Europe/Stockholm')
In [307]:
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
df1.plot(ax=ax)
df2.plot(ax=ax)

fig.tight_layout()

select January data

In [308]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49548 entries, 2014-01-01 00:03:06+01:00 to 2014-12-30 23:56:35+01:00
Data columns (total 1 columns):
outdoor    49548 non-null float64
dtypes: float64(1)
memory usage: 774.2 KB
In [309]:
df1_jan = df1[(df1.index > "2014-1-1") & (df1.index < "2014-2-1")]
In [310]:
df1.index < "2014-2-1"
Out[310]:
array([ True,  True,  True, ..., False, False, False], dtype=bool)
In [311]:
df1_jan.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4452 entries, 2014-01-01 00:03:06+01:00 to 2014-01-31 23:56:58+01:00
Data columns (total 1 columns):
outdoor    4452 non-null float64
dtypes: float64(1)
memory usage: 69.6 KB
In [312]:
df2_jan = df2["2014-1-1":"2014-1-31"]
In [313]:
fig, ax = plt.subplots(1, 1, figsize=(12, 4))

df1_jan.plot(ax=ax)
df2_jan.plot(ax=ax)

fig.tight_layout()

group by month

In [314]:
df_month = pd.concat([df.to_period("M").groupby(level=0).mean() for df in [df1, df2]], axis=1)
In [315]:
df_month.head(3)
Out[315]:
outdoor indoor
time
2014-01 -1.776646 19.862590
2014-02 2.231613 20.231507
2014-03 4.615437 19.597748
In [316]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df_month.plot(kind='bar', ax=axes[0])
df_month.plot(kind='box', ax=axes[1])

fig.tight_layout()
In [317]:
df_month
Out[317]:
outdoor indoor
time
2014-01 -1.776646 19.862590
2014-02 2.231613 20.231507
2014-03 4.615437 19.597748
2014-04 8.105193 22.149754
2014-05 12.261396 26.332160
2014-06 15.586955 28.687491
2014-07 20.780314 30.605333
2014-08 16.494823 28.099068
2014-09 12.823905 26.950366
2014-10 9.352000 23.379460
2014-11 4.992142 20.610365
2014-12 -0.058940 16.465674

Assignments - 3

1. 计算素数

使用循环和向量化两种不同的方法来计算100以内质数之和

$hint$: 10以内的质数有2、3、5、7四个,要找出100以内的其他质数,只需在10到100里筛掉能被2、3、5、7其中一个整除的,留下的就是剩下要找的质数。如果想求质数之和,可以将非质数赋值为0,再对数值求和。

2. 随机漫步

  • 模拟一个醉汉在二维空间上的随机漫步。首先可以建立一个坐标系,以醉汉最初的位置为原点 (0, 0), 以醉汉从后向前的方向为X轴,从左往右的方向为Y轴。
  • 绘制一个二维随机漫步的图形。

3. 梯形法

  • 使用梯形法计算一个二次函数的数值积分。可以先写一个用梯形法求函数f在区间[a, b]上定积分的函数 trapzf(f, a, b, npts=100) 要用到的公式为:$\int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(f(x_{i})+f(x_{i-1})\right).$
  • 画一二次函数及梯形法积分时的各个梯形

4. 研究ipywidgets模块中的interact函数,绘制一个交互式的函数图形。

5. 使用pandas中的函数,下载上证综指过去一段时间的数据,进行数据探索。

  • 利用Pandas模块直接获取雅虎财经数据,方便之极。注意把官方提示把from pandas.io import data, wb替换为from pandas_datareader import data, wb
    Pandas for finance 文档
    上证指数000001.SS.

6. http://data.earthquake.cn/datashare/globeEarthquake_csn.html 将上述地址中的地震数据抓取下来,分析你感兴趣的分析点。

7.基于QQ群的数据(qqdata.csv), 分析你感兴趣的分析点。