While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.
In particular, Python lists are very flexible containers that can be nested arbitrarily deep and which can hold any Python object in them, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices. In contrast, much of our modern heritage of scientific computing has been built on top of libraries written in the Fortran language, which has native support for vectors and matrices as well as a library of mathematical functions that can efficiently operate on entire arrays at once.
We now turn our attention to the Numpy library, which forms the base layer for the entire 'scipy ecosystem'. Once you have installed numpy, you can import it as
import numpy
numpy.__version__
For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later.
By convention, you'll find that most people in the SciPy/PyData world will import NumPy using np as an alias:
import numpy as np
np?
lst = [10, 20, 30, 40]
type(lst)
arr = np.array([10, 20, 30, 40])
arr
L = list(range(10))
L
type(L[0])
Or, similarly, a list of strings:
L2 = [str(c) for c in L]
L2
type(L2[0])
Because of Python's dynamic typing, we can even create heterogeneous lists:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]
Python offers several different options for storing data in efficient, fixed-type data buffers.
The built-in array module (available since Python 3.3) can be used to create dense arrays of a uniform type:
import array
L = list(range(10))
A = array.array('i', L)
A
# integer array:
np.array([1, 4, 2, 5, 3])
np.array([1, 2, 3, 4], dtype='float32')
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))
# Create a 3x3 identity matrix
np.eye(3)
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)
NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.
The standard NumPy data types are listed in the following table. Note that when constructing an array, they can be specified using a string:
np.zeros(10, dtype='int16')
Or using the associated NumPy object:
np.zeros(10, dtype=np.int16)
| Data type | Description |
|---|---|
bool_ |
Boolean (True or False) stored as a byte |
int_ |
Default integer type (same as C long; normally either int64 or int32) |
intc |
Identical to C int (normally int32 or int64) |
intp |
Integer used for indexing (same as C ssize_t; normally either int32 or int64) |
int8 |
Byte (-128 to 127) |
int16 |
Integer (-32768 to 32767) |
int32 |
Integer (-2147483648 to 2147483647) |
int64 |
Integer (-9223372036854775808 to 9223372036854775807) |
uint8 |
Unsigned integer (0 to 255) |
uint16 |
Unsigned integer (0 to 65535) |
uint32 |
Unsigned integer (0 to 4294967295) |
uint64 |
Unsigned integer (0 to 18446744073709551615) |
float_ |
Shorthand for float64. |
float16 |
Half precision float: sign bit, 5 bits exponent, 10 bits mantissa |
float32 |
Single precision float: sign bit, 8 bits exponent, 23 bits mantissa |
float64 |
Double precision float: sign bit, 11 bits exponent, 52 bits mantissa |
complex_ |
Shorthand for complex128. |
complex64 |
Complex number, represented by two 32-bit floats |
complex128 |
Complex number, represented by two 64-bit floats |
More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the NumPy documentation.
Elements of a one-dimensional array are accessed with the same syntax as a list:
lst[0]
arr[0]
arr[-1]
arr[2:]
The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:
lst[-1] = 'a string inside a list'
lst
but the same can not be done with an array, as we get an error message:
arr[-1] = 'a string inside an array'
The information about the type of an array is contained in its dtype attribute:
arr
np.mean(arr)
arr.mean()
?arr.dtype
Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:
arr[-1] = 1.234
arr
?np.zeros
np.zeros(5, dtype=float)
np.zeros(3, dtype=int)
np.zeros(3, dtype=complex)
and similarly for ones:
print('5 ones:', np.ones(5))
If we want an array initialized with an arbitrary value, we can create an empty array and then use the fill method to put the value we want into the array:
a = np.empty(4)
a.fill(5.5)
a
Numpy also offers the arange function, which works like the builtin range but returns an array instead of a list:
np.arange(5)
np.arange(1,100,3)
and the linspace and logspace functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:
print("A linear grid between 0 and 1:")
print(np.linspace(0, 1, 5))
print("A logarithmic grid between 10**1 and 10**3:")
print(np.logspace(1, 3, 4))
np.randomx = np.random.randn(100)
np.random.randn(5)
whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:
norm10 = np.random.normal(10, 3, 5)
norm10
bin=np.random.binomial(10,0.4,50)
bin
Above we saw how to index arrays with single numbers and slices, just like Python lists. But arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.
Consider for example that in the array norm10 we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is true or false:
mask = norm10 > 9
mask
Now that we have this mask, we can use it to either read those values or to reset them to 0:
print('Values above 9:', norm10[mask])
print('Resetting all values above 9 to 0...')
norm10[mask] = 0
print(norm10)
lst2 = [[1, 2], [3, 4]]
lst2
arr2 = np.array([[1, 2], [3, 4]])
arr2
print(lst2[0][1])
print(arr2[0,1])
import numpy as np
np.zeros((2,3))
arr = np.random.normal(10, 3, (2, 4))
arr
arr.reshape(4,2)
In fact, the shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:
arr = np.arange(8).reshape(2, 4)
arr
Note that reshaping (like most numpy operations) provides a view of the same memory:
arr = np.arange(8)
arr2 = arr.reshape(2, 4)
arr[3] = 1000
print(arr)
print(arr2)
This lack of copying allows for very efficient vectorized operations.
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the copy() method:
arr3_copy = arr2[:2,:2].copy()
print (arr3_copy)
arr3_copy[0,0] = 33
print (arr3_copy)
print (arr2)
With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions:
print('Slicing in the second row:', arr2[1, 2:4])
print('All rows, third column :', arr2[:, 2])
If you only provide one index, then you will get an array with one less dimension containing that row:
print('First row: ', arr2[0])
print('Second row: ', arr2[1])
arr = arr2
print('Data type :', arr.dtype)
print('Total number of elements :', arr.size)
print('Number of dimensions :', arr.ndim)
print('Shape (dimensionality) :', arr.shape)
print('Memory used (in bytes) :', arr.nbytes)
Arrays also have many useful methods, some especially useful ones are:
print('Minimum and maximum :', arr.min(), arr.max())
print('Sum and product of all elements :', arr.sum(), arr.prod())
print('Mean and standard deviation :', arr.mean(), arr.std())
For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the axis parameter; for example:
?arr.sum
?np.sum
?arr.mean
arr
print('For the following array:\n', arr)
print('The sum of elements along the rows is :', arr.sum(axis=1))
print('The sum of elements along the columns is :', arr.sum(axis=0))
np.zeros((3,4,5,6)).sum(2).shape
Another widely used property of arrays is the .T attribute, which allows you to access the transpose of the array:
print('Array:\n', arr)
print('Transpose:\n', arr.T)
All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays.
Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines np.concatenate, np.vstack, and np.hstack.
np.concatenate takes a tuple or list of arrays as its first argument.
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
z = [99, 99, 99]
print(np.concatenate([x, y, z]))
grid = np.array([[1, 2, 3],
[4, 5, 6]])
# concatenate along the first axis
np.concatenate([grid, grid])
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
# horizontally stack the arrays
y = np.array([[99],
[99]])
np.hstack([grid, y])
Similary, np.dstack will stack arrays along the third axis.
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and np.vsplit.
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
grid = np.arange(16).reshape((4, 4))
grid
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
left, right = np.hsplit(grid, [2])
print(left)
print(right)
Similarly, np.dsplit will split arrays along the third axis.
We don't have time here to look at all the methods and properties of arrays, here's a complete list. Simply try exploring some of these IPython to learn more, or read their description in the full Numpy documentation:
arr.T arr.copy arr.getfield arr.put arr.squeeze
arr.all arr.ctypes arr.imag arr.ravel arr.std
arr.any arr.cumprod arr.item arr.real arr.strides
arr.argmax arr.cumsum arr.itemset arr.repeat arr.sum
arr.argmin arr.data arr.itemsize arr.reshape arr.swapaxes
arr.argsort arr.diagonal arr.max arr.resize arr.take
arr.astype arr.dot arr.mean arr.round arr.tofile
arr.base arr.dtype arr.min arr.searchsorted arr.tolist
arr.byteswap arr.dump arr.nbytes arr.setasflat arr.tostring
arr.choose arr.dumps arr.ndim arr.setfield arr.trace
arr.clip arr.fill arr.newbyteorder arr.setflags arr.transpose
arr.compress arr.flags arr.nonzero arr.shape arr.var
arr.conj arr.flat arr.prod arr.size arr.view
arr.conjugate arr.flatten arr.ptp arr.sort
Arrays support all regular arithmetic operators, and the numpy library also contains a complete collection of basic mathematical functions that operate on arrays.
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print(arr1, '+', arr2, '=', arr1+arr2)
print(arr1, '*', arr2, '=', arr1*arr2)
1.5 * arr1
import numpy as np
np.random.seed(0)
def compute_reciprocals(values):
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0 / values[i]
return output
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)
print(compute_reciprocals(values))
print(1.0 / values)
The following table lists the arithmetic operators implemented in NumPy:
| Operator | Equivalent ufunc | Description |
|---|---|---|
+ |
np.add |
Addition (e.g., 1 + 1 = 2) |
- |
np.subtract |
Subtraction (e.g., 3 - 2 = 1) |
- |
np.negative |
Unary negation (e.g., -2) |
* |
np.multiply |
Multiplication (e.g., 2 * 3 = 6) |
/ |
np.divide |
Division (e.g., 3 / 2 = 1.5) |
// |
np.floor_divide |
Floor division (e.g., 3 // 2 = 1) |
** |
np.power |
Exponentiation (e.g., 2 ** 3 = 8) |
% |
np.mod |
Modulus/remainder (e.g., 9 % 4 = 1) |
x = np.array([-2, -1, 0, 1, 2])
abs(x)
np.absolute(x)
np.abs(x)
x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0 + 1j])
np.abs(x)
NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist are the trigonometric functions.
theta = np.linspace(0, np.pi, 3)
print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
Another common type of operation available in a NumPy ufunc are the exponentials:
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
NumPy has many more ufuncs available, including hyperbolic trig functions, bitwise arithmetic, comparison operators, conversions from radians to degrees, rounding and remainders, and much more.
Another excellent source for more specialized and obscure ufuncs is the submodule scipy.special.
If you want to compute some obscure mathematical function on your data, chances are it is implemented in scipy.special.
from scipy import special
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
# Error function (integral of Gaussian)
# its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
Many NumPy users make use of ufuncs without ever learning their full set of features.
For large calculations, it is sometimes useful to be able to specify the array where the result of the calculation will be stored.
Rather than creating a temporary array, this can be used to write computation results directly to the memory location where you'd like them to be.
For all ufuncs, this can be done using the out argument of the function:
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
x = np.arange(1, 6)
np.add.reduce(x)
np.multiply.reduce(x)
np.add.accumulate(x)
np.multiply.accumulate(x)
x = np.arange(1, 6)
np.multiply.outer(x, x)
This is a first example of broadcasting
While this means that in principle arrays must always match in their dimensionality in order for an operation to be valid, numpy will broadcast dimensions when possible. Here is an example of broadcasting a scalar to a 1D array:
print(np.arange(3))
print(np.arange(3) + 5)
We can also broadcast a 1D array to a 2D array, in this case adding a vector to all rows of a matrix:
np.ones((3, 3))
np.arange(3)
np.ones((3, 3)) + np.arange(3)
We can also broadcast in two directions at a time:
np.arange(3).reshape((3, 1))
np.arange(3)
np.arange(3).reshape((3, 1)) + np.arange(3)
import numpy as np
import pandas as pd
# use pandas to extract rainfall inches as a NumPy array
rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
inches = rainfall / 254.0 # 1/10mm -> inches
inches.shape
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # set plot styles
plt.hist(inches, 40);
x = np.array([1, 2, 3, 4, 5])
x < 3 # less than
x > 3 # greater than
x != 3 # not equal
(2 * x) == (x ** 2)
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x
# how many values less than 6?
np.count_nonzero(x < 6)
np.sum(x < 6)
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)
# are all values in each row less than 8?
np.all(x < 8, axis=1)
np.sum((inches > 0.5) & (inches < 1))
np.sum(~( (inches <= 0.5) | (inches >= 1) ))
print("Number days without rain: ", np.sum(inches == 0))
print("Number days with rain: ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches :", np.sum((inches > 0) &
(inches < 0.2)))
# construct a mask of all rainy days
rainy = (inches > 0)
# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)
print("Median precip on rainy days in 2014 (inches): ",
np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches): ",
np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
np.median(inches[rainy & ~summer]))
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B
import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[x[3], x[7], x[2]]
ind = [3, 7, 4]
x[ind]
ind = np.array([[3, 7],
[4, 5]])
x[ind]
X = np.arange(12).reshape((3, 4))
X
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]
X[row[:, np.newaxis], col]
row[:, np.newaxis] * col
X[2, [2, 0, 1]]
X[1:, [2, 0, 1]]
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
One common use of fancy indexing is the selection of subsets of rows from a matrix. For example, we might have an $N$ by $D$ matrix representing $N$ points in $D$ dimensions, such as the following points drawn from a two-dimensional normal distribution:
mean = [0, 0]
cov = [[1, 2],
[2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # for plot styling
plt.scatter(X[:, 0], X[:, 1]);
indices = np.random.choice(X.shape[0], 20, replace=False)
indices
selection = X[indices] # fancy indexing here
selection.shape
plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1],
facecolor='none', s=200);
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)
x = np.zeros(10)
x[[0, 0]] = [4, 6]
print(x)
i = [2, 3, 3, 4, 4, 4]
x[i] += 1
x
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
You can use these ideas to efficiently bin data to create a histogram by hand.
For example, imagine we have 1,000 values and would like to quickly find where they fall within an array of bins.
We could compute it using ufunc.at like this:
np.random.seed(42)
x = np.random.randn(100)
# compute a histogram by hand
bins = np.linspace(-5, 5, 20)
counts = np.zeros_like(bins)
# find the appropriate bin for each x
i = np.searchsorted(bins, x)
# add 1 to each of these bins
np.add.at(counts, i, 1)
# plot the results
plt.plot(bins, counts, linestyle='steps');
import numpy as np
def selection_sort(x):
for i in range(len(x)):
swap = i + np.argmin(x[i:])
(x[i], x[swap]) = (x[swap], x[i])
return x
x = np.array([2, 1, 4, 3, 5])
selection_sort(x)
Even selection sort, though, is much better than my all-time favorite sorting algorithms, the bogosort:
def bogosort(x):
while np.any(x[:-1] > x[1:]):
np.random.shuffle(x)
return x
x = np.array([2, 1, 4, 3, 5])
bogosort(x)
x = np.array([2, 1, 4, 3, 5])
np.sort(x)
x.sort()
print(x)
A related function is argsort, which instead returns the indices of the sorted elements:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
# sort each column of X
np.sort(X, axis=0)
# sort each row of X
np.sort(X, axis=1)
x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)
np.partition(X, 2, axis=1)
Let's quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set.
We'll start by creating a random set of 10 points on a two-dimensional plane.
Using the standard convention, we'll arrange these in a $10\times 2$ array:
X = rand.rand(10, 2)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling
plt.scatter(X[:, 0], X[:, 1], s=100);
Now we'll compute the distance between each pair of points. Recall that the squared-distance between two points is the sum of the squared differences in each dimension;
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=-1)
This operation has a lot packed into it, and it might be a bit confusing if you're unfamiliar with NumPy's broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape
# square the coordinate differences
sq_differences = differences ** 2
sq_differences.shape
# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
dist_sq.shape
dist_sq.diagonal()
nearest = np.argsort(dist_sq, axis=1)
print(nearest)
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
In order to visualize this network of neighbors, let's quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K = 2
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
# plot a line from X[i] to X[j]
# use some zip magic to make it happen:
plt.plot(*zip(X[j], X[i]), color='black')
import numpy as np
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
x = np.zeros(4, dtype=int)
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
# Get all names
data['name']
# Get first row of data
data[0]
# Get the name from the last row
data[-1]['name']
Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:
# Get names where age is under 30
data[data['age'] < 30]['name']
Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method:
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
np.dtype('S10,i4,f8')
The shortened string format codes may seem confusing, but they are built on simple principles.
The first (optional) character is < or >, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits.
The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).
The last character or characters represents the size of the object in bytes.
| Character | Description | Example |
|---|---|---|
'b' |
Byte | np.dtype('b') |
'i' |
Signed integer | np.dtype('i4') == np.int32 |
'u' |
Unsigned integer | np.dtype('u1') == np.uint8 |
'f' |
Floating point | np.dtype('f8') == np.int64 |
'c' |
Complex floating point | np.dtype('c16') == np.complex128 |
'S', 'a' |
String | np.dtype('S5') |
'U' |
Unicode string | np.dtype('U') == np.str_ |
'V' |
Raw data (void) | np.dtype('V') == np.void |
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])
data['age']
data_rec = data.view(np.recarray)
data_rec.age
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age
Will the following broadcasting operations work?
arr1 = np.ones((2, 3))
arr2 = np.ones((2, 1))
print (arr1)
print (arr2)
arr1 + arr2
arr1 = np.ones((2, 3))
arr2 = np.ones(2).reshape(2,1)
print (arr1)
print (arr2)
arr1 + arr2
arr1 = np.ones((2, 3))
arr2 = np.ones((2, 1))
print (arr1)
print (arr2)
arr1 + arr2
Use np.arange and reshape
A = [[1 2 3 4]
[5 6 7 8]]
Use np.arange to create the array
B = [1 2]
Use broadcasting to add B to each column of A to create the final array
A + B = [[2 3 4 5]
[7 8 9 10]
Hint: what shape does B have to be changed to?
A = np.arange(1, 9).reshape((2, 4))
B = np.arange(1, 3)
A + B.reshape((2, 1))
Another way to change the shape of B is to use the newaxis keyword:
As we mentioned before, Numpy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, scipy ships a rich special function library in the scipy.special module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between $0$ and $2\pi$ is as simple as:
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
Numpy ships with a basic linear algebra library, and all arrays have a dot method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:
v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])
print(v1, '.', v2, '=', np.dot(v1, v2))
Here is a regular matrix-vector multiplication, note that the array v1 should be viewed as a column vector in traditional linear algebra notation; numpy makes no distinction between row and column vectors and simply verifies that the dimensions match the required rules of matrix multiplication, in this case we have a $2 \times 3$ matrix multiplied by a 3-vector, which produces a 2-vector:
A = np.arange(6).reshape(2, 3)
print(A, 'x', v1, '=', np.dot(A, v1))
For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A \times A^T$:
print(np.dot(A, A.T))
and $A^T \times A$:
print(np.dot(A.T, A))
Furthermore, the numpy.linalg module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc. For even more linear algebra tools, scipy.linalg contains the majority of the tools in the classic LAPACK libraries as well as functions to operate on sparse matrices. We refer the reader to the Numpy and Scipy documentations for additional details on these.
Numpy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a text and a binary file containing numerical data. In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, with in this case 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable pi had in the computer's memory.
The tradeoffs between the two modes are thus:
Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can only be used for one- and two-dimensional arrays.
Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.
First, let's see how to read and write arrays in text mode. The np.savetxt function saves an array to a text file, with options to control the precision, separators and even adding a header:
arr = np.arange(10).reshape(2, 5)
np.savetxt('test.out', arr, fmt='%.2e', header="My dataset")
!cat test.out
And this same type of file can then be read with the matching np.loadtxt function:
arr2 = np.loadtxt('test.out')
print(arr2)
For binary data, Numpy provides the np.save and np.savez routines. The first saves a single array to a file with .npy extension, while the latter can be used to save a group of arrays into a single file with .npz extension. The files created with these routines can then be read with the np.load function.
Let us first see how to use the simpler np.save function to save a single array:
np.save('test.npy', arr2)
# Now we read this back
arr2n = np.load('test.npy')
# Let's see if any element is non-zero in the difference.
# A value of True would be a problem.
print('Any differences?', np.any(arr2-arr2n))
Now let us see how the np.savez function works. You give it a filename and either a sequence of arrays or a set of keywords. In the first mode, the function will auotmatically name the saved arrays in the archive as arr_0, arr_1, etc:
np.savez('test.npz', arr, arr2)
arrays = np.load('test.npz')
arrays.files
Alternatively, we can explicitly choose how to name the arrays we save:
np.savez('test.npz', array1=arr, array2=arr2)
arrays = np.load('test.npz')
arrays.files
The object returned by np.load from an .npz file works like a dictionary, though you can also access its constituent files by attribute using its special .f field; this is best illustrated with an example with the arrays object from above:
print('First row of first array:', arrays['array1'][0])
# This is an equivalent way to get the same field
print('First row of first array:', arrays.f.array1[0])
This .npz format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5.
Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this discussion, but of particular interest for scientific users we at least mention the following:
Matlab files
The scipy.io module contains routines to read and write Matlab files in .mat format and files in the NetCDF format that is widely used in certain scientific disciplines.
HDF5 files
For manipulating files in the HDF5 format, there are two excellent options in Python: The PyTables project offers a high-level, object oriented approach to manipulating HDF5 datasets, while the h5py project offers a more direct mapping to the standard HDF5 library interface. Both are excellent tools; if you need to work with HDF5 datasets you should read some of their documentation and examples and decide which approach is a better match for your needs.
Illustrates: basic array slicing, functions as first class objects.
In this exercise, you are tasked with implementing the simple trapezoid rule formula for numerical integration. If we want to compute the definite integral
$$ \int_{a}^{b}f(x)dx $$we can partition the integration interval $[a,b]$ into smaller subintervals, and approximate the area under the curve for each subinterval by the area of the trapezoid created by linearly interpolating between the two function values at each end of the subinterval:
<img src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Trapezoidal_rule_illustration.png/316px-Trapezoidal_rule_illustration.png" /img>
The blue line represents the function $f(x)$ and the red line is the linear interpolation. By subdividing the interval $[a,b]$, the area under $f(x)$ can thus be approximated as the sum of the areas of all the resulting trapezoids.
If we denote by $x_{i}$ ($i=0,\ldots,n,$ with $x_{0}=a$ and $x_{n}=b$) the abscissas where the function is sampled, then
$$ \int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(f(x_{i})+f(x_{i-1})\right). $$The common case of using equally spaced abscissas with spacing $h=(b-a)/n$ reads simply
$$ \int_{a}^{b}f(x)dx\approx\frac{h}{2}\sum_{i=1}^{n}\left(f(x_{i})+f(x_{i-1})\right). $$One frequently receives the function values already precomputed, $y_{i}=f(x_{i}),$ so the equation above becomes
$$ \int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(y_{i}+y_{i-1}\right). $$
Write a function trapzf(f, a, b, npts=100) that accepts a function f, the endpoints a
and b and the number of samples to take npts. Sample the function uniformly at these
points and return the value of the integral.
Verify that both functions above are correct by showing that they produce correct values for a simple integral such as $\int_0^3 x^2$.
Preview of what's to come: use the documentation features of IPython to explore the submodule scipy.integrage. Can you find a suitable function to perform this integration above?
The matplotlib library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the book. We encourage the reader to read the tutorials included with the matplotlib documentation as well as to browse its extensive gallery of examples that include source code.
Just as we typically use the shorthand np for Numpy, we will use plt for the matplotlib.pyplot module where the easy-to-use plotting functions reside (the library contains a rich object-oriented architecture that we don't have the space to discuss here):
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
dir(mpl)
We will use the plt.style directive to choose appropriate aesthetic styles for our figures.
plt.style.use('classic')
plt.style.use('ggplot')
x = np.linspace(-5, 2, 100)
y1 = x**3 + 5*x**2 + 10
y2 = 3*x**2 + 10*x
y3 = 6*x + 10
fig,ax=plt.subplots()
ax.plot(x,y1,label='f(x)')
ax.plot(x,y2,label="f'(x)")
ax.plot(x,y3,label="f''(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
plt.plot(x,y1)
fig, ax = plt.subplots()
ax.plot(x, y1);
fig, ax = plt.subplots()
ax.plot(x, y1, color="blue", label="y(x)")
ax.plot(x, y2, color="red", label="y'(x)")
ax.plot(x, y3, color="green", label="y''(x)")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend();
fig.savefig("figure-1.jpg")
!ls -lh figure-1.jpg
from IPython.display import Image
Image(filename="figure-1.jpg")
In savefig(), the file format is inferred from the extension of the given filename.
Depending on what backends you have installed, many different file formats are available.
fig.canvas.get_supported_filetypes()
A potentially confusing feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based interface, and a more powerful object-oriented interface. We'll quickly highlight the differences between the two here.
Matplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact.
The MATLAB-style tools are contained in the pyplot (plt) interface.
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)
# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));
fig, ax = plt.subplots(figsize=(8,5))
ax.plot(x, y1, lw=1.5, color="blue", label="$y(x)$")
ax.plot(x, y2, lw=1.5, color="red", label="$y'(x)$")
ax.plot(x, y3, lw=1.5, color="green", label="$y''(x)$")
ax.plot(x, np.zeros_like(x), lw=0.5, color="black")
ax.plot([-3.33, -3.33], [0, (-3.33)**3 + 5*(-3.33)**2 + 10], ls='--', lw=0.5, color="black")
ax.plot([-3.33], [(-3.33)**3 + 5*(-3.33)**2 + 10], lw=0.5, marker='o', color="blue")
ax.plot([0, 0], [0, 10], lw=0.5, ls='--', color="black")
ax.plot([0], [10], lw=0.5, marker='o', color="blue")
ax.set_ylim(-15, 40)
ax.set_yticks([-10, 0, 10, 20, 30])
ax.set_xticks([-4, -2, 0, 2])
ax.set_xlabel("$x$", fontsize=18)
ax.set_ylabel("$y$", fontsize=18)
ax.legend(loc=0, ncol=3, fontsize=14, frameon=False);
fig.savefig("figure-2.jpg")
fig = plt.figure(figsize=(8, 2.5), facecolor="#f1f1f1")
# axes coordinates as fractions of the canvas width and height
left, bottom, width, height = 0.1, 0.1, 0.8, 0.8
ax = fig.add_axes((left, bottom, width, height), axisbg="#e1e1e1")
x = np.linspace(-2, 2, 1000)
y1 = np.cos(40 * x)
y2 = np.exp(-x**2)
ax.plot(x, y1 * y2)
ax.plot(x, y2, 'g')
ax.plot(x, -y2, 'g')
ax.set_xlabel("x")
ax.set_ylabel("y")
fig.savefig("graph.png", dpi=100, facecolor="#f1f1f1")
fig.savefig("graph.pdf", dpi=300, facecolor="#f1f1f1")
Line Colors and Styles
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
plt.plot(x, x + 0, '-g') # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r'); # dotted red
fignum = 0
def hide_labels(fig, ax):
global fignum
ax.set_xticks([])
ax.set_yticks([])
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.axis('tight')
fignum += 1
fig.savefig("plot-types-%d.pdf" % fignum)
x = np.linspace(-3, 3, 25)
y1 = x**3+ 3 * x**2 + 10
y2 = -1.5 * x**3 + 10*x**2 - 15
fig, ax = plt.subplots(figsize=(4, 3))
ax.plot(x, y1)
ax.plot(x, y2)
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
ax.step(x, y1)
ax.step(x, y2)
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
width = 1/10.0
ax.bar(x - width/2, y1, width=width, color="blue")
ax.bar(x + width/2, y2, width=width, color="green")
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
ax.fill_between(x, y1, y2)
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
ax.hist(y2, bins=30)
ax.hist(y1, bins=30)
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
ax.errorbar(x, y2, yerr=y1, fmt='o-')
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
ax.stem(x, y2, 'b', markerfmt='bs')
ax.stem(x, y1, 'r', markerfmt='ro')
hide_labels(fig, ax)
fig, ax = plt.subplots(figsize=(4, 3))
x = np.linspace(0, 5, 50)
ax.scatter(x, -1 + x + 0.25 * x**2 + 2 * np.random.rand(len(x)))
ax.scatter(x, np.sqrt(x) + 2 * np.random.rand(len(x)), color="green")
hide_labels(fig, ax)
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
Simple Scatter Plots
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='black');
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker,
label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
plt.plot(x, y, '-ok');
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2, 1.2);
plt.scatter(x, y, marker='o');
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
cmap='viridis')
plt.colorbar(); # show color scale
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T
plt.scatter(features[0], features[1], alpha=0.2,
s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);
Errorbars
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
ecolor='lightgray', elinewidth=3, capsize=0);
from sklearn.gaussian_process import GaussianProcess
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)
# Compute the Gaussian process fit
gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
# Visualize the result
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')
plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,
color='gray', alpha=0.2)
plt.xlim(0, 10);
Density and Contour Plots
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black');
plt.contour(X, Y, Z, 20, cmap='RdGy');
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy', alpha=0.5)
plt.colorbar();
Histograms, Binnings, and Density
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);
plt.hist(data, bins=30, normed=True, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
# simply compute the histogram
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
# Hexagonal binnings
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')
# Kernel density estimation
from scipy.stats import gaussian_kde
# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)
# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
origin='lower', aspect='auto',
extent=[-3.5, 3.5, -6, 6],
cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")
fig, ax = plt.subplots(figsize=(8, 4))
x = np.linspace(-20, 20, 100)
y = np.sin(x) / x
ax.plot(x, y)
ax.set_ylabel("y label")
ax.set_xlabel("x label")
for label in ax.get_xticklabels() + ax.get_yticklabels():
label.set_rotation(45)
Axes Limits
x = np.linspace(0, 10, 1000)
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
plt.plot(x, np.sin(x))
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);
plt.plot(x, np.sin(x))
plt.axis('tight');
plt.plot(x, np.sin(x))
plt.axis('equal');
fig, axes = plt.subplots(ncols=2, nrows=3)
fig, axes = plt.subplots(1, 2, figsize=(8, 3.5), sharey=True)
data1 = np.random.randn(200, 2) * np.array([3, 1])
#area1 = (np.random.randn(200) + 0.5) * 100
data2 = np.random.randn(200, 2) * np.array([1, 3])
#area2 = (np.random.randn(200) + 0.5) * 100
axes[0].scatter(data1[:,0], data1[:,1], color="green", marker="s", s=30, alpha=0.5)
axes[0].scatter(data2[:,0], data2[:,1], color="blue", marker="o", s=30, alpha=0.5)
axes[1].hist([data1[:,1], data2[:,1]], bins=15, color=["green", "blue"], alpha=0.5, orientation='horizontal');
Labeling Plots
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')
plt.legend();
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
x = np.linspace(0, 1, 100)
for n in range(4):
axes[n].plot(x, x, label="y(x) = x")
axes[n].plot(x, x + x**2, label="y(x) = x + x**2")
axes[n].legend(loc=n+1)
axes[n].set_title("legend(loc=%d)" % (n+1))
fig.tight_layout()
fig.savefig("legend-loc.pdf")
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
x = np.linspace(0, 10, 1000)
I = np.sin(x) * np.cos(x[:, np.newaxis])
plt.imshow(I)
plt.colorbar();
plt.imshow(I, cmap='gray');
from matplotlib.colors import LinearSegmentedColormap
def grayscale_cmap(cmap):
"""Return a grayscale version of the given colormap"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
# convert RGBA to perceived grayscale luminance
# cf. http://alienryderflex.com/hsp.html
RGB_weight = [0.299, 0.587, 0.114]
luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
colors[:, :3] = luminance[:, np.newaxis]
return LinearSegmentedColormap.from_list(cmap.name + "_gray", colors, cmap.N)
def view_colormap(cmap):
"""Plot a colormap with its grayscale equivalent"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
cmap = grayscale_cmap(cmap)
grayscale = cmap(np.arange(cmap.N))
fig, ax = plt.subplots(2, figsize=(6, 2),
subplot_kw=dict(xticks=[], yticks=[]))
ax[0].imshow([colors], extent=[0, 10, 0, 1])
ax[1].imshow([grayscale], extent=[0, 10, 0, 1])
view_colormap('jet')
view_colormap('viridis')
view_colormap('cubehelix')
view_colormap('RdBu')
# make noise in 1% of the image pixels
speckles = (np.random.random(I.shape) < 0.01)
I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles))
plt.figure(figsize=(10, 3.5))
plt.subplot(1, 2, 1)
plt.imshow(I, cmap='RdBu')
plt.colorbar()
plt.subplot(1, 2, 2)
plt.imshow(I, cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
# load images of the digits 0 through 5 and visualize several of them
from sklearn.datasets import load_digits
digits = load_digits(n_class=6)
fig, ax = plt.subplots(8, 8, figsize=(6, 6))
for i, axi in enumerate(ax.flat):
axi.imshow(digits.images[i], cmap='binary')
axi.set(xticks=[], yticks=[])
# project the digits into 2 dimensions using IsoMap
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
projection = iso.fit_transform(digits.data)
# plot the results
plt.scatter(projection[:, 0], projection[:, 1], lw=0.1,
c=digits.target, cmap=plt.cm.get_cmap('cubehelix', 6))
plt.colorbar(ticks=range(6), label='digit value')
plt.clim(-0.5, 5.5)
fig, ax1 = plt.subplots(figsize=(8, 4))
r = np.linspace(0, 5, 100)
a = 4 * np.pi * r ** 2 # area
v = (4 * np.pi / 3) * r ** 3 # volume
ax1.set_title("surface area and volume of a sphere", fontsize=16)
ax1.set_xlabel("radius [m]", fontsize=16)
ax1.plot(r, a, lw=2, color="blue")
ax1.set_ylabel("surface area ($m^2$)", fontsize=16, color="blue")
for label in ax1.get_yticklabels():
label.set_color("blue")
ax2 = ax1.twinx()
ax2.plot(r, v, lw=2, color="red")
ax2.set_ylabel(r"volume ($m^3$)", fontsize=16, color="red")
for label in ax2.get_yticklabels():
label.set_color("red")
fig.tight_layout()
fig.savefig("ch4-axis-twin-ax.pdf")
fig = plt.figure(figsize=(8, 4))
def f(x):
return 1/(1 + x**2) + 0.1/(1 + ((3 - x)/0.1)**2)
def plot_and_format_axes(ax, x, f, fontsize):
ax.plot(x, f(x), linewidth=2)
ax.xaxis.set_major_locator(mpl.ticker.MaxNLocator(5))
ax.yaxis.set_major_locator(mpl.ticker.MaxNLocator(4))
ax.set_xlabel(r"$x$", fontsize=fontsize)
ax.set_ylabel(r"$f(x)$", fontsize=fontsize)
# main graph
ax = fig.add_axes([0.1, 0.15, 0.8, 0.8], axisbg="#f5f5f5")
x = np.linspace(-4, 14, 1000)
plot_and_format_axes(ax, x, f, 18)
# inset
x0, x1 = 2.5, 3.5
ax = fig.add_axes([0.5, 0.5, 0.38, 0.42], axisbg='none')
x = np.linspace(x0, x1, 1000)
plot_and_format_axes(ax, x, f, 14)
fig.savefig("ch4-advanced-axes-inset.pdf")
ncols, nrows = 3, 3
fig, axes = plt.subplots(nrows, ncols)
for m in range(nrows):
for n in range(ncols):
axes[m, n].set_xticks([])
axes[m, n].set_yticks([])
axes[m, n].text(0.5, 0.5, "axes[%d, %d]" % (m, n),
horizontalalignment='center')
fig, axes = plt.subplots(2, 2, figsize=(6, 6), sharex=True, sharey=True, squeeze=False)
x1 = np.random.randn(100)
x2 = np.random.randn(100)
axes[0, 0].set_title("Uncorrelated")
axes[0, 0].scatter(x1, x2)
axes[0, 1].set_title("Weakly positively correlated")
axes[0, 1].scatter(x1, x1 + x2)
axes[1, 0].set_title("Weakly negatively correlated")
axes[1, 0].scatter(x1, -x1 + x2)
axes[1, 1].set_title("Strongly correlated")
axes[1, 1].scatter(x1, x1 + 0.15 * x2)
axes[1, 1].set_xlabel("x")
axes[1, 0].set_xlabel("x")
axes[0, 0].set_ylabel("y")
axes[1, 0].set_ylabel("y")
plt.subplots_adjust(left=0.1, right=0.95, bottom=0.1, top=0.95, wspace=0.1, hspace=0.2)
fig.savefig("ch4-advanced-axes-subplots.pdf")
While most plt functions translate directly to ax methods (such as plt.plot() → ax.plot(), plt.legend() → ax.legend(), etc.), this is not the case for all commands.
plt.xlabel() → ax.set_xlabel()plt.ylabel() → ax.set_ylabel()plt.xlim() → ax.set_xlim()plt.ylim() → ax.set_ylim()plt.title() → ax.set_title()In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ax.set() method to set all these properties at once:
ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
xlabel='x', ylabel='sin(x)',
title='A Simple Plot');
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
births = pd.read_csv('data/births.csv')
quartiles = np.percentile(births['births'], [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
births['day'] = births['day'].astype(int)
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births_by_date = births.pivot_table('births',
[births.index.month, births.index.day])
births_by_date.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
# Add labels to the plot
style = dict(size=10, color='gray')
ax.text('2012-1-1', 3950, "New Year's Day", **style)
ax.text('2012-7-4', 4250, "Independence Day", ha='center', **style)
ax.text('2012-9-4', 4850, "Labor Day", ha='center', **style)
ax.text('2012-10-31', 4600, "Halloween", ha='right', **style)
ax.text('2012-11-25', 4450, "Thanksgiving", ha='center', **style)
ax.text('2012-12-25', 3850, "Christmas ", ha='right', **style)
# Label the axes
ax.set(title='USA births by day of year (1969-1988)',
ylabel='average daily births')
# Format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
# Add labels to the plot
ax.annotate("New Year's Day", xy=('2012-1-1', 4100), xycoords='data',
xytext=(50, -30), textcoords='offset points',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3,rad=-0.2"))
ax.annotate("Independence Day", xy=('2012-7-4', 4250), xycoords='data',
bbox=dict(boxstyle="round", fc="none", ec="gray"),
xytext=(10, -40), textcoords='offset points', ha='center',
arrowprops=dict(arrowstyle="->"))
ax.annotate('Labor Day', xy=('2012-9-4', 4850), xycoords='data', ha='center',
xytext=(0, -20), textcoords='offset points')
ax.annotate('', xy=('2012-9-1', 4850), xytext=('2012-9-7', 4850),
xycoords='data', textcoords='data',
arrowprops={'arrowstyle': '|-|,widthA=0.2,widthB=0.2', })
ax.annotate('Halloween', xy=('2012-10-31', 4600), xycoords='data',
xytext=(-80, -40), textcoords='offset points',
arrowprops=dict(arrowstyle="fancy",
fc="0.6", ec="none",
connectionstyle="angle3,angleA=0,angleB=-90"))
ax.annotate('Thanksgiving', xy=('2012-11-25', 4500), xycoords='data',
xytext=(-120, -60), textcoords='offset points',
bbox=dict(boxstyle="round4,pad=.5", fc="0.9"),
arrowprops=dict(arrowstyle="->",
connectionstyle="angle,angleA=0,angleB=80,rad=20"))
ax.annotate('Christmas', xy=('2012-12-25', 3850), xycoords='data',
xytext=(-30, 0), textcoords='offset points',
size=13, ha='right', va="center",
bbox=dict(boxstyle="round", alpha=0.1),
arrowprops=dict(arrowstyle="wedge,tail_width=0.5", alpha=0.1));
# Label the axes
ax.set(title='USA births by day of year (1969-1988)',
ylabel='average daily births')
# Format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));
ax.set_ylim(3600, 5400);
from mpl_toolkits import mplot3d
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection='3d')
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
ax.view_init(60, 35)
fig
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface');
r = np.linspace(0, 6, 20)
theta = np.linspace(-0.9 * np.pi, 0.8 * np.pi, 40)
r, theta = np.meshgrid(r, theta)
X = r * np.sin(theta)
Y = r * np.cos(theta)
Z = f(X, Y)
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none');
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5);
ax = plt.axes(projection='3d')
ax.plot_trisurf(x, y, z,
cmap='viridis', edgecolor='none');
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
Once you have the Basemap toolkit installed and imported, geographic plots are just a few lines away (the graphics in the following also requires the PIL package in Python 2, or the pillow package in Python 3):
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);
from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
fig, ax = plt.subplots(1, 2, figsize=(12, 8))
for i, res in enumerate(['l', 'h']):
m = Basemap(projection='gnom', lat_0=57.3, lon_0=-6.2,
width=90000, height=120000, resolution=res, ax=ax[i])
m.fillcontinents(color="#FFDDCC", lake_color='#DDEEFF')
m.drawmapboundary(fill_color="#DDEEFF")
m.drawcoastlines()
ax[i].set_title("resolution='{0}'".format(res));
import pandas as pd
cities = pd.read_csv('data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h',
lat_0=37.5, lon_0=-119,
width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,
c=np.log10(population), s=area,
cmap='Reds', alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a,
label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,
labelspacing=1, loc='lower left');
pandas¶%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.__version__
pd?
s = pd.Series([909976, 8615246, 2872086, 2273305])
s
type(s)
s.dtype
s.index
s.values
s.index = ["Stockholm", "London", "Rome", "Paris"]
s.name = "Population"
s
s = pd.Series([909976, 8615246, 2872086, 2273305],
index=["Stockholm", "London", "Rome", "Paris"], name="Population")
s["London"]
s.Stockholm
s[["Paris", "Rome"]]
s.median(), s.mean(), s.std(),s.min(), s.max()
s.quantile(q=0.25), s.quantile(q=0.5), s.quantile(q=0.75)
s.describe()
mpl.style.use('ggplot')
fig, axes = plt.subplots(1, 4, figsize=(12, 3))
s.plot(ax=axes[0], kind='line', title="line")
s.plot(ax=axes[1], kind='bar', title="bar")
s.plot(ax=axes[2], kind='box', title="box")
s.plot(ax=axes[3], kind='pie', title="pie")
fig.tight_layout()
fig.savefig("ch12-series-plot.pdf")
fig.savefig("ch12-series-plot.png")
df = pd.DataFrame([[909976, 8615246, 2872086, 2273305],
["Sweden", "United kingdom", "Italy", "France"]])
df
df = pd.DataFrame([[909976, "Sweden"],
[8615246, "United kingdom"],
[2872086, "Italy"],
[2273305, "France"]])
df
df.index = ["Stockholm", "London", "Rome", "Paris"]
df.columns = ["Population", "State"]
df
df = pd.DataFrame([[909976, "Sweden"],
[8615246, "United kingdom"],
[2872086, "Italy"],
[2273305, "France"]],
index=["Stockholm", "London", "Rome", "Paris"],
columns=["Population", "State"])
df
df = pd.DataFrame({"Population": [909976, 8615246, 2872086, 2273305],
"State": ["Sweden", "United kingdom", "Italy", "France"]},
index=["Stockholm", "London", "Rome", "Paris"])
df
df.ix[:,'State']
df.index
df.columns
df.values
df.Population
type(df.Population)
df.Population.Stockholm
type(df.ix)
print (df.ix["Stockholm"])
print (type(df.ix["Stockholm"]))
df.ix[["Paris", "Rome"]]
df.ix[["Paris", "Rome"], "Population"]
df.mean()
df.info()
df.describe()
df.dtypes
df.head(3)
!head -n5 european_cities.csv
These slicing and indexing conventions can be a source of confusion.
For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
# explicit index when indexing
data[1]
# implicit index when slicing
data[1:3]
data.loc[1]
data.loc[1:3]
data.iloc[1]
data.iloc[1:3]
vals1 = np.array([1, None, 3, 4])
vals1
for dtype in ['object', 'int']:
print("dtype =", dtype)
%timeit np.arange(1E6, dtype=dtype).sum()
print()
vals1.sum()
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
1 + np.nan
vals2.sum(), vals2.min(), vals2.max()
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
pd.Series([1, np.nan, 2, None])
x = pd.Series(range(2), dtype=int)
x
x[0] = None
x
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value.
(Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included).
While this type of magic may feel a bit hackish compared to the more unified approach to NA values in domain-specific languages like R, the Pandas sentinel/casting approach works quite well in practice and in my experience only rarely causes issues.
The following table lists the upcasting conventions in Pandas when NA values are introduced:
| Typeclass | Conversion When Storing NAs | NA Sentinel Value |
|---|---|---|
floating |
No change | np.nan |
object |
No change | None or np.nan |
integer |
Cast to float64 |
np.nan |
boolean |
Cast to object |
None or np.nan |
Keep in mind that in Pandas, string data is always stored with an object dtype.
Pandas data structures have two useful methods for detecting null data: isnull() and notnull().
Either one will return a Boolean mask over the data. For example:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
data[data.notnull()]
data.dropna()
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
data.fillna(0)
# forward-fill
data.fillna(method='ffill')
# back-fill
data.fillna(method='bfill')
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
pop[('California', 2010):('Texas', 2000)]
pop[[i for i in pop.index if i[1] == 2010]]
index = pd.MultiIndex.from_tuples(index)
index
pop = pop.reindex(index)
pop
pop[:, 2010]
In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:
pop_df = pop.unstack()
pop_df
pop_df.stack()
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
health_data['Guido', 'HR']
health_data.loc[:, ('Bob', 'HR')]
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
display('df1', 'df2', 'df1.append(df2)')
import pandas as pd
import numpy as np
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
df3 = pd.merge(df1, df2)
df3
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
on keyword¶display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
left_on and right_on keywords¶df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
left_index and right_index keywords¶df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
display('df1a', 'df2a',
"pd.merge(df1a, df2a, left_index=True, right_index=True)")
display('df1a', 'df2a', 'df1a.join(df2a)')
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
pd.merge(df6, df7, how='inner')
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
import numpy as np
import pandas as pd
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
planets.head()
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
planets.dropna().describe()
The following table summarizes some other built-in Pandas aggregations:
| Aggregation | Description |
|---|---|
count() |
Total number of items |
first(), last() |
First and last item |
mean(), median() |
Mean and median |
min(), max() |
Minimum and maximum |
std(), var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
These are all methods of DataFrame and Series objects.
A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated in this figure:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
df.groupby('key')
df.groupby('key').sum()
planets.groupby('method')['orbital_period'].median()
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
planets.groupby('method')['year'].describe().unstack()
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
This contains a wealth of information on each passenger of that ill-fated voyage, including gender, age, class, fare paid, and much more.
titanic.groupby('sex')[['survived']].mean()
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
titanic.pivot_table('survived', index='sex', columns='class')
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
The full call signature of the pivot_table method of DataFrames is as follows:
# call signature as of Pandas 0.18
DataFrame.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All')
The aggfunc keyword controls what type of aggregation is applied, which is a mean by default.
As in the GroupBy, the aggregation specification can be a string representing one of several common choices (e.g., 'sum', 'mean', 'count', 'min', 'max', etc.) or a function that implements an aggregation (e.g., np.sum(), min(), sum(), etc.).
titanic.pivot_table(index='sex', columns='class',
aggfunc={'survived':sum, 'fare':'mean'})
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
df_pop = pd.read_csv("data/european_cities.csv")
df_pop.head()
type(df_pop.Population[0])
df_pop = pd.read_csv("data/european_cities.csv", delimiter=",", encoding=None, header=0)
df_pop.info()
df_pop.head()
type(df_pop.Population[0])
df_pop["NumericPopulation"] = df_pop.Population.apply(lambda x: int(x.replace(",", "")))
df_pop["State"][0:3]
df_pop["State"] = df_pop["State"].apply(lambda x: x.strip())##默认去除空格,一般是去除首尾字符
df_pop.head()
df_pop.dtypes
df_pop2 = df_pop.set_index("City")
df_pop2 = df_pop2.sort_index()
df_pop2.head()
df_pop3 = df_pop.set_index(["State", "City"]).sortlevel(0)
df_pop3.head(7)
df_pop3.ix["Sweden"]
df_pop.ix[df_pop.State == 'Sweden']
df_pop3.ix[("Sweden", "Gothenburg")]
df_pop.set_index("City").sort_values(["State", "NumericPopulation"], ascending=[False, True]).head()
city_counts = df_pop.State.value_counts()
city_counts.name = "# cities in top 105"
df_pop3 = df_pop[["State", "City", "NumericPopulation"]].set_index(["State", "City"])
df_pop3.head()
df_pop4=df_pop3.sum(level="State").sort_values("NumericPopulation", ascending=False)
df_pop4.head()
df_pop5 = df_pop.drop("Rank", axis=1).groupby("State").sum().sort_values("NumericPopulation", ascending=False)
df_pop5.head()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
city_counts.plot(kind='barh', ax=ax1)
ax1.set_xlabel("# cities in top 105")
df_pop5.NumericPopulation.plot(kind='barh', ax=ax2)
ax2.set_xlabel("Total pop. in top 105 cities")
fig.tight_layout()
fig.savefig("ch12-state-city-counts-sum.pdf")
import datetime
pd.date_range("2015-1-1", periods=31)
pd.date_range(datetime.datetime(2015, 1, 1), periods=31)
pd.date_range("2015-1-1 00:00", "2015-1-1 12:00", freq="H")
ts1 = pd.Series(np.arange(31), index=pd.date_range("2015-1-1", periods=31))
ts1["2015-1-3"]
ts1.index[2]
ts1.index[2].year, ts1.index[2].month, ts1.index[2].day
ts1.index[2].to_pydatetime()
ts2 = pd.Series(np.random.rand(2),
index=[datetime.datetime(2015, 1, 1), datetime.datetime(2015, 2, 1)])
ts2
periods = pd.PeriodIndex([pd.Period('2015-01'), pd.Period('2015-02'), pd.Period('2015-03')])
ts3 = pd.Series(np.random.rand(3), periods)
ts3
ts3.index
ts2.to_period('M')
!head -n 5 temperature_outdoor_2014.tsv
!wc -l temperature_outdoor_2014.tsv
df1 = pd.read_csv('data/temperature_outdoor_2014.tsv', delimiter="\t", names=["time", "outdoor"])
df2 = pd.read_csv('data/temperature_indoor_2014.tsv', delimiter="\t", names=["time", "indoor"])
df1.head()
df2.head()
df1.time = (pd.to_datetime(df1.time.values, unit="s")
.tz_localize('UTC').tz_convert('Europe/Stockholm'))
df1 = df1.set_index("time")
df2.time = (pd.to_datetime(df2.time.values, unit="s")
.tz_localize('UTC').tz_convert('Europe/Stockholm'))
df2 = df2.set_index("time")
df1.head()
df1.index[0]
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
df1.plot(ax=ax)
df2.plot(ax=ax)
fig.tight_layout()
df1.info()
df1_jan = df1[(df1.index > "2014-1-1") & (df1.index < "2014-2-1")]
df1.index < "2014-2-1"
df1_jan.info()
df2_jan = df2["2014-1-1":"2014-1-31"]
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
df1_jan.plot(ax=ax)
df2_jan.plot(ax=ax)
fig.tight_layout()
df_month = pd.concat([df.to_period("M").groupby(level=0).mean() for df in [df1, df2]], axis=1)
df_month.head(3)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df_month.plot(kind='bar', ax=axes[0])
df_month.plot(kind='box', ax=axes[1])
fig.tight_layout()
df_month
使用循环和向量化两种不同的方法来计算100以内质数之和
$hint$: 10以内的质数有2、3、5、7四个,要找出100以内的其他质数,只需在10到100里筛掉能被2、3、5、7其中一个整除的,留下的就是剩下要找的质数。如果想求质数之和,可以将非质数赋值为0,再对数值求和。
f在区间[a, b]上定积分的函数 trapzf(f, a, b, npts=100)
要用到的公式为:$\int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(f(x_{i})+f(x_{i-1})\right).$