In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

1. 一百以内质数之和

判断是否为质数

判断一个整数是否为质数比较简单,即除了自身和1以外不可被别的数整除。不过根据数学理论证明,不用从2检查到n,到int(sqrt(n))+1即可,可以提高效率。注意返回值为True或False,方便后续的boolean索引。

In [3]:
def is_prime(num):
	if num <= 1:
		return False
	for i in range(2,int(np.sqrt(num))+1):
		if num % i == 0:
			return False
	return True

利用循环

简单粗暴的方式,从1循环到100,一次判断是否为质数,若是质数,则加到ans上,若不是直接跳过。因为%%timeit会执行1000,所以跑完代码就comment out了。

In [4]:
def prime_sum_iter(n=100):
    ans = 0
    for i in range(1,n+1):
        if is_prime(i):
            ans += i
    return ans

print prime_sum_iter()
# %%timeit
# 1000 loops, best of 3: 253 µs per loop
1060

利用np向量化方法

利用numpy可以向量化,用更简洁的方式遍历所有的元素。向量化的理解,就本例子而言,循环的思想是每次取一个数,对其判断是否为质数;向量化是取这个数组为变量,直接对其所有元素判断是否为质数,然后返回一个同size的数组。由于is_prime()函数本身接受单个integer,如要接受向量、数组等变量,需要对函数进行向量话,is_prime_vec = np.vectorize(is_prime)。

np.vectorize: Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a numpy array as output,具体可参考文档

is_prime_vec(np_arr)返回一个布尔型数组,比如np_arr = array([1,2,3,4]);那is_prime_vec(np_arr)返回array([False, True, True, False]),因为2,3是质数,1,4不是。np_arr[is_prime_vec(np_arr)]是布尔索引,简单讲就是返回对应True的元素,这里会返回array([2,3]),因为2,3对应的boolean值为True。之后再sum就实现了和循环一样的功能。

In [5]:
def prime_sum_vect(n=100):
    np_arr = np.arange(1,n+1)
    is_prime_vec = np.vectorize(is_prime)
    return np.sum(np_arr[is_prime_vec(np_arr)])

print prime_sum_vect()
# %%timeit
# 1000 loops, best of 3: 286 µs per loop
1060

2. 模拟醉汉随机漫步

假设醉汉每一步的距离是1或2,方向也完全随机,360度不确定,然后模拟醉汉的行走路径.

我们用坐标表示醉汉的位置,每次产生两个随机数,一个是步长,就是醉汉一步走多远,我们假设1或2,r = np.random.randint(1,3,N),一个是方向,是一个度数,0-360,theta = np.radians(np.random.randint(0,361,N)),注意转化为弧度,每走一步,坐标值就是之前坐标值之和,用cumsum函数可以很方便地实现,x = np.cumsum(r*np.cos(theta)),y = np.cumsum(r*np.sin(theta))。然后ok了,plot一下看看醉汉的步伐有多醉人。

In [6]:
N = 500 # steps

r = np.random.randint(1,3,N) # move 1 or 2 units of distance each step
theta = np.radians(np.random.randint(0,361,N))

# np.cumsum Return the cumulative sum of the elements along a given axis.
x = np.cumsum(r*np.cos(theta))
y = np.cumsum(r*np.sin(theta))
plt.plot(x,y);

3. 使用梯形法计算一二次函数的数值积分

$$ \int_{a}^{b}f(x)dx $$

we can partition the integration interval $[a,b]$ into smaller subintervals, and approximate the area under the curve for each subinterval by the area of the trapezoid created by linearly interpolating between the two function values at each end of the subinterval: <img src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Trapezoidal_rule_illustration.png/316px-Trapezoidal_rule_illustration.png" /img> The blue line represents the function $f(x)$ and the red line is the linear interpolation. By subdividing the interval $[a,b]$, the area under $f(x)$ can thus be approximated as the sum of the areas of all the resulting trapezoids.

If we denote by $x_{i}$ ($i=0,\ldots,n,$ with $x_{0}=a$ and $x_{n}=b$) the abscissas where the function is sampled, then

$$ \int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(f(x_{i})+f(x_{i-1})\right). $$

The common case of using equally spaced abscissas with spacing $h=(b-a)/n$ reads simply

$$ \int_{a}^{b}f(x)dx\approx\frac{h}{2}\sum_{i=1}^{n}\left(f(x_{i})+f(x_{i-1})\right). $$

具体计算只需要这个公式积分 道理很简单,就是把积分区间分割为很多小块,用梯形替代,其实还是局部用直线近似代替曲线的思想。这里对此一元二次函数积分,并与python模块积分制对比(精确值为4.5)用以验证。 $$ \int_a^b (x^2 - 3x + 2) dx $$

In [4]:
from scipy import integrate
import numpy as np
def f(x):
    return x*x - 3*x + 2

def trape(f,a,b,n=100):
    f = np.vectorize(f) # can apply on vector
    h = float(b - a)/n
    arr = f(np.linspace(a,b,n+1))
    return (h/2.)*(2*arr.sum() - arr[0] - arr[-1])

def quad(f,a,b):
    return integrate.quad(f,a,b)[0] # compare with python library result
In [5]:
a, b = -1, 2
print (trape(f,a,b))
print (quad(f,a,b))
4.50045
4.499999999999999

4. 模拟生命

模拟生命类似一个小游戏,可以假设有很多个小生命,或小细胞,可生可灭,具体k看这个细胞邻居的多少,规则如下,更多参见

The universe of the Game of Life is an infinite two-dimensional orthogonal grid of square cells, each of which is in one of two possible states, live or dead. Every cell interacts with its eight neighbours, which are the cells that are directly horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:

  • Any live cell with fewer than two live neighbours dies, as if by needs caused by underpopulation.
  • Any live cell with more than three live neighbours dies, as if by overcrowding.
  • Any live cell with two or three live neighbours lives, unchanged, to the next generation.
  • Any dead cell with exactly three live neighbours becomes a live cell.

The initial pattern constitutes the 'seed' of the system. The first generation is created by applying the above rules simultaneously to every cell in the seed – births and deaths happen simultaneously, and the discrete moment at which this happens is sometimes called a tick. (In other words, each generation is a pure function of the one before.) The rules continue to be applied repeatedly to create further generations.

目标就是根据这些规则,确定经过若干次演变后,生命的形态,哪些细胞生,哪些细胞灭。例如:

Z = np.array([[0,0,0,0,0,0],
              [0,0,0,1,0,0],
              [0,1,0,1,0,0],
              [0,0,1,1,0,0],
              [0,0,0,0,0,0],
              [0,0,0,0,0,0]])

演变后就是

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

算法的关键是确定每个细胞的邻居多少,以决定生死,且假定边界均死,用0表示。

In [9]:
def evolve(Z):
    N = (Z[0:-2,0:-2] + Z[0:-2,1:-1] + Z[0:-2,2:] +
         Z[1:-1,0:-2]                + Z[1:-1,2:] +
         Z[2:  ,0:-2] + Z[2:  ,1:-1] + Z[2:  ,2:])
    # Apply rules
    birth = (N==3) & (Z[1:-1,1:-1]==0) #若本来死,且邻居为3,则复活
    survive = ((N==2) | (N==3)) & (Z[1:-1,1:-1]==1) # 若本来活,且邻居为2 or 3,则幸存
    Z[...] = 0 # 重置为0
    Z[1:-1,1:-1][birth | survive] = 1 # 复活或幸存的置为1
    return Z
In [10]:
Z = np.random.randint(0,2,(256,512))
for i in range(100): evolve(Z)
In [11]:
# plot
size = np.array(Z.shape)
dpi = 72.0
figsize= size[1]/float(dpi),size[0]/float(dpi)
fig = plt.figure(figsize=figsize, dpi=dpi, facecolor="white")
fig.add_axes([0.0, 0.0, 1.0, 1.0], frameon=False)
plt.imshow(Z,interpolation='nearest', cmap=plt.cm.hot)
plt.xticks([]), plt.yticks([])
plt.show()
In [ ]:
 

1. 随机漫步

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import *
from scipy.stats import norm
In [4]:
sns.set_style("darkgrid") #利用seaborn的style更漂亮一些
def random_walk(m,n,N=200):
    fig = plt.figure(figsize=(12,5))
    for i in range(1,m*n+1):
        r = np.random.randint(1,3,N)
        theta = np.radians(np.random.randint(0,361,N))
        x = np.cumsum(r*np.cos(theta))
        y = np.cumsum(r*np.sin(theta))
        plt.subplot(m,n,i) #切换到特定的subplot
        plt.plot(x,y)
        plt.xticks([]), plt.yticks([]) # 去掉ticks
    plt.show()
    
random_walk(3,4)

2. 画一二次函数及梯形法积分时的各个梯形

In [10]:
def f(x):
    return x*x - 3*x

a,b = -1,3
x = np.linspace(a,b,100,endpoint=True)
# 基本绘图
plt.figure(figsize=(10,8))
plt.plot(x,f(x),color='black',alpha=.1)

# 填充积分区域fill_between
plt.axhline(0,color='gray')
plt.fill_between(x,0,f(x),f(x)>0,color='blue',alpha=.15)
plt.fill_between(x,0,f(x),f(x)<0,color='blue',alpha=.05)

n = 7 #为凸显梯形和原来曲线的区别,把分割分数降低
# 画梯形
for i in range(n):
    x = np.linspace(a,b,n,endpoint=True)
    # 画垂直的线
    plt.plot([x[i],x[i]],[0,f(x[i])],color='black',linestyle='--')
    # 画特定的点
    plt.scatter(x[i],f(x[i]),20,color='black')
    if i==0: continue 
    # 画梯形最后的斜边
    plt.plot([x[i-1],x[i]],[f(x[i-1]),f(x[i])],color='black',linestyle='--')


ax = plt.gca()
# 注释
ax.annotate(r'$f(x)=x^2 - 3x$', xy=(2, -2), xytext=(2.5, -2),arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=-.2"))

# 坐标轴平移
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_position(('data',0))
ax.spines['left'].set_position(('data',0))

# 旋转label
for label in ax.get_xticklabels() + ax.get_yticklabels():
    label.set_rotation(45)

# 设置坐标轴的limit
plt.xlim([a-.1,b+.1]),plt.xticks([-1,0,1,2,3])
plt.ylim([-2.3,4.1]),plt.yticks([-2,0,2,4])
plt.show()

3. 交互函数

基本想法很简单,首先写一个你感兴趣的函数,然后指定想要交互参数就行。直接利用interact函数调用相应的函数,然后就会有交互效果,比如一个slider之类的。更多参考

In [11]:
# 画一个正态分布曲线,调节其两个参数,看概率密度函数和累计分布函数
def f(mean,std):
    x = np.arange(-10, 10, 0.1)
    fig, axes = plt.subplots(1, 2, figsize=(10, 5))
    axes[0].plot(x,norm.pdf(x,0,1))
    axes[0].plot(x,norm.pdf(x,mean,std))
    axes[1].plot(x,norm.cdf(x,0,1))
    axes[1].plot(x,norm.cdf(x,mean,std)) 
    
#     fig, axes = plt.subplots(2, 2, figsize=(10, 5))
#     axes[0][0].plot(x,norm.pdf(x,0,1))
#     axes[0][1].plot(x,norm.cdf(x,0,1))
#     axes[1][0].plot(x,norm.pdf(x,mean,std))
#     axes[1][1].plot(x,norm.cdf(x,mean,std)) 

interact(f,mean=(-7.,7,0.5),std=(0.0,5));
In [ ]:
 

雅虎财经

利用Pandas模块直接获取雅虎财经数据,方便之极。注意把官方提示把from pandas.io import data, wb替换为from pandas_datareader import data, wb
Pandas for finance 文档
上证指数000001.SS.

In [2]:
import pandas as pd
import numpy as np
from pandas_datareader import data, wb # 需要安装 pip install pandas_datareader
import datetime
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline

可以直接下载数据

网站提供了csv格式数据下载服务。下载,然后读取sh_table = pd.read_csv('sh_table.csv')

利用DataReader抓取数据

In [3]:
# 定义获取数据的时间段
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2016,5,20)
In [4]:
sh = data.DataReader("000001.SS", 'yahoo', start, end)
In [5]:
sh.head(3) # 数据获取成功
Out[5]:
Open High Low Close Volume Adj Close
Date
2010-01-04 3243.76 3243.76 3243.76 3243.76 0 3243.76
2010-01-05 3282.18 3282.18 3282.18 3282.18 0 3282.18
2010-01-06 3254.22 3254.22 3254.22 3254.22 0 3254.22
In [6]:
sh.describe() # 数据整体概览
Out[6]:
Open High Low Close Volume Adj Close
count 1558.000000 1558.000000 1558.000000 1558.000000 1558.0 1558.000000
mean 2659.427452 2659.427452 2659.427452 2659.427452 0.0 2659.427452
std 595.170673 595.170673 595.170673 595.170673 0.0 595.170673
min 1950.010000 1950.010000 1950.010000 1950.010000 0.0 1950.010000
25% 2199.832500 2199.832500 2199.832500 2199.832500 0.0 2199.832500
50% 2492.600000 2492.600000 2492.600000 2492.600000 0.0 2492.600000
75% 2968.977500 2968.977500 2968.977500 2968.977500 0.0 2968.977500
max 5166.350000 5166.350000 5166.350000 5166.350000 0.0 5166.350000
In [7]:
sh['Close'].plot(); #看一下收盘趋势

数据读取和输出pd.read_csv and to_csv

从文件读取数据是非常常见的操作

In [8]:
sh.to_csv('sh.csv',header=None)
In [9]:
names = ['Date','Open','High','Low','Close','Volume','Adj Close']
sh1 = pd.read_csv('sh.csv',names=names,index_col='Date')
In [10]:
sh1.tail(2)
Out[10]:
Open High Low Close Volume Adj Close
Date
2016-05-19 2806.91 2806.91 2806.91 2806.91 0 2806.91
2016-05-20 2825.48 2825.48 2825.48 2825.48 0 2825.48

清洗数据

In [11]:
# drop Volume列,所以数据都为0
sh = sh.drop('Volume',axis=1)
In [12]:
sh.head(2) # Volume列消失了
Out[12]:
Open High Low Close Adj Close
Date
2010-01-04 3243.76 3243.76 3243.76 3243.76 3243.76
2010-01-05 3282.18 3282.18 3282.18 3282.18 3282.18

检查缺失值

有些时候会有数据缺失,不完整,需要查看一下。

In [13]:
# Detect missing values,用isnull函数
pd.isnull(sh).head() # or sh.isnull()
Out[13]:
Open High Low Close Adj Close
Date
2010-01-04 False False False False False
2010-01-05 False False False False False
2010-01-06 False False False False False
2010-01-07 False False False False False
2010-01-08 False False False False False

但返回的是一个boolean型的DataFrame,我只想先看一下是否有缺失值,不用肉眼
利用any()函数,如果有可以用sum()函数查看有多少。

In [14]:
#参考 http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe
sh.isnull().values.any() # 数据很完整
Out[14]:
False
In [15]:
sh.isnull().values.sum() # 没有值
Out[15]:
0
In [16]:
sh.head()
Out[16]:
Open High Low Close Adj Close
Date
2010-01-04 3243.76 3243.76 3243.76 3243.76 3243.76
2010-01-05 3282.18 3282.18 3282.18 3282.18 3282.18
2010-01-06 3254.22 3254.22 3254.22 3254.22 3254.22
2010-01-07 3192.78 3192.78 3192.78 3192.78 3192.78
2010-01-08 3196.00 3196.00 3196.00 3196.00 3196.00

填补缺失值

这里把一些位置设置为np.nan,然后填充,没有实际意义,这里只是拿来练手。

设置nan如下:

In [17]:
sh.iloc[0,:] = np.nan
sh.iloc[[1,3],1] = np.nan
sh.iloc[2,2] = np.nan
sh.iloc[3,3] = np.nan
In [18]:
sh.head(4)
Out[18]:
Open High Low Close Adj Close
Date
2010-01-04 NaN NaN NaN NaN NaN
2010-01-05 3282.18 NaN 3282.18 3282.18 3282.18
2010-01-06 3254.22 3254.22 NaN 3254.22 3254.22
2010-01-07 3192.78 NaN 3192.78 NaN 3192.78
In [19]:
# check NaN number per column
def num_missing(x):
    return x.isnull().sum()

print "Missing values count per column:"
print sh.apply(num_missing, axis=0)
Missing values count per column:
Open         1
High         3
Low          2
Close        2
Adj Close    1
dtype: int64

dropna
有几个参数

  • how='all'只有全部为NaN的行才drop,若axis=1则对列;
  • how='any'默认,则drop所有含NaN的行或列;
  • inplacce=True则inplace操作,不返回;
  • 默认inplace=False,返回一个drop后的,不改变原DataFrame
In [20]:
sh.dropna(how='all',inplace=True); 
In [21]:
sh.head(3)
Out[21]:
Open High Low Close Adj Close
Date
2010-01-05 3282.18 NaN 3282.18 3282.18 3282.18
2010-01-06 3254.22 3254.22 NaN 3254.22 3254.22
2010-01-07 3192.78 NaN 3192.78 NaN 3192.78

填充fillna

In [22]:
sh['High'].fillna(sh['High'].mean(),inplace=True) #fill in with mean
sh['Low'].fillna(method='ffill',inplace=True) #fill in with value from previous row
sh['Close'].fillna(method='bfill',inplace=True) # fill in with value from the row behind
In [23]:
sh.head()
Out[23]:
Open High Low Close Adj Close
Date
2010-01-05 3282.18 2658.308199 3282.18 3282.18 3282.18
2010-01-06 3254.22 3254.220000 3282.18 3254.22 3254.22
2010-01-07 3192.78 2658.308199 3192.78 3196.00 3192.78
2010-01-08 3196.00 3196.000000 3196.00 3196.00 3196.00
2010-01-11 3212.75 3212.750000 3212.75 3212.75 3212.75
In [24]:
sh.isnull().values.sum()
Out[24]:
0

计算涨跌额

涨跌额是指当日股票价格与前一日收盘价格相比的涨跌数值。添加一列change,其为当日close价格与之前一天的差值。当然注意这里数据有缺失,有的日期没有记录。

In [25]:
change = sh.Close.diff()
change.fillna(change.mean(),inplace=True)
sh['Change'] = change
In [26]:
sh.head(3)
Out[26]:
Open High Low Close Adj Close Change
Date
2010-01-05 3282.18 2658.308199 3282.18 3282.18 3282.18 -0.293509
2010-01-06 3254.22 3254.220000 3282.18 3254.22 3254.22 -27.960000
2010-01-07 3192.78 2658.308199 3192.78 3196.00 3192.78 -58.220000
In [27]:
# 按涨跌额排序,未改变sh,若需要,可以`inplace=True`
sh.sort_values(by='Change',ascending=False).head(3)
Out[27]:
Open High Low Close Adj Close Change
Date
2015-10-12 3287.66 3287.66 3287.66 3287.66 3287.66 234.88
2015-06-30 4277.22 4277.22 4277.22 4277.22 4277.22 224.19
2015-06-01 4828.74 4828.74 4828.74 4828.74 4828.74 217.00

计算涨跌幅

In [28]:
sh.head()
Out[28]:
Open High Low Close Adj Close Change
Date
2010-01-05 3282.18 2658.308199 3282.18 3282.18 3282.18 -0.293509
2010-01-06 3254.22 3254.220000 3282.18 3254.22 3254.22 -27.960000
2010-01-07 3192.78 2658.308199 3192.78 3196.00 3192.78 -58.220000
2010-01-08 3196.00 3196.000000 3196.00 3196.00 3196.00 0.000000
2010-01-11 3212.75 3212.750000 3212.75 3212.75 3212.75 16.750000
In [29]:
# 用shift方法错位
# sh['pct_change'] = ((sh['Change'] - sh['Change'].shift(1)) / sh['Change'])
# 或用pct_Change函数
sh['pct_change'] = sh.Change.pct_change()
sh.iloc[5:9]
Out[29]:
Open High Low Close Adj Close Change pct_change
Date
2010-01-12 3273.97 3273.97 3273.97 3273.97 3273.97 61.22 2.654925
2010-01-13 3172.66 3172.66 3172.66 3172.66 3172.66 -101.31 -2.654851
2010-01-14 3215.55 3215.55 3215.55 3215.55 3215.55 42.89 -1.423354
2010-01-15 3224.15 3224.15 3224.15 3224.15 3224.15 8.60 -0.799487

Apply and Applymap

利用applymap格式化数据,均保留两位小数。这是对其进行element-wise操作

In [30]:
sh.head(2)
Out[30]:
Open High Low Close Adj Close Change pct_change
Date
2010-01-05 3282.18 2658.308199 3282.18 3282.18 3282.18 -0.293509 NaN
2010-01-06 3254.22 3254.220000 3282.18 3254.22 3254.22 -27.960000 94.261134
In [31]:
format = lambda x: '%.2f' % x
sh = sh.applymap(format)
In [32]:
sh.head(2)
Out[32]:
Open High Low Close Adj Close Change pct_change
Date
2010-01-05 3282.18 2658.31 3282.18 3282.18 3282.18 -0.29 nan
2010-01-06 3254.22 3254.22 3282.18 3254.22 3254.22 -27.96 94.26
In [33]:
sh = sh.applymap(float) # 可以重新把数据类型转换为float类型
In [34]:
sh.ix[1]
Out[34]:
Open          3254.22
High          3254.22
Low           3282.18
Close         3254.22
Adj Close     3254.22
Change         -27.96
pct_change      94.26
Name: 2010-01-06 00:00:00, dtype: float64

利用apply找出每列最值及变化范围

In [35]:
def f(x):
    return pd.Series([x.min(),x.max(),x.max()-x.min()],index=['min','max','range'])
sh.apply(f)
Out[35]:
Open High Low Close Adj Close Change pct_change
min 1950.01 1950.01 1950.01 1950.01 1950.01 -345.35 -inf
max 5166.35 5166.35 5166.35 5166.35 5166.35 234.88 inf
range 3216.34 3216.34 3216.34 3216.34 3216.34 580.23 inf

Group by Month

手工
In [36]:
# 按月分组
sh['group_index'] = sh.index.map(lambda x: 100*x.year + x.month)
In [37]:
sh.head(2).append(sh.tail(2)) # 查看group_index, as expected
Out[37]:
Open High Low Close Adj Close Change pct_change group_index
Date
2010-01-05 3282.18 2658.31 3282.18 3282.18 3282.18 -0.29 NaN 201001
2010-01-06 3254.22 3254.22 3282.18 3254.22 3254.22 -27.96 94.26 201001
2016-05-19 2806.91 2806.91 2806.91 2806.91 2806.91 -0.60 -0.98 201605
2016-05-20 2825.48 2825.48 2825.48 2825.48 2825.48 18.57 -31.95 201605
In [38]:
sh.groupby('group_index') # 返回一个groupby对象
Out[38]:
<pandas.core.groupby.DataFrameGroupBy object at 0x10f7a9110>
In [39]:
# 对不同的列运用不同的函数聚合
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped = sh.groupby('group_index').agg({
        'Open' : peak_to_peak,
        'High' : 'max',
        'Low' : 'min',
        'Change' : 'mean'
    })
grouped.head(3).append(grouped.tail(3))
Out[39]:
High Open Low Change
group_index
201001 3273.97 295.57 2986.61 -15.430526
201002 3060.62 125.91 2934.71 3.132500
201003 3128.47 151.53 2976.94 2.485217
201603 3018.80 285.63 2733.17 13.736522
201604 3082.36 144.04 2938.32 -3.452632
201605 2997.84 190.93 2806.91 -8.060000
用pandas时间序列功能更便捷
In [40]:
sh.index # DatetimeIndex 提供很多便捷操作
Out[40]:
DatetimeIndex(['2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
               '2010-01-11', '2010-01-12', '2010-01-13', '2010-01-14',
               '2010-01-15', '2010-01-18',
               ...
               '2016-05-09', '2016-05-10', '2016-05-11', '2016-05-12',
               '2016-05-13', '2016-05-16', '2016-05-17', '2016-05-18',
               '2016-05-19', '2016-05-20'],
              dtype='datetime64[ns]', name=u'Date', length=1557, freq=None)
In [41]:
grouped_1 = sh.groupby(pd.TimeGrouper("M")).agg({
        'Open' : peak_to_peak,
        'High' : 'max',
        'Low' : 'min',
        'Change' : 'mean'
    })
grouped_1.head(3).append(grouped_1.tail(3))
Out[41]:
High Open Low Change
Date
2010-01-31 3273.97 295.57 2986.61 -15.430526
2010-02-28 3060.62 125.91 2934.71 3.132500
2010-03-31 3128.47 151.53 2976.94 2.485217
2016-03-31 3018.80 285.63 2733.17 13.736522
2016-04-30 3082.36 144.04 2938.32 -3.452632
2016-05-31 2997.84 190.93 2806.91 -8.060000
或者用resample实现重采样

可以downsampling,也可以upsampling

In [42]:
# resample('M',how='mean')这样的语法将来不支持,推荐.mean()以及.apply()
# 可以传入词典给apply
grouped_2 = sh[['Open','High','Low','Change']].resample('M').apply({
        'Open' : peak_to_peak,
        'High' : 'max',
        'Low' : 'min',
        'Change' : 'mean'
    })
grouped_2.head(3).append(grouped_2.tail(3))
Out[42]:
High Open Low Change
Date
2010-01-31 3273.97 295.57 2986.61 -15.430526
2010-02-28 3060.62 125.91 2934.71 3.132500
2010-03-31 3128.47 151.53 2976.94 2.485217
2016-03-31 3018.80 285.63 2733.17 13.736522
2016-04-30 3082.36 144.04 2938.32 -3.452632
2016-05-31 2997.84 190.93 2806.91 -8.060000
In [43]:
sh[['Open']].plot();
plt.show()
In [1]:
import pandas as pd
import requests
import numpy as np
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
%matplotlib inline
/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
In [2]:
html_1 = requests.get('http://zh.wikipedia.org/wiki/%E7%9C%81%E4%BC%9A')
html_1_text = html_1.text
dfs_1 = pd.read_html(html_1_text,header=0,attrs={'class': 'wikitable'})
dfs_1[0].head()
Out[2]:
編號 行政區 簡稱 省會或首府 地區 Unnamed: 5 編號.1 行政區.1 簡稱.1 省會或首府.1 地區.1
0 1 江蘇省 鎮江 華中 20 甘肅省 蘭州 華北 NaN
1 2 浙江省 杭州 華中 21 寧夏省 銀川 塞北 NaN
2 3 安徽省 合肥 華中 22 青海省 西寧 西部 NaN
3 4 江西省 南昌 華中 23 綏遠省 歸綏(今呼和浩特) 塞北 NaN
4 5 湖北省 武昌(今武漢) 華中 24 察哈爾省 張垣(今張家口) 塞北 NaN

读取表格read_html

安装

read_html依赖一些库,比如html5lib,lxml,beautiful soup等,如果没有安装会报错。

吐槽

本来就想安安静静地写一行解决问题,谁知道还有坑有坑
dsf = pd.read_html('http://data.earthquake.cn/datashare/globeEarthquake_csn.html',header=0)[4]
执行后文字部分都是乱码,应该是编码问题,下面给出了解决方案。

不过我忍不住要吐槽一句,为什么这个网站把所以的内容都放在table里,如果这样,能不能给个id或者class,导致利用attrs精确获得表格的微操失败,心中也是万马奔腾。

获取数据

In [3]:
url = 'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'
html = requests.get(url)

乱码问题

request默认是ISO-8859-1编码,这里会导致乱码

In [4]:
html.encoding
Out[4]:
'ISO-8859-1'
In [5]:
html_text = html.text
dfs = pd.read_html(html_text,header=0)
dfs[4].head()
Out[5]:
·¢ÕðÈÕÆÚ ·¢Õðʱ¿Ì γ¶È(¡ã) ¾­¶È(¡ã) Éî¶È(km) Õ𼶠ʼþÀàÐÍ ²Î¿¼µØµã
0 2016-05-24 20:12:09.6 40.9 79.3 6 Ms3.2 ÌìÈ»µØÕð н®°¢¿ËËÕµØÇøÎÚÊ²ÏØ
1 2016-05-24 18:34:34.2 40.2 77.2 8 Ms3.0 ÌìÈ»µØÕð н®¿Ë×ÎÀÕËÕÖݰ¢Í¼Ê²ÊÐ
2 2016-05-24 08:56:14.7 37.4 92.9 10 Ms3.0 ÌìÈ»µØÕð Çຣº£Î÷Öݸñ¶ûľÊÐ
3 2016-05-24 03:09:56.5 28.0 85.3 8 Ms3.8 ÌìÈ»µØÕð Äá²´¶û
4 2016-05-24 02:09:01.5 28.6 87.5 7 Ms3.9 ÌìÈ»µØÕð Î÷²ØÈÕ¿¦ÔòÊж¨ÈÕÏØ

改变默认编码

In [6]:
html.encoding =  html.apparent_encoding
html.encoding #竟然是GB2312,一开始窃以为是UTF-8...
Out[6]:
'GB2312'
In [7]:
html_text = html.text
dfs = pd.read_html(html_text,header=0) # 返回的是一个list,list里是表格
dfs[4].head()
Out[7]:
发震日期 发震时刻 纬度(°) 经度(°) 深度(km) 震级 事件类型 参考地点
0 2016-05-24 20:12:09.6 40.9 79.3 6 Ms3.2 天然地震 新疆阿克苏地区乌什县
1 2016-05-24 18:34:34.2 40.2 77.2 8 Ms3.0 天然地震 新疆克孜勒苏州阿图什市
2 2016-05-24 08:56:14.7 37.4 92.9 10 Ms3.0 天然地震 青海海西州格尔木市
3 2016-05-24 03:09:56.5 28.0 85.3 8 Ms3.8 天然地震 尼泊尔
4 2016-05-24 02:09:01.5 28.6 87.5 7 Ms3.9 天然地震 西藏日喀则市定日县
In [8]:
df = dfs[4] 
df.columns = ['date','time','lats','lons','dept','mag','class','place']
In [9]:
df.head(3)
Out[9]:
date time lats lons dept mag class place
0 2016-05-24 20:12:09.6 40.9 79.3 6 Ms3.2 天然地震 新疆阿克苏地区乌什县
1 2016-05-24 18:34:34.2 40.2 77.2 8 Ms3.0 天然地震 新疆克孜勒苏州阿图什市
2 2016-05-24 08:56:14.7 37.4 92.9 10 Ms3.0 天然地震 青海海西州格尔木市

画地震分布图

不太懂地震,除了了解一下分布,不知道还分析什么,就先画个分布图吧。

把以Ms震级开头的行去掉(共7个),只保留ML开头的,便于分析

In [10]:
Ms = df.mag.map(lambda x: not x.startswith('Ms')) # boolean Series
df = df[Ms]

把震级转换为数字

In [16]:
import re
get_num = lambda x: float(re.findall('\d+\.\d+', x)[0])
# a = 'ML0.5'
# n = re.findall('\d+\.\d+', a)
# float(n[0])
# df['mag_num'] = df['mag'].map(get_num) 会报错
# http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
temp =  df['mag'].map(get_num)
df.loc[:,('mag_num')] = temp
df.head(3)
Out[16]:
date time lats lons dept mag class place mag_num
5 2016-05-23 23:43:52.5 23.30 103.21 6 ML1.7 天然地震 云南个旧 1.7
6 2016-05-23 23:43:33.9 39.82 118.76 8 ML0.5 天然地震 河北滦县 0.5
7 2016-05-23 23:36:25.9 39.60 76.93 6 ML1.4 天然地震 新疆伽师 1.4
获取地图分布范围,经纬度表示
In [17]:
# lower left
llcrnrlon, llcrnrlat = df['lons'].min(), df['lats'].min()
# upper right
urcrnrlon, urcrnrlat = df['lons'].max(), df['lats'].max()
获取地震地点经纬度及强度
In [18]:
lons, lats = list(df['lons']), list(df['lats'])
mags = list(df['mag_num'])
辅助函数

不同的等级用不同的颜色

In [19]:
# 按等级配色
def get_marker_color(magnitude):
    # Returns green for small earthquakes, yellow for moderate
    #  earthquakes, and red for significant earthquakes.
    if magnitude < 1.5:
        return ('go')
    elif magnitude < 3.0:
        return ('yo')
    else:
        return ('ro')
作图
In [20]:
fig = plt.figure(figsize=(14,10))
ax = plt.subplot(1,1,1)
eq_map = Basemap(projection='merc', resolution = 'l', area_thresh = 1000.0,
              lat_0=0, lon_0=120,
              llcrnrlon=llcrnrlon-5, llcrnrlat=llcrnrlat-8,
              urcrnrlon=urcrnrlon+10, urcrnrlat=urcrnrlat+3)

eq_map.drawcoastlines()
eq_map.drawcountries()
eq_map.fillcontinents(color = 'lightgray')
eq_map.drawmapboundary()
eq_map.drawmeridians(np.arange(0, 360, 30))
eq_map.drawparallels(np.arange(-90, 90, 30))

min_marker_size = 4
for lon, lat, mag in zip(lons,lats,mags):
    x,y = eq_map(lon, lat)
    msize = mag*min_marker_size
    marker_string = get_marker_color(mag)
    eq_map.plot(x, y, marker_string, markersize=msize)

plt.show()

可以看到最近一段时间地震在全国范围内的分布。

挖掘QQ聊天记录

主要联系pandas的基本操作

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline
/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
In [2]:
# 数据初探
!head -n 4 qqdata.csv
id,time
8cha0,2011/7/8 12:11:13
2cha061,2011/7/8 12:11:49
6cha437,2011/7/8 12:13:36
In [3]:
!wc -l qqdata.csv #数据很小,才一万多行,直接读
   11563 qqdata.csv

解析时间

直接读取的时间列是str类型,如果解析成时间类型,分析更方便。

In [4]:
# 默认的parse_dates = True不能有效地解析
# http://stackoverflow.com/questions/17465045/can-pandas-automatically-recognize-dates
dateparse = lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
qq = pd.read_csv('qqdata.csv', parse_dates=['time'], date_parser=dateparse)
In [5]:
qq['time'][0] # 时间戳类型,而不是str
Out[5]:
Timestamp('2011-07-08 12:11:13')
In [6]:
qq.head(3)
Out[6]:
id time
0 8cha0 2011-07-08 12:11:13
1 2cha061 2011-07-08 12:11:49
2 6cha437 2011-07-08 12:13:36

基本信息获取

群聊人数

In [7]:
len(qq['id'].unique()) #群人数144人
Out[7]:
144

时间戳是否唯一?有重复,暂时先不把其设为index

In [8]:
time = qq['time']
len(time) == len(time.unique())
Out[8]:
False

找话唠

把话唠定义为发言次数最多的人

In [9]:
qq['count'] = 1 #添加一列
In [10]:
# 因为qq['count']设置为1,所以count()也可以替换为sum()
gp_by_id = qq['count'].groupby(qq['id']).count().sort_values(ascending=False)
In [11]:
type(gp_by_id) #返回一个Series
Out[11]:
pandas.core.series.Series
In [12]:
gp_by_id[:5]
Out[12]:
id
7cha1      1511
6cha437    1238
4cha387    1100
8cha08      695
4cha69      533
dtype: int64
In [13]:
plt.figure(figsize=(6,4))
ax = gp_by_id[:10].plot.bar()
ax.set_xticklabels(gp_by_id.index,rotation='45')
ax.set_xlabel('');

发现一个很怪的id: )chailed (104: Connection reset by pee,确认一下是不是在.

In [14]:
(qq['id'] == ')chailed (104: Connection reset by pee').any()
Out[14]:
True

而且万年潜水党,只说过一句话,名字还这么难记,内个群主,踢人了可以。

In [15]:
gp_by_id.ix[')chailed (104: Connection reset by pee']
Out[15]:
1

聊天密度周分布

看看大家聊天主要集中在周几

In [16]:
# 添加一列 weekday, derived from time
qq['weekday'] = qq['time'].map(lambda x : x.weekday())
In [17]:
gp_by_weekday = qq['count'].groupby(qq['weekday']).count()
In [18]:
gp_by_weekday.plot.bar(); # 不好好上班,平时话真多

聊天密度小时分布

In [19]:
# 添加一列 hour, derived from time
qq['hour'] = qq['time'].map(lambda x : x.hour)
gp_by_hour = qq['count'].groupby(qq['hour']).count()
gp_by_hour.plot.bar();

聊天密度历史分布

In [20]:
# 添加一列 day, derived from time
qq['day'] = qq['time'].map(lambda x : x.date())
gp_by_day = qq['count'].groupby(qq['day']).count()
In [21]:
ax = gp_by_day.plot.bar();
ax.set_xticks([])
ax.set_xticklabels([]);

活跃天数最多的用户?

如果某天说话了,则定义为这一天活跃。

In [22]:
# qq.groupby('id') group by id 
# .day we only interest in active day now
# .nunique() the number of unique active day
# 等价于 apply(lambda x: len(x.unique()))
gp_by_act_day = qq.groupby('id').day.nunique().sort_values(ascending=False)
In [23]:
plt.figure(figsize=(6,4))
ax = gp_by_act_day[:10].plot.bar()
ax.set_xticklabels(gp_by_act_day.index,rotation='45')
ax.set_xlabel('');

活跃用户数与发言量的关系

观察是否发言人数多,相应的发言量也增加了

In [24]:
# 活跃用户数
people = qq['id'].groupby(qq['day']).nunique()
# 发言量
speech = qq['count'].groupby(qq['day']).count()
In [25]:
# 可以看出正相关
plt.figure(figsize=(6,4))
ax = plt.scatter(people,speech)