任务说明
- 任务主题:论文代码统计,统计所有论文出现代码的相关统计;
- 任务内容:使用正则表达式统计代码连接、页数和图表数据;
- 任务成果:学习正则表达式统计;
数据处理步骤
在原始arxiv数据集中作者经常会在论文的comments
或abstract
字段中给出具体的代码链接,所以我们需要从这些字段里面找出代码的链接。
- 确定数据出现的位置;
- 使用正则表达式完成匹配;
- 完成相关的统计;
正则表达式
正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。
普通字符:大写和小写字母、所有数字、所有标点符号和一些其他符号
字符 |
描述 |
[ABC] |
匹配 […] 中的所有字符,例如 [aeiou] 匹配字符串 “google runoob taobao” 中所有的 e o u a 字母。 |
ABC |
匹配除了 […] 中字符的所有字符,例如 aeiou 匹配字符串 “google runoob taobao” 中除了 e o u a 字母的所有字母。 |
[A-Z] |
[A-Z] 表示一个区间,匹配所有大写字母,[a-z] 表示所有小写字母。 |
. |
匹配除换行符(\n、\r)之外的任何单个字符,相等于 \n\r。 |
[\s\S] |
匹配所有。\s 是匹配所有空白符,包括换行,\S 非空白符,包括换行。 |
\w |
匹配字母、数字、下划线。等价于 [A-Za-z0-9_] |
特殊字符:有特殊含义的字符
特别字符 |
描述 |
( ) |
标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符,请使用 ( 和 )。 |
* |
匹配前面的子表达式零次或多次。要匹配 字符,请使用 \。 |
+ |
匹配前面的子表达式一次或多次。要匹配 + 字符,请使用 +。 |
. |
匹配除换行符 \n 之外的任何单字符。要匹配 . ,请使用 . 。 |
[ |
标记一个中括号表达式的开始。要匹配 [,请使用 [。 |
? |
匹配前面的子表达式零次或一次,或指明一个非贪婪限定符。要匹配 ? 字符,请使用 \?。 |
\ |
将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符。例如, ‘n’ 匹配字符 ‘n’。’\n’ 匹配换行符。序列 ‘\\’ 匹配 “\”,而 ‘(‘ 则匹配 “(“。 |
^ |
匹配输入字符串的开始位置,除非在方括号表达式中使用,当该符号在方括号表达式中使用时,表示不接受该方括号表达式中的字符集合。要匹配 ^ 字符本身,请使用 \^。 |
{ |
标记限定符表达式的开始。要匹配 {,请使用 \{。 |
\ |
|
指明两项之间的一个选择。要匹配 \ |
,请使用 \ |
。 |
限定符
字符 |
描述 |
* |
匹配前面的子表达式零次或多次。例如,zo 能匹配 “z” 以及 “zoo”。 等价于{0,}。 |
+ |
匹配前面的子表达式一次或多次。例如,’zo+’ 能匹配 “zo” 以及 “zoo”,但不能匹配 “z”。+ 等价于 {1,}。 |
? |
匹配前面的子表达式零次或一次。例如,”do(es)?” 可以匹配 “do” 、 “does” 中的 “does” 、 “doxy” 中的 “do” 。? 等价于 {0,1}。 |
{n} |
n 是一个非负整数。匹配确定的 n 次。例如,’o{2}’ 不能匹配 “Bob” 中的 ‘o’,但是能匹配 “food” 中的两个 o。 |
{n,} |
n 是一个非负整数。至少匹配n 次。例如,’o{2,}’ 不能匹配 “Bob” 中的 ‘o’,但能匹配 “foooood” 中的所有 o。’o{1,}’ 等价于 ‘o+’。’o{0,}’ 则等价于 ‘o*’。 |
{n,m} |
m 和 n 均为非负整数,其中n <= m。最少匹配 n 次且最多匹配 m 次。例如,”o{1,3}” 将匹配 “fooooood” 中的前三个 o。’o{0,1}’ 等价于 ‘o?’。请注意在逗号和两个数之间不能有空格。 |
具体代码实现以及讲解
首先我们来统计论文页数,也就是在comments
字段中抽取pages和figures和个数,首先完成字段读取。
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
'report-no', 'categories', 'license', 'abstract', 'versions',
'update_date', 'authors_parsed'], count=None):
'''
定义读取文件的函数
path: 文件路径
columns: 需要选择的列
count: 读取行数
'''
data = []
with open(path, 'r') as f:
for idx, line in enumerate(f):
if idx == count:
break
d = json.loads(line)
d = {col : d[col] for col in columns}
data.append(d)
data = pd.DataFrame(data)
return data
data = readArxivFile('../data/arxiv-metadata-oai-snapshot.json', ['id', 'abstract', 'categories', 'comments'])
data.head()
|
id |
abstract |
categories |
comments |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
1 |
0704.0002 |
We describe a new algorithm, the $(k,\ell)$-... |
math.CO cs.CG |
To appear in Graphs and Combinatorics |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics.gen-ph |
23 pages, 3 figures |
3 |
0704.0004 |
We show that a determinant of Stirling cycle... |
math.CO |
11 pages |
4 |
0704.0005 |
In this paper we show how to compute the $\L... |
math.CA math.FA |
None |
对 pages 进行提取
data['pages'] = data['comments'].apply(lambda x: re.findall("[1-9][0-9]* pages", str(x)))
data.head()
|
id |
abstract |
categories |
comments |
pages |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
[37 pages] |
1 |
0704.0002 |
We describe a new algorithm, the $(k,\ell)$-... |
math.CO cs.CG |
To appear in Graphs and Combinatorics |
[] |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics.gen-ph |
23 pages, 3 figures |
[23 pages] |
3 |
0704.0004 |
We show that a determinant of Stirling cycle... |
math.CO |
11 pages |
[11 pages] |
4 |
0704.0005 |
In this paper we show how to compute the $\L... |
math.CA math.FA |
None |
[] |
data = data[data["pages"].apply(lambda x: len(x) > 0)]
data.head()
|
id |
abstract |
categories |
comments |
pages |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
[37 pages] |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics.gen-ph |
23 pages, 3 figures |
[23 pages] |
3 |
0704.0004 |
We show that a determinant of Stirling cycle... |
math.CO |
11 pages |
[11 pages] |
5 |
0704.0006 |
We study the two-particle wave function of p... |
cond-mat.mes-hall |
6 pages, 4 figures, accepted by PRA |
[6 pages] |
6 |
0704.0007 |
A rather non-standard quantum representation... |
gr-qc |
16 pages, no figures. Typos corrected to match... |
[16 pages] |
data['pages'] = data.loc[:,'pages'].apply(lambda x: float(x[0].replace(" pages", '')))
data.head()
|
id |
abstract |
categories |
comments |
pages |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
37.0 |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics.gen-ph |
23 pages, 3 figures |
23.0 |
3 |
0704.0004 |
We show that a determinant of Stirling cycle... |
math.CO |
11 pages |
11.0 |
5 |
0704.0006 |
We study the two-particle wave function of p... |
cond-mat.mes-hall |
6 pages, 4 figures, accepted by PRA |
6.0 |
6 |
0704.0007 |
A rather non-standard quantum representation... |
gr-qc |
16 pages, no figures. Typos corrected to match... |
16.0 |
对 pages 进行初步的数据统计
data["pages"].describe().astype(int)
count 1089180
mean 17
std 22
min 1
25% 8
50% 13
75% 22
max 11232
Name: pages, dtype: int32
分类统计论文页数,选择论文第一个类别作为主要类别
data["categories"] = data.loc[:, "categories"].apply(lambda x: x.split(' ')[0])
data["categories"] = data.loc[:, "categories"].apply(lambda x: x.split('.')[0])
data.head()
|
id |
abstract |
categories |
comments |
pages |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
37.0 |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics |
23 pages, 3 figures |
23.0 |
3 |
0704.0004 |
We show that a determinant of Stirling cycle... |
math |
11 pages |
11.0 |
5 |
0704.0006 |
We study the two-particle wave function of p... |
cond-mat |
6 pages, 4 figures, accepted by PRA |
6.0 |
6 |
0704.0007 |
A rather non-standard quantum representation... |
gr-qc |
16 pages, no figures. Typos corrected to match... |
16.0 |
fig, axes = plt.subplots(figsize=(12, 6))
data.groupby(["categories"])["pages"].mean().plot(kind="bar")
groupby分组使用
分组操作在日常生活中使用极其广泛,例如:
- 依据 性别 分组,统计全国人口 寿命 的 平均值
- 依据 季节 分组,对每一个季节的 温度 进行 组内标准化
- 依据 班级 分组,筛选出组内 数学分数 的 平均值超过80分的班级
从上述的几个例子中不难看出,想要实现分组操作,必须明确三个要素:分组依据 、 数据来源 、 操作及其返回结果 。同时从充分性的角度来说,如果明确了这三方面,就能确定一个分组操作,从而分组代码的一般模式即:
df.groupby(分组依据)[数据来源].使用操作
例如第一个例子中的代码就应该如下:
df.groupby('Gender')['Longevity'].mean()
现在返回到学生体测的数据集上,如果想要按照性别统计身高中位数,就可以如下写出:
In [3]: df = pd.read_csv('data/learn_pandas.csv')
In [4]: df.groupby('Gender')['Height'].median()
Out[4]:
Gender
Female 159.6
Male 173.4
Name: Height, dtype: float64
data["figures"] = data.loc[:, "comments"].apply(lambda x: re.findall('[1-9][0-9]* figure', str(x)))
data.head()
|
id |
abstract |
categories |
comments |
pages |
figures |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
37.0 |
[15 figure] |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics |
23 pages, 3 figures |
23.0 |
[3 figure] |
5 |
0704.0006 |
We study the two-particle wave function of p... |
cond-mat |
6 pages, 4 figures, accepted by PRA |
6.0 |
[4 figure] |
9 |
0704.0010 |
Partial cubes are isometric subgraphs of hyp... |
math |
36 pages, 17 figures |
36.0 |
[17 figure] |
13 |
0704.0014 |
In this article we discuss a relation betwee... |
math |
18 pages, 1 figure |
18.0 |
[1 figure] |
data = data[data.loc[:,"figures"].apply(lambda x: len(x) > 0)]
data.head()
|
id |
abstract |
categories |
comments |
pages |
figures |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
37.0 |
[15 figure] |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics |
23 pages, 3 figures |
23.0 |
[3 figure] |
5 |
0704.0006 |
We study the two-particle wave function of p... |
cond-mat |
6 pages, 4 figures, accepted by PRA |
6.0 |
[4 figure] |
9 |
0704.0010 |
Partial cubes are isometric subgraphs of hyp... |
math |
36 pages, 17 figures |
36.0 |
[17 figure] |
13 |
0704.0014 |
In this article we discuss a relation betwee... |
math |
18 pages, 1 figure |
18.0 |
[1 figure] |
data["figures"] = data.loc[:, "figures"].apply(lambda x: int(x[0].replace(" figure", " ")))
data.head()
|
id |
abstract |
categories |
comments |
pages |
figures |
0 |
0704.0001 |
A fully differential calculation in perturba... |
hep-ph |
37 pages, 15 figures; published version |
37.0 |
15 |
2 |
0704.0003 |
The evolution of Earth-Moon system is descri... |
physics |
23 pages, 3 figures |
23.0 |
3 |
5 |
0704.0006 |
We study the two-particle wave function of p... |
cond-mat |
6 pages, 4 figures, accepted by PRA |
6.0 |
4 |
9 |
0704.0010 |
Partial cubes are isometric subgraphs of hyp... |
math |
36 pages, 17 figures |
36.0 |
17 |
13 |
0704.0014 |
In this article we discuss a relation betwee... |
math |
18 pages, 1 figure |
18.0 |
1 |
fig, axes = plt.subplots(figsize=(12, 6))
data.groupby(["categories"])["figures"].mean().plot(kind="bar")
对代码链接进行提取
print(type(data.comments),type(data["comments"]))
data.comments == data["comments"]
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
0 True
2 True
5 True
9 True
13 True
...
1796880 True
1796885 True
1796890 True
1796894 True
1796902 True
Name: comments, Length: 670321, dtype: bool
data_with_github_code = data[(data.comments.str.contains("github") == True)|
(data.abstract.str.contains("github") == True)
]
data_with_github_code
|
id |
abstract |
categories |
comments |
pages |
figures |
87991 |
0810.2412 |
The Clifford algebra of a n-dimensional Eucl... |
math-ph |
10 pages, 1 figure |
10.0 |
1 |
212359 |
1009.2203 |
Quantum error correction allows for faulty q... |
quant-ph |
38 pages, 15 figure, 10 tables. The algorithm ... |
38.0 |
15 |
229459 |
1012.0091 |
Conan is a C++ library created for the accur... |
q-bio |
5 pages and 1 figure |
5.0 |
1 |
253172 |
1103.5904 |
Solar tomography has progressed rapidly in r... |
astro-ph |
21 pages, 6 figures, 5 tables |
21.0 |
6 |
254226 |
1104.0672 |
We describe a hybrid Fourier/direct space co... |
astro-ph |
10 pages, 6 figures. Submitted to Astronomy an... |
10.0 |
6 |
... |
... |
... |
... |
... |
... |
... |
1381310 |
2011.08562 |
The target identification in brain-computer ... |
cs |
12 pages, 6 figures |
12.0 |
6 |
1381509 |
2011.08761 |
In this paper, we study the problem of imagi... |
eess |
10 pages, 2 figures, to be published in STACOM... |
10.0 |
2 |
1381606 |
2011.08858 |
We derive a simple prescription for includin... |
astro-ph |
14 pages; 6 figures; 3 appendices |
14.0 |
6 |
1381626 |
2011.08878 |
The production of numerical relativity wavef... |
gr-qc |
11 pages, 1 figure, 1 table. Open source softw... |
11.0 |
1 |
1382418 |
2011.09670 |
Rotation detection serves as a fundamental b... |
cs |
12 pages, 6 figures, 8 tables |
12.0 |
6 |
2265 rows × 6 columns
data_with_github_code.loc[:, "text"] = data_with_github_code.loc[:, "abstract"].fillna("") + data_with_github_code.loc[:, "comments"].fillna("")
data_with_github_code
|
id |
abstract |
categories |
comments |
pages |
figures |
text |
87991 |
0810.2412 |
The Clifford algebra of a n-dimensional Eucl... |
math-ph |
10 pages, 1 figure |
10.0 |
1 |
The Clifford algebra of a n-dimensional Eucl... |
212359 |
1009.2203 |
Quantum error correction allows for faulty q... |
quant-ph |
38 pages, 15 figure, 10 tables. The algorithm ... |
38.0 |
15 |
Quantum error correction allows for faulty q... |
229459 |
1012.0091 |
Conan is a C++ library created for the accur... |
q-bio |
5 pages and 1 figure |
5.0 |
1 |
Conan is a C++ library created for the accur... |
253172 |
1103.5904 |
Solar tomography has progressed rapidly in r... |
astro-ph |
21 pages, 6 figures, 5 tables |
21.0 |
6 |
Solar tomography has progressed rapidly in r... |
254226 |
1104.0672 |
We describe a hybrid Fourier/direct space co... |
astro-ph |
10 pages, 6 figures. Submitted to Astronomy an... |
10.0 |
6 |
We describe a hybrid Fourier/direct space co... |
... |
... |
... |
... |
... |
... |
... |
... |
1381310 |
2011.08562 |
The target identification in brain-computer ... |
cs |
12 pages, 6 figures |
12.0 |
6 |
The target identification in brain-computer ... |
1381509 |
2011.08761 |
In this paper, we study the problem of imagi... |
eess |
10 pages, 2 figures, to be published in STACOM... |
10.0 |
2 |
In this paper, we study the problem of imagi... |
1381606 |
2011.08858 |
We derive a simple prescription for includin... |
astro-ph |
14 pages; 6 figures; 3 appendices |
14.0 |
6 |
We derive a simple prescription for includin... |
1381626 |
2011.08878 |
The production of numerical relativity wavef... |
gr-qc |
11 pages, 1 figure, 1 table. Open source softw... |
11.0 |
1 |
The production of numerical relativity wavef... |
1382418 |
2011.09670 |
Rotation detection serves as a fundamental b... |
cs |
12 pages, 6 figures, 8 tables |
12.0 |
6 |
Rotation detection serves as a fundamental b... |
2265 rows × 7 columns
pattern = "[a-z]+://github[^\s]*"
data_with_github_code["code_flag"] = data_with_github_code.loc[:, "text"].str.findall(pattern).apply(lambda x: len(x)>0)
data_with_github_code.head()
|
id |
abstract |
categories |
comments |
pages |
figures |
text |
code_flag |
87991 |
0810.2412 |
The Clifford algebra of a n-dimensional Eucl... |
math-ph |
10 pages, 1 figure |
10.0 |
1 |
The Clifford algebra of a n-dimensional Eucl... |
False |
212359 |
1009.2203 |
Quantum error correction allows for faulty q... |
quant-ph |
38 pages, 15 figure, 10 tables. The algorithm ... |
38.0 |
15 |
Quantum error correction allows for faulty q... |
True |
229459 |
1012.0091 |
Conan is a C++ library created for the accur... |
q-bio |
5 pages and 1 figure |
5.0 |
1 |
Conan is a C++ library created for the accur... |
True |
253172 |
1103.5904 |
Solar tomography has progressed rapidly in r... |
astro-ph |
21 pages, 6 figures, 5 tables |
21.0 |
6 |
Solar tomography has progressed rapidly in r... |
False |
254226 |
1104.0672 |
We describe a hybrid Fourier/direct space co... |
astro-ph |
10 pages, 6 figures. Submitted to Astronomy an... |
10.0 |
6 |
We describe a hybrid Fourier/direct space co... |
True |
data_with_github_code = data_with_github_code[data_with_github_code["code_flag"] == 1]
fig, axes = plt.subplots(figsize=(12,6))
data_with_github_code.groupby(["categories"])["code_flag"].count().plot(kind="bar")