Task3 论文代码统计


任务说明

  • 任务主题:论文代码统计,统计所有论文出现代码的相关统计;
  • 任务内容:使用正则表达式统计代码连接、页数和图表数据;
  • 任务成果:学习正则表达式统计;

数据处理步骤

在原始arxiv数据集中作者经常会在论文的commentsabstract字段中给出具体的代码链接,所以我们需要从这些字段里面找出代码的链接。

  • 确定数据出现的位置;
  • 使用正则表达式完成匹配;
  • 完成相关的统计;

正则表达式

正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。

普通字符:大写和小写字母、所有数字、所有标点符号和一些其他符号

字符 描述
[ABC] 匹配 […] 中的所有字符,例如 [aeiou] 匹配字符串 “google runoob taobao” 中所有的 e o u a 字母。
ABC 匹配除了 […] 中字符的所有字符,例如 aeiou 匹配字符串 “google runoob taobao” 中除了 e o u a 字母的所有字母。
[A-Z] [A-Z] 表示一个区间,匹配所有大写字母,[a-z] 表示所有小写字母。
. 匹配除换行符(\n、\r)之外的任何单个字符,相等于 \n\r
[\s\S] 匹配所有。\s 是匹配所有空白符,包括换行,\S 非空白符,包括换行。
\w 匹配字母、数字、下划线。等价于 [A-Za-z0-9_]

特殊字符:有特殊含义的字符

特别字符 描述
( ) 标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符,请使用 ( 和 )。
* 匹配前面的子表达式零次或多次。要匹配 字符,请使用 \
+ 匹配前面的子表达式一次或多次。要匹配 + 字符,请使用 +。
. 匹配除换行符 \n 之外的任何单字符。要匹配 . ,请使用 . 。
[ 标记一个中括号表达式的开始。要匹配 [,请使用 [。
? 匹配前面的子表达式零次或一次,或指明一个非贪婪限定符。要匹配 ? 字符,请使用 \?。
\ 将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符。例如, ‘n’ 匹配字符 ‘n’。’\n’ 匹配换行符。序列 ‘\\’ 匹配 “\”,而 ‘(‘ 则匹配 “(“。
^ 匹配输入字符串的开始位置,除非在方括号表达式中使用,当该符号在方括号表达式中使用时,表示不接受该方括号表达式中的字符集合。要匹配 ^ 字符本身,请使用 \^。
{ 标记限定符表达式的开始。要匹配 {,请使用 \{。
\ 指明两项之间的一个选择。要匹配 \ ,请使用 \

限定符

字符 描述
* 匹配前面的子表达式零次或多次。例如,zo 能匹配 “z” 以及 “zoo”。 等价于{0,}。
+ 匹配前面的子表达式一次或多次。例如,’zo+’ 能匹配 “zo” 以及 “zoo”,但不能匹配 “z”。+ 等价于 {1,}。
? 匹配前面的子表达式零次或一次。例如,”do(es)?” 可以匹配 “do” 、 “does” 中的 “does” 、 “doxy” 中的 “do” 。? 等价于 {0,1}。
{n} n 是一个非负整数。匹配确定的 n 次。例如,’o{2}’ 不能匹配 “Bob” 中的 ‘o’,但是能匹配 “food” 中的两个 o。
{n,} n 是一个非负整数。至少匹配n 次。例如,’o{2,}’ 不能匹配 “Bob” 中的 ‘o’,但能匹配 “foooood” 中的所有 o。’o{1,}’ 等价于 ‘o+’。’o{0,}’ 则等价于 ‘o*’。
{n,m} m 和 n 均为非负整数,其中n <= m。最少匹配 n 次且最多匹配 m 次。例如,”o{1,3}” 将匹配 “fooooood” 中的前三个 o。’o{0,1}’ 等价于 ‘o?’。请注意在逗号和两个数之间不能有空格。

具体代码实现以及讲解

首先我们来统计论文页数,也就是在comments字段中抽取pages和figures和个数,首先完成字段读取。

import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import matplotlib.pyplot as plt
def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',
       'report-no', 'categories', 'license', 'abstract', 'versions',
       'update_date', 'authors_parsed'], count=None):
    '''
    定义读取文件的函数
        path: 文件路径
        columns: 需要选择的列
        count: 读取行数
    '''
    
    data  = []
    with open(path, 'r') as f: 
        for idx, line in enumerate(f): 
            if idx == count:
                break
                
            d = json.loads(line)
            d = {col : d[col] for col in columns}
            data.append(d)

    data = pd.DataFrame(data)
    return data

data = readArxivFile('../data/arxiv-metadata-oai-snapshot.json', ['id', 'abstract', 'categories', 'comments'])
data.head()
id abstract categories comments
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version
1 0704.0002 We describe a new algorithm, the $(k,\ell)$-... math.CO cs.CG To appear in Graphs and Combinatorics
2 0704.0003 The evolution of Earth-Moon system is descri... physics.gen-ph 23 pages, 3 figures
3 0704.0004 We show that a determinant of Stirling cycle... math.CO 11 pages
4 0704.0005 In this paper we show how to compute the $\L... math.CA math.FA None

对 pages 进行提取

# 正则匹配
data['pages'] = data['comments'].apply(lambda x: re.findall("[1-9][0-9]* pages", str(x)))
data.head()
id abstract categories comments pages
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version [37 pages]
1 0704.0002 We describe a new algorithm, the $(k,\ell)$-... math.CO cs.CG To appear in Graphs and Combinatorics []
2 0704.0003 The evolution of Earth-Moon system is descri... physics.gen-ph 23 pages, 3 figures [23 pages]
3 0704.0004 We show that a determinant of Stirling cycle... math.CO 11 pages [11 pages]
4 0704.0005 In this paper we show how to compute the $\L... math.CA math.FA None []
# 筛选出有 len(pages) > 0的数据
data = data[data["pages"].apply(lambda x: len(x) > 0)]
data.head()
id abstract categories comments pages
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version [37 pages]
2 0704.0003 The evolution of Earth-Moon system is descri... physics.gen-ph 23 pages, 3 figures [23 pages]
3 0704.0004 We show that a determinant of Stirling cycle... math.CO 11 pages [11 pages]
5 0704.0006 We study the two-particle wave function of p... cond-mat.mes-hall 6 pages, 4 figures, accepted by PRA [6 pages]
6 0704.0007 A rather non-standard quantum representation... gr-qc 16 pages, no figures. Typos corrected to match... [16 pages]
# 转换 pages 数据格式
data['pages'] = data.loc[:,'pages'].apply(lambda x: float(x[0].replace(" pages", '')))
data.head()
id abstract categories comments pages
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version 37.0
2 0704.0003 The evolution of Earth-Moon system is descri... physics.gen-ph 23 pages, 3 figures 23.0
3 0704.0004 We show that a determinant of Stirling cycle... math.CO 11 pages 11.0
5 0704.0006 We study the two-particle wave function of p... cond-mat.mes-hall 6 pages, 4 figures, accepted by PRA 6.0
6 0704.0007 A rather non-standard quantum representation... gr-qc 16 pages, no figures. Typos corrected to match... 16.0
# 转换 pages 数据格式
# data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(" pages", '')))
# data.head()
# 这种写法会有 Warning such as: 
# SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
# Try using .loc[row_index,col_indexer] = value instead
# 附上 stackoverflow 解释:https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas

对 pages 进行初步的数据统计

data["pages"].describe().astype(int)
count    1089180
mean          17
std           22
min            1
25%            8
50%           13
75%           22
max        11232
Name: pages, dtype: int32

分类统计论文页数,选择论文第一个类别作为主要类别

data["categories"] = data.loc[:, "categories"].apply(lambda x: x.split(' ')[0])
data["categories"] = data.loc[:, "categories"].apply(lambda x: x.split('.')[0])
data.head()
id abstract categories comments pages
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version 37.0
2 0704.0003 The evolution of Earth-Moon system is descri... physics 23 pages, 3 figures 23.0
3 0704.0004 We show that a determinant of Stirling cycle... math 11 pages 11.0
5 0704.0006 We study the two-particle wave function of p... cond-mat 6 pages, 4 figures, accepted by PRA 6.0
6 0704.0007 A rather non-standard quantum representation... gr-qc 16 pages, no figures. Typos corrected to match... 16.0
# 每类论文的平均页数
fig, axes = plt.subplots(figsize=(12, 6))
data.groupby(["categories"])["pages"].mean().plot(kind="bar")

png

groupby分组使用

分组操作在日常生活中使用极其广泛,例如:

  • 依据 性别 分组,统计全国人口 寿命 的 平均值
  • 依据 季节 分组,对每一个季节的 温度 进行 组内标准化
  • 依据 班级 分组,筛选出组内 数学分数 的 平均值超过80分的班级

从上述的几个例子中不难看出,想要实现分组操作,必须明确三个要素:分组依据 、 数据来源 、 操作及其返回结果 。同时从充分性的角度来说,如果明确了这三方面,就能确定一个分组操作,从而分组代码的一般模式即:

df.groupby(分组依据)[数据来源].使用操作

例如第一个例子中的代码就应该如下:

df.groupby('Gender')['Longevity'].mean()

现在返回到学生体测的数据集上,如果想要按照性别统计身高中位数,就可以如下写出:

In [3]: df = pd.read_csv('data/learn_pandas.csv')

In [4]: df.groupby('Gender')['Height'].median()
Out[4]: 
Gender
Female    159.6
Male      173.4
Name: Height, dtype: float64

对 figures 进行提取

data["figures"] = data.loc[:, "comments"].apply(lambda x: re.findall('[1-9][0-9]* figure', str(x)))
data.head()
id abstract categories comments pages figures
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version 37.0 [15 figure]
2 0704.0003 The evolution of Earth-Moon system is descri... physics 23 pages, 3 figures 23.0 [3 figure]
5 0704.0006 We study the two-particle wave function of p... cond-mat 6 pages, 4 figures, accepted by PRA 6.0 [4 figure]
9 0704.0010 Partial cubes are isometric subgraphs of hyp... math 36 pages, 17 figures 36.0 [17 figure]
13 0704.0014 In this article we discuss a relation betwee... math 18 pages, 1 figure 18.0 [1 figure]
data = data[data.loc[:,"figures"].apply(lambda x: len(x) > 0)]
data.head()
id abstract categories comments pages figures
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version 37.0 [15 figure]
2 0704.0003 The evolution of Earth-Moon system is descri... physics 23 pages, 3 figures 23.0 [3 figure]
5 0704.0006 We study the two-particle wave function of p... cond-mat 6 pages, 4 figures, accepted by PRA 6.0 [4 figure]
9 0704.0010 Partial cubes are isometric subgraphs of hyp... math 36 pages, 17 figures 36.0 [17 figure]
13 0704.0014 In this article we discuss a relation betwee... math 18 pages, 1 figure 18.0 [1 figure]
data["figures"] = data.loc[:, "figures"].apply(lambda x: int(x[0].replace(" figure", " ")))
data.head()
id abstract categories comments pages figures
0 0704.0001 A fully differential calculation in perturba... hep-ph 37 pages, 15 figures; published version 37.0 15
2 0704.0003 The evolution of Earth-Moon system is descri... physics 23 pages, 3 figures 23.0 3
5 0704.0006 We study the two-particle wave function of p... cond-mat 6 pages, 4 figures, accepted by PRA 6.0 4
9 0704.0010 Partial cubes are isometric subgraphs of hyp... math 36 pages, 17 figures 36.0 17
13 0704.0014 In this article we discuss a relation betwee... math 18 pages, 1 figure 18.0 1
# 每类论文的平均图片数
fig, axes = plt.subplots(figsize=(12, 6))
data.groupby(["categories"])["figures"].mean().plot(kind="bar")

png

对代码链接进行提取

print(type(data.comments),type(data["comments"]))
data.comments == data["comments"]
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
0          True
2          True
5          True
9          True
13         True
           ... 
1796880    True
1796885    True
1796890    True
1796894    True
1796902    True
Name: comments, Length: 670321, dtype: bool
# data_with_github_code = data[(data["comments"].str.contains("github") == True)|
#                             (data["abstract"].str.contains("github") == True)
#                             ]
# data_with_github_code
data_with_github_code = data[(data.comments.str.contains("github") == True)|
                            (data.abstract.str.contains("github") == True)
                            ]
data_with_github_code
id abstract categories comments pages figures
87991 0810.2412 The Clifford algebra of a n-dimensional Eucl... math-ph 10 pages, 1 figure 10.0 1
212359 1009.2203 Quantum error correction allows for faulty q... quant-ph 38 pages, 15 figure, 10 tables. The algorithm ... 38.0 15
229459 1012.0091 Conan is a C++ library created for the accur... q-bio 5 pages and 1 figure 5.0 1
253172 1103.5904 Solar tomography has progressed rapidly in r... astro-ph 21 pages, 6 figures, 5 tables 21.0 6
254226 1104.0672 We describe a hybrid Fourier/direct space co... astro-ph 10 pages, 6 figures. Submitted to Astronomy an... 10.0 6
... ... ... ... ... ... ...
1381310 2011.08562 The target identification in brain-computer ... cs 12 pages, 6 figures 12.0 6
1381509 2011.08761 In this paper, we study the problem of imagi... eess 10 pages, 2 figures, to be published in STACOM... 10.0 2
1381606 2011.08858 We derive a simple prescription for includin... astro-ph 14 pages; 6 figures; 3 appendices 14.0 6
1381626 2011.08878 The production of numerical relativity wavef... gr-qc 11 pages, 1 figure, 1 table. Open source softw... 11.0 1
1382418 2011.09670 Rotation detection serves as a fundamental b... cs 12 pages, 6 figures, 8 tables 12.0 6

2265 rows × 6 columns

data_with_github_code.loc[:, "text"] = data_with_github_code.loc[:, "abstract"].fillna("") + data_with_github_code.loc[:, "comments"].fillna("")
data_with_github_code
id abstract categories comments pages figures text
87991 0810.2412 The Clifford algebra of a n-dimensional Eucl... math-ph 10 pages, 1 figure 10.0 1 The Clifford algebra of a n-dimensional Eucl...
212359 1009.2203 Quantum error correction allows for faulty q... quant-ph 38 pages, 15 figure, 10 tables. The algorithm ... 38.0 15 Quantum error correction allows for faulty q...
229459 1012.0091 Conan is a C++ library created for the accur... q-bio 5 pages and 1 figure 5.0 1 Conan is a C++ library created for the accur...
253172 1103.5904 Solar tomography has progressed rapidly in r... astro-ph 21 pages, 6 figures, 5 tables 21.0 6 Solar tomography has progressed rapidly in r...
254226 1104.0672 We describe a hybrid Fourier/direct space co... astro-ph 10 pages, 6 figures. Submitted to Astronomy an... 10.0 6 We describe a hybrid Fourier/direct space co...
... ... ... ... ... ... ... ...
1381310 2011.08562 The target identification in brain-computer ... cs 12 pages, 6 figures 12.0 6 The target identification in brain-computer ...
1381509 2011.08761 In this paper, we study the problem of imagi... eess 10 pages, 2 figures, to be published in STACOM... 10.0 2 In this paper, we study the problem of imagi...
1381606 2011.08858 We derive a simple prescription for includin... astro-ph 14 pages; 6 figures; 3 appendices 14.0 6 We derive a simple prescription for includin...
1381626 2011.08878 The production of numerical relativity wavef... gr-qc 11 pages, 1 figure, 1 table. Open source softw... 11.0 1 The production of numerical relativity wavef...
1382418 2011.09670 Rotation detection serves as a fundamental b... cs 12 pages, 6 figures, 8 tables 12.0 6 Rotation detection serves as a fundamental b...

2265 rows × 7 columns

pattern = "[a-z]+://github[^\s]*"
data_with_github_code["code_flag"] = data_with_github_code.loc[:, "text"].str.findall(pattern).apply(lambda x: len(x)>0)
data_with_github_code.head()
id abstract categories comments pages figures text code_flag
87991 0810.2412 The Clifford algebra of a n-dimensional Eucl... math-ph 10 pages, 1 figure 10.0 1 The Clifford algebra of a n-dimensional Eucl... False
212359 1009.2203 Quantum error correction allows for faulty q... quant-ph 38 pages, 15 figure, 10 tables. The algorithm ... 38.0 15 Quantum error correction allows for faulty q... True
229459 1012.0091 Conan is a C++ library created for the accur... q-bio 5 pages and 1 figure 5.0 1 Conan is a C++ library created for the accur... True
253172 1103.5904 Solar tomography has progressed rapidly in r... astro-ph 21 pages, 6 figures, 5 tables 21.0 6 Solar tomography has progressed rapidly in r... False
254226 1104.0672 We describe a hybrid Fourier/direct space co... astro-ph 10 pages, 6 figures. Submitted to Astronomy an... 10.0 6 We describe a hybrid Fourier/direct space co... True
data_with_github_code = data_with_github_code[data_with_github_code["code_flag"] == 1]
fig, axes = plt.subplots(figsize=(12,6))
data_with_github_code.groupby(["categories"])["code_flag"].count().plot(kind="bar")

png


文章作者: Terence Cai
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Terence Cai !
  目录