使用 Python 的 Spark：PySpark 入门指南 - 创建 RDD。

最编程 2024-07-01 14:08:13

...

现在我们用这个下载的文件来创建RDD。

data_file = "./kddcup.data_10_percent.gz"

raw_data = sc.textFile(data_file)

过滤

假设我们要计算在数据集中有多少正常的相互作用。我们可以按如下过滤raw_data RDD。

from time import time

t0 = time()

normal_count = normal_raw_data.count()

tt = time() - t0

print "There are {} 'normal' interactions".format(normal_count)

print "Count completed in {} seconds".format(round(tt,3))

学习Python游戏编程的入门指南：使用PyGame开发