欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

Spark精华代码示例之一:轻松实现宽表与窄表之间的转换

最编程 2024-07-27 13:03:58
...

不定期上代码干货

spark列转行
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext, Row, functions as F
from pyspark.sql.functions import array, col, explode, struct, lit

conf = SparkConf().setAppName("test").setMaster("local[*]")
sc = SparkContext(conf=conf)
spark = SQLContext(sc)

# df is datasource, by will exclude column
def df_columns_to_line(df, by):
    # Filter dtypes and split into column names and type description
    df_a = df.select([col(c).cast("string") for c in df.columns])
    cols, dtypes = zip(*((c, t) for (c, t) in df_a.dtypes if c not in by))
    # Spark SQL supports only homogeneous columns
    assert len(set(dtypes)) == 1, "All columns have to be of the same type"
    # Create and explode an array of (column_name, column_value) structs
    kvs = explode(array([
      struct(lit(c).alias("feature"), col(c).alias("value")) for c in cols
    ])).alias("kvs")
    return df_a.select(by + [kvs]).select(by + ["kvs.feature", "kvs.value"])

df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])
df_row_data = df_columns_to_line(df, ["A"])
df.show()
df_row_data.show()

>>> df.show()
+---+-----+-----+
|  A|col_1|col_2|
+---+-----+-----+
|  1|  0.0|  0.6|
|  1|  0.6|  0.7|
+---+-----+-----+

>>> df_row_data.show()
+---+-------+-----+
|  A|feature|value|
+---+-------+-----+
|  1|  col_1|  0.0|
|  1|  col_2|  0.6|
|  1|  col_1|  0.6|
|  1|  col_2|  0.7|
+---+-------+-----+


注意feature和value是原多列名转换为行数据后,重新定义的最终两列名

spark行转列
df_features = df_row_data.select('feature').distinct().collect()
features = map(lambda r:r.feature, df_features)
df_column_data = df_row_data.groupby("A").pivot('feature', features).agg(F.first('value', ignorenulls=True))
df_column_data.show()



+---+-----+-----+
|  A|col_2|col_1|
+---+-----+-----+
|  1|  0.6|  0.0|
+---+-----+-----+


行转列比较简单,在上文结果基础上直接转换,关键是pivot函数的使用