Skip to content Skip to sidebar Skip to footer

Efficient Column Processing In Pyspark

I have a dataframe with a very large number of columns (>30000). I'm filling it with 1 and 0 based on the first column like this: for column in list_of_column_names: df = df.w

Solution 1:

You might approach like this,

import pyspark.sql.functions as F

exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
                  for column in list_column_names]

df = df.select(['list_column']+exprs)

Solution 2:

withColumn is already distributed so a faster approach would be difficult to get other than what you already have. you can try defining a udf function as following

from pyspark.sql import functions as f
from pyspark.sql import types as t

defcontainsUdf(listColumn):
    row = {}
    for column in list_of_column_names:
        if(column in listColumn):
            row.update({column: 1})
        else:
            row.update({column: 0})
    return row

callContainsUdf = f.udf(containsUdf, t.StructType([t.StructField(x, t.StringType(), True) for x in list_of_column_names]))

df.withColumn('struct', callContainsUdf(df['list_column']))\
    .select(f.col('list_column'), f.col('struct.*'))\
    .show(truncate=False)

which should give you

+-----------+---+---+---+
|list_column|Foo|Bar|Baz|
+-----------+---+---+---+
|[Foo, Bak] |1  |0  |0  |
|[Bar, Baz] |0  |1  |1  |
|[Foo]      |1  |0  |0  |
+-----------+---+---+---+

Note: list_of_column_names = ["Foo","Bar","Baz"]

Post a Comment for "Efficient Column Processing In Pyspark"