Skip to content Skip to sidebar Skip to footer

Create A Tuple Out Of Two Columns - Pyspark

My problem is based on the similar question here PySpark: Add a new column with a tuple created from columns, with the difference that I have a list of values instead of one value

Solution 1:

If size of the arrays varies from row to row you'll need and UDF:

from pyspark.sql.functions import udf

@udf("array<struct<_1:double,_2:double>>")defzip_(xs, ys):
    returnlist(zip(xs, ys))

df.withColumn("v_tuple", zip_("v1", "v2"))

In Spark 1.6:

from pyspark.sql.types import *

zip_ = udf(
    lambda xs, ys: list(zip(xs, ys)),
    ArrayType(StructType([StructField("_1", DoubleType()), StructField("_2", DoubleType())])))

Post a Comment for "Create A Tuple Out Of Two Columns - Pyspark"