Fixing error when Converting Pandas dataframe into Spark dataframe ( Spark dataframe error )

Error when Converting Pandas dataframe into Spark dataframe ( Spark dataframe error ) is an error which occurs when you try to convert a Pandas dataframe to a Spark dataframe and you use the incorrect way.

In today’s blog post I am going to present an annoying and confusing python error and explain why this error is taking place and how to fix it, with a set of possible fixes.

Exploring the error when Converting Pandas dataframe into Spark dataframe ( Spark dataframe error )

This is an error which occurs when you try to convert a Pandas dataframe to a Spark dataframe and you use the incorrect way.

Please make sure the error message looks like the error message bellow after double checking. Do not mix between errors.

                                                                       #
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>
                                                                       #

Bellow is a number of tested solutions that I have tried and worked for me.

Solution 1 : Correctly use .astype() to change the type of the columns

When you try to convert a Pandas dataframe to a Spark dataframe and fail you will get this error. It is normal since this operation can be quite tricky.

This is where most developers fail : Before converting the Pandas dataframe, you should read the pandas dataframe and check if all the columns are appropriate the type spark wants.

To solve the problem, you can try using .astype( ) like this

                                                                       #
mydataframe[['myColumn', 'secondcolumn']] = mydataframe[['myColumn', 'secondcolumn']].astype(str)
                                                                       #

Specify the type of the columns inside .astype( ) like we did in the line above.

Solution 2 : Correctly use .createDataFrame( )

The second solution is to make sure you use .createDataFrame( ) the way it should be used.

First you should read the csv file using pd.read_csv( )

                                                                       #
mydataset = pd.read_csv("myfile.csv")
                                                                       #

Then use spark.createDataFrame() on the dataset

                                                                       #
mysparkdataframe = spark.createDataFrame(mydataset);
                                                                       #

I hope the fixes above fixed your problem, good luck with your python projects

Summing-up : 

If this article has been helpful, please consider donating to my Kofi account, you will find a big red button at the top of this page.

Thank you for reading, keep coding and cheers. If you want to learn more about Python, please check out the Python Documentation : https://docs.python.org/3/