PySpark Error – PipelinedRDD object has no attribute ‘toDF’

PySpark Error – PipelinedRDD object has no attribute ‘toDF’ is an error which happens when you use ‘toDF’ in PySpark with Python.

In this article I am going to explain what happens when you get this error and how you can solve it with a main solution, we will also explore other solutions which can possibly solve the issue.

Explaining PySpark Error – PipelinedRDD object has no attribute ‘toDF’

The error is clear and the solution we will provide will usually help you solve the error for good.

Usually, the impotant line of the error will look like the error line bellow.

                                                                       #
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
                                                                       #

This is an example of the complete error message.

                                                                       #
Traceback (most recent call last):
File "/home/fred-spark/spark-1.5.0-bin-hadoop2.6/test.py", line 22, in <module>
data = MLUtils.loadLibSVMFile(sc, "/home/fred-spark/svm_capture").toDF()
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
                                                                       #

Bellow is the solution which have worked for me and will help you to successfully get rid of the error.

Solution : use toDF the right way

The root of the problem is toDF. You see, using toDF is actually very confusing.

First, you should add these lines of code bellow to your file/code.

                                                                       #
from pyspark.sql import SparkSession
from pyspark import SparkContext
                                                                       #

The code above will import SparkSession which needed for toDF to work.

Here is an example of how you can use it.

                                                                       #
sparky = SparkContext()
rolly = sparky.parallelize([("z", 1)])
hasattr(rolly, "toDF")
spark = SparkSession(sc)
hasattr(rolly, "toDF")
rolly.toDF().show()
                                                                       #

The result of the code above gives us no error and will be as follows.

+---+---+
| _1| _2|
+---+---+
|  z|  1|
+---+---+

The error above was hard to deal with, I spent hours looking for a proper solution or set of solutions. This was my absolute best attempt guys, Cheers.

Summing-up

I hope this article helped you solve the PySpark Error – PipelinedRDD object has no attribute ‘toDF’, If not, I hope the solutions presented here guided you at least in the right direction.

Keep coding and cheers, see you in another article. If you want to learn more about Python, please check out the Python Documentation : https://docs.python.org/3/