Fixing corrupt record error when reading a JSON file into Spark

Corrupt record error when reading a JSON file into Spark is an error which occurs when you try to use pyspark to read a file into a DataFrame in Spark and you use an incorrect way.

In today’s article I am going to present a set of possible solutions. In order to deal with a confusing error and explain why it takes place.

Exploring the corrupt record error when reading a JSON file into Spark

This is an error which occurs when you try to use pyspark to read a file into a DataFrame in Spark and you use an incorrect way.

Please make sure the error message looks like the error message bellow after double checking. Do not mix between errors.

                                                                       #
_corrupt_record ...
                                                                       #

Bellow is a number of tested solutions that I have tried and worked for me.

Solution 1 : use spark.read.option(“multiline”,True)

Using pyspark to read a file into a DataFrame in Spark can be tricky, one small mistake and you end up with the error we are trying to solve.

The first solution is to use the “multiline”,true option like in the example bellow.

                                                                       #
mydataframe = spark.read.option("multiline",True).json('PathToYourFile')
                                                                       #

Sometimes, this error can occur because of file encoding. You should use the “encoding”, “cp1252” option in order to specify the desired encoding. Just like in this example

                                                                       #
mydataframe = spark.read.option("encoding", "cp1252").json('PathToYourFile')
                                                                       #

I hope that after trying the options I have just proposed the error is finally gone for good.

Solution 2 : use the standard JSON notation instead of Python notation

Sometimes, if you use Python notation with a JSON column String for example, this error takes place.

To solve the error you should use the standard JSON notation instead of Python notation.

So, start using the code bellow first

                                                                       #
mydataframe.withColumn("json_notation", F.regexp_replace(F.regexp_replace(F.regexp_replace("_corrupt_record", 
"None", "null"), "False", "false") ,"True", "true")
                                                                       #

Before you do something like this.

                                                                       #
mydataframe = sqlc.read.json('PathToYourFile')
                                                                       #

I hope the methods I offered above have been helpful. Thank you for reading this long post.

Summing-up : 

This is the end of our article, I hope the solutions I presented worked for you, Learning Python is a fun journey, do not let the errors discourage you. Keep coding and cheers.

Thank you for reading, keep coding and cheers. If you want to learn more about Python, please check out the Python Documentation : https://docs.python.org/3/