Solving Python & Pandas Error - 'utf-8' codec can't decode byte in position : invalid start byte

‘utf-8’ codec can’t decode byte in position : invalid start byte is a Python error which occurs when there is an unknown character in your file.

In this article I am going to solve the error while I try explaining why the error is popping up in the first place, I will also introduce some solutions which worked for other developers and we will see if those solutions can solve the error in your unique situation.

Exploring the Error : ‘utf-8’ codec can’t decode byte in position : invalid start byte

This error is a Python error which occurs when there is an unknown character in your file.

A simple code like this can cause the error.

                                                                       #
import pandas as pd
df1=pd.read_csv("file.csv",sep=";") # file.csv could be a link to the csv like https://website url/Dataset/file.csv
df1.head()
                                                                       #

Make sure your error message matches the error bellow. You do not want any confusion between error messages.

                                                                       #
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
                                                                       #

To solve the problem above, I have 2 solutions which have worked for me, bellow is a detailed explanation of both.

Solution 1 : Remove the faulty character or characters

Since the error occurs when there is an unknown character in your file.

The simplest solution is to remove the character from your file. For example only one character in windows 1250 can cause the error even when the file with utf-8 encoding has thousands of other characters.

If this solution does not work try the solution bellow.

Solution 2 : Use ‘ISO-8859-1’ instead

The second solution is to stop using ‘utf-8’ for decoding and start using ‘ISO-8859-1’ instead.

The code bellow is an illustration of how you can do that

                                                                       #
text = open(fii, 'erm').read().decode('ISO-8859-1')
                                                                       #

Check out the link : https://docs.python.org/3/library/codecs.html

To learn more about Python’s Codec (encoders and decoders) registry and base classes.

If this article has been useful for your particular case, consider donating to our Kofi account, there is a red button at the top of this page.

Summing-up

I hope this article helped you solve the error, If not, I hope the solutions presented here guided you at least in the right direction.

Keep coding guys and cheers, see you in another post. If you want to learn more about Python, please check out the Python Documentation : https://docs.python.org/3/

Exploring the Error : ‘utf-8’ codec can’t decode byte in position : invalid start byte

Solution 1 : Remove the faulty character or characters

Solution 2 : Use ‘ISO-8859-1’ instead

Summing-up

Related posts: