Fixing Python and Pandas MemoryError when merging two Pandas data frames

Python and Pandas MemoryError when merging two Pandas data frames is an error which occurs when you merge two Dataframes which have big sizes without using the chunksize option.

Today I try to explain why this error takes place and how to solve it, I will also add other solutions that could solve the error if possible.

Exploring the Error : Python and Pandas MemoryError when merging two Pandas data frames

This is an error which occurs when you merge two Dataframes who have a big size without using the chunksize option.

Please double check so you can avoid mixing between different errors. The error message should look like the error message bellow.

                                                                       #
MemoryError: Unable to allocate ... etc.
                                                                       #

Bellow is a number of tested solutions that I have tried and worked for me.

Solution 1 : Correctly merge the two Pandas Dataframs while using the chunksize option

We have already established that the MemoryError happens when you merge two Pandas dataframes without using the chunksize option. Usually this happens when one or two of the Dataframes are big.

In this example we will be working with two dataframes dataframe1 and dataframe2. Let us start our code with this

                                                                       #
dataframe1 = pd.read_csv("data1.csv")
dataframe2 = pd.read_csv("data2.csv")
dataframe2key = dataframe2.Colname2
                                                                       #

Now, that we have loaded the two csv files data1.csv and data2.csv, let us start the merging process

                                                                       #
df_result = pd.DataFrame(columns=(dataframe1.columns.append(dataframe2.columns)).unique())
df_result.to_csv("dataframe3.csv",index_label=False)
del(dataframe2)

def preprocess(x):
    dataframe2=pd.merge(dataframe1,x, left_on = "Colname1", right_on = "Colname2")
    dataframe2.to_csv("dataframe3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=2000
                                                                       #

Finally our code should end with [preprocess(r) for r in reader]

                                                                       #
[preprocess(r) for r in reader]
                                                                       #

This process should be enough to help you get rid of any Pandas Data frame memory error when merging two data frames. I hope your error is gone by now.

Solution 2 : replace the Faulty Nan values with zeros by using .fillna(0)

For some developers the process might slow down the machine or even crash it, in most cases that is caused by  joining columns that have Nan values.

The solution is to remove those Nan values from your data ( the right column for example ), you can try the line bellow to replace the Faulty Nan values with zeros by using .fillna(0)

                                                                       #
dataframe1['rightcolumn'] = dataframe1['rightcolumn'].fillna(0)
                                                                       #

I hope this trick did the job for you, thanks for reading this blog post.

Summing-up : 

The first solution solved the error : Python and Pandas MemoryError when merging two Pandas data frames for me and most other developers who had this issue, I hope you found a solution in our article, keep creating and keep coding, cheers.

If you want to learn more about Python, please check out the Python Documentation : https://docs.python.org/3/