Fixing ImportError: No module named numpy on spark workers

ImportError: No module named numpy on spark workers is an error which occurs when you make a mistake when deploying an application in cluster mode.

I will explain why this error is taking place and how to fix it, while also trying to add other solutions that could help us solve the problem.

Exploring the ImportError: No module named numpy on spark workers

This is an error which occurs when you make a mistake when deploying an application in cluster mode.

Please do not mix between multiple errors. Check if the error message looks like the error message bellow.

                                                                       #
ImportError: No module named numpy
                                                                       #

Bellow we will take care of the error using multiple possible solutions according to your needs.

Solution 1 : Use Anaconda to make the process easier

You should use anaconda to upload python dependencies when deploying the application in cluster mode because you do not need to install numpy or other modules on each machine.

To be able to use Anaconda, you should zip your anaconda and put the zip file to the cluster.

Create a script and call it something, maybe myspark. And past the following lines of code.

                                                                       #
spark-submit \
 --master yarn \
 --deploy-mode cluster \
                                                                       #

The Spark driver runs inside an application master process managed by YARN In cluster mode.

Then add the lines bellow to the code

                                                                       #
--archives hdfs://host/path/to/anaconda.zip#python-env
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python 
 app_main.py
                                                                       #

The dependency of an application is distributed to each node typically via HDFS.

–archives hdfs://host/path/to/anaconda.zip#python-env copies anaconda.zip from the HDFS path to each one of the workers.

After these steps, the problem should be solved. If that is not the case, please follow the method bellow.

Solution 2 : Use Spark-submit to launch the application on the cluster

Frameworks like YARN ensure that each application is executed in a self-contained environment.

Our goal is to install the dependencies each machine, to distribute the python dependencies we use spark-submit.

Spark-submit, is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface

                                                                       #
spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
                                                                       #

I hope the solutions have been helpful, I hope you solved the error already. Thank you for reading this entire blog post.

Summing-up : 

Guys, this has been my best attempt at helping you understand and solve this issue. I hope you found a solution which suits your needs.

Thank you for reading, keep coding and cheers. If you want to learn more about Python, please check out the Python Documentation : https://docs.python.org/3/