Skip to main content

How to test AWS Glue jobs locally / run AWS Glue jobs locally – Part 1



AWS Glue – Local Testing using Apache Spark 2.4.3


Recently, AWS release its glue libs on GitHub AWS Glue GitHub - 


You can either download Glue 0.9 or Glue 1.0 from the GitHub branch.


glue-1.0: 
git clone -b glue-1.0 https://github.com/awslabs/aws-glue-libs.git

glue 0.9: 
git clone https://github.com/awslabs/aws-glue-libs.git

Prerequisites:

Maven 3.6.0 or higher – 

Spark 2.2x or higher – 



Step 1: Install and configure Maven and Apache Spark – configure it as per your installation


>> vi ~/.bashrc

export M2_HOME=/home/tusharsarde/soft/maven3
export PATH=${M2_HOME}/bin:${PATH}
export PYSPARK_PYTHON=python3
export SPARK_HOME=/tmp/tush/aws-glue-pfa/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/
export PATH=${SPARK_HOME}/bin:${PATH}

>> source ~/.bashrc


Step 2: Clone git repo and run the glue-setup.sh file, this step will create folder jarsv1/ and download the required jars.


>> cd aws-glue-libs

>> chmod +x bin/*

>> ./bin/glue-setup.sh


Note – Very important step, there are some netty-all-* jar incompatibility issues.
Spark uses netty-all-4.1 version whereas AWS Glue downloads netty-all-4.0.23.Final.jar so we need to remove netty-all-4.0.23.Final.jar from jarsv1/


Step 3: Test Glue pyspark shell is running fine

>> ./bin/gluepyspark


OR

Step 3: Submit your AWS Glue script using spark submit

 >> ./bin/gluesparksubmit --master local glue-spark-pycharm-example.py


Step 4: Comment me here if you face any issues :)

In Part 2 we’ll see how to run Glue-1.0 using PyCharm Community Edition.


Comments

Post a Comment