Install & run AWS Glue 1.0 and PySpark on Ubuntu 20.04

Background

It’s much faster to be able to develop and debug AWS Glue / PySpark scripts locally.

The Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library instructions describe installation but are not complete. There are certain dependencies to consider to make this work.

Also note that the version of PySpark used (2.4.3) for Glue 1.0 does not support:

  • Python 3.8, which Ubuntu 20.04 comes with. Therefore some of the PySpark code needs to be hacked a bit as per Stackoverflow and Gist.
  • Java 11, which Ubuntu 20.04 comes with. OpenJDK 8 headless is therefore installed and made the default runtime interpreter

Install Ubuntu package dependencies

First install the Ubuntu package dependencies:

sudo apt install zip
sudo apt install python-pytest
# Maven v3.6.3 is currently distributed
sudo apt install maven
# Ubuntu 20.04 comes with openjdk 11 per default, which PySpark is not compatible with
sudo apt install openjdk-8-jdk-headless
sudo update-alternatives --config java
# Choose the option /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Consider updating python and pip alternatives:

sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1
sudo update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1

Install AWS Glue Python library

cd
mkdir -p $HOME/app

# Just get the zip file from github, no need to clone repo (get glue-1.0, which supports Python3)
curl -LO https://github.com/awslabs/aws-glue-libs/archive/glue-1.0.zip
unzip glue-1.0.zip -d $HOME/app
mv $HOME/app/aws-glue-libs-glue-1.0 $HOME/app/aws-glue-libs
rm glue-1.0.zip

Install Glue (1.0) artifacts

cd
mkdir -p $HOME/app

curl -LO https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
tar xvpfz spark-2.4.3-bin-hadoop2.8.tgz -C $HOME/app
rm spark-2.4.3-bin-hadoop2.8.tgz

Now, as per this Gist the file $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py will need to be edited in order for PySpark to work with Python 3.8. If this is not done then the PySpark shell will fail to start.

First make a copy of the file:

cp -p $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py \
      $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py.original

Then edit the file $HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/cloudpickle.py and make the necessary changes as per abovementioned Gist.

Create glue source file

mkdir -p $HOME/bin

cat <<EOF >$HOME/bin/glue
# Spark / Glue
export SPARK_HOME=\$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
export PATH=\$HOME/app/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/bin:\$PATH
export PATH=\$HOME/app/aws-glue-libs/bin:\$PATH
EOF

Source environment and test

Below ./bin/gluepyspark command will download a considerable number of artifacts using maven.

. glue
cd $HOME/app/aws-glue-libs
# Start Glue Shell
./bin/gluepyspark

pyspark for IDE lookup

Optionally, it’s possible to install PySpark (same version as that used with Glue) in the virtualenv to ensure that the IDE can cross-reference it.

# For VSCode run this from the project/git folder
python3 -m venv .
pip3 install pyspark==2.4.3

Test AWS Glue set-up & PySpark

Before starting the Glue PySpark shell:

  • Make sure relevant AWS credentials are available via environment variables or .aws/credentials profile (and AWS_PROFILE is set accordingly)
  • Set a suitable region

Start the Glue PySpark shell :

export AWS_REGION=eu-west-1

cd $HOME/app/aws-glue-libs
./bin/gluepyspark

Run some test code - if this doesn’t yield an error then it’s ready to go:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
job.init('test-job1')
job.commit()