Docker Images
Ocean Spark maintains a popular fleet of Docker images for Apache Spark.
What's a Docker image for Spark?
When Spark runs on Kubernetes, the driver and executors are Docker containers that execute a Docker image specifically built to run Spark.
What’s in these Docker Images?
In addition to a version of Spark itself, the spark:platform images include connectors to popular object stores: (S3, GCS, ADLS), Snowflake, Delta Lake, Python support with pip and conda, Jupyter notebook support, Hadoop, AWS Glue Catalog, and more.
Images to start with
| Full Image name | Spark Version | Scala Version | Python Version | Hadoop Version |
|---|---|---|---|---|
| gcr.io/ocean-spark/spark:platform-3.5-latest | 3.5.3 | 2.12 | 3.10 | 3.3.6 |
| gcr.io/ocean-spark/spark:platform-3.4-latest | 3.4.4 | 2.12 | 3.10 | 3.3.6 |
How to use those images for your apps and jobs?
When submitting Spark apps on Ocean for Apache Spark, you can:
- Omit the image field. In this case,
spark:platformwill be used by default according to the Spark version specified in thesparkVersionfield. If both image and sparkVersion fields are specified, the Spark version of the image takes precedence. - Specify the image in your configuration with the image field (using a
configOverridesdirectly in your API call, or using a configuration template).
Need another image?
To match different dependencies and version requirements you can find more images in Spot's Docker registry.
All these dependencies can have different versions. A combination of dependency versions is called a flavor of spark:platform in this page. The image tag indicates the flavor of the image and can be adjusted to fit your needs. Here are two examples of image tags:
gcr.io/ocean-spark/spark:platform-3.3.0-latest
gcr.io/ocean-spark/spark:platform-3.3.0-hadoop-3.3.1-java-8-scala-2.12-python-3.8-latest
Need to build your own Image?
You should use one of the spark:platform images as a base. Once your custom image is in your local docker repository you have to Tag and Push it, see Set up a Docker registry and push your image.
Data source connectors
The image tags gcr.io/ocean-spark/spark:platform supports for the following data sources:
- AWS S3 (s3a:// or s3:// scheme)
- Google Cloud Storage (gs:// scheme)
- Azure Blob Storage (wasbs:// scheme)
- Azure Datalake generation 1 (adl:// scheme)
- Azure Datalake generation 2 (abfss:// scheme)
- Snowflake
- Delta Lake
- AWS Glue
To check the versions used by an image, see the release notes.
Python support
The image tag gcr.io/ocean-spark/spark:platform supports Pyspark applications. When building a custom image or working from a notebook, additional Python packages can be installed with pip or conda.