大数据之 spark on k8s

一、spark on k8s架构解析

1.k8s的优点

k8s是一个开源的容器集群管理系统,可以实现容器集群的自动化部署、自动扩缩容、维护等功能。
1、故障迁移
2、资源调度
3、资源隔离
4、负载均衡
5、跨平台部署

2.k8s集群架构

file

3.Spark on K8s工作原理

file

具体流程,包括以下几步:

①:用户使用kubectl 创建 SparkApplication 对象,提交sparkApplication的请求到api-server,并把sparkApplication的CRD持久化到etcd;

②:SparkApplication controller 从 kube-api server 接收到有对象请求,创建 submission (实际上就是 参数化以后的 spark-submit 命令),然后发送给 submission runner。

③:Submission runner 提交 app 到 k8s 集群,并创建 driver pod。一旦 driver pod 正常运行,则由 driver pod 创建 executor pod。 当应用正常运行时,spark pod monitor 监听 application 的 pod 状态,(通过kubectl可以通过list、status看到)并将pod 状态的更新发送给 controller,由 controller 负责调用 kube-api 更新 spark application 的状态(实际上是更新 SparkApplication CRD 的状态字段)。

④:mutating adminission webhook创建svc,可以查看spark web ui

spark私有镜像制作

https://blog.csdn.net/lhyandlwl/article/details/122025937

二、spark私有镜像制作

1. 下载spark包

下载安装spark3.0.0安装包并解压

#下载,地址失效就从官网下载 http://spark.apache.org/downloads.html
# $ wget https://mirror.bit.edu.cn/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
> wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.2.tgz

> tar -zxvf spark-3.1.3-bin-hadoop3.2.tgz -C /mmp/modules
> cd /mmp/modules/spark-3.1.3-bin-hadoop3.2/bin

# 添加到环境变量
# export SPARK_HOME=/mmp/modules/spark-3.1.3-bin-hadoop3.2

打印:
[root@master1 spark-3.1.3-bin-hadoop3.2]# cd bin
[root@master1 bin]# ls -l
total 116
-rwxr-xr-x. 1 hemei hemei  1089 Feb  7  2022 beeline
-rw-r--r--. 1 hemei hemei  1064 Feb  7  2022 beeline.cmd
-rwxr-xr-x. 1 hemei hemei 10965 Feb  7  2022 docker-image-tool.sh
-rwxr-xr-x. 1 hemei hemei  1935 Feb  7  2022 find-spark-home
-rw-r--r--. 1 hemei hemei  2685 Feb  7  2022 find-spark-home.cmd
-rw-r--r--. 1 hemei hemei  2337 Feb  7  2022 load-spark-env.cmd
-rw-r--r--. 1 hemei hemei  2435 Feb  7  2022 load-spark-env.sh
-rwxr-xr-x. 1 hemei hemei  2634 Feb  7  2022 pyspark
-rw-r--r--. 1 hemei hemei  1540 Feb  7  2022 pyspark2.cmd
-rw-r--r--. 1 hemei hemei  1170 Feb  7  2022 pyspark.cmd
-rwxr-xr-x. 1 hemei hemei  1030 Feb  7  2022 run-example
-rw-r--r--. 1 hemei hemei  1223 Feb  7  2022 run-example.cmd
-rwxr-xr-x. 1 hemei hemei  3539 Feb  7  2022 spark-class
-rwxr-xr-x. 1 hemei hemei  2812 Feb  7  2022 spark-class2.cmd
-rw-r--r--. 1 hemei hemei  1180 Feb  7  2022 spark-class.cmd
-rwxr-xr-x. 1 hemei hemei  1039 Feb  7  2022 sparkR
-rw-r--r--. 1 hemei hemei  1097 Feb  7  2022 sparkR2.cmd
-rw-r--r--. 1 hemei hemei  1168 Feb  7  2022 sparkR.cmd
-rwxr-xr-x. 1 hemei hemei  3122 Feb  7  2022 spark-shell
-rw-r--r--. 1 hemei hemei  1818 Feb  7  2022 spark-shell2.cmd
-rw-r--r--. 1 hemei hemei  1178 Feb  7  2022 spark-shell.cmd
-rwxr-xr-x. 1 hemei hemei  1065 Feb  7  2022 spark-sql
-rw-r--r--. 1 hemei hemei  1118 Feb  7  2022 spark-sql2.cmd
-rw-r--r--. 1 hemei hemei  1173 Feb  7  2022 spark-sql.cmd
-rwxr-xr-x. 1 hemei hemei  1040 Feb  7  2022 spark-submit
-rw-r--r--. 1 hemei hemei  1155 Feb  7  2022 spark-submit2.cmd
-rw-r--r--. 1 hemei hemei  1180 Feb  7  2022 spark-submit.cmd
[root@master1 bin]# 

2. 构建镜像

Spark(从2.3版开始)附带了一个Dockerfile,可以在 kubernetes/dockerfiles/ 目录中找到它。

[root@master1 spark]# tree /mmp/modules/spark-3.1.3-bin-hadoop3.2/kubernetes/dockerfiles/spark
/mmp/modules/spark-3.1.3-bin-hadoop3.2/kubernetes/dockerfiles/spark
├── bindings
│   ├── python
│   │   └── Dockerfile
│   └── R
│       └── Dockerfile
├── decom.sh
├── Dockerfile
└── entrypoint.sh

3 directories, 5 files
[root@master1 spark]# 

Spark还附带一个构建和push镜像的脚本 bin/docker-image-tool.sh。构建镜像命令如下:

cd $SPARK_HOME
# 构建镜像
## -p ./kubernetes/dockerfiles/spark/Dockerfile,-p 指定Dockerfile
#$SPARK_HOME/bin/docker-image-tool.sh -r myharbor.com/bigdata -t 3.1.3-hadoop3 build  build
# push
#$SPARK_HOME/bin/docker-image-tool.sh -r myharbor.com/bigdata -t 3.1.3-hadoop3 push

构建镜像:

# 实践-构建镜像
> $SPARK_HOME/bin/docker-image-tool.sh -r harbor.hmatm.com/modelsvc  -t 3.1.3-hadoop3 build  build
...
Successfully built 6b7f09a69a45
Successfully tagged harbor.hmatm.com/modelsvc/spark:3.1.3-hadoop3

[root@master1 spark-3.1.3-bin-hadoop3.2]# docker images | grep 'spark'
harbor.hmatm.com/modelsvc/spark                                         3.1.3-hadoop3   6b7f09a69a45   6 minutes ago   547MB
[root@master1 spark-3.1.3-bin-hadoop3.2]# 

推送镜像到镜像仓库

# push
> $SPARK_HOME/bin/docker-image-tool.sh -r harbor.hmatm.com/modelsvc -t 3.1.3-hadoop3 push

3)配置 spark 用户权限

> kubectl create ns spark
> kubectl create serviceaccount spark -n spark
> kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark
##在spark-submit中添加
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

4)提交 Spark 任务(cluster 模式)

# 查看k8s apiserverl:kubectl cluster-info
cd $SPARK_HOME
./bin/spark-submit \
    --master k8s://https://192.168.1.61:6443 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.namespace=spark \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=harbor.hmatm.com/modelsvc/spark:3.1.3-hadoop3 \
    local:///opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar

执行打印:

22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Application status for spark-81e407abc7974ecb9304e0dc53d8cb2e (phase: Succeeded)
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:

         container name: spark-kubernetes-driver
         container image: harbor.hmatm.com/modelsvc/spark:3.1.3-hadoop3
         container state: terminated
         container started at: 2022-12-12T16:53:57Z
         container finished at: 2022-12-12T16:54:18Z
         exit code: 0
         termination reason: Completed
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Application spark-pi with submission ID spark:spark-pi-a52d88850741db4b-driver finished
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Application spark-pi with submission ID spark:spark-pi-a52d88850741db4b-driver finished
22/12/13 00:54:20 INFO ShutdownHookManager: Shutdown hook called
22/12/13 00:54:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-00f4d5c4-0fd3-4fdf-8b71-3bd5f550f4d2
[root@master1 spark-3.1.3-bin-hadoop3.2]# kubectl get pods -n spark
NAME                               READY   STATUS      RESTARTS   AGE
spark-pi-a52d88850741db4b-driver   0/1     Completed   0          104s
spark-pi-f63d4185073ce565-driver   0/1     Error       0          7m9s
[root@master1 spark-3.1.3-bin-hadoop3.2]# 

【注意】这里的 local:///opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar 指的是 容器的文件系统路径,不是执行 spark-submit 的机器的文件系统路径,如果不使用 local 的话,也可以用 HTTP、HDFS 等系统,没指定的话默认是 local 模式


相关文章:
【云原生】Spark on k8s 讲解与实战操作
大数据之spark on k8s
Spark On K8S 在有赞的实践与经验
spark on k8s和on yarn对比有什么优势?

为者常成,行者常至