大数据之 spark on k8s
一、spark on k8s架构解析
1.k8s的优点
k8s是一个开源的容器集群管理系统,可以实现容器集群的自动化部署、自动扩缩容、维护等功能。
1、故障迁移
2、资源调度
3、资源隔离
4、负载均衡
5、跨平台部署
2.k8s集群架构
3.Spark on K8s工作原理
具体流程,包括以下几步:
①:用户使用kubectl 创建 SparkApplication 对象,提交sparkApplication的请求到api-server,并把sparkApplication的CRD持久化到etcd;
②:SparkApplication controller 从 kube-api server 接收到有对象请求,创建 submission (实际上就是 参数化以后的 spark-submit 命令),然后发送给 submission runner。
③:Submission runner 提交 app 到 k8s 集群,并创建 driver pod。一旦 driver pod 正常运行,则由 driver pod 创建 executor pod。 当应用正常运行时,spark pod monitor 监听 application 的 pod 状态,(通过kubectl可以通过list、status看到)并将pod 状态的更新发送给 controller,由 controller 负责调用 kube-api 更新 spark application 的状态(实际上是更新 SparkApplication CRD 的状态字段)。
④:mutating adminission webhook创建svc,可以查看spark web ui
spark私有镜像制作
https://blog.csdn.net/lhyandlwl/article/details/122025937
二、spark私有镜像制作
1. 下载spark包
下载安装spark3.0.0安装包并解压
#下载,地址失效就从官网下载 http://spark.apache.org/downloads.html
# $ wget https://mirror.bit.edu.cn/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
> wget --no-check-certificate https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.2.tgz
> tar -zxvf spark-3.1.3-bin-hadoop3.2.tgz -C /mmp/modules
> cd /mmp/modules/spark-3.1.3-bin-hadoop3.2/bin
# 添加到环境变量
# export SPARK_HOME=/mmp/modules/spark-3.1.3-bin-hadoop3.2
打印:
[root@master1 spark-3.1.3-bin-hadoop3.2]# cd bin
[root@master1 bin]# ls -l
total 116
-rwxr-xr-x. 1 hemei hemei 1089 Feb 7 2022 beeline
-rw-r--r--. 1 hemei hemei 1064 Feb 7 2022 beeline.cmd
-rwxr-xr-x. 1 hemei hemei 10965 Feb 7 2022 docker-image-tool.sh
-rwxr-xr-x. 1 hemei hemei 1935 Feb 7 2022 find-spark-home
-rw-r--r--. 1 hemei hemei 2685 Feb 7 2022 find-spark-home.cmd
-rw-r--r--. 1 hemei hemei 2337 Feb 7 2022 load-spark-env.cmd
-rw-r--r--. 1 hemei hemei 2435 Feb 7 2022 load-spark-env.sh
-rwxr-xr-x. 1 hemei hemei 2634 Feb 7 2022 pyspark
-rw-r--r--. 1 hemei hemei 1540 Feb 7 2022 pyspark2.cmd
-rw-r--r--. 1 hemei hemei 1170 Feb 7 2022 pyspark.cmd
-rwxr-xr-x. 1 hemei hemei 1030 Feb 7 2022 run-example
-rw-r--r--. 1 hemei hemei 1223 Feb 7 2022 run-example.cmd
-rwxr-xr-x. 1 hemei hemei 3539 Feb 7 2022 spark-class
-rwxr-xr-x. 1 hemei hemei 2812 Feb 7 2022 spark-class2.cmd
-rw-r--r--. 1 hemei hemei 1180 Feb 7 2022 spark-class.cmd
-rwxr-xr-x. 1 hemei hemei 1039 Feb 7 2022 sparkR
-rw-r--r--. 1 hemei hemei 1097 Feb 7 2022 sparkR2.cmd
-rw-r--r--. 1 hemei hemei 1168 Feb 7 2022 sparkR.cmd
-rwxr-xr-x. 1 hemei hemei 3122 Feb 7 2022 spark-shell
-rw-r--r--. 1 hemei hemei 1818 Feb 7 2022 spark-shell2.cmd
-rw-r--r--. 1 hemei hemei 1178 Feb 7 2022 spark-shell.cmd
-rwxr-xr-x. 1 hemei hemei 1065 Feb 7 2022 spark-sql
-rw-r--r--. 1 hemei hemei 1118 Feb 7 2022 spark-sql2.cmd
-rw-r--r--. 1 hemei hemei 1173 Feb 7 2022 spark-sql.cmd
-rwxr-xr-x. 1 hemei hemei 1040 Feb 7 2022 spark-submit
-rw-r--r--. 1 hemei hemei 1155 Feb 7 2022 spark-submit2.cmd
-rw-r--r--. 1 hemei hemei 1180 Feb 7 2022 spark-submit.cmd
[root@master1 bin]#
2. 构建镜像
Spark(从2.3版开始)附带了一个Dockerfile,可以在 kubernetes/dockerfiles/
目录中找到它。
[root@master1 spark]# tree /mmp/modules/spark-3.1.3-bin-hadoop3.2/kubernetes/dockerfiles/spark
/mmp/modules/spark-3.1.3-bin-hadoop3.2/kubernetes/dockerfiles/spark
├── bindings
│ ├── python
│ │ └── Dockerfile
│ └── R
│ └── Dockerfile
├── decom.sh
├── Dockerfile
└── entrypoint.sh
3 directories, 5 files
[root@master1 spark]#
Spark还附带一个构建和push镜像的脚本 bin/docker-image-tool.sh
。构建镜像命令如下:
cd $SPARK_HOME
# 构建镜像
## -p ./kubernetes/dockerfiles/spark/Dockerfile,-p 指定Dockerfile
#$SPARK_HOME/bin/docker-image-tool.sh -r myharbor.com/bigdata -t 3.1.3-hadoop3 build build
# push
#$SPARK_HOME/bin/docker-image-tool.sh -r myharbor.com/bigdata -t 3.1.3-hadoop3 push
构建镜像:
# 实践-构建镜像
> $SPARK_HOME/bin/docker-image-tool.sh -r harbor.hmatm.com/modelsvc -t 3.1.3-hadoop3 build build
...
Successfully built 6b7f09a69a45
Successfully tagged harbor.hmatm.com/modelsvc/spark:3.1.3-hadoop3
[root@master1 spark-3.1.3-bin-hadoop3.2]# docker images | grep 'spark'
harbor.hmatm.com/modelsvc/spark 3.1.3-hadoop3 6b7f09a69a45 6 minutes ago 547MB
[root@master1 spark-3.1.3-bin-hadoop3.2]#
推送镜像到镜像仓库
# push
> $SPARK_HOME/bin/docker-image-tool.sh -r harbor.hmatm.com/modelsvc -t 3.1.3-hadoop3 push
3)配置 spark 用户权限
> kubectl create ns spark
> kubectl create serviceaccount spark -n spark
> kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark:spark
##在spark-submit中添加
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
4)提交 Spark 任务(cluster 模式)
# 查看k8s apiserverl:kubectl cluster-info
cd $SPARK_HOME
./bin/spark-submit \
--master k8s://https://192.168.1.61:6443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=harbor.hmatm.com/modelsvc/spark:3.1.3-hadoop3 \
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar
执行打印:
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Application status for spark-81e407abc7974ecb9304e0dc53d8cb2e (phase: Succeeded)
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Container final statuses:
container name: spark-kubernetes-driver
container image: harbor.hmatm.com/modelsvc/spark:3.1.3-hadoop3
container state: terminated
container started at: 2022-12-12T16:53:57Z
container finished at: 2022-12-12T16:54:18Z
exit code: 0
termination reason: Completed
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Application spark-pi with submission ID spark:spark-pi-a52d88850741db4b-driver finished
22/12/13 00:54:20 INFO LoggingPodStatusWatcherImpl: Application spark-pi with submission ID spark:spark-pi-a52d88850741db4b-driver finished
22/12/13 00:54:20 INFO ShutdownHookManager: Shutdown hook called
22/12/13 00:54:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-00f4d5c4-0fd3-4fdf-8b71-3bd5f550f4d2
[root@master1 spark-3.1.3-bin-hadoop3.2]# kubectl get pods -n spark
NAME READY STATUS RESTARTS AGE
spark-pi-a52d88850741db4b-driver 0/1 Completed 0 104s
spark-pi-f63d4185073ce565-driver 0/1 Error 0 7m9s
[root@master1 spark-3.1.3-bin-hadoop3.2]#
【注意】这里的 local:///opt/spark/examples/jars/spark-examples_2.12-3.1.3.jar 指的是 容器的文件系统路径,不是执行 spark-submit 的机器的文件系统路径,如果不使用 local 的话,也可以用 HTTP、HDFS 等系统,没指定的话默认是 local 模式
相关文章:
【云原生】Spark on k8s 讲解与实战操作
大数据之spark on k8s
Spark On K8S 在有赞的实践与经验
spark on k8s和on yarn对比有什么优势?
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)