A Job Performance Evaluation Method under Spark Platform

doi:10.3969/j.issn.1671-1122.2022.09.010

Abstract

Abstract:

In order to solve the problem of performance evaluation and performance optimization during the operation of Spark jobs, this paper proposed a performance evaluation and analysis method of Spark jobs based on hierarchical analysis. Firstly, to address the problem of low accuracy of traditional job type classification affected by feature selection, more realistic CPU and I/O features were selected and combined with K-Means clustering algorithm to build a job classifier to improve the classification accuracy. Secondly, the job workflow was optimized by eliminating operations such as data sorting, disk overflow writing, and file merging during job operation, and the optimized job performance index was used as the evaluation benchmark, making the job operation performance evaluation more objective and general. Afterwards, the performance metrics were quantified and stratified, hierarchical analysis was introduced to calculate their weights, and the performance evaluation model was constructed by combining job classifiers and evaluation benchmarks. Finally, experimental validation was conducted in three aspects: job type classification, workflow optimization method and performance evaluation. The experimental results show the effectiveness of the proposed job type classification and workflow optimization method, as well as the accuracy of the evaluation model.

Key words: Spark, assessment benchmark, quantification, hierarchical analysis

CLC Number:

TP309

ZHANG Zhenghui, CHEN Xingshu, LUO Yonggang, WU Tianxiong. A Job Performance Evaluation Method under Spark Platform[J]. Netinfo Security, 2022, 22(9): 86-95.

Figures/Tables 15

标度	含义
1	表示两个因素相比，具有相同重要性
3	表示两个因素相比，前者比后者稍重要
5	表示两个因素相比，前者比后者明显重要
7	表示两个因素相比，前者比后者强烈重要
9	表示两个因素相比，前者比后者极端重要
2，4，6，8	表示上述相邻判断的中间值
倒数	若因素i与因素j的重要性之比为 ${{a}_{ij}}$ ，那么因素j与因素i重要性之比为 ${{a}_{ji}}=1/{{a}_{ij}}$

X Y	${{X}_{1}}$	${{X}_{2}}$	$\ldots$	${{X}_{m}}$	总排序权重
${{Y}_{1}}$	${{y}_{11}}$	${{y}_{12}}$	$\ldots$	${{y}_{1m}}$	$\sum\limits_{i=1}^{m}{{{y}_{1i}}{{x}_{i}}}$
${{Y}_{2}}$	${{y}_{21}}$	${{y}_{22}}$	$\ldots$	${{y}_{2m}}$	$\sum\limits_{i=1}^{m}{{{y}_{2i}}{{x}_{i}}}$
$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$
${{Y}_{m}}$	${{y}_{n1}}$	${{y}_{n2}}$	$\ldots$	${{y}_{nm}}$	$\sum\limits_{i=1}^{m}{{{y}_{ni}}{{x}_{i}}}$

准则层	B₁	B₂	B₃	B₄	权重 ${{\omega }_{cpu}}$
准则层权重	0.1	0.18	0.36	0.36	权重 ${{\omega }_{cpu}}$
Shuffle	0	0	0.67	0	0.2412
ShuffleRead	0	0	0	0.57	0.2052
CpuLoad	0	0.67	0	0	0.1206
ShuffleWrite	0	0	0	0.29	0.1044
GC	0	0	0.17	0	0.0612
ThroughputRate	0	0	0	0.14	0.0504
Job	0.4	0	0	0	0.04
Stage	0.4	0	0	0	0.04
PeakMem	0	0.22	0	0	0.0396
Task	0	0	0.08	0	0.0288
ExecutorTime	0	0	0.08	0	0.0288
Config	0.2	0	0	0	0.02
IO	0	0.11	0	0	0.0198

References 19

[1]	ZAHARIA M, XIN R S, WENDELL P, et al. Apache Spark: A Unified Engine for Big Data Processing[J]. Communications of the ACM, 2016, 59(11): 56-65.
[2]	PEREZ T B G, CHEN Wei, JI R, et al. Pets: Bottleneck-Aware Spark Tuning with Parameter Ensembles[C]// IEEE. 2018 27th International Conference on Computer Communication and Networks (ICCCN). New Your:IEEE, 2018: 1-9.
[3]	GULINO A, CANAKOGLU A, CERI S, et al. Performance Prediction for Data-Driven Workflows on Apache Spark[C]// IEEE. 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). New York:IEEE, 2020: 1-8.
[4]	GU Jing, LI Ying, TANG Hongyan, et al. Auto-Tuning Spark Configurations Based on Neural Network[C]// IEEE. International Conference on Communications (ICC). New York:IEEE, 2018: 1-6.
[5]	SHAH S, AMANNEJAD Y, KRISHNAMURTHY D, et al. PERIDOT: Modeling Execution Time of Spark Applications[J]. IEEE Open Journal of the Computer Society, 2021, 2(1): 346-359.
[6]	AHMED N, BARCZAK A L C, RASHID M A, et al. A Parallelization Model for Performance Characterization of Spark Big Data Jobs on Hadoop Clusters[J]. Journal of Big Data, 2021, 8(1): 1-28.
[7]	MYUNG R, YU H. Performance Prediction for Convolutional Neural Network on Spark Cluster[J]. Electronics, 2020, 9(9): 1340-1362.
[8]	PERTRIDIS P, GOUNARIS A, TORRES J. Spark Parameter Tuning via Trial-and-Error[C]// Springer. INNS Conference on Big Data. Heidelberg: Springer, 2016: 226-237.
[9]	CHENG Guoli, YING Shi, WANG Bingming, et al. Efficient Performance Prediction for Apache Spark[J]. Journal of Parallel and Distributed Computing, 2021, 149(5): 40-51.
[10]	TIAN Chunqi, LI Jing, WANG Wei, et al. A Method for Improving the Performance of Spark on Container Cluster Based on Machine Learning[J]. Netinfo Security, 2019, 19(4): 11-19.
	田春岐, 李静, 王伟, 等. 一种基于机器学习的Spark容器集群性能提升方法[J]. 信息网络安全, 2019, 19(4): 11-19.
[11]	RUAN Shuhua, PAN Fanfan, CHEN Xingshu, et al. An Intelligent Optimization Method for Spark Job Configuration Parameters[J]. Advanced Engineering Sciences, 2020, 52(1): 191-197.
	阮树骅, 潘梵梵, 陈兴蜀, 等. 一种Spark作业配置参数智能优化方法[J]. 工程科学与技术, 2020, 52(1): 191-197.
[12]	AL-SAYEH H, HAGEDORN S, SATTLER K U. A Gray-Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs[J]. Distributed and Parallel Databases, 2020, 38(4): 819-839.
[13]	KAN Zhongliang, LI Jianzhong. A Regression Model-Based Approach to Spark Task Performance Analysis[J]. Journal of Harbin Institute of Technology, 2018, 50(3): 192-198.
	阚忠良, 李建中. 基于回归模型的Spark任务性能分析方法[J]. 哈尔滨工业大学学报, 2018, 50(3): 192-198.
[14]	LI Cichao. Analysis and Optimization of Hadoop Job Scheduling Algorithm[D]. Wuhan: Wuhan University of Technology, 2015.
	李词超. Hadoop作业调度算法分析与优化[D]. 武汉: 武汉理工大学, 2015.
[15]	LI Zhe. Scheduling Algorithm Based on Job Type Classification and Cost Comparison for Hadoop Platform[D]. Guangzhou: South China University of Technology, 2015.
	李哲. Hadoop平台基于作业类型划分和代价比较的调度算法[D]. 广州: 华南理工大学, 2015.
[16]	AMORIM R C D, MIRKIN B. Minkowski Metric, Feature Weighting and Anomalous Cluster Initializing in K-Means Clustering[J]. Pattern Recognition, 2012, 45(3): 1061-1075.
[17]	SAATY T L. The Analytic Hierarchy Process[M]. New York: McGraw Hill Higher Education, 1980.
[18]	PETRIDIS P, GOUNARIS A, TORRES J. Spark Parameter Tuning via Trial-and-Error[C]// Springer. INNS Conference on Big Data. Heidelberg: Springer, 2016: 226-237.
[19]	WANG Lei, ZHAN Jianfeng, LUO Chunjie, et al. BigDataBench: A Big Data Benchmark Suite from Internet Services[C]// IEEE. The 20th IEEE International Symposium on High Performance Computer Architecture (HPCA-2014). New York:IEEE, 2014: 488-499.

序列化器配置信息	得分
未设置或不为KryoSerializer	3
序列化为KryoSerializer	5

是否动态分配	是否外部Shuffle服务	得分
是/否	是	5
否	否	3
是	否	1

聚类类别	idle-ratio	iowait-ratio	appName
1	4.699	0.018	BigDataBench Sort
1	4.312	0.020	BigDataBench Sort
0	1.662	0.001	BigDataBench Grep
0	1.834	0.003	BigDataBench Grep
0	1.519	0.002	BigDataBench WordCount
0	1.598	0.002	BigDataBench WordCount
0	1.662	0.001	BigDataBench PageRank
0	1.941	0.001	BigDataBench PageRank

分类算法类别	A	P	R	F1
本文方案	91.1%	94.4%	92.1%	93.3%
文献[13]方案	88.1%	93.0%	88.8%	90.8%
文献[14]方案	92.8%	-	-	-
文献[15]方案	90.4%	-	-	-

作业	1	2	3	4	3-2	4-2
时间/ms	83039	67997	62663	81235	-	-
总分	0.774	0.875	0.886	0.833	0.755	0.767
Config	1	1	1	1	1	1
Job	1	1	1	1	1	1
Stage	1	1	1	1	1	1
Peak Mem	1	0.704	0.569	0.393	0.443	0.323
Cpu Load	1	1	1	1	0.413	0.672
IO	1	1	1	1	0.992	0.991
GC	0.858	0.888	0.956	0.991	0.914	0.991
Task	1	0.91	1	0.539	0.711	0.539
Shuffle	0.103	0.574	0.613	0.498	0.538	0.502
Executor Time	0.996	0.966	1	1	1	1
Shuffle Write	1	1	1	1	1	1
Through putRate	0.977	1	1	0.847	0.482	0.358
Shuffle Read	1	1	1	1	1	1