해뜨기전에자자

spark standalone cluster 본문

개발/spark

spark standalone cluster

조앙'ㅁ' 2018. 3. 16. 23:06

standalone cluster with standby masters 구성

참고: https://spark.apache.org/docs/2.3.0/spark-standalone.html

briefly.. standby master구성을 위해 zookeeper를 이용함

standalone는 걍 하둡은 접근가능하지만 kerberized hdfs 에 접근하는 방법을 제공하고있지 않음. ㅜㅜ 


# 준비

environment:

- centos 7.4

- java 8

- ssh key setting

- zookeeper for HA

- servers

spark-master01

spark-master02

spark-slave01

spark-slave02

spark-slave03


# 설정

master01 에서 설정을 모두 마친다음 master02 slaves에 copy 함


## master01 setting

```

# 다운로드

wget http://mirror.apache-kr.org/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.6.tgz

tar xvfpz spark-2.3.0-bin-hadoop2.6.tgz

ln -s $PWD/spark-2.3.0-bin-hadoop2.6 spark


# SPARK_HOME, PATH등록

echo '''

export SPARK_HOME=$HOME/spark

export PATH=$PATH:$SPARK_HOME/bin

''' >> ~/.bashrc


source ~/.bashrc


# 슬레이브 등록

echo '''

joanne-spark-slave01

joanne-spark-slave02

joanne-spark-slave03

''' >> spark/conf/slaves


# spark-default.conf 세팅

echo '''spark.master                    spark://spark-master01:7077,spark-master02:7077

''' >> spark/conf/spark-default.conf


# HA 세팅

echo '''spark.deploy.recoveryMode=ZOOKEEPER

spark.deploy.zookeeper.url=zookeeperUrl:2181

spark.deploy.zookeeper.dir=/spark-cluster-01

''' >> spark/conf/ha.conf


```


## master02, slaves들에 copy

scp -rp 를 써서 copy.  -rp 는 recursive하게, preserves time(변경, 액세스 시간), mode 유지 옵션


```

scp -rp ~/.bashrc deploy@spark-master02:~/.bashrc

scp -rp ~/spark deploy@spark-master02:~/spark

for i in $(seq -f "%02g" 1 3); do scp -rp ~/.bashrc deploy@spark-slave$i:~/.bashrc; done

for i in $(seq -f "%02g" 1 3); do scp -rp ~/spark deploy@spark-slave$i:~/spark; done

```


## master02 start script 수정

master02에 접속해서, 맨 아래로 내려가면 실행 인자중 하나를 수정한다


spark/sbin/start-master.sh

```

# as-is: 원래는 $CLASS 1이라고 되어있음 이걸 $CLASS 2로 바꾼다.

"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \

  --host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \

  $ORIGINAL_ARGS


# to-be

"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 2 \

  --host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \

  $ORIGINAL_ARGS

```


# cluster 실행

spark-master01: ~/spark/sbin/start-master.sh

spark-master02: ~/spark/sbin/start-master.sh

spark-master01: ~/spark/sbin/start-slaves.sh


잘된다~