Support init_spark_on_yarn and RayContext #1344

zhichao-li · May 21, 2019

This patch would provide a mechanism to deploy Python dependencies and Ray services automatically across yarn cluster.
Base on init_spark_on_yarn and Conda, python user would be able to run Analytics-Zoo or Ray in a more pythonic way on yarn with pip install analytics-zoo only without spark-submit or installing Analytics-Zoo or Ray across all cluster nodes.

Example

import ray

from zoo import init_on_yarn
from zoo.ray.util.raycontext import RayContext

slave_num = 2

sc = init_spark_on_yarn(
    hadoop_conf="/opt/work/almaren-yarn-config/",
    conda_name="ray36-dev",
    num_executor=slave_num,
    executor_cores=28,
    executor_memory="10g",
    driver_memory="2g",
    driver_cores=4,
    extra_executor_memory_for_ray="30g")

ray_ctx = RayContext(sc=sc,
                       object_store_memory="25g")
ray_ctx.init()


@ray.remote
class TestRay():
    def hostname(self):
        import socket
        return socket.gethostname()

    def check_cv2(self):
        # conda install -c conda-forge opencv==3.4.2
        import cv2
        return cv2.__version__

    def ip(self):
        import ray.services as rservices
        return rservices.get_node_ip_address()


actors = [TestRay.remote() for i in range(0, slave_num)]
print([ray.get(actor.hostname.remote()) for actor in actors])
print([ray.get(actor.ip.remote()) for actor in actors])

ray_ctx.stop()

zhichao-li · May 23, 2019

zhichao-li reviewed May 23, 2019

View changes

zhichao-li · May 23, 2019

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/api/net/python/PythonZooNet.scala

@@ -137,4 +139,39 @@ class PythonZooNet[T: ClassTag](implicit ev: TensorNumeric[T]) extends PythonZoo
      toJSample(x).asInstanceOf[RDD[JSample[Float]]], batchSize)
  }

+  val processToBeKill = new ArrayList[String]()


switch this to a copy on write array and avoid using locking here.

jason-dai · May 30, 2019

Can we have something like:

sc = init_nncontext_on_yarn() #or init_zoo_on_yarn()
ray_ctx = RayContext(sc)
ray_ctx.init()
@ray.remote
class TestRay():
ray_ctx.stop()

When running on YARN, two executors can possibly run on the same node with the same ip; does it work?
Do we support Spark local? Need an example for that. Do we still start a Ray cluster in this case?

jason-dai · May 30, 2019

jason-dai reviewed May 30, 2019

View changes

jason-dai · May 30, 2019

zoo/src/main/scala/com/intel/analytics/zoo/pipeline/api/net/python/PythonZooNet.scala

+          override def run(): Unit = {
+            // Give it a chance to be gracefully killed
+            killPids(processToBeKill, "kill ")
+            Thread.sleep(2000)


if (!processToBeKill.isEmply()){ Thread.sleep(2000) killPids(processToBeKill, "kill -9") }

jason-dai · May 30, 2019

jason-dai reviewed May 30, 2019

View changes

jason-dai · May 30, 2019

pyzoo/zoo/ray/util/spark.py

+                           driver_memory="1g",
+                           driver_cores=10,
+                           extra_executor_memory_for_ray=None,
+                           extra_pmodule_zip=None,


should give it a clearer name

jason-dai · May 30, 2019

jason-dai reviewed May 30, 2019

View changes

jason-dai · May 30, 2019

pyzoo/zoo/ray/util/spark.py

+                           penv_archive=None,
+                           master="yarn",
+                           hadoop_user_name="root",
+                           spark_yarn_jars=None,


is it spark.yarn.jars or spark.yarn.archive?

jason-dai · May 30, 2019

jason-dai reviewed May 30, 2019

View changes

jason-dai · May 30, 2019

pyzoo/zoo/ray/util/spark.py

+            sc = self._init_yarn(hadoop_conf=hadoop_conf,
+                                 spark_yarn_jars=spark_yarn_jars,
+                                 penv_archive=penv_archive,
+                                 python_zip_file=extra_pmodule_zip,


why different names? make sure _init_yarn matches init_spark_on_yarn for their parameter lists

jason-dai · May 30, 2019

jason-dai reviewed May 30, 2019

View changes

jason-dai · May 30, 2019

pyzoo/zoo/ray/util/spark.py

+            command = " --archives {}#python_env --num-executors {} " \
+                      " --executor-cores {} --executor-memory {}".\
+                format(penv_archive, num_executor, executor_cores, executor_memory)
+            path_to_zoo_jar = get_analytics_zoo_classpath()


this can possibly return BigDL class path, or return nothing at all?

yes, it would either return a zoo.jar or an empty string.

it can also possibly return bigdl.jar?

no, since it searches from the zoo distributed folder. Actually, I'm thinking we should throw an Exception here when the jar cannot be found then the user would aware that there's a problem here and can be fixed either by specify proper environment or reinstall Analytics-Zoo.

in get_analytics_zoo_classpath:

if os.getenv("BIGDL_CLASSPATH"): return os.environ["BIGDL_CLASSPATH"]

jason-dai · May 30, 2019

jason-dai reviewed May 30, 2019

View changes

jason-dai · May 30, 2019

pyzoo/zoo/ray/util/utils.py

+        return value
+    except Exception:
+        raise Exception("Size must be specified as bytes(b),"
+                        "kibibytes(k), mebibytes(m), gibibytes(g). "


kilobytes, megabytes, gigabytes

jason-dai · May 30, 2019