add image analysis w/ tensorflow #318

h324yang · Apr 25, 2019

JCDL2019 demo

Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.

default setting is standalone mode, so need to set up master and slaves first.
run detect.py to get and store the object probabilities and the image byte strings.
run extract_images.py to get image files from the result of step2

codecov-io · Apr 25, 2019

Codecov Report

Merging #318 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #318   +/-   ##
=======================================
  Coverage   75.95%   75.95%           
=======================================
  Files          41       41           
  Lines        1148     1148           
  Branches      200      200           
=======================================
  Hits          872      872           
  Misses        209      209           
  Partials       67       67

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb05f7...fbf31fb. Read the comment docs.

ruebot · Apr 28, 2019

@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some?

@lintool do you want #241 open still? Does this supersede it?

ruebot · Apr 28, 2019

...and is this apart of everything that should be included, or just helpers for the work you did on the paper?

h324yang · May 6, 2019

Distributed image analysis via the integration of AUT and Tensorflow

GitHub issue(s): #240 #241

What does this Pull Request do?

Integrating AUT and Tensorflow with python interface (pyspark).
The code of the JCDL 2019 paper.
Single Shot MultiBox Detector is used so far, because of the balance between speed and accuracy.
The inference scores and the byte strings of images are stored first.
Using the image extractor to get the image files, , i.e., jpeg, gif, etc., which scores are higher than the threshold defined by users.

How should this be tested?

Step 1: Run detection

python aut/src/main/python/tf/detect.py \
		--web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
		--aut_jar aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
		--aut_py aut/src/main/python \
		--spark spark-2.3.2-bin-hadoop2.7/bin \
		--master spark://127.0.1.1:7077 \
		--img_model ssd \
		--filter_size 640 640 \
		--output_path warc_res

Step 2: Extract Images

python aut/src/main/python/tf/extract_images.py \
		--res_dir warc_res \
		--output_dir warc_imgs \
		--threshold 0.85

Additional Notes:

Python Dependency

My python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then pip install req.txt .

Note that you should ensure that driver and workers use the same python version. You might set as follows:

export PYSPARK_PYTHON=[YOUR PYTHON]
export PYSPARK_DRIVER_PYTHON=[YOUR PYTHON]

Spark Mode

The default mode is standalone. E.g., you can launch in this mode as follows:

cd spark-2.3.2-bin-hadoop2.7
./sbin/start-master.sh
./start-slave.sh 127.0.1.1:7077

The spark parameters are set by using init_spark() in src/main/python/tf/util/init.py

Design Details

The pre-trained model and the corresponding dictionary for label mapping are stored in src/main/python/tf/model/graph/ and src/main/python/tf/model/category/ , respectively.
For each pre-trained model, though there is only one now, we define a model class and an extractor class, as SSD and SSDExtractor in src/main/python/tf/model/object_detection.py.
Using the model class, as SSD, to derive the pandas UDF function for inference.

Interested parties

@lintool

ruebot · May 30, 2019

@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them?

ruebot · May 31, 2019

ruebot requested changes May 31, 2019

View changes

ruebot · May 31, 2019

src/main/python/tf/util/init.py

+    parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')
+    parser.add_argument('--aut_jar', help='aut compiled jar package', default='aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar')
+    parser.add_argument('--aut_py', help='path to python package', default='aut/src/main/python')
+    parser.add_argument('--spark', help='path to python package', default='spark-2.3.2-bin-hadoop2.7/bin')


Change to: help='Path to Apache Spark.

ruebot · May 31, 2019

src/main/python/tf/util/init.py

+    parser = argparse.ArgumentParser(description='PySpark for Web Archive Image Retrieval')
+    parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')
+    parser.add_argument('--aut_jar', help='aut compiled jar package', default='aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar')
+    parser.add_argument('--aut_py', help='path to python package', default='aut/src/main/python')


Is this supposed to be the Python binary? Or something else? Needs a better help description.

ruebot · May 31, 2019

src/main/python/tf/util/init.py

+def get_args():
+    parser = argparse.ArgumentParser(description='PySpark for Web Archive Image Retrieval')
+    parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')
+    parser.add_argument('--aut_jar', help='aut compiled jar package', default='aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar')


Change to: `help='Path to compiled aut jar.'

ruebot · May 31, 2019

src/main/python/tf/util/init.py

+
+def get_args():
+    parser = argparse.ArgumentParser(description='PySpark for Web Archive Image Retrieval')
+    parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')


Change to: help='Path to warcs.'

ruebot · May 31, 2019

src/main/python/tf/util/init.py

+    parser.add_argument('--aut_py', help='path to python package', default='aut/src/main/python')
+    parser.add_argument('--spark', help='path to python package', default='spark-2.3.2-bin-hadoop2.7/bin')
+    parser.add_argument('--master', help='master IP address', default='spark://127.0.1.1:7077')
+    parser.add_argument('--img_model', help='model for image processing, use ssd', default='ssd')


model for image processing, use ssd

If this is the only option, why is there an argument?

ruebot · May 31, 2019