add image analysis w/ tensorflow #318

h324yang · Apr 25, 2019

JCDL2019 demo

Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.

default setting is standalone mode, so need to set up master and slaves first.
run detect.py to get and store the object probabilities and the image byte strings.
run extract_images.py to get image files from the result of step2

codecov-io · Apr 25, 2019

Codecov Report

Merging #318 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #318   +/-   ##
=======================================
  Coverage   75.95%   75.95%           
=======================================
  Files          41       41           
  Lines        1148     1148           
  Branches      200      200           
=======================================
  Hits          872      872           
  Misses        209      209           
  Partials       67       67

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb05f7...4d104b0. Read the comment docs.

ruebot · Apr 28, 2019

@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some?

@lintool do you want #241 open still? Does this supersede it?

ruebot · Apr 28, 2019

...and is this apart of everything that should be included, or just helpers for the work you did on the paper?

h324yang · May 6, 2019

Distributed image analysis via the integration of AUT and Tensorflow

GitHub issue(s): #240 #241

What does this Pull Request do?

Integrating AUT and Tensorflow with python interface (pyspark).
The code of the JCDL 2019 paper.
Single Shot MultiBox Detector is used so far, because of the balance between speed and accuracy.
The inference scores and the byte strings of images are stored first.
Using the image extractor to get the image files, , i.e., jpeg, gif, etc., which scores are higher than the threshold defined by users.

How should this be tested?

Step 1: Run detection

python aut/src/main/python/tf/detect.py \
		--web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
		--aut_jar aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
		--spark spark-2.3.2-bin-hadoop2.7/bin \
		--master spark://127.0.1.1:7077 \
		--img_model ssd \
		--filter_size 640 640 \
		--output_path warc_res

Step 2: Extract Images

python aut/src/main/python/tf/extract_images.py \
		--res_dir warc_res \
		--output_dir warc_imgs \
		--threshold 0.85

Additional Notes:

Python Dependency

My python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then pip install req.txt .

Note that you should ensure that driver and workers use the same python version. You might set as follows:

export PYSPARK_PYTHON=[YOUR PYTHON]
export PYSPARK_DRIVER_PYTHON=[YOUR PYTHON]

Spark Mode

The default mode is standalone. E.g., you can launch in this mode as follows:

cd spark-2.3.2-bin-hadoop2.7
./sbin/start-master.sh
./sbin/start-slave.sh 127.0.1.1:7077

The spark parameters are set by using init_spark() in src/main/python/tf/util/init.py

Design Details

The pre-trained model and the corresponding dictionary for label mapping are stored in src/main/python/tf/model/graph/ and src/main/python/tf/model/category/ , respectively.
For each pre-trained model, though there is only one now, we define a model class and an extractor class, as SSD and SSDExtractor in src/main/python/tf/model/object_detection.py.
Using the model class, as SSD, to derive the pandas UDF function for inference.

Interested parties

@lintool

ruebot · May 30, 2019

@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them?

ruebot · May 31, 2019

ruebot requested changes May 31, 2019

View changes

src/main/python/tf/util/init.py

src/main/python/tf/extract_images.py

ruebot · Jun 5, 2019

@h324yang I'm unable to get this to run.

$ cat warc-image-classification/run_detection.sh 
export PYSPARK_PYTHON=/home/ruestn/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/ruestn/anaconda3/bin/python

python /home/ruestn/aut/src/main/python/tf/detect.py --web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
    --aut_jar /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
    --aut_py /home/ruestn/aut/src/main/python \
    --spark /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin \
    --master spark://127.0.1.1:7077 \
    --img_model ssd \
    --filter_size 100 100 \
    --output_path /home/ruestn/aut_318_test

I get:

$ ./run_detection.sh 
Traceback (most recent call last):
  File "/home/ruestn/aut/src/main/python/tf/detect.py", line 3, in <module>
    from util.init import *
  File "/home/ruestn/aut/src/main/python/tf/util/init.py", line 4, in <module>
    from pyspark import SparkConf, SparkContext, SQLContext
ModuleNotFoundError: No module named 'pyspark'

ruebot · Jun 5, 2019

Chatting with Leo in Slack; guess who did a 🤦‍♂?

I was giving a path to Python, ~~not PySpark~~, without having PySpark installed for Anaconda Python.

ruebot · Jun 6, 2019

First pass worked with some tweaks; changed "spark.cores.max", "48" and added "spark.network.timeout", "1000000".

We should definitely figure out a way to pass the Spark conf settings, since a user will definitely need to tweak them depending on their setup. I don't think we should have the conf settings hard coded in src/main/python/tf/util/init.py.

With auk we just pass a whole bunch of flags with we run Spark. That might not be ideal here since we already pass a lot of flags. Or we just roll with it. Or, we include a sample conf file in the repo, and tell folks to copy that and tweak it as needed.

What do you think @h324yang @lintool @ianmilligan1?

ianmilligan1 · Jun 6, 2019

All of the options sound good to me for various reasons! But I think at this stage as a prototype function we could probably just have people add some flags and roll with it – down the line, perhaps as a separate issue, come up with a conf file to try to reduce some of the flag soup? @ruebot

ruebot · Jun 19, 2019

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

ruebot · Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

h324yang · Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

Seems like an OOM error; The arguments I set in util/init.py were optimized and running well on Tuna. I got some errors but I don't think OOM is a frequent one. You also run on Tuna?

Maybe a lower value of "spark.sql.execution.arrow.maxRecordsPerBatch" could help, e.g., 1280 -> 640. (Indeed, tuning such settings bothered me a lot :-/)

ruebot · Jun 24, 2019

@h324yang I ended up dropping it down to 320, and doing 10 WARCs instead of the previous attempts of doing 1000, and 100. It was a lot more stable with 10, and the initial job completed successfully.

h324yang · Jun 29, 2019

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

I update to the TF 1.14.0 api, i.e. tf.io.gfile.GFile.

h324yang · Jun 29, 2019

@ruebot I done all requested changes except for --img_model, which reason is replied in the thread. Also, conf file is added. Please re-review the new commits.

ruebot · Jul 3, 2019

ruebot approved these changes Jul 3, 2019

View changes

ruebot · Jul 3, 2019

ruebot requested changes Jul 3, 2019

View changes

@h324yang we still have the models files. Those need to be pulled out. I don't believe we can distribute them based on a discussion with @lintool.

h324yang · Jul 3, 2019

Sorry! That slipped my mind, and I already removed it.
The model is from TF detection model zoo: ssd_mobilenet_v1_fpn_coco ☆

We can download it and mv the frozen_inference_graph.pb to the designated folder aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640

For example:

wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
tar -xzvf ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
mkdir -p aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/
cp ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/frozen_inference_graph.pb aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/

Then, we need the category mapping file mscoco_label_map.pbtxt, which can be downloaded from here and also mv it to the designated folder aut/src/main/python/tf/model/category/

For example:

mkdir -p aut/src/main/python/tf/model/category/
cd aut/src/main/python/tf/model/category/
wget https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_label_map.pbtxt

ruebot · Jul 5, 2019

ruebot approved these changes Jul 5, 2019

View changes

add image analysis w/ tensorflow

Loading status checks…

fbf31fb

h324yang added some commits Jun 29, 2019

change argument desc.

Loading status checks…

45d0448

add spark.conf for setting.

Loading status checks…

8a066bb

delete model file

Loading status checks…

4d104b0

This was referenced Jul 4, 2019

Open

Update to Spark 2.4.3 and update Tika to 1.20. #321

Closed

Dicussion: TensorFlow for Image Analysis #241

rm category dictionary

Loading status checks…

99f0779

archivesunleashed/aut

Join GitHub today

add image analysis w/ tensorflow #318

Conversation

h324yang commented Apr 25, 2019

JCDL2019 demo

This comment has been minimized.

codecov-io commented Apr 25, 2019 • edited

Codecov Report

This comment has been minimized.

ruebot commented Apr 28, 2019

This comment has been minimized.

ruebot commented Apr 28, 2019

This comment has been minimized.

h324yang commented May 6, 2019 • edited

Distributed image analysis via the integration of AUT and Tensorflow

What does this Pull Request do?

How should this be tested?

Step 1: Run detection

Step 2: Extract Images

Additional Notes:

Python Dependency

Spark Mode

Design Details

Interested parties

This comment has been minimized.

ruebot commented May 30, 2019

ruebot requested changes May 31, 2019 View changes

This comment has been minimized.

ruebot commented Jun 5, 2019

This comment has been minimized.

ruebot commented Jun 5, 2019 • edited

This comment has been minimized.

ruebot commented Jun 6, 2019

This comment has been minimized.

ianmilligan1 commented Jun 6, 2019

This comment has been minimized.

ruebot commented Jun 19, 2019

This comment has been minimized.

ruebot commented Jun 21, 2019

This comment has been minimized.

h324yang commented Jun 21, 2019

This comment has been minimized.

ruebot commented Jun 24, 2019

This comment has been minimized.

h324yang commented Jun 29, 2019

h324yang added some commits Jun 29, 2019

This comment has been minimized.

h324yang commented Jun 29, 2019

ruebot approved these changes Jul 3, 2019 View changes

ruebot requested changes Jul 3, 2019 View changes

ruebot left a comment

This comment has been minimized.

h324yang commented Jul 3, 2019 • edited

This was referenced Jul 4, 2019

Update to Spark 2.4.3 and update Tika to 1.20. #321

Dicussion: TensorFlow for Image Analysis #241

ruebot approved these changes Jul 5, 2019 View changes

Hide details View details ruebot merged commit 7a61f0e into archivesunleashed:master Jul 5, 2019 3 checks passed

3 checks passed

codecov-io commented Apr 25, 2019 •

edited

h324yang commented May 6, 2019 •

edited

ruebot requested changes May 31, 2019

View changes

ruebot commented Jun 5, 2019 •

edited

ruebot approved these changes Jul 3, 2019

View changes

ruebot requested changes Jul 3, 2019

View changes

h324yang commented Jul 3, 2019 •

edited

ruebot approved these changes Jul 5, 2019

View changes

ruebot merged commit `7a61f0e` into archivesunleashed:master Jul 5, 2019
3 checks passed