Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add image analysis w/ tensorflow #318

Merged
merged 5 commits into from Jul 5, 2019

Conversation

Projects
None yet
4 participants
@h324yang
Copy link
Contributor

commented Apr 25, 2019

JCDL2019 demo

Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.


  1. default setting is standalone mode, so need to set up master and slaves first.
  2. run detect.py to get and store the object probabilities and the image byte strings.
  3. run extract_images.py to get image files from the result of step2
@codecov-io

This comment has been minimized.

Copy link

commented Apr 25, 2019

Codecov Report

Merging #318 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #318   +/-   ##
=======================================
  Coverage   75.95%   75.95%           
=======================================
  Files          41       41           
  Lines        1148     1148           
  Branches      200      200           
=======================================
  Hits          872      872           
  Misses        209      209           
  Partials       67       67

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb05f7...4d104b0. Read the comment docs.

@ruebot

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some?

@lintool do you want #241 open still? Does this supersede it?

@ruebot

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

...and is this apart of everything that should be included, or just helpers for the work you did on the paper?

@h324yang

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

Distributed image analysis via the integration of AUT and Tensorflow

GitHub issue(s): #240 #241

What does this Pull Request do?

  • Integrating AUT and Tensorflow with python interface (pyspark).
  • The code of the JCDL 2019 paper.
  • Single Shot MultiBox Detector is used so far, because of the balance between speed and accuracy.
  • The inference scores and the byte strings of images are stored first.
  • Using the image extractor to get the image files, , i.e., jpeg, gif, etc., which scores are higher than the threshold defined by users.

How should this be tested?

Step 1: Run detection

python aut/src/main/python/tf/detect.py \
		--web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
		--aut_jar aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
		--spark spark-2.3.2-bin-hadoop2.7/bin \
		--master spark://127.0.1.1:7077 \
		--img_model ssd \
		--filter_size 640 640 \
		--output_path warc_res

Step 2: Extract Images

python aut/src/main/python/tf/extract_images.py \
		--res_dir warc_res \
		--output_dir warc_imgs \
		--threshold 0.85

Additional Notes:

Python Dependency

My python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then pip install req.txt .

Note that you should ensure that driver and workers use the same python version. You might set as follows:

export PYSPARK_PYTHON=[YOUR PYTHON]
export PYSPARK_DRIVER_PYTHON=[YOUR PYTHON]

Spark Mode

The default mode is standalone. E.g., you can launch in this mode as follows:

cd spark-2.3.2-bin-hadoop2.7
./sbin/start-master.sh
./sbin/start-slave.sh 127.0.1.1:7077

The spark parameters are set by using init_spark() in src/main/python/tf/util/init.py

Design Details

  • The pre-trained model and the corresponding dictionary for label mapping are stored in src/main/python/tf/model/graph/ and src/main/python/tf/model/category/ , respectively.
  • For each pre-trained model, though there is only one now, we define a model class and an extractor class, as SSD and SSDExtractor in src/main/python/tf/model/object_detection.py.
  • Using the model class, as SSD, to derive the pandas UDF function for inference.

Interested parties

@lintool

@ruebot

This comment has been minimized.

Copy link
Member

commented May 30, 2019

@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them?

Show resolved Hide resolved src/main/python/tf/util/init.py Outdated
Show resolved Hide resolved src/main/python/tf/util/init.py Outdated
Show resolved Hide resolved src/main/python/tf/util/init.py Outdated
Show resolved Hide resolved src/main/python/tf/util/init.py Outdated
Show resolved Hide resolved src/main/python/tf/util/init.py Outdated
Show resolved Hide resolved src/main/python/tf/extract_images.py Outdated
Show resolved Hide resolved src/main/python/tf/extract_images.py Outdated
Show resolved Hide resolved src/main/python/tf/extract_images.py Outdated
Show resolved Hide resolved src/main/python/tf/extract_images.py Outdated
Show resolved Hide resolved src/main/python/tf/extract_images.py Outdated
@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

@h324yang I'm unable to get this to run.

$ cat warc-image-classification/run_detection.sh 
export PYSPARK_PYTHON=/home/ruestn/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/ruestn/anaconda3/bin/python

python /home/ruestn/aut/src/main/python/tf/detect.py --web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
    --aut_jar /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
    --aut_py /home/ruestn/aut/src/main/python \
    --spark /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin \
    --master spark://127.0.1.1:7077 \
    --img_model ssd \
    --filter_size 100 100 \
    --output_path /home/ruestn/aut_318_test

I get:

$ ./run_detection.sh 
Traceback (most recent call last):
  File "/home/ruestn/aut/src/main/python/tf/detect.py", line 3, in <module>
    from util.init import *
  File "/home/ruestn/aut/src/main/python/tf/util/init.py", line 4, in <module>
    from pyspark import SparkConf, SparkContext, SQLContext
ModuleNotFoundError: No module named 'pyspark'
@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

Chatting with Leo in Slack; guess who did a 🤦‍♂?

I was giving a path to Python, not PySpark, without having PySpark installed for Anaconda Python.

@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

First pass worked with some tweaks; changed "spark.cores.max", "48" and added "spark.network.timeout", "1000000".

We should definitely figure out a way to pass the Spark conf settings, since a user will definitely need to tweak them depending on their setup. I don't think we should have the conf settings hard coded in src/main/python/tf/util/init.py.

With auk we just pass a whole bunch of flags with we run Spark. That might not be ideal here since we already pass a lot of flags. Or we just roll with it. Or, we include a sample conf file in the repo, and tell folks to copy that and tweak it as needed.

What do you think @h324yang @lintool @ianmilligan1?

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

All of the options sound good to me for various reasons! But I think at this stage as a prototype function we could probably just have people add some flags and roll with it – down the line, perhaps as a separate issue, come up with a conf file to try to reduce some of the flag soup? @ruebot

@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 19, 2019

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

@h324yang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

Seems like an OOM error; The arguments I set in util/init.py were optimized and running well on Tuna. I got some errors but I don't think OOM is a frequent one. You also run on Tuna?

Maybe a lower value of "spark.sql.execution.arrow.maxRecordsPerBatch" could help, e.g., 1280 -> 640. (Indeed, tuning such settings bothered me a lot :-/)

@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 24, 2019

@h324yang I ended up dropping it down to 320, and doing 10 WARCs instead of the previous attempts of doing 1000, and 100. It was a lot more stable with 10, and the initial job completed successfully.

@h324yang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 29, 2019

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

I update to the TF 1.14.0 api, i.e. tf.io.gfile.GFile.

h324yang added some commits Jun 29, 2019

@h324yang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 29, 2019

@ruebot I done all requested changes except for --img_model, which reason is replied in the thread. Also, conf file is added. Please re-review the new commits.

@ruebot

ruebot approved these changes Jul 3, 2019

@ruebot
Copy link
Member

left a comment

@h324yang we still have the models files. Those need to be pulled out. I don't believe we can distribute them based on a discussion with @lintool.

@h324yang

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2019

Sorry! That slipped my mind, and I already removed it.
The model is from TF detection model zoo: ssd_mobilenet_v1_fpn_coco ☆

We can download it and mv the frozen_inference_graph.pb to the designated folder aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640

For example:

wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
tar -xzvf ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
mkdir -p aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/
cp ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/frozen_inference_graph.pb aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/

Then, we need the category mapping file mscoco_label_map.pbtxt, which can be downloaded from here and also mv it to the designated folder aut/src/main/python/tf/model/category/

For example:

mkdir -p aut/src/main/python/tf/model/category/
cd aut/src/main/python/tf/model/category/
wget https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_label_map.pbtxt
@ruebot

ruebot approved these changes Jul 5, 2019

@ruebot ruebot merged commit 7a61f0e into archivesunleashed:master Jul 5, 2019

3 checks passed

codecov/patch Coverage not affected when comparing 5cb05f7...99f0779
Details
codecov/project 75.95% remains the same compared to 5cb05f7
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.