Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add image analysis w/ tensorflow #318

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
4 participants
@h324yang
Copy link

commented Apr 25, 2019

JCDL2019 demo

Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.


  1. default setting is standalone mode, so need to set up master and slaves first.
  2. run detect.py to get and store the object probabilities and the image byte strings.
  3. run extract_images.py to get image files from the result of step2
@codecov-io

This comment has been minimized.

Copy link

commented Apr 25, 2019

Codecov Report

Merging #318 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #318   +/-   ##
=======================================
  Coverage   75.95%   75.95%           
=======================================
  Files          41       41           
  Lines        1148     1148           
  Branches      200      200           
=======================================
  Hits          872      872           
  Misses        209      209           
  Partials       67       67

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb05f7...fbf31fb. Read the comment docs.

@ruebot

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some?

@lintool do you want #241 open still? Does this supersede it?

@ruebot

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

...and is this apart of everything that should be included, or just helpers for the work you did on the paper?

@h324yang

This comment has been minimized.

Copy link
Author

commented May 6, 2019

Distributed image analysis via the integration of AUT and Tensorflow

GitHub issue(s): #240 #241

What does this Pull Request do?

  • Integrating AUT and Tensorflow with python interface (pyspark).
  • The code of the JCDL 2019 paper.
  • Single Shot MultiBox Detector is used so far, because of the balance between speed and accuracy.
  • The inference scores and the byte strings of images are stored first.
  • Using the image extractor to get the image files, , i.e., jpeg, gif, etc., which scores are higher than the threshold defined by users.

How should this be tested?

Step 1: Run detection

python aut/src/main/python/tf/detect.py \
		--web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
		--aut_jar aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
		--aut_py aut/src/main/python \
		--spark spark-2.3.2-bin-hadoop2.7/bin \
		--master spark://127.0.1.1:7077 \
		--img_model ssd \
		--filter_size 640 640 \
		--output_path warc_res

Step 2: Extract Images

python aut/src/main/python/tf/extract_images.py \
		--res_dir warc_res \
		--output_dir warc_imgs \
		--threshold 0.85

Additional Notes:

Python Dependency

My python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then pip install req.txt .

Note that you should ensure that driver and workers use the same python version. You might set as follows:

export PYSPARK_PYTHON=[YOUR PYTHON]
export PYSPARK_DRIVER_PYTHON=[YOUR PYTHON]

Spark Mode

The default mode is standalone. E.g., you can launch in this mode as follows:

cd spark-2.3.2-bin-hadoop2.7
./sbin/start-master.sh
./start-slave.sh 127.0.1.1:7077

The spark parameters are set by using init_spark() in src/main/python/tf/util/init.py

Design Details

  • The pre-trained model and the corresponding dictionary for label mapping are stored in src/main/python/tf/model/graph/ and src/main/python/tf/model/category/ , respectively.
  • For each pre-trained model, though there is only one now, we define a model class and an extractor class, as SSD and SSDExtractor in src/main/python/tf/model/object_detection.py.
  • Using the model class, as SSD, to derive the pandas UDF function for inference.

Interested parties

@lintool

@ruebot

This comment has been minimized.

Copy link
Member

commented May 30, 2019

@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them?

parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')
parser.add_argument('--aut_jar', help='aut compiled jar package', default='aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar')
parser.add_argument('--aut_py', help='path to python package', default='aut/src/main/python')
parser.add_argument('--spark', help='path to python package', default='spark-2.3.2-bin-hadoop2.7/bin')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: help='Path to Apache Spark.

parser = argparse.ArgumentParser(description='PySpark for Web Archive Image Retrieval')
parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')
parser.add_argument('--aut_jar', help='aut compiled jar package', default='aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar')
parser.add_argument('--aut_py', help='path to python package', default='aut/src/main/python')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Is this supposed to be the Python binary? Or something else? Needs a better help description.

def get_args():
parser = argparse.ArgumentParser(description='PySpark for Web Archive Image Retrieval')
parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')
parser.add_argument('--aut_jar', help='aut compiled jar package', default='aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: `help='Path to compiled aut jar.'


def get_args():
parser = argparse.ArgumentParser(description='PySpark for Web Archive Image Retrieval')
parser.add_argument('--web_archive', help='input directory for web archive data', default='/tuna1/scratch/nruest/geocites/warcs')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: help='Path to warcs.'

parser.add_argument('--aut_py', help='path to python package', default='aut/src/main/python')
parser.add_argument('--spark', help='path to python package', default='spark-2.3.2-bin-hadoop2.7/bin')
parser.add_argument('--master', help='master IP address', default='spark://127.0.1.1:7077')
parser.add_argument('--img_model', help='model for image processing, use ssd', default='ssd')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

model for image processing, use ssd

If this is the only option, why is there an argument?

extractor = SSDExtractor(args.res_dir, args.output_dir)
extractor.extract_and_save(class_ids="all", threshold=args.threshold)


This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Remove unnecessary end lines.



def get_args():
parser = argparse.ArgumentParser(description='Extracting images from model output')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: description='Extracting images from model output.'


def get_args():
parser = argparse.ArgumentParser(description='Extracting images from model output')
parser.add_argument('--res_dir', help='result (model output) dir')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: help='Path of result (model output) directory.'

def get_args():
parser = argparse.ArgumentParser(description='Extracting images from model output')
parser.add_argument('--res_dir', help='result (model output) dir')
parser.add_argument('--output_dir', help='extracted image file output dir')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: help='Path of extracted image file output directory.'

parser = argparse.ArgumentParser(description='Extracting images from model output')
parser.add_argument('--res_dir', help='result (model output) dir')
parser.add_argument('--output_dir', help='extracted image file output dir')
parser.add_argument('--threshold', type=float, help='threshold of detection confidence scores')

This comment has been minimized.

Copy link
@ruebot

ruebot May 31, 2019

Member

Change to: help='Threshold of detection confidence scores.'

@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

@h324yang I'm unable to get this to run.

$ cat warc-image-classification/run_detection.sh 
export PYSPARK_PYTHON=/home/ruestn/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/ruestn/anaconda3/bin/python

python /home/ruestn/aut/src/main/python/tf/detect.py --web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
    --aut_jar /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
    --aut_py /home/ruestn/aut/src/main/python \
    --spark /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin \
    --master spark://127.0.1.1:7077 \
    --img_model ssd \
    --filter_size 100 100 \
    --output_path /home/ruestn/aut_318_test

I get:

$ ./run_detection.sh 
Traceback (most recent call last):
  File "/home/ruestn/aut/src/main/python/tf/detect.py", line 3, in <module>
    from util.init import *
  File "/home/ruestn/aut/src/main/python/tf/util/init.py", line 4, in <module>
    from pyspark import SparkConf, SparkContext, SQLContext
ModuleNotFoundError: No module named 'pyspark'
@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

Chatting with Leo in Slack; guess who did a 🤦‍♂?

I was giving a path to Python, not PySpark, without having PySpark installed for Anaconda Python.

@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

First pass worked with some tweaks; changed "spark.cores.max", "48" and added "spark.network.timeout", "1000000".

We should definitely figure out a way to pass the Spark conf settings, since a user will definitely need to tweak them depending on their setup. I don't think we should have the conf settings hard coded in src/main/python/tf/util/init.py.

With auk we just pass a whole bunch of flags with we run Spark. That might not be ideal here since we already pass a lot of flags. Or we just roll with it. Or, we include a sample conf file in the repo, and tell folks to copy that and tweak it as needed.

What do you think @h324yang @lintool @ianmilligan1?

@ianmilligan1

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

All of the options sound good to me for various reasons! But I think at this stage as a prototype function we could probably just have people add some flags and roll with it – down the line, perhaps as a separate issue, come up with a conf file to try to reduce some of the flag soup? @ruebot

@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 19, 2019

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
@ruebot

This comment has been minimized.

Copy link
Member

commented Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

@h324yang

This comment has been minimized.

Copy link
Author

commented Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

Seems like an OOM error; The arguments I set in util/init.py were optimized and running well on Tuna. I got some errors but I don't think OOM is a frequent one. You also run on Tuna?

Maybe a lower value of "spark.sql.execution.arrow.maxRecordsPerBatch" could help, e.g., 1280 -> 640. (Indeed, tuning such settings bothered me a lot :-/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.