Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract popular images - Data Frame implementation #382

Merged
merged 13 commits into from Nov 21, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Nov 20, 2019

Extract popular images - Data Frame implementation

#380

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz",sc)
					 .images()	  

ExtractPopularImages(df,10,30,30).show()
@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 20, 2019

Scala doesn't support function overloading with default arguments. For the RDD implementation, minWidth and minHeight arguments were optional. For the current data frame implementation, they are necessary. If it is required to be kept as optional, I can

  1. Shift df implementation to a new object (ExtractPopularImagesDF)
    OR
  2. give a new name to the method. (That would require calling method name along with the object at the time of implementation)
@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 20, 2019

Codecov Report

Merging #382 into master will increase coverage by 0.22%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master    #382      +/-   ##
=========================================
+ Coverage   76.47%   76.7%   +0.22%     
=========================================
  Files          40      41       +1     
  Lines        1437    1451      +14     
  Branches      268     268              
=========================================
+ Hits         1099    1113      +14     
  Misses        221     221              
  Partials      117     117
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

@SinghGursimran tests? 😃

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Nov 20, 2019

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

That makes sense to me too!

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 20, 2019

Let's use a different convention for "end-to-end" functionalities. One option would be to have all UDFs be verb phrases, e.g., ExtractX, and "end-to-end" functionalities be noun phrases. So this would be PopularImagesExtractor. That way it'll be clear on what to use in what context.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

@lintool so, should we have @SinghGursimran change the existing RDD method to PopularImagesExtractorRDD, and this new one should be PopularImagesExtractorDF?

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 20, 2019

Yes, if you like my suggestion of nouns vs. verbs.

I.e., UDFs are verbs, "do this".
"Assemblies" are thing-doers.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

Cool.

That make sense @SinghGursimran?

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 20, 2019

@ruebot
Ya! I will change accordingly.
Regarding tests, there's an orderBy() function in implementation. If the value of key is same it orders randomly. If the count of two image URLs is same, the order of Data Frame row might change on running the code again. For the archive in resources, all images appear only once that is they all have count one. Adding a static test won't be feasible in such a case.

g285sing
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 21, 2019

@SinghGursimran so, for the test. Can we assert other items in the DataFrame that is returned, that is not dependent on the order it returns in? count maybe? Or, combine distinct and count to hit the same thing each time?

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 21, 2019

@SinghGursimran so, for the test. Can we assert other items in the DataFrame that is returned, that is not dependent on the order it returns in? count maybe? Or, combine distinct and count to hit the same thing each time?

Actually, for the archive available in the resources, the count is 1 for each data entry in the row.
The sequence of the whole row changes for the same count on subsequent running of the code.
But, yes, I can keep count as the factor because it is always 1. Shall I do that?

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 21, 2019

Yes, let's do that to get something in there, and we can loop back around to it later and see if we can in improve it.

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 21, 2019
… Hash".

- See archivesunleashed/aut#382
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 21, 2019

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 21, 2019

Tested on 10 local GeoCities WARCs:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1",sc)
                                  .images()        

ExtractPopularImages(df,10,30,30).show()

// Exiting paste mode, now interpreting.

19/11/21 12:09:57 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
[Stage 0:>                (0 + 10) / 10][Stage 1:>                 (0 + 0) / 10]19/11/21 12:10:00 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

19/11/21 12:10:00 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
+--------------------+-----+                                                    
|                 url|count|
+--------------------+-----+
|http://geocities....|   51|
|http://geocities....|   31|
|http://www.geocit...|   29|
|http://i24.photob...|   28|
|http://geocities....|   26|
|http://geocities....|   24|
|http://geocities....|   22|
|http://www.geocit...|   22|
|http://www.geocit...|   22|
|http://geocities....|   21|
+--------------------+-----+

import io.archivesunleashed._
import io.archivesunleashed.app._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 8 more fields]

I'll squash and merge once we get the test.

g285sing added 2 commits Nov 21, 2019
@ruebot
ruebot approved these changes Nov 21, 2019
@ruebot ruebot merged commit 4042180 into archivesunleashed:master Nov 21, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 76.47%)
Details
codecov/project 76.7% (+0.22%) compared to c4eaca9
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
ianmilligan1 added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 22, 2019
#28)

* Add example for Scala DF version of "Extract Most Frequent Images MD5 Hash".

- See archivesunleashed/aut#382

* rename
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.