Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract popular images - Data Frame implementation #382

Open
wants to merge 9 commits into
base: master
from

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Nov 20, 2019

Extract popular images - Data Frame implementation

#380

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz",sc)
					 .images()	  

ExtractPopularImages(df,10,30,30).show()
@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 20, 2019

Scala doesn't support function overloading with default arguments. For the RDD implementation, minWidth and minHeight arguments were optional. For the current data frame implementation, they are necessary. If it is required to be kept as optional, I can

  1. Shift df implementation to a new object (ExtractPopularImagesDF)
    OR
  2. give a new name to the method. (That would require calling method name along with the object at the time of implementation)
@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 20, 2019

Codecov Report

Merging #382 into master will decrease coverage by 0.53%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master     #382      +/-   ##
==========================================
- Coverage   76.23%   75.69%   -0.54%     
==========================================
  Files          40       40              
  Lines        1422     1432      +10     
  Branches      268      268              
==========================================
  Hits         1084     1084              
- Misses        221      231      +10     
  Partials      117      117
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

@SinghGursimran tests? 😃

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Nov 20, 2019

A new method, ExtractPopularImagesDF makes sense to me. @ianmilligan1 @lintool work for you, or y'all prefer another path?

That makes sense to me too!

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 20, 2019

Let's use a different convention for "end-to-end" functionalities. One option would be to have all UDFs be verb phrases, e.g., ExtractX, and "end-to-end" functionalities be noun phrases. So this would be PopularImagesExtractor. That way it'll be clear on what to use in what context.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

@lintool so, should we have @SinghGursimran change the existing RDD method to PopularImagesExtractorRDD, and this new one should be PopularImagesExtractorDF?

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 20, 2019

Yes, if you like my suggestion of nouns vs. verbs.

I.e., UDFs are verbs, "do this".
"Assemblies" are thing-doers.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

Cool.

That make sense @SinghGursimran?

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 20, 2019

@ruebot
Ya! I will change accordingly.
Regarding tests, there's an orderBy() function in implementation. If the value of key is same it orders randomly. If the count of two image URLs is same, the order of Data Frame row might change on running the code again. For the archive in resources, all images appear only once that is they all have count one. Adding a static test won't be feasible in such a case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.