Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Data Frame Implementations + Code Refactoring #383

Merged
merged 14 commits into from Nov 21, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Nov 20, 2019

Added more Data Frame Implementations along with some Code Refactoring

#223

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz",sc)
			.keepImages()
			.all()
			.select(ComputeImageSizeDF($"bytes").as("size"), ComputeMD5DF($"bytes").as("md5"), ComputeSHA1DF($"bytes").as("sha1"))
			.show(20,false)
@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 20, 2019

Codecov Report

Merging #383 into master will increase coverage by 0.24%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #383      +/-   ##
==========================================
+ Coverage   76.23%   76.47%   +0.24%     
==========================================
  Files          40       40              
  Lines        1422     1437      +15     
  Branches      268      268              
==========================================
+ Hits         1084     1099      +15     
  Misses        221      221              
  Partials      117      117
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 20, 2019

Interesting! I like it. Can you add a test?

g285sing
Copy link
Member

ruebot left a comment

Couple small changes. Overall great!

Let's see what @lintool things about the name, and we might as well get this added to the Python side of things as well, so it works with PySpark: https://github.com/archivesunleashed/aut/blob/e2ec5a17502709de08c191fbf4783a3f6f0e8199/src/main/python/aut/common.py

I think you should see the implementation pattern there. But, if not, get a hold of me in Slack, and I'm happy to help out.

@@ -27,6 +27,12 @@ class DataFrameLoader(sc: SparkContext) {
.pages()
}

def pagesWithBytes(path: String): DataFrame = {

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 20, 2019

Member

Let's get a doc comment for this method. You can just crib from the methods around it for the pattern.

@@ -115,6 +115,24 @@ package object archivesunleashed {
sqlContext.getOrCreate().createDataFrame(records, schema)
}

/*Creates a column for Bytes as well in Dataframe.
Call KeepImages OR KeepValidPages on RDD depending upon the requirement before calling this method */
def pagesWithBytes(): DataFrame = {

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 20, 2019

Member

@lintool you like naming things, you good with this one, or you got something better? 😃

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 21, 2019

Member

@SinghGursimran as per slack conversation, let's call this one all.

I can take care of renaming pages to html tomorrow or later tonight under the cover of a separate PR.

@@ -30,7 +30,7 @@ import io.archivesunleashed.matchbox.ExtractDate.DateComponent.DateComponent
import java.net.URI
import java.net.URL
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types.{IntegerType, StringType, BinaryType, StructField, StructType}

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 20, 2019

Member

Let's get this in alphabetical order so Scalastyle is happy. Pedantic, sorry.

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Nov 20, 2019

With DF, we shouldn't need pagesWithBytes, we should be okay with the widest possible table (as many fields as we need) in a single method. Spark is supposed to do the optimization behind the scenes, i.e., if you don't select it, Spark should be smart to project away it early on.

Worth testing just to make sure.

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Nov 20, 2019

@lintool Initially, I added this only for images. I had to keep images for analysis. Instead of adding it specific to images, I created a general method considering "bytes" might be required in the future for non-image analysis as well.
Previous "pages" method filters out only valid pages without keeping images.
Plus, that "pages" method is being used in many use cases where we actually require only valid pages. So, I thought of not changing it.
If you want, I would change this "pagesWithBytes" to be specific for images.

Copy link
Member

ruebot left a comment

Updates from Slack convo

@@ -115,6 +115,24 @@ package object archivesunleashed {
sqlContext.getOrCreate().createDataFrame(records, schema)
}

/*Creates a column for Bytes as well in Dataframe.
Call KeepImages OR KeepValidPages on RDD depending upon the requirement before calling this method */
def pagesWithBytes(): DataFrame = {

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 21, 2019

Member

@SinghGursimran as per slack conversation, let's call this one all.

I can take care of renaming pages to html tomorrow or later tonight under the cover of a separate PR.

g285sing
@ruebot
ruebot approved these changes Nov 21, 2019
ruebot and others added 3 commits Nov 21, 2019
g285sing
@ruebot ruebot merged commit c4eaca9 into archivesunleashed:master Nov 21, 2019
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.