Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upMore Data Frame Implementations + Code Refactoring #383
Conversation
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Nov 20, 2019
•
Codecov Report
@@ Coverage Diff @@
## master #383 +/- ##
==========================================
+ Coverage 76.23% 76.47% +0.24%
==========================================
Files 40 40
Lines 1422 1437 +15
Branches 268 268
==========================================
+ Hits 1084 1099 +15
Misses 221 221
Partials 117 117 |
This comment has been minimized.
This comment has been minimized.
Interesting! I like it. Can you add a test? |
Couple small changes. Overall great! Let's see what @lintool things about the name, and we might as well get this added to the Python side of things as well, so it works with PySpark: https://github.com/archivesunleashed/aut/blob/e2ec5a17502709de08c191fbf4783a3f6f0e8199/src/main/python/aut/common.py I think you should see the implementation pattern there. But, if not, get a hold of me in Slack, and I'm happy to help out. |
@@ -27,6 +27,12 @@ class DataFrameLoader(sc: SparkContext) { | |||
.pages() | |||
} | |||
|
|||
def pagesWithBytes(path: String): DataFrame = { |
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 20, 2019
Member
Let's get a doc comment for this method. You can just crib from the methods around it for the pattern.
@@ -115,6 +115,24 @@ package object archivesunleashed { | |||
sqlContext.getOrCreate().createDataFrame(records, schema) | |||
} | |||
|
|||
/*Creates a column for Bytes as well in Dataframe. | |||
Call KeepImages OR KeepValidPages on RDD depending upon the requirement before calling this method */ | |||
def pagesWithBytes(): DataFrame = { |
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 20, 2019
Member
@lintool you like naming things, you good with this one, or you got something better?
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 21, 2019
Member
@SinghGursimran as per slack conversation, let's call this one all
.
I can take care of renaming pages
to html
tomorrow or later tonight under the cover of a separate PR.
@@ -30,7 +30,7 @@ import io.archivesunleashed.matchbox.ExtractDate.DateComponent.DateComponent | |||
import java.net.URI | |||
import java.net.URL | |||
import org.apache.spark.sql.{DataFrame, Row, SparkSession} | |||
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} | |||
import org.apache.spark.sql.types.{IntegerType, StringType, BinaryType, StructField, StructType} |
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 20, 2019
Member
Let's get this in alphabetical order so Scalastyle is happy. Pedantic, sorry.
This comment has been minimized.
This comment has been minimized.
With DF, we shouldn't need Worth testing just to make sure. |
This comment has been minimized.
This comment has been minimized.
@lintool Initially, I added this only for images. I had to keep images for analysis. Instead of adding it specific to images, I created a general method considering "bytes" might be required in the future for non-image analysis as well. |
Updates from Slack convo |
@@ -115,6 +115,24 @@ package object archivesunleashed { | |||
sqlContext.getOrCreate().createDataFrame(records, schema) | |||
} | |||
|
|||
/*Creates a column for Bytes as well in Dataframe. | |||
Call KeepImages OR KeepValidPages on RDD depending upon the requirement before calling this method */ | |||
def pagesWithBytes(): DataFrame = { |
This comment has been minimized.
This comment has been minimized.
ruebot
Nov 21, 2019
Member
@SinghGursimran as per slack conversation, let's call this one all
.
I can take care of renaming pages
to html
tomorrow or later tonight under the cover of a separate PR.
c4eaca9
into
archivesunleashed:master
SinghGursimran commentedNov 20, 2019
•
edited
Added more Data Frame Implementations along with some Code Refactoring
#223
For Testing: