Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe matchbox Implementations #387

Merged
merged 12 commits into from Dec 5, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Dec 4, 2019

Dataframe Implementations for ExtractDate, DetectLanguage and ExtarctBoilerpipeText

For Testing:

ExtractDate:

import io.archivesunleashed._
import io.archivesunleashed.df._
import org.apache.spark.sql.functions._


val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select(ExtractDateDF($"crawl_date",lit("YYYY")))
			.show(3,false)

DetectLanguage:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select(DetectLanguageDF($"content"))
			.show(3,false)

ExtarctBoilerpipeText:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select(ExtractBoilerpipeTextDF($"content"))
			.show(3,false)
@ruebot ruebot self-requested a review Dec 4, 2019
@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 4, 2019

Codecov Report

Merging #387 into master will decrease coverage by 0.73%.
The diff coverage is 48.27%.

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
- Coverage    76.7%   75.97%   -0.74%     
==========================================
  Files          41       41              
  Lines        1451     1469      +18     
  Branches      268      274       +6     
==========================================
+ Hits         1113     1116       +3     
- Misses        221      236      +15     
  Partials      117      117
@ruebot
ruebot approved these changes Dec 5, 2019
Copy link
Member

ruebot left a comment

Couple formatting and comment prose tweaks.

Tested, and works as expected. I'll get a PR in for the docs.

@@ -26,6 +26,7 @@ import java.util.Base64
/**
* UDFs for data frames.
*/

This comment has been minimized.

Copy link
@ruebot

ruebot Dec 5, 2019

Member

Let's remove this blank line.

@@ -49,4 +49,29 @@ object ExtractDate {
""
}
}

/** Extracts the wanted date component from a date (for DataFrames).

This comment has been minimized.

Copy link
@ruebot

ruebot Dec 5, 2019

Member

Let's reword this to:

Extracts a provided date component from a date (for DataFrames).

g285sing added 2 commits Dec 5, 2019
g285sing
g285sing
@ruebot
ruebot approved these changes Dec 5, 2019
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Dec 5, 2019
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Dec 5, 2019

@ruebot ruebot merged commit 079cd24 into archivesunleashed:master Dec 5, 2019
1 of 3 checks passed
1 of 3 checks passed
codecov/patch 48.27% of diff hit (target 76.7%)
Details
codecov/project 75.97% (-0.74%) compared to 560ed2b
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.