Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup for Serializable APIs on DataFrames #389

Merged
merged 12 commits into from Dec 17, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Dec 16, 2019

Setup for Serializable APIs on DataFrames

#223

For Testing:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.all()
			.keepValidPagesDF()
			.select($"crawl_date",$"url")
			.show(10,false)

Since it is the migration of RDD's functionality to DF, final answer should be the same as given by the following transformation on RDD.

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
			.webpages()
			.select($"crawl_date",$"url")
			.show(10,false)
@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 16, 2019

Codecov Report

Merging #389 into master will increase coverage by 0.15%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #389      +/-   ##
==========================================
+ Coverage   76.87%   77.03%   +0.15%     
==========================================
  Files          40       40              
  Lines        1466     1476      +10     
  Branches      274      274              
==========================================
+ Hits         1127     1137      +10     
  Misses        217      217              
  Partials      122      122
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Dec 16, 2019

@SinghGursimran got tests? 😃

@SinghGursimran

This comment has been minimized.

Copy link
Contributor Author

SinghGursimran commented Dec 16, 2019

@SinghGursimran got tests? 😃

oh! sorry forgot about that. I will add it now.

g285sing
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Dec 17, 2019

Looks good!

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz",sc)
			.all()
			.keepValidPagesDF()
			.select($"crawl_date",$"url")
			.show(10,false)
			
+----------+-----------------------------------------------------------------------------------+
|crawl_date|url                                                                                |
+----------+-----------------------------------------------------------------------------------+
|20091027  |http://geocities.com/babiekaos/Links.html                                          |
|20091027  |http://geocities.com/cloneaccount3/6490/                                           |
|20091027  |http://www.geocities.com/coledale28/hi-power-soldiers-music.html                   |
|20091027  |http://www.geocities.com/orvilleduncan811/12-day-of-christmas-sheet-music.html     |
|20091027  |http://geocities.com/jtbm71/fotos/2000/                                            |
|20091027  |http://geocities.com/cancmay/s/sunshine.html                                       |
|20091027  |http://www.talent-direct.com/cgi-bin/tal_pro.cgi?profile=ARZCdYbJU5KsMARKdUxiO4l3DY|
|20091027  |http://geocities.com/akimi919/sp_ph/?M=A                                           |
|20091027  |http://geocities.com/cancmay/s/save-tonight.html                                   |
|20091027  |http://www.geocities.com/orvilleduncan811/child-youth-elbow-knee-pad.html          |
+----------+-----------------------------------------------------------------------------------+
only showing top 10 rows
import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz",sc)
			.webpages()
			.select($"crawl_date",$"url")
			.show(10,false)
			
+----------+-----------------------------------------------------------------------------------+
|crawl_date|url                                                                                |
+----------+-----------------------------------------------------------------------------------+
|20091027  |http://geocities.com/babiekaos/Links.html                                          |
|20091027  |http://geocities.com/cloneaccount3/6490/                                           |
|20091027  |http://www.geocities.com/coledale28/hi-power-soldiers-music.html                   |
|20091027  |http://www.geocities.com/orvilleduncan811/12-day-of-christmas-sheet-music.html     |
|20091027  |http://geocities.com/jtbm71/fotos/2000/                                            |
|20091027  |http://geocities.com/cancmay/s/sunshine.html                                       |
|20091027  |http://www.talent-direct.com/cgi-bin/tal_pro.cgi?profile=ARZCdYbJU5KsMARKdUxiO4l3DY|
|20091027  |http://geocities.com/akimi919/sp_ph/?M=A                                           |
|20091027  |http://geocities.com/cancmay/s/save-tonight.html                                   |
|20091027  |http://www.geocities.com/orvilleduncan811/child-youth-elbow-knee-pad.html          |
+----------+-----------------------------------------------------------------------------------+
only showing top 10 rows
@ruebot
ruebot approved these changes Dec 17, 2019
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Dec 17, 2019

Once @ianmilligan1 merges #388, I'll get this merged. Should be here in a bit, or later this afternoon.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Dec 17, 2019

I think we're good on docs for this, since we already have webpages() documented in a few places.

[nruest@wombat:aut-docs] (git)-[master]-$ ag -R "webpages()" current 
current/image-analysis.md
302:              .webpages()

current/collection-analysis.md
47:WebArchive(sc, sqlContext, "src/test/resources/warc/example.warc.gz").webpages() \
94:df = archive.webpages()

current/setting-up-aut.md
217:webpages = archive.webpages()
218:webpages.printSchema()

current/link-analysis.md
198:df = archive.webpages()
@ruebot ruebot merged commit ca928d8 into archivesunleashed:master Dec 17, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 76.87%)
Details
codecov/project 77.03% (+0.15%) compared to 9e32284
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.