Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matchbox utilities to DataFrames #380

Merged
merged 10 commits into from Nov 18, 2019

Conversation

@SinghGursimran
Copy link
Contributor

SinghGursimran commented Nov 17, 2019

Extended matchbox utilities to df

Issue: #223

For Testing:

import io.archivesunleashed.df._
import io.archivesunleashed._

RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.pages()
						.select(GetExtensionMime(($"url"),($"mime_type_web_server")).as("extension"))
						.show(20,false)

RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.pages()
						.select($"url".as("url"), explode_outer(ExtractImageLinks(($"url"), ($"content"))).as("imageLinks"))
						.show(20,false)

RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.pages()
						.select(ComputeMD5DF(($"content")).as("MD5Hash"))
						.show(20,false)

RecordLoader.loadArchives("./ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
						.pages()
						.select(ComputeSHA1(($"content")).as("SHA1Hash"))
						.show(20,false)
@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 17, 2019

Codecov Report

Merging #380 into master will increase coverage by 0.06%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #380      +/-   ##
==========================================
+ Coverage   76.16%   76.23%   +0.06%     
==========================================
  Files          40       40              
  Lines        1418     1422       +4     
  Branches      268      268              
==========================================
+ Hits         1080     1084       +4     
  Misses        221      221              
  Partials      117      117
@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 18, 2019

Solid. Let's make that small tweak, and I'll merge. Thanks @SinghGursimran!

+---------+                                                                     
|extension|
+---------+
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|html     |
|htm      |
|html     |
|htm      |
|html     |
|html     |
|htm      |
|html     |
|html     |
+---------+
only showing top 20 rows

+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
|url                                                                          |imageLinks                                                                                        |
+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/logo.jpg                  |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/btn_rss.gif               |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/mainbanner_5.jpg          |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/ev_speaks.jpg             |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/arrowright.gif            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/bnr_experience.jpg        |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/hdr_exper.jpg             |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/arrowright.gif            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/bnr_action.jpg            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/arrowright.gif            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/bnr_news3.jpg             |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/arrowright.gif            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/arrowright.gif            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/arrowright.gif            |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/sponsors/merck_frosst.jpg |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/sponsors/swc.jpg          |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/sponsors/enbridge.jpg     |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/sponsors/janssen_ortho.jpg|
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/sponsors/telus.jpg        |
|http://www.equalvoice.ca/images/images/french/js/images/sponsors/enbridge.jpg|http://www.equalvoice.ca/images/images/french/js/images/sponsors/images/sponsors/cibc.jpg         |
+-----------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
only showing top 20 rows

+--------------------------------+
|MD5Hash                         |
+--------------------------------+
|43080eb4a517016a55dbd6eb721876ed|
|df930922f68845b00a2d827c58b20852|
|876b1466c0cc33b43daec3f212db1b8e|
|c1835f1a2eacef60aec9d6ca4a49b5af|
|b4f392a203b9d6492d3b2e3991a51c1c|
|0c09dc7286cdc5d0745da8e96f12195d|
|2b1ba24768290ddaba1a5925e98038b9|
|139c357aba3089270c23269e51df46fc|
|619888ebfbd073ef93685fb7ed9cb6ce|
|0b1d2bf4350dd76cb3242b294f91231d|
|1d1a45168e1ee68915d9ff90588b9f1c|
|d994c5a717057218db5a4966618f0311|
|91244df58cefe7ff1b555870d7407f42|
|2aa77d59ab242354080b0861e8b5140d|
|0e304566e430bf4694c52919f39da047|
|292ee14685bebd3eabf409b356a04211|
|463c54230e1d77ec4db98ce9f257c417|
|9db9e8cb0c7ae00048615623ac26c57a|
|e8c5b6b88b8de89f3ea3d4349fc7db59|
|0f26c55255ba6fa4136636f73b79d406|
+--------------------------------+
only showing top 20 rows

+----------------------------------------+
|SHA1Hash                                |
+----------------------------------------+
|0b5d369a5ae6d341642dab9e2bc6b56246adfc7c|
|b1616d9ab173a6a4737ef53a2a3aa5aa832edd4b|
|d815844ee63d167d8276c445c887b09b5fa12393|
|0b990b59b5b9cb78c7163e87485ca968cd85ecf7|
|7a859f5c32eee0d0c0333a68b7492f17cfea7f98|
|69863c660dbd8c5d4e4df53780ce878c43c9cce6|
|38e6874972e9236294135f8236388479e80d40bc|
|45d7d34c2400cfb96a2b93398bd956193508687e|
|5d3b1dbe9d0fb82928a14d86cb8f96726b052d7c|
|a8810cb3926ba06c6239b135dcac6df1b2d05633|
|eb0fb595cd33af362cfff5dd171697ce5deb2c59|
|e8a0602a6b5d76924b13ec800096403f33aafc75|
|0b5f376f6a8d9767eed161ec8bd30ca29d04998a|
|024b3c7ad5a0b0805b66c80592c0c869cee622b2|
|4a6906df2f777a83b4813f444042863cdbc9c7ac|
|aa09f10ce985737e05cbf032c867517e96d34c37|
|6b8b752ed983796b458373d790299d0f74eda73b|
|b927c5ea334c73060fd3ed478521f2b3f4cce4d0|
|932fb2d7782972fd68036064541d32c9e6279a1c|
|f9fc2a3ee9db808833a52aec20c8e025dc39e9cf|
+----------------------------------------+
only showing top 20 rows

import io.archivesunleashed.df._
import io.archivesunleashed._

val ComputeMD5DF = udf((content: String) => io.archivesunleashed.matchbox.ComputeMD5.apply(content.getBytes()))

val ComputeSHA1 = udf((content: String) => io.archivesunleashed.matchbox.ComputeSHA1.apply(content.getBytes()))

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 18, 2019

Member

I'm thinking we should add DF to the end of this one as well, so it will potentially cut down on confusion.

g285sing
@ruebot
ruebot approved these changes Nov 18, 2019
@ruebot ruebot merged commit a081d7b into archivesunleashed:master Nov 18, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 76.16%)
Details
codecov/project 76.23% (+0.06%) compared to 67ca17d
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.