Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move data frame fields names to snake_case. #327

Merged
merged 1 commit into from Jul 18, 2019

Conversation

Projects
None yet
3 participants
@ruebot
Copy link
Member

commented Jul 17, 2019

GitHub issue(s): #229

What does this Pull Request do?

Move data frame fields names to snake_case.

How should this be tested?

  1. Pull down branch
  2. rm -rf ~/.m2/repository/* && mvn clean install
  3. rm -rf ~/.ivy2/* && ~/bin/spark-2.4.3-bin-hadoop2.7/bin/spark-shell --packages io.archivesunleashed:aut:0.17.1-SNAPSHOT
  4. Test the following scripts:

List of Domains

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc)
  .extractValidPagesDF()

df.printSchema()

df.select(ExtractBaseDomain($"url").as("domain"))
  .groupBy("domain").count().orderBy(desc("count")).show()

// Exiting paste mode, now interpreting.

root
 |-- crawl_date: string (nullable = true)
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- content: string (nullable = true)

+--------------------+-----+                                                    
|              domain|count|
+--------------------+-----+
|       geocities.com|18341|
|   www.geocities.com| 4183|
|animaldiversity.u...|    4|
|webspace.webring.com|    3|
|          weather.bg|    2|
|   www.sitemeter.com|    2|
|       www.mystat.pl|    2|
|nude-tyhai-girls....|    1|
|stop-unwanted-pgo...|    1|
|        www.icoc.org|    1|
|   www.gospelcom.net|    1|
|   www.gafa-koeln.de|    1|
|  pub12.bravenet.com|    1|
|bilder-sextv.sand...|    1|
|horn-eng.sexsex.s...|    1|
|  www.adi-design.org|    1|
|gay-eng.sexsex.sa...|    1|
|       builds-il.com|    1|
|    www.prempree.com|    1|
|   pic.geocities.com|    1|
+--------------------+-----+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [crawl_date: string, url: string ... 2 more fields]

Hyperlink Network

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc)
  .extractHyperlinksDF()

df.printSchema()

df.select(RemovePrefixWWW(ExtractBaseDomain($"src")).as("src_domain"),
    RemovePrefixWWW(ExtractBaseDomain($"dest")).as("dest_domain"))
  .groupBy("src_domain", "dest_domain").count().orderBy(desc("src_domain")).show()

// Exiting paste mode, now interpreting.

root
 |-- crawl_date: string (nullable = true)
 |-- src: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- anchor: string (nullable = true)

+--------------------+--------------------+-----+                               
|          src_domain|         dest_domain|count|
+--------------------+--------------------+-----+
|webspace.webring.com|         webring.com|    6|
|webspace.webring.com|gettysburg.cdmhos...|    1|
|webspace.webring.com|dunedinlibraries.com|    1|
|webspace.webring.com|    bardic-music.com|    1|
|webspace.webring.com|     georgelloyd.com|    2|
|webspace.webring.com|webspace.webring.com|    3|
|webspace.webring.com|       teara.govt.nz|    1|
|webspace.webring.com|hyperion-records....|    1|
|webspace.webring.com|                    |    1|
|webspace.webring.com|         indiana.edu|    1|
|webspace.webring.com|  whirligig-tv.co.uk|    1|
|webspace.webring.com|          bris.ac.uk|    1|
|webspace.webring.com|        hwwilson.com|    1|
|webspace.webring.com|  orchestranet.co.uk|    1|
|webspace.webring.com|       geocities.com|    1|
|webspace.webring.com|classicalmusic.co.uk|    1|
|webspace.webring.com|       eldrbarry.net|    1|
|webspace.webring.com|     musicweb.uk.net|   10|
|     warpedspace.org|     internet.com.uy|    1|
|     warpedspace.org|   c.viardot.free.fr|    1|
+--------------------+--------------------+-----+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [crawl_date: string, src: string ... 2 more fields]

Image Analysis

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocites/1/*gz", sc).extractImageDetailsDF();

df.printSchema()

df.select($"url", $"mime_type", $"width", $"height", $"md5", $"bytes").orderBy(desc("md5")).show()

// Exiting paste mode, now interpreting.

root
 |-- url: string (nullable = true)
 |-- mime_type: string (nullable = true)
 |-- width: integer (nullable = true)
 |-- height: integer (nullable = true)
 |-- md5: string (nullable = true)
 |-- bytes: string (nullable = true)

+--------------------+----------+-----+------+--------------------+--------------------+
|                 url| mime_type|width|height|                 md5|               bytes|
+--------------------+----------+-----+------+--------------------+--------------------+
|http://geocities....| image/gif|  450|   300|ffffb9dbe3bc135e9...|R0lGODlhwgEsAeYAA...|
|http://geocities....| text/html|    0|     0|fffdcef3dc2d35afd...|PEhUTUw+PEhFQUQ+P...|
|http://geocities....|image/jpeg|  300|   201|fffc050897bb59612...|/9j/4AAQSkZJRgABA...|
|http://geocities....| image/gif|  143|   150|fff7909e24515e253...|R0lGODlhjwCWAPcAA...|
|http://www.geocit...|image/jpeg|  800|   248|fff771f736ef5c7d0...|/9j/4AAQSkZJRgABA...|
|http://geocities....| image/gif|  248|    60|fff35cb4fe9711295...|R0lGODlh+AA8AIAAA...|
|http://geocities....|image/jpeg|  423|   549|fff314108cac46e23...|/9j/4AAQSkZJRgABA...|
|http://geocities....| image/gif|   53|    64|ffec4f50051f526b1...|R0lGODlhNQBAAIcAA...|
|http://geocities....| image/gif|   90|   320|ffeb52a309fcd4905...|R0lGODlhWgBAAbMAA...|
|http://geocities....|image/jpeg|  270|   175|ffe855670a1619852...|/9j/4AAQSkZJRgABA...|
|http://geocities....|image/jpeg|  150|   177|ffe22af593db84b8d...|/9j/4AAQSkZJRgABA...|
|http://geocities....|image/jpeg|  367|   433|ffe1ee98aead8a0bd...|/9j/4AAQSkZJRgABA...|
|http://geocities....|image/jpeg|  542|   789|ffe1d329704d6f7b8...|/9j/4AAQSkZJRgABA...|
|http://geocities....|image/jpeg|  800|   600|ffdeeacd915b4218c...|/9j/4AAQSkZJRgABA...|
|http://geocities....| image/gif|   75|    50|ffd93f41d9809150f...|R0lGODlhSwAyALMAA...|
|http://geocities....|image/jpeg|  263|   155|ffd91a558b0250944...|/9j/4AAQSkZJRgABA...|
|http://geocities....|image/jpeg|  100|    97|ffd387e0ec3617fe1...|/9j/4AAQSkZJRgABA...|
|http://geocities....| image/gif|  100|    21|ffd3857d353bec674...|R0lGODlhZAAVAPcAA...|
|http://geocities....|image/jpeg|   46|    63|ffd2b13fb18d1e095...|/9j/4AAQSkZJRgABA...|
|http://geocities....| image/gif|  118|    40|ffd0fa4f0c923a369...|R0lGODdhdgAoAPAAA...|
+--------------------+----------+-----+------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [url: string, mime_type: string ... 4 more fields]

Additional Notes:

@ianmilligan1 @lintool should we go a bit further on some of these field names? For example, should we change src to source, or dest_domain to destination_domain? If so, I'm happy to update this PR.

Also, we need update the documentation on the archivesunleashed.org/aut for the next release. The above example should do it, since that's were they came from.

@ruebot ruebot requested review from lintool and ianmilligan1 Jul 17, 2019

@codecov-io

This comment has been minimized.

Copy link

commented Jul 17, 2019

Codecov Report

Merging #327 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #327   +/-   ##
=======================================
  Coverage   74.97%   74.97%           
=======================================
  Files          39       39           
  Lines        1123     1123           
  Branches      197      197           
=======================================
  Hits          842      842           
  Misses        215      215           
  Partials       66       66
Impacted Files Coverage Δ
...chivesunleashed/app/DomainFrequencyExtractor.scala 100% <100%> (ø) ⬆️
...o/archivesunleashed/app/DomainGraphExtractor.scala 100% <100%> (ø) ⬆️
src/main/scala/io/archivesunleashed/package.scala 84.54% <100%> (ø) ⬆️
.../io/archivesunleashed/app/PlainTextExtractor.scala 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e701b2...edec1b2. Read the comment docs.

@ianmilligan1
Copy link
Member

left a comment

Tested all scripts locally with my own sample data and all worked perfectly – great stuff, @ruebot. If you want to change the field names to be more descriptive, happy to re-review too, but otherwise happy to merge. Just let me know.

@ruebot

This comment has been minimized.

Copy link
Member Author

commented Jul 18, 2019

eh, let's just leave it as is now.

@ianmilligan1 ianmilligan1 merged commit f35d54e into master Jul 18, 2019

3 checks passed

codecov/patch 100% of diff hit (target 74.97%)
Details
codecov/project 74.97% (+0%) compared to 0e701b2
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.