Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and clean-up Scaladocs; resolves #184 #193

Merged
merged 28 commits into from Apr 11, 2018

Conversation

Projects
None yet
4 participants
@ruebot
Copy link
Member

ruebot commented Apr 6, 2018

GitHub issue(s):

#184

What does this Pull Request do?

Improve and clean-up Scaladocs so that we will have much improved Scaladocs for the next release.

Every single scala file was touched.

How should this be tested?

  • TravisCI should take care of the build test
  • mvn site -DskipTests and visit target/site/scaladocs/index.html with a browser
  • We should do a smoke test with some auk standard output since we pulled in #189 and #186.

Interested parties

@lintool I'd appreciate your eyes on the doc comments in the code.
@ianmilligan1 can you do the smoke test, and any other review that you'd like to do?

Before this is merged let me know since we're squashing and merging this, the commit message is going to be pretty crazy with all the commits we've had go into this.

@greebie++ for his work on this.

ruebot and others added some commits Mar 20, 2018

@greebie

This comment has been minimized.

Just a small typo here. ext => text.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 6, 2018

Codecov Report

Merging #193 into master will decrease coverage by 0.05%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #193      +/-   ##
==========================================
- Coverage   67.34%   67.29%   -0.06%     
==========================================
  Files          32       32              
  Lines         637      636       -1     
  Branches      124      124              
==========================================
- Hits          429      428       -1     
  Misses        167      167              
  Partials       41       41
Impacted Files Coverage Δ
src/main/scala/io/archivesunleashed/package.scala 71.66% <ø> (ø) ⬆️
...io/archivesunleashed/matchbox/NER3Classifier.scala 0% <ø> (ø) ⬆️
...in/scala/io/archivesunleashed/util/JsonUtils.scala 100% <ø> (ø) ⬆️
...ala/io/archivesunleashed/matchbox/ComputeMD5.scala 100% <ø> (ø) ⬆️
...la/io/archivesunleashed/matchbox/ExtractDate.scala 60% <ø> (ø) ⬆️
.../io/archivesunleashed/matchbox/ExtractDomain.scala 100% <ø> (ø) ⬆️
...io/archivesunleashed/matchbox/DetectLanguage.scala 100% <ø> (ø) ⬆️
.../scala/io/archivesunleashed/app/WriteGraphML.scala 100% <ø> (ø) ⬆️
...scala/io/archivesunleashed/ArchiveRecordImpl.scala 81.57% <ø> (ø) ⬆️
...la/io/archivesunleashed/matchbox/ExtractUrls.scala 100% <ø> (ø) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3163ace...28fca42. Read the comment docs.

@ruebot ruebot requested review from lintool and ianmilligan1 Apr 9, 2018

@@ -29,6 +29,11 @@ import org.archive.io.arc.ARCRecord
import org.archive.io.warc.WARCRecord
import org.archive.util.ArchiveUtils

/** ArchiveRecord.
*
* @constructor an archive record.

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

I believe the style guide calls for a noun phrase without trailing period?
http://www.oracle.com/technetwork/java/javase/tech/index-137868.html

This comment has been minimized.

@ruebot

ruebot Apr 9, 2018

Author Member

That's Javadocs, not Scaladocs.

This comment has been minimized.

@ruebot

ruebot Apr 9, 2018

Author Member

...that said, it would be nice to have something other than "ArchiveRecord." here. Do you have a suggestion?

This comment has been minimized.

@ruebot

ruebot Apr 10, 2018

Author Member

@lintool are you good with this as is, or should we change it? I think this is the only outstanding left before we can merge.

This comment has been minimized.

@greebie

greebie Apr 10, 2018

Contributor

I had

The `ArchiveRecord` [class](https://docs.scala-lang.org/tour/classes.html) is used by 
`RecordLoader` to manage the ingestion of WARC and ARC files. It represents one 
record from a collection found in the file. It uses the Internet Archive's ARCRecord &
 WARCRecord APIs to extract data from web crawls such as the crawl date, content strings, urls and so on.

`ArchiveRecord` is a generic record that covers data from both ARCRecord and
 WARCRecord format. AUT will convert either format into the generic `ArchiveRecord`
 class.``` 
in the original reference document.

So maybe ```ArchiveRecord is a class for ingesting WARC and ARC files.```
or ```Used by RecordLoader to extract data from WARC and ARC files.``` 

@@ -21,27 +21,35 @@ import io.archivesunleashed.matchbox.{NER3Classifier, RemoveHTML}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD

/**
* Extracts entities
/** Conducts Named Entity Recognition (NER) on a WARC or ARC file.

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

Conducts -> Performs?

*/
def extractFromRecords(iNerClassifierFile: String, inputRecordFile: String, outputFile: String, sc: SparkContext): RDD[(String, String, String)] = {
val rdd = RecordLoader.loadArchives(inputRecordFile, sc)
.map(r => (r.getCrawlDate, r.getUrl, RemoveHTML(r.getContentString)))
extractAndOutput(iNerClassifierFile, rdd, outputFile)
}

/**
/** Extracts NER entities from tuple-formatted derivatives scraped from a website.

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

NER stands for Named Entity Recognition, so either say "Extracts named entities" or "Performs NER"...

*
* @param verticesPath Filepath for vertices output
* @param edgesPath Filepath for edges output
* @return Unit().

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

no trailing period.

This comment has been minimized.

@ruebot

ruebot Apr 9, 2018

Author Member

Trailing period goes on @return, but not on @param.

object ExtractPopularImages {

/** Extracts the n most popular images from an RDD within a given size range.

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

italics n


/** Computes image size from a byte array using ImageIO.
*
* Used by `ExtractPopularImages` to calculate the size of

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

Is backtick notation properly interpreted by scaladocs?

This comment has been minimized.

@ruebot

ruebot Apr 9, 2018

Author Member

Yes.

screenshot from 2018-04-09 08-02-31

/** Detect mime type(s) from an input string.
*
* @param content a string of content for which to detect the MimeType
* @return mimetype(s) (e.g. "text/html" or "application/xml") or "N/A".

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

does this UDF really return multiple MIME types?

def apply(file: String) = {
serializedClassifier = file
}

/** Conducts NER classificiation based on NER Classifier.

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

Same comment as above...

object RemoveHTML {

/** Remove HTML markup.

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

Use 3rd person (descriptive) not 2nd person (prescriptive).

http://www.oracle.com/technetwork/java/javase/tech/index-137868.html

(Here and elsewhere)

*/
implicit class WARecordRDD(rdd: RDD[ArchiveRecord]) extends java.io.Serializable {

/** Removes all non-html-based data (images, executables etc.) from html text*/

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

missing period at end.

def keepMimeTypes(mimeTypes: Set[String]) = {
rdd.filter(r => mimeTypes.contains(r.getMimeType))
}

/** Removes all data that does not have selected data

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

missing period.

def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD) = {
rdd.filter(r => dates.contains(ExtractDate(r.getCrawlDate, component)))
}

/** Removes all data but selected exact urls

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

URL, Url, urls - inconsistent usage... it was "Url" above...

implicit class JsonTweet(tweet: JValue) {
implicit lazy val formats = org.json4s.DefaultFormats

/* Get Twitter status id. */

This comment has been minimized.

@lintool

lintool Apr 9, 2018

Member

do you mean to have /** ... */ here?

@lintool
Copy link
Member

lintool left a comment

Minor comments, okay to merge after cleanup...

@ianmilligan1
Copy link
Member

ianmilligan1 left a comment

Looks great to me!

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Apr 10, 2018

lgtm

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 10, 2018

@lintool did you want me to add Used by RecordLoader to extract data from WARC and ARC files. in place of ArchiveRecord. above, or it looks good as it is right now?

@lintool

This comment has been minimized.

Copy link
Member

lintool commented Apr 10, 2018

either is fine.

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 10, 2018

@lintool cool. I'll add it, and then worth with @ianmilligan1 on getting it merged. Thanks for reviewing!

@ruebot ruebot dismissed stale reviews from lintool and ianmilligan1 via 28fca42 Apr 10, 2018

@ianmilligan1 ianmilligan1 merged commit 47f7a97 into master Apr 11, 2018

2 of 4 checks passed

codecov/patch 0% of diff hit (target 67.34%)
Details
codecov/project 67.29% (-0.06%) compared to 3163ace
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ianmilligan1 ianmilligan1 deleted the issue-184 branch Apr 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.