Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove and clean-up Scaladocs; resolves #184 #193
Conversation
ruebot
and others
added some commits
Mar 20, 2018
This comment has been minimized.
This comment has been minimized.
greebie
commented on src/main/scala/io/archivesunleashed/matchbox/ExtractBoilerpipeText.scala
in 40400ea
Apr 6, 2018
Just a small typo here. ext => text. |
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Apr 6, 2018
•
Codecov Report
@@ Coverage Diff @@
## master #193 +/- ##
==========================================
- Coverage 67.34% 67.29% -0.06%
==========================================
Files 32 32
Lines 637 636 -1
Branches 124 124
==========================================
- Hits 429 428 -1
Misses 167 167
Partials 41 41
Continue to review full report at Codecov.
|
ruebot
requested review from
lintool and
ianmilligan1
Apr 9, 2018
lintool
reviewed
Apr 9, 2018
@@ -29,6 +29,11 @@ import org.archive.io.arc.ARCRecord | |||
import org.archive.io.warc.WARCRecord | |||
import org.archive.util.ArchiveUtils | |||
|
|||
/** ArchiveRecord. | |||
* | |||
* @constructor an archive record. |
This comment has been minimized.
This comment has been minimized.
lintool
Apr 9, 2018
Member
I believe the style guide calls for a noun phrase without trailing period?
http://www.oracle.com/technetwork/java/javase/tech/index-137868.html
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ruebot
Apr 9, 2018
Author
Member
...that said, it would be nice to have something other than "ArchiveRecord." here. Do you have a suggestion?
This comment has been minimized.
This comment has been minimized.
ruebot
Apr 10, 2018
Author
Member
@lintool are you good with this as is, or should we change it? I think this is the only outstanding left before we can merge.
This comment has been minimized.
This comment has been minimized.
greebie
Apr 10, 2018
•
Contributor
I had
The `ArchiveRecord` [class](https://docs.scala-lang.org/tour/classes.html) is used by
`RecordLoader` to manage the ingestion of WARC and ARC files. It represents one
record from a collection found in the file. It uses the Internet Archive's ARCRecord &
WARCRecord APIs to extract data from web crawls such as the crawl date, content strings, urls and so on.
`ArchiveRecord` is a generic record that covers data from both ARCRecord and
WARCRecord format. AUT will convert either format into the generic `ArchiveRecord`
class.```
in the original reference document.
So maybe ```ArchiveRecord is a class for ingesting WARC and ARC files.```
or ```Used by RecordLoader to extract data from WARC and ARC files.```
lintool
reviewed
Apr 9, 2018
@@ -21,27 +21,35 @@ import io.archivesunleashed.matchbox.{NER3Classifier, RemoveHTML} | |||
import org.apache.spark.SparkContext | |||
import org.apache.spark.rdd.RDD | |||
|
|||
/** | |||
* Extracts entities | |||
/** Conducts Named Entity Recognition (NER) on a WARC or ARC file. |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
*/ | ||
def extractFromRecords(iNerClassifierFile: String, inputRecordFile: String, outputFile: String, sc: SparkContext): RDD[(String, String, String)] = { | ||
val rdd = RecordLoader.loadArchives(inputRecordFile, sc) | ||
.map(r => (r.getCrawlDate, r.getUrl, RemoveHTML(r.getContentString))) | ||
extractAndOutput(iNerClassifierFile, rdd, outputFile) | ||
} | ||
|
||
/** | ||
/** Extracts NER entities from tuple-formatted derivatives scraped from a website. |
This comment has been minimized.
This comment has been minimized.
lintool
Apr 9, 2018
Member
NER stands for Named Entity Recognition, so either say "Extracts named entities" or "Performs NER"...
lintool
reviewed
Apr 9, 2018
* | ||
* @param verticesPath Filepath for vertices output | ||
* @param edgesPath Filepath for edges output | ||
* @return Unit(). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
object ExtractPopularImages { | ||
|
||
/** Extracts the n most popular images from an RDD within a given size range. |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
|
||
/** Computes image size from a byte array using ImageIO. | ||
* | ||
* Used by `ExtractPopularImages` to calculate the size of |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
/** Detect mime type(s) from an input string. | ||
* | ||
* @param content a string of content for which to detect the MimeType | ||
* @return mimetype(s) (e.g. "text/html" or "application/xml") or "N/A". |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
def apply(file: String) = { | ||
serializedClassifier = file | ||
} | ||
|
||
/** Conducts NER classificiation based on NER Classifier. |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
object RemoveHTML { | ||
|
||
/** Remove HTML markup. |
This comment has been minimized.
This comment has been minimized.
lintool
Apr 9, 2018
Member
Use 3rd person (descriptive) not 2nd person (prescriptive).
http://www.oracle.com/technetwork/java/javase/tech/index-137868.html
(Here and elsewhere)
lintool
reviewed
Apr 9, 2018
*/ | ||
implicit class WARecordRDD(rdd: RDD[ArchiveRecord]) extends java.io.Serializable { | ||
|
||
/** Removes all non-html-based data (images, executables etc.) from html text*/ |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
def keepMimeTypes(mimeTypes: Set[String]) = { | ||
rdd.filter(r => mimeTypes.contains(r.getMimeType)) | ||
} | ||
|
||
/** Removes all data that does not have selected data |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD) = { | ||
rdd.filter(r => dates.contains(ExtractDate(r.getCrawlDate, component))) | ||
} | ||
|
||
/** Removes all data but selected exact urls |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
implicit class JsonTweet(tweet: JValue) { | ||
implicit lazy val formats = org.json4s.DefaultFormats | ||
|
||
/* Get Twitter status id. */ |
This comment has been minimized.
This comment has been minimized.
lintool
reviewed
Apr 9, 2018
Minor comments, okay to merge after cleanup... |
ruebot
dismissed
lintool’s
stale review
via
abd4701
Apr 9, 2018
ianmilligan1
reviewed
Apr 9, 2018
Looks great to me! |
This comment has been minimized.
This comment has been minimized.
lgtm |
lintool
reviewed
Apr 10, 2018
This comment has been minimized.
This comment has been minimized.
@lintool did you want me to add |
This comment has been minimized.
This comment has been minimized.
either is fine. |
This comment has been minimized.
This comment has been minimized.
@lintool cool. I'll add it, and then worth with @ianmilligan1 on getting it merged. Thanks for reviewing! |
ruebot commentedApr 6, 2018
•
edited
GitHub issue(s):
#184
What does this Pull Request do?
Improve and clean-up Scaladocs so that we will have much improved Scaladocs for the next release.
Every single scala file was touched.
How should this be tested?
mvn site -DskipTests
and visittarget/site/scaladocs/index.html
with a browserauk
standard output since we pulled in #189 and #186.Interested parties
@lintool I'd appreciate your eyes on the doc comments in the code.
@ianmilligan1 can you do the smoke test, and any other review that you'd like to do?
Before this is merged let me know since we're squashing and merging this, the commit message is going to be pretty crazy with all the commits we've had go into this.
@greebie++ for his work on this.