New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log closing of ARC and WARC files, per #156 #301

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@jrwiebe
Copy link
Contributor

jrwiebe commented Jan 30, 2019

GitHub issue(s): #156

What does this Pull Request do?

  • Adds log message (level "INFO") indicating when an ARC or WARC file that has been read has been closed.

How should this be tested?

In Spark shell:

scala> sc.setLogLevel("INFO")

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/PATH/TO/test-classes/arc/*.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

The logs should contain something similar to this:

2019-01-30 11:10:04 INFO  NewHadoopRDD:54 - Input split: file:/home/jrwiebe/aut/target/test-classes/arc/badexample.arc.gz:0+1221
2019-01-30 11:10:04 INFO  NewHadoopRDD:54 - Input split: file:/home/jrwiebe/aut/target/test-classes/arc/example.arc.gz:0+2012526
WARNING Record STARTING at 0 has 1761 trailing byte(s): file:/home/jrwiebe/aut/target/test-classes/arc/badexample.arc.gz: {subject-uri=filedesc://IAH-20080430204825-00000-blackbook.arc, ip-address=0.0.0.0, origin=InternetArchive, length=1300, absolute-offset=0, creation-date=20080430204825, content-type=text/plain, version=1.1}
2019-01-30 11:10:05 INFO  ArchiveRecordInputFormat:240 - Closed archive file file:/home/jrwiebe/aut/target/test-classes/arc/badexample.arc.gz

Additional Notes:

Interested parties

@dportabella

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jan 30, 2019

Codecov Report

Merging #301 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #301      +/-   ##
==========================================
+ Coverage   75.76%   75.84%   +0.08%     
==========================================
  Files          41       41              
  Lines        1147     1151       +4     
  Branches      202      202              
==========================================
+ Hits          869      873       +4     
  Misses        209      209              
  Partials       69       69
Impacted Files Coverage Δ
...chivesunleashed/data/ArchiveRecordInputFormat.java 75% <100%> (+1.92%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e69040...626ae72. Read the comment docs.

@ianmilligan1
Copy link
Member

ianmilligan1 left a comment

Tested locally on a set of WARCs and the the Closed archive file messages appear. Great stuff, @jrwiebe, thanks for this.

@ianmilligan1 ianmilligan1 requested a review from ruebot Jan 30, 2019

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Jan 30, 2019

Once I get a 👍 from @dportabella, I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment