Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: log when loadArchives opens and closes warc files in a dir #156

Closed
dportabella opened this Issue Dec 18, 2017 · 5 comments

Comments

4 participants
@dportabella
Copy link
Contributor

dportabella commented Dec 18, 2017

val inputWarcDir = "/data/warcs/*.warc.gz"
val webPages: RDD[ArchiveRecord] = RecordLoader.loadArchives(inputWarcDir, sc)

Is it possible to add a log showing when a warc file is open and closed?

@ruebot ruebot added the enhancement label Dec 21, 2017

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Dec 21, 2017

One quick solution might be passing

sc.setLogLevel("INFO")

in Spark-shell, which gives you very verbose logging. It does include input information like this:

2017-12-21 09:59:00,434 [Executor task launch worker for task 6] INFO  NewHadoopRDD - Input split: file:/Users/ianmilligan1/Dropbox/git/aut/example.arc.gz:0+2012526
2017-12-21 09:59:00,435 [Executor task launch worker for task 7] INFO  NewHadoopRDD - Input split: file:/Users/ianmilligan1/Dropbox/git/aut/example2.arc.gz:0+2012526

No file close information, but I have used it to debug things before (such as finding bad W/ARCs).

@ruebot ruebot added this to Done in 1.0.0 Release of AUT Jan 3, 2018

@ruebot ruebot moved this from Done to To Do in 1.0.0 Release of AUT Jan 3, 2018

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Jan 4, 2018

Just re pinging on this @dportabella, is this something you want above and beyond the Spark logging options?

@dportabella

This comment has been minimized.

Copy link
Contributor Author

dportabella commented Jan 8, 2018

Just re pinging on this @dportabella, is this something you want above and beyond the Spark logging options?

No. Your solution sc.setLogLevel("INFO") is fine for me, thx!

But I would need file close information also.
Thought it is not an urgent issue.

@jrwiebe

This comment has been minimized.

Copy link
Contributor

jrwiebe commented Jan 29, 2019

@dportabella, my recent commit addresses your request. If you set the log level to "INFO", you will see messages like this:

2019-01-29 12:54:11 INFO  ArchiveRecordInputFormat:141 - Opening archive file file:/home/jrwiebe/aut/target/test-classes/arc/example.arc.gz
2019-01-29 12:54:11 INFO  ArchiveRecordInputFormat:240 - Closed archive file file:/home/jrwiebe/aut/target/test-classes/arc/example.arc.gz

Is this satisfactory?

@dportabella

This comment has been minimized.

Copy link
Contributor Author

dportabella commented Jan 31, 2019

Great, thanks!

@ruebot ruebot closed this in fc0178d Jan 31, 2019

1.0.0 Release of AUT automation moved this from In Progress to Done Jan 31, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.