Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Binary Object Extraction #133

Open
ianmilligan1 opened this issue Aug 12, 2019 · 4 comments

Comments

@ianmilligan1
Copy link
Member

commented Aug 12, 2019

In #77 we are updating documentation to reflect new functionality in 0.18.0. This ticket can contain information on binary extraction that needs to be added to the DataFrame section.

@ianmilligan1 ianmilligan1 self-assigned this Aug 12, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

PDF implemented in AUT #340. Script along lines of this:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/*.gz", sc).extractPDFDetailsDF();  
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/pdfs/9/aut-302-test", "pdf")
@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

A note to myself, mostly, as I would like to document all the binary objects at the same time.

For PDF, these two scripts worked and need to be documented. Similar to the image extraction, will have to make clear on the export path for binary data.

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc).extractPDFDetailsDF();
df.select($"url", $"mime_type", $"md5").orderBy(desc("md5")).write.csv("/Users/ianmilligan1/desktop/pdf/csv")
import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc).extractPDFDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/Users/ianmilligan1/desktop/pdf", "pdf")
@ruebot

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

FYI, this is going to change a bit after we do the rest of the extraction.

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.