Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Binary Object Extraction #133

Open
ianmilligan1 opened this issue Aug 12, 2019 · 1 comment

Comments

@ianmilligan1
Copy link
Member

commented Aug 12, 2019

In #77 we are updating documentation to reflect new functionality in 0.18.0. This ticket can contain information on binary extraction that needs to be added to the DataFrame section.

@ianmilligan1 ianmilligan1 self-assigned this Aug 12, 2019

@ianmilligan1

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

PDF implemented in AUT #340. Script along lines of this:

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/*.gz", sc).extractPDFDetailsDF();  
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/pdfs/9/aut-302-test", "pdf")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.