Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDocument Binary Object Extraction #133
Comments
ianmilligan1
self-assigned this
Aug 12, 2019
This comment has been minimized.
This comment has been minimized.
PDF implemented in AUT #340. Script along lines of this: import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/*.gz", sc).extractPDFDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/pdfs/9/aut-302-test", "pdf") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ianmilligan1 commentedAug 12, 2019
In #77 we are updating documentation to reflect new functionality in 0.18.0. This ticket can contain information on binary extraction that needs to be added to the DataFrame section.