Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upDocument Binary Object Extraction #133
Closed
Comments
This comment has been minimized.
This comment has been minimized.
PDF implemented in AUT #340. Script along lines of this: import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/tuna1/scratch/nruest/geocites/warcs/9/*.gz", sc).extractPDFDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/tuna1/scratch/nruest/geocites/pdfs/9/aut-302-test", "pdf") |
This comment has been minimized.
This comment has been minimized.
A note to myself, mostly, as I would like to document all the binary objects at the same time. For PDF, these two scripts worked and need to be documented. Similar to the image extraction, will have to make clear on the export path for binary data. import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc).extractPDFDetailsDF();
df.select($"url", $"mime_type", $"md5").orderBy(desc("md5")).write.csv("/Users/ianmilligan1/desktop/pdf/csv") import io.archivesunleashed._
import io.archivesunleashed.df._
val df = RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz", sc).extractPDFDetailsDF();
val res = df.select($"bytes").saveToDisk("bytes", "/Users/ianmilligan1/desktop/pdf", "pdf") |
This comment has been minimized.
This comment has been minimized.
FYI, this is going to change a bit after we do the rest of the extraction. |
This comment has been minimized.
This comment has been minimized.
Sounds good! |
This comment has been minimized.
This comment has been minimized.
Closed due to move to https://github.com/archivesunleashed/aut-docs. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ianmilligan1 commentedAug 12, 2019
In #77 we are updating documentation to reflect new functionality in 0.18.0. This ticket can contain information on binary extraction that needs to be added to the DataFrame section.