Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd datathon derivatives to app (binary info, web pages, web graph #447
Comments
ruebot
added a commit
that referenced
this issue
Apr 21, 2020
- Resolves #447 - Add AudioInformationExtractor, ImageInformationExtractor, PDFInformationExtractor, PresentationProgramInformationExtractor, SpreadsheetInformationExtractor, TextFilesInformationExtractor, VideoInformationExtractor, WebGraphExtractor, WordProcessorInformationExtractor - Add tests for the new extractors - Update CommandLineApp to use new extractors - Change "TEXT" to "csv" - Lower case "GEXF" and "GRAPHML"
ruebot
added a commit
that referenced
this issue
Apr 21, 2020
- Resolves #447 - Add AudioInformationExtractor, ImageInformationExtractor, PDFInformationExtractor, PresentationProgramInformationExtractor, SpreadsheetInformationExtractor, TextFilesInformationExtractor, VideoInformationExtractor, WebGraphExtractor, WordProcessorInformationExtractor - Add tests for the new extractors - Update CommandLineApp to use new extractors - Add domain, and language column WebPagesExtractor - Change "TEXT" to "csv" - Lower case "GEXF" and "GRAPHML"
ianmilligan1
pushed a commit
that referenced
this issue
Apr 21, 2020
- Resolves #447 - Add AudioInformationExtractor, ImageInformationExtractor, PDFInformationExtractor, PresentationProgramInformationExtractor, SpreadsheetInformationExtractor, TextFilesInformationExtractor, VideoInformationExtractor, WebGraphExtractor, WordProcessorInformationExtractor - Add tests for the new extractors - Update CommandLineApp to use new extractors - Add domain, and language column WebPagesExtractor - Change "TEXT" to "csv" - Lower case "GEXF" and "GRAPHML"
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
ruebot commentedApr 20, 2020
Is your feature request related to a problem? Please describe.
There only way to create the derivatives we used for the recent datathon(s) is to do them via spark shell. We should add them to the app.
Describe the solution you'd like
Add the following derivatives to app:
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
.webgraph()
Additional context
webpages
, should we add a domain column, so it is similar to the "full-text" derivative, or should it completely replace the "full-text" derivative?webgraph
, should this just be theDomainGraphExtractor
as "TEXT"?