-
- Follow on to 72cb5e2 - https://nvd.nist.gov/vuln/detail/CVE-2018-7489
-
Update jackson-databind version; resolves #279. (#280)
- CVE-2017-752 - See also: https://nvd.nist.gov/vuln/detail/CVE-2017-7525
-
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits -
Verified
This commit was signed with a verified signature.ruebot Nick RuestGPG key ID: 417FAF1A0E1080CD Learn about signing commits
-
ExtractBoilerpipeText to remove headers as well. #253 (#256)
* ExtractBoilerpipeText now removes headers.
-
Address main scalastyle errors - #196 (#248)
* Deal with wildcard import lint issues. * Fix some magic numbers & duplicate string runs. * Lint fixes, mostly explicit import warnings. * All other scalastyle issues require refactoring.
-
Add ExtractGraphX including algorithms for PageRank and Components. I…
…ssue 203 (#245) * pom.xml change for GraphX * Changes for GraphXSLS * Changes for SLS graph * Changes for GraphX * Changes for converting WARC RDD to GraphX object * Rename extractor to ExtractGraphX * Various lint fixes (usually Magic Numbers) * Remove illegal imports from scala style (we use wildcard imports a lot) * Add WriteGraphXMLTest.
-
Fix TravisCI build issues (#244)
* Make the TravisCI build less verbose since we're hitting the 4MB log limit. * Pin site.plugin and project-info-reports.plugin so mvn site builds. - See: - https://stackoverflow.com/questions/51091539/maven-site-plugins-3-3-java-lang-classnotfoundexception-org-apache-maven-doxia - https://travis-ci.org/archivesunleashed/aut/jobs/408259462#L3201-L3202
-
Save images from dataframe to disk (#234)
* Save images from dataframe to disk * Fix spacing * Move save images to inline * Refactor to chain and fix concurrency issue * Add save image test * Move saveToDisk to df
-
Add Extract Image Details API (#226); Adresses #220
* Add Extract Image Details API * Change check for jpeg and fix spacing * Add tiff parser * Use AutoDetectParser and read Numeric fields * Use ComputeImageSize * Hex encode hash and base64 encode image bytes * Fix test * Change df column names
-
Implement DomainFrequency, DomainGraph and PlainText extractor that c…
…an be run from command line (#225) * Resolves issue 195. Implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests * Restructure CommandLineAppRunner to make it more robust. Add option to write GEXF output for DomainGraphExtractor (enable via --output-format GEXF). Add support for multiple input files. Other polish and cleanup.
-
-
Extract Image Links DF API + Test (#221)
* Extract Image Links DF API * Add extract image links text * Remove unnecessary comment from test * Add doc comments * Addresses #220
-
Update Apache Spark to 2.3.0; resolves #218 (#219)
- Update tests to use workaround for SPARK-2243 - Comment out ExtractGraph test as per https://github.com/archivesunleashed/aut/pull/204/files#diff-4541b9834513985c360b64093fd45073 - Align Hadoop version with Apache Spark pom.xml https://github.com/apache/spark/blob/branch-2.3/pom.xml#L120
-
-
* Create issue templates
-
-
Initial DataFrames merge (#210); Partially addresses #190
* Initial stab at df. * Initial stab of what link extraction would look like with DFs. * Added test case. * Docs.