Stay up to date on releases
Create your free account today to subscribe to this repository for notifications about new releases, and build software alongside 40 million developers on GitHub.
Sign up for free See pricing for teams and enterprisesaut-0.18.0 (2019-08-21)
Implemented enhancements:
- Add method for unknown extensions in binary extractions #343
- Use Tika's detected MIME type instead of ArchiveRecord getMimeType? #342
- Add filter/keep by http status to RecordLoader class #315
- Audio binary object extraction #307
- Video binary object extraction #306
- Powerpoint binary object extraction #305
- Doc binary object extraction #304
- Spreadsheet binary object extraction #303
- PDF binary object extraction #302
- Test aut with Apache Spark 2.4.0 #295
- Replace hashing of unique ids with .zipWithUniqueId() #243
- Integration of neural network models for image analysis #240
- More complete Twitter Ingestion #194
- Image Search Functionality #165
- feature request: log when loadArchives opens and closes warc files in a dir #156
Fixed bugs:
- DataFrame commands throwing java.lang.NullPointerException on example data #320
- Class issues when using aut-0.17.0-fatjar.jar #313
- Image extraction does not scale with number of WARCs #298
- ExtractDomain mistakenly checks source first then url #277
- Improve ExtractDomain to Better Isolate Domains #269
Closed issues:
- Inconsistency in ArchiveRecord.getContentBytes #334
- Rationalize computeHash and ComputeMD5 #333
- Test additional Java versions with TravisCI #324
- Remove Twitter/tweet analysis #322
- Trouble testing s3 connectivity #319
- Depfu Error: No dependency files found #309
- Strategy to deal with conflict between application and Spark distribution dependencies #308
- SaveImageTest.scala should delete saved image file #299
- Remove Deprecated ExtractGraph.scala file for next release. #291
- DetectLanguage.scala: class LanguageIdentifier in package language is deprecated #286
- CVE-2017-7525 -- com.fasterxml.jackson.core:jackson-databind #279
- Maven build warning during release #273
- Improve DataFrameLoader.scala test coverage #265
- Improve package.scala test coverage #263
- Discussion: Idiom for loading DataFrames #231
- DataFrame field names: open thread #229
- DataFrame performance comparison: Scala vs. Python #215
- TweetUtilsTest.scala doesn't test Spark, only underlying json4s library #206
- feature request: ArchiveRecord.archiveFile #164
- feature request: possibility to query about the progress #162
- Update to Apache Tika 1.19.1; security vulnerabilities in 1.12 #131
- Create tests for ExtractGraph.scala #49
- Setup Victims #5
Merged pull requests:
- Update LICENSE and license headers. #351 (ruebot)
- Add binary extraction DataFrames to PySpark. #350 (ruebot)
- Add method for determining binary file extension #349 (jrwiebe)
- Add keep and discard by http status. #347 (ruebot)
- Add office document binary extraction. #346 (ruebot)
- Use version of tika-parsers without a classifier #345 (jrwiebe)
- Use Tika's detected MIME type instead of ArchiveRecord getMimeType. #344 (ruebot)
- Add Audio & Video binary extraction #341 (ruebot)
- Extract PDF #340 (jrwiebe)
- More scalastyle work; addresses #196. #339 (ruebot)
- Replace computeHash with ComputeMD5; resolves #333. #338 (ruebot)
- Update Tika to 1.22; address security alerts. #337 (ruebot)
- Tests #336 (ruebot)
- Make ArchiveRecord.getContentBytes consistent, Resolve #334 #335 (ianmilligan1)
- Enable S3 access #332 (jrwiebe)
- Updates to pom following 0e701b2 #328 (ruebot)
- Move data frame fields names to snake_case. #327 (ruebot)
- Python formatting, and gitignore additions. #326 (ruebot)
- Test Java 8 & 11, and remove OracleJDK; resolves #324. #325 (ruebot)
- Remove Tweet utils. #323 (ruebot)
- Update to Spark 2.4.3 and update Tika to 1.20. #321 (ruebot)
- add image analysis w/ tensorflow #318 (h324yang)
- Makes ArchiveRecordImpl serializable #316 (jrwiebe)
- Resolve cobertura-maven-plugin class issue; resolves #313. #314 (ruebot)
- Update spark-core_2.11 to 2.3.1. #312 (ruebot)
- Log closing of ARC and WARC files, per #156 #301 (jrwiebe)
- Delete saved image file; resolves #299 #300 (jrwiebe)
- Remove Deprecated ExtractGraph app #293 (greebie)
- Add .getHttpStatus and .getFilename to ArchiveRecordImpl class #198 & #164 #292 (greebie)
- Update license headers for #208. #290 (ruebot)
- Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289 (greebie)
- CVE-2018-11771 update #288 (ruebot)
- CVE-2017-17485 update; follow-on to #281. #287 (ruebot)
- Update Apache Tika - security vulnerabilities; resolves #131. #285 (ruebot)
- [skip travis] Update README #284 (ruebot)
- Only trigger TravisCI on master. #283 (ruebot)
- Missed something for #208. #282 (ruebot)
- CVE-2018-7489 fix. #281 (ruebot)
- Update jackson-databind version; resolves #279. #280 (ruebot)
- Patch for #277: Fix bug and unit test for ExtractDomain #278 (borislin)
- Patch for #269: Replace backslash with forward slash in URL #276 (borislin)
- Clean-up pom.xml to remove plugin warnings; resolves #273. #274 (ruebot)
Assets
17
aut-0.18.0-fatjar.jar
217 MB
aut-0.18.0-fatjar.jar.md5
56 Bytes
aut-0.18.0-fatjar.jar.sha1
64 Bytes
aut-0.18.0-javadoc.jar
72.5 KB
aut-0.18.0-javadoc.jar.md5
57 Bytes
aut-0.18.0-javadoc.jar.sha1
65 Bytes
aut-0.18.0-test-javadoc.jar
48.3 KB
aut-0.18.0-test-javadoc.jar.md5
62 Bytes
aut-0.18.0-test-javadoc.jar.sha1
70 Bytes
aut-0.18.0.jar
538 KB
aut-0.18.0.jar.md5
49 Bytes
aut-0.18.0.jar.sha1
57 Bytes
aut.zip
1.04 KB
aut.zip.md5
42 Bytes
aut.zip.sha1
50 Bytes