archivesunleashed/aut

Commits on Nov 6, 2019

Updates description. See archivesunleashed/aut-docs-new#18 (#373 )

ruebot authored and ianmilligan1 committed Nov 6, 2019

Loading status checks…

08b9ae3

Commits on Nov 5, 2019

Align NER output to WANE format; addresses #297 (#361 )

ruebot authored and ianmilligan1 committed Nov 5, 2019

- Update Stanford core NLP
- Format NER output in json
- Add getPayloadDigest to ArchiveRecord
- Add test for getPayloadDigest
- Add payload digest to NER output
- Remove extractFromScrapeText
- Remove extractFromScrapeText test
- TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

379cc68

Various UDF implementation and cleanup for DF. (#370 )

lintool authored and ruebot committed Nov 5, 2019

- Replace ExtractBaseDomain with ExtractDomain
- Closes #367
- Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
- Closes #369
- Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
- Partially addresses #238
- Updates tests where necessary
- Punts on #368 UDF CaMeL cASe consistency issues

6686519

Commits on Oct 14, 2019

Update commons-compress to 1.19; CVE-2019-12402 (#365 )

ruebot authored and ianmilligan1 committed Oct 14, 2019

Loading status checks…

4e8b41d

Commits on Oct 9, 2019

Add ComputeSHA1 method; resolves #363 . (#364 )

ruebot authored and ianmilligan1 committed Oct 9, 2019

Loading status checks…
```
- Update tests where needed
- Add SHA1 method to ExtractImageDetails
- Add SHA1 to DataFrames binary extraction and analysis
```
03ac99c

Commits on Sep 11, 2019

Update keepValidPages to include a filter on 200 OK. (#360 )

ruebot authored and ianmilligan1 committed Sep 11, 2019

- Add status code filter to keepValidPages
- Add MimeTypeTika to valid pages DF
- Update tests since we filter more and better now 😄
- Resolves #359

9b3e025

Commits on Sep 3, 2019

Update to Spark 2.4.4 (#358 )

ruebot authored and ianmilligan1 committed Sep 3, 2019

Loading status checks…

7305ed7

Commits on Aug 27, 2019

[skip travis] Update links (#357 )

ruebot committed Aug 27, 2019

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

a454ed3

Commits on Aug 23, 2019

Add discardLanguage filter to RecordLoader. (#353 )

ruebot authored and ianmilligan1 committed Aug 23, 2019

Loading status checks…
```
- Clean up doc comments
- Add test
- Resolves #352
```
0284d33

Commits on Aug 22, 2019

Improve test coverage. (#354 )

ruebot authored and ianmilligan1 committed Aug 22, 2019

Loading status checks…
```
- Add tests a few more filters in RecordLoader
- Add binary extration DataFrameLoader tests
```
bced854

Commits on Aug 21, 2019

[maven-release-plugin] prepare for next development iteration

ruebot committed Aug 21, 2019

Loading status checks…

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

4313174
[maven-release-plugin] prepare release aut-0.18.0

ruebot committed Aug 21, 2019

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

95e5f03

Add binary extraction DataFrames to PySpark. (#350 )

ruebot authored and ianmilligan1 committed Aug 21, 2019

* Add binary extration DataFrames to PySpark.
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
- Resolves #350 
- Update README

eda185b

Update LICENSE and license headers. (#351 )

ruebot authored and ianmilligan1 committed Aug 21, 2019

- Update LICENSE file to full Apache 2.0 license
- Reconfigure license-maven-plugin
- Update all license headers in java and scala files to include
copyright year, and project name
- Move LICENSE_HEADER.txt to config
- Update scalastyle config

e32ae17

Commits on Aug 18, 2019

Add method for determining binary file extension. (#349 )

jrwiebe authored and ruebot committed Aug 18, 2019

This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox.

This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.)

Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files

* Adds method for getting a file extension from a MIME type.
* Add getExtensions method to DetectMimeTypeTika.
* Matchbox object to get extension of URL
* Use GetExtensionMime for extraction methods; minor fixes.
* Remove tika-parsers classifier
* Remove most filtering by file extension from binary extraction methods; add CSV/TSV special cases.
* Fix GetExtensionMime case where URL has no extension but a MIME type is detected
* Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunleashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.
* Remove filtering on URL for audio, video, and images.
* Remove filtering on URL for images; add DF fields to image extraction
* Remove saveImageToDisk and its test
* Remove robots.txt check and extraneous imports
* Close files so we don't get too many files open again.
* Add GetExtensionMimeTest
* Resolve #343

448601e

Commits on Aug 17, 2019

Add keep and discard by http status. (#347 )

ruebot authored and ianmilligan1 committed Aug 17, 2019

Loading status checks…
```
- Add keep and discard by http status RecordLoader
- Add tests
- Clean up/add doc comments in RecordLoader
- Resolve #315
```
018527a

Commits on Aug 16, 2019

Add office document binary extraction. (#346 )

ruebot authored and ianmilligan1 committed Aug 16, 2019

- Add Word Processor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add Text files DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixtures for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Use aut-resources repo to distribute our shaded tika-parsers 1.22
- Close TikaInputStream
- Add RDD filters on MimeTypeTika values
- Add CodeCov configuration yaml
- Includes work by @jrwiebe, see #346 for all commits before squash

c824ad8

Commits on Aug 14, 2019

Use version of tika-parsers without a classifier. (#345 )

jrwiebe authored and ruebot committed Aug 14, 2019

Loading status checks…
```
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact
was unnecessary.
```
39831c2

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. (#…

ruebot authored and ianmilligan1 committed Aug 14, 2019

…344)

- Move audio, pdf, and video DF extraction to tuple map
- Provide two MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342

01d12b4

Commits on Aug 13, 2019

Add audio & video binary extraction (#341 )

ruebot authored and ianmilligan1 committed Aug 13, 2019

- Add Audio & Video binary extraction.
- Add filename, and extenstion column to audio, pdf, and video DF
- Pass binary bytes instread of string to DetectMimeTypeTika in DF (s/getContentString/getBinaryBytes)
- Updates saveToDisk to use file extension from DF column
- Adds tests for Audio, PDF, and Video DF extraction
- Add test fixtures for Audio, PDF, and Video DF extraction
- Rename SaveBytesTest to SaveImageBytes test
- Eliminate bytes->string->bytes conversion that was causing data loss in DetectMimeTypeTika
- Update tika-parsers dep from JitPack
- Remove tweet cruft
- Resolves #306
- Resolves #307
- Includes work by @jrwiebe, see #341 for all commits before squash

54c0c3e

Commits on Aug 12, 2019

Add PDF binary extraction. (#340 )

jrwiebe authored and ruebot committed Aug 12, 2019

Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome.

- Updates getImageBytes to getBinaryBytes
- Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk
- Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0.
- Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection.
- Added generic saveToDisk method to save a bytes column of a DataFrame to files
- Updates tests
- Resolves #302
- Further addresses #308
- Includes work by @ruebot, see #340 for all commits before squash

73981a7

Commits on Aug 8, 2019

More scalastyle work; addresses #196 . (#339 )

ruebot authored and ianmilligan1 committed Aug 8, 2019

- Remove all underscore imports, except shapeless
- Address all scalastyle warnings
- Update scalastyle config for magic numbers, and null (only used in
tests)

b2d7394

Commits on Aug 7, 2019

Replace computeHash with ComputeMD5; resolves #333 . (#338 )

ruebot authored and jrwiebe committed Aug 7, 2019

Loading status checks…
```
* Replace computeHash with ComputeMD5; resolves #333.

* I suppose these are redundant.
```
9623c7a

Commits on Aug 6, 2019

Make ArchiveRecord.getContentBytes consistent,#334 (#335 )

ianmilligan1 authored and ruebot committed Aug 6, 2019

Loading status checks…

1818596
Update Tika to 1.22; address security alerts. (#337 )

ruebot authored and ianmilligan1 committed Aug 6, 2019

Loading status checks…
```
- Update Tika to 1.22
- pom.xml surgery to get aut to build again with --packages
```
2d14b92

Commits on Jul 31, 2019

Update test coverage for data frames (#336 ).

ruebot authored and ianmilligan1 committed Jul 31, 2019

- This commit will fall under @ruebot, but @jrwiebe did the heavy lifting here; see #336 for his commits before they were squashed down.
- Resolves #265
- Resolves #263
- Update Scaladocs

605afcc

Commits on Jul 25, 2019

Enable S3 access (#332 )

jrwiebe authored and ruebot committed Jul 25, 2019

* Update POM to access data stored in Amazon S3, per #319
* In RecordLoader detect FileSystem based on path.
* Resolves #319

64c1f1f

Commits on Jul 23, 2019

Updates to pom following 0e701b2 (#328 )

ruebot authored and ianmilligan1 committed Jul 23, 2019

- Remove explicit Guava dependency (should have been remove in
0e701b2)
- Update Scala to 2.11.12; aligns with Spark 2.4.3
- Update Scala test
- Update Shapeless
- Update Scala lang parsers
- Fix a typo in a test

19b49e1

Commits on Jul 18, 2019

Python formatting, and gitignore additions. (#326 )

ruebot authored and ianmilligan1 committed Jul 18, 2019

Loading status checks…
```
- Run black and isort on Python files.
- Move Spark config to example file.
- Update gitignore for 7a61f0e
additions.
```
bd5ef14
Move data frame fields names to snake_case. (#327 )

ruebot authored and ianmilligan1 committed Jul 18, 2019

Loading status checks…
```
- Resolves #229
```
f35d54e

Commits on Jul 17, 2019

Update to Spark 2.4.3 and update Tika to 1.20. (#321 )

ruebot authored and ianmilligan1 committed Jul 17, 2019

* Update to Spark 2.4.3 and update Tika to 1.20.

- Resolves #295
- Resolves #308
- Resolves #286
- Pulls in unfinished work by @jrwiebe and @borislin.

* Add patched lang-detector

0e701b2

Commits on Jul 15, 2019

Remove Tweet utils. (#323 )

ruebot authored and ianmilligan1 committed Jul 15, 2019

Loading status checks…
```
- Resolves #322
- Resolves #206
- Resolves #194
```
20ffeeb

Commits on Jul 8, 2019

Test Java 8 & 11, and remove OracleJDK; resolves #324 . (#325 )

ruebot authored and ianmilligan1 committed Jul 8, 2019

Loading status checks…

4ce59c8

Commits on Jul 5, 2019

Add image analysis and extraction w/TensorFlow (#318 )

h324yang authored and ruebot committed Jul 5, 2019

Loading status checks…

7a61f0e

Commits on Apr 22, 2019

Makes ArchiveRecordImpl serializable by removing non-serializable ARC…

jrwiebe authored and ruebot committed Apr 22, 2019

Loading status checks…
```
…Record and WARCRecord variables. Also removes unused headerResponseFormat variable. (#316)
```
5cb05f7

Older

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut