archivesunleashed/aut

Commits on Dec 18, 2019

Add additional filters for fextFiles; resolves #362 . (#393 )

ruebot authored and ianmilligan1 committed Dec 18, 2019

Loading status checks…
```
* Add additional filters for fextFiles; resolves #362.

- Add filedesc, and dns filter (arc files)
- Add test case
```
8eb43ff

Commits on Dec 17, 2019

udf API implementations for DataFrame (#391 )

SinghGursimran authored and ruebot committed Dec 17, 2019

Loading status checks…
```
- add discardMimeTypesDF
- add discardDateDF
- add discardUrlsDF
- add discardDomainsDF
- update tests
- addresses #223
```
40a59de
Add Serializable APIs for DataFrames (#389 )

SinghGursimran authored and ruebot committed Dec 17, 2019

Loading status checks…
```
- Add keepValidPagesDF
- Add HTTP status code column to all()
- Add test for keepValidPagesDF
- Addresses #223
```
ca928d8

Add and update tests, resolve textFiles bug. (#388 )

ruebot authored and ianmilligan1 committed Dec 17, 2019

- Add ExtractDateDF test
- Fix conditional logic of textFiles filter to resolve #390
- Add test for conditional logic fix for #390
- Remove cruft ExtractUrls, left over from Twitter analysis removal
(see:
https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
- Tweak null/nothing on a few tests

9e32284

Commits on Dec 5, 2019

Add new DataFrame matchbox udfs (#387 )

SinghGursimran authored and ruebot committed Dec 5, 2019

- Add DetectLanguageDF
- Add ExtractBoilerpipeTextDF
- Add ExtractDateDF
- Update tests
- Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
- Partially addresses #223

079cd24

Commits on Nov 28, 2019

Clean-up underscore import, and scalastyle warnings. (#386 )

ruebot authored and ianmilligan1 committed Nov 28, 2019

Loading status checks…

560ed2b

Commits on Nov 21, 2019

Add "Extract popular images" DataFrame implementation (#382 ).

SinghGursimran authored and ruebot committed Nov 21, 2019

Loading status checks…
```
- Add tests for ExtractPopularImagesDF
- Rename ExtractPopularImages to ExtractPopularImagesRDD
- Addresses #223
```
4042180

Add all() method and refactor DF UDFs (#383 ).

SinghGursimran authored and ruebot committed Nov 21, 2019

- Add `all()` DataFrame method 
- Refactor fixity DataFrame UDFs
- Add ComputeImageSize UDF
- Add Python implementation of `all()`
- Addresses #223

c4eaca9

Rename pages() to webpages(). (#384 )

ruebot authored and ianmilligan1 committed Nov 21, 2019

Loading status checks…
```
- Part of work on #233
```
d8e8df3

Commits on Nov 19, 2019

Append UDF with RDD or RF. (#381 )

ruebot authored and ianmilligan1 committed Nov 19, 2019

Loading status checks…
```
- Addresses #223
```
b98ba4b

Commits on Nov 18, 2019

Extend more Matchbook utilities to DataFrames (#380 ).

SinghGursimran authored and ruebot committed Nov 18, 2019

Loading status checks…
```
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
- Addresses #223
```
a081d7b

Commits on Nov 17, 2019

Rename DF functions to be consistent with Python DF functions. (#379 )

ruebot authored and ianmilligan1 committed Nov 17, 2019

Loading status checks…
```
- Resolves #366
```
67ca17d

Commits on Nov 14, 2019

Finalize converting NER Classifier to WANE Format (#378 ).

SinghGursimran authored and ruebot committed Nov 14, 2019

Loading status checks…
```
- Fully resolves #297 
- Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
```
f9ce826

Commits on Nov 12, 2019

Add df ExtractLinks udf; resolves #238 . (#377 )

SinghGursimran authored and ruebot committed Nov 12, 2019

Loading status checks…
```
- Add df ExtractLinks udf
- Add test
```
c353dae

Commits on Nov 10, 2019

Update README.md (#376 )

lintool authored and ruebot committed Nov 10, 2019

Loading status checks…
```
Tweaks the style of the license badge to look consistent with the other badges.
```
107def2

Commits on Nov 7, 2019

Change RemoveHttpHeader to RemoveHTTPHeader. (#374 )

SinghGursimran authored and ruebot committed Nov 7, 2019

Loading status checks…
```
Resolves #368.
```
25ca5a9

Commits on Nov 6, 2019

Updates description. See archivesunleashed/aut-docs#18 (#373 )

ruebot authored and ianmilligan1 committed Nov 6, 2019

Loading status checks…

08b9ae3

Commits on Nov 5, 2019

Align NER output to WANE format; addresses #297 (#361 )

ruebot authored and ianmilligan1 committed Nov 5, 2019

- Update Stanford core NLP
- Format NER output in json
- Add getPayloadDigest to ArchiveRecord
- Add test for getPayloadDigest
- Add payload digest to NER output
- Remove extractFromScrapeText
- Remove extractFromScrapeText test
- TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

379cc68

Various UDF implementation and cleanup for DF. (#370 )

lintool authored and ruebot committed Nov 5, 2019

- Replace ExtractBaseDomain with ExtractDomain
- Closes #367
- Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
- Closes #369
- Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
- Partially addresses #238
- Updates tests where necessary
- Punts on #368 UDF CaMeL cASe consistency issues

6686519

Commits on Oct 14, 2019

Update commons-compress to 1.19; CVE-2019-12402 (#365 )

ruebot authored and ianmilligan1 committed Oct 14, 2019

Loading status checks…

4e8b41d

Commits on Oct 9, 2019

Add ComputeSHA1 method; resolves #363 . (#364 )

ruebot authored and ianmilligan1 committed Oct 9, 2019

Loading status checks…
```
- Update tests where needed
- Add SHA1 method to ExtractImageDetails
- Add SHA1 to DataFrames binary extraction and analysis
```
03ac99c

Commits on Sep 11, 2019

Update keepValidPages to include a filter on 200 OK. (#360 )

ruebot authored and ianmilligan1 committed Sep 11, 2019

- Add status code filter to keepValidPages
- Add MimeTypeTika to valid pages DF
- Update tests since we filter more and better now 😄
- Resolves #359

9b3e025

Commits on Sep 3, 2019

Update to Spark 2.4.4 (#358 )

ruebot authored and ianmilligan1 committed Sep 3, 2019

Loading status checks…

7305ed7

Commits on Aug 27, 2019

[skip travis] Update links (#357 )

ruebot committed Aug 27, 2019

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

a454ed3

Commits on Aug 23, 2019

Add discardLanguage filter to RecordLoader. (#353 )

ruebot authored and ianmilligan1 committed Aug 23, 2019

Loading status checks…
```
- Clean up doc comments
- Add test
- Resolves #352
```
0284d33

Commits on Aug 22, 2019

Improve test coverage. (#354 )

ruebot authored and ianmilligan1 committed Aug 22, 2019

Loading status checks…
```
- Add tests a few more filters in RecordLoader
- Add binary extration DataFrameLoader tests
```
bced854

Commits on Aug 21, 2019

[maven-release-plugin] prepare for next development iteration

ruebot committed Aug 21, 2019

Loading status checks…

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

4313174
[maven-release-plugin] prepare release aut-0.18.0

ruebot committed Aug 21, 2019

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

95e5f03

Add binary extraction DataFrames to PySpark. (#350 )

ruebot authored and ianmilligan1 committed Aug 21, 2019

* Add binary extration DataFrames to PySpark.
- Address #190
- Address #259
- Address #302
- Address #303
- Address #304
- Address #305
- Address #306
- Address #307
- Resolves #350 
- Update README

eda185b

Update LICENSE and license headers. (#351 )

ruebot authored and ianmilligan1 committed Aug 21, 2019

- Update LICENSE file to full Apache 2.0 license
- Reconfigure license-maven-plugin
- Update all license headers in java and scala files to include
copyright year, and project name
- Move LICENSE_HEADER.txt to config
- Update scalastyle config

e32ae17

Commits on Aug 18, 2019

Add method for determining binary file extension. (#349 )

jrwiebe authored and ruebot committed Aug 18, 2019

This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox.

This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.)

Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files

* Adds method for getting a file extension from a MIME type.
* Add getExtensions method to DetectMimeTypeTika.
* Matchbox object to get extension of URL
* Use GetExtensionMime for extraction methods; minor fixes.
* Remove tika-parsers classifier
* Remove most filtering by file extension from binary extraction methods; add CSV/TSV special cases.
* Fix GetExtensionMime case where URL has no extension but a MIME type is detected
* Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunleashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.
* Remove filtering on URL for audio, video, and images.
* Remove filtering on URL for images; add DF fields to image extraction
* Remove saveImageToDisk and its test
* Remove robots.txt check and extraneous imports
* Close files so we don't get too many files open again.
* Add GetExtensionMimeTest
* Resolve #343

448601e

Commits on Aug 17, 2019

Add keep and discard by http status. (#347 )

ruebot authored and ianmilligan1 committed Aug 17, 2019

Loading status checks…
```
- Add keep and discard by http status RecordLoader
- Add tests
- Clean up/add doc comments in RecordLoader
- Resolve #315
```
018527a

Commits on Aug 16, 2019

Add office document binary extraction. (#346 )

ruebot authored and ianmilligan1 committed Aug 16, 2019

- Add Word Processor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add Text files DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixtures for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Use aut-resources repo to distribute our shaded tika-parsers 1.22
- Close TikaInputStream
- Add RDD filters on MimeTypeTika values
- Add CodeCov configuration yaml
- Includes work by @jrwiebe, see #346 for all commits before squash

c824ad8

Commits on Aug 14, 2019

Use version of tika-parsers without a classifier. (#345 )

jrwiebe authored and ruebot committed Aug 14, 2019

Loading status checks…
```
Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact
was unnecessary.
```
39831c2

Use Tika's detected MIME type instead of ArchiveRecord getMimeType. (#…

ruebot authored and ianmilligan1 committed Aug 14, 2019

…344)

- Move audio, pdf, and video DF extraction to tuple map
- Provide two MimeType columns; mime_type_web_server and mime_type_tika
- Update tests
- Resolves #342

01d12b4

Older

Please note that GitHub no longer supports your web browser.

archivesunleashed/aut