archivesunleashed/aut

Commits on Jan 21, 2020

Clean up test descriptions, addresses #372 . (#416 )

ruebot authored and ianmilligan1 committed Jan 21, 2020

Loading status checks…
```
- Clean up test descriptions
- Rename typo filename
```
ffef735
Add ExtractImageDetailsDF. (#415 )

SinghGursimran authored and ruebot committed Jan 21, 2020

Loading status checks…
```
- Add test
- Addresses #223
```
71b459c

Commits on Jan 18, 2020

Add crawl_date to binary DataFrames and imageLinks. (#414 )

ruebot authored and ianmilligan1 committed Jan 18, 2020

Loading status checks…
```
- Resolves #413
- Update tests where necessary
```
9e357cc

Commits on Jan 17, 2020

Various DataFrame implementation updates for documentation clean-up; …

ruebot authored and ianmilligan1 committed Jan 17, 2020

…Addresses #372.

- .all() column HttpStatus to http_status_code
- Adds archive_filename to .all()
- Significant README updates for setup
- See also: archivesunleashed/aut-docs#39

9277e68

Commits on Jan 16, 2020

Use https for maven repo. (#405 )

ruebot authored and ianmilligan1 committed Jan 16, 2020

- Looks like repos are forcing https to be used now:
[WARNING] repository metadata for: 'artifact joda-time:joda-time' could not be retrieved from repository: maven due to an error: Failed to transfer file: http://repo.maven.apache.org/maven2/joda-time/joda-time/maven-metadata.xml. Return code is: 501 , ReasonPhrase:HTTPS Required.

4c6875d

Commits on Jan 13, 2020

Test clean-up. (#404 )

ruebot authored and ianmilligan1 committed Jan 13, 2020

Loading status checks…
```
- Clean-up variable names in RecordDFTest.scala
- Remove dos line endings on a number of files
```
75b7502

Commits on Jan 12, 2020

Add language detection column to webpages. (#403 )

ruebot authored and ianmilligan1 committed Jan 12, 2020

Loading status checks…
```
- Addresses #402
```
bc0d663

Commits on Jan 10, 2020

Add more DataFrame Implementation Serializable APIs (#401 ).

SinghGursimran authored and ruebot committed Jan 10, 2020

- Partially addresses  #223 
- Add discardContentDF
- Add discardUrlPatternsDF
- Add discardLanguagesDF
- Add keepImagesDF
- Add keepContentDF
- Add keepUrlPatternsDF
- Add keepLanguagesDF
- Update tests

0ecc4f8

Commits on Jan 8, 2020

Filter blank src/dest out of webgraph. (#400 )

ruebot authored and ianmilligan1 committed Jan 8, 2020

Loading status checks…

3dc1545

Commits on Jan 7, 2020

Add more DF implementations for #223 . (#399 )

SinghGursimran authored and ruebot committed Jan 7, 2020

Loading status checks…
```
- Add discardHttpStatusDF
- Add keepMimeTypesDF
- Add keepMimeTypesTikaDF
- Update tests
```
be15375

Commits on Jan 5, 2020

Scala imports cleanup. (#398 )

ruebot authored and ianmilligan1 committed Jan 5, 2020

Loading status checks…

d5c7bf7

Commits on Dec 29, 2019

Add more serializable APIs for DataFrames (#396 )

SinghGursimran authored and ruebot committed Dec 29, 2019

Loading status checks…
```
- Partially address #223 
- Add keepHttpStatusDF
- Add keepDateDF
- Add keepUrlsDF
- Add keepDomainsDF
- Add tests
```
b915f82

Commits on Dec 19, 2019

Remove redundant test; addresses #64 . (#395 )

ruebot authored and ianmilligan1 committed Dec 19, 2019

Loading status checks…

2c96ff3

Commits on Dec 18, 2019

Add doc comments for webpages and webgraph; resolves #392 . (#394 )

ruebot authored and ianmilligan1 committed Dec 18, 2019

Loading status checks…

99e9d06
Add additional filters for fextFiles; resolves #362 . (#393 )

ruebot authored and ianmilligan1 committed Dec 18, 2019

Loading status checks…
```
* Add additional filters for fextFiles; resolves #362.

- Add filedesc, and dns filter (arc files)
- Add test case
```
8eb43ff

Commits on Dec 17, 2019

udf API implementations for DataFrame (#391 )

SinghGursimran authored and ruebot committed Dec 17, 2019

Loading status checks…
```
- add discardMimeTypesDF
- add discardDateDF
- add discardUrlsDF
- add discardDomainsDF
- update tests
- addresses #223
```
40a59de
Add Serializable APIs for DataFrames (#389 )

SinghGursimran authored and ruebot committed Dec 17, 2019

Loading status checks…
```
- Add keepValidPagesDF
- Add HTTP status code column to all()
- Add test for keepValidPagesDF
- Addresses #223
```
ca928d8

Add and update tests, resolve textFiles bug. (#388 )

ruebot authored and ianmilligan1 committed Dec 17, 2019

- Add ExtractDateDF test
- Fix conditional logic of textFiles filter to resolve #390
- Add test for conditional logic fix for #390
- Remove cruft ExtractUrls, left over from Twitter analysis removal
(see:
https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
- Tweak null/nothing on a few tests

9e32284

Commits on Dec 5, 2019

Add new DataFrame matchbox udfs (#387 )

SinghGursimran authored and ruebot committed Dec 5, 2019

- Add DetectLanguageDF
- Add ExtractBoilerpipeTextDF
- Add ExtractDateDF
- Update tests
- Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
- Partially addresses #223

079cd24

Commits on Nov 28, 2019

Clean-up underscore import, and scalastyle warnings. (#386 )

ruebot authored and ianmilligan1 committed Nov 28, 2019

Loading status checks…

560ed2b

Commits on Nov 21, 2019

Add "Extract popular images" DataFrame implementation (#382 ).

SinghGursimran authored and ruebot committed Nov 21, 2019

Loading status checks…
```
- Add tests for ExtractPopularImagesDF
- Rename ExtractPopularImages to ExtractPopularImagesRDD
- Addresses #223
```
4042180

Add all() method and refactor DF UDFs (#383 ).

SinghGursimran authored and ruebot committed Nov 21, 2019

- Add `all()` DataFrame method 
- Refactor fixity DataFrame UDFs
- Add ComputeImageSize UDF
- Add Python implementation of `all()`
- Addresses #223

c4eaca9

Rename pages() to webpages(). (#384 )

ruebot authored and ianmilligan1 committed Nov 21, 2019

Loading status checks…
```
- Part of work on #233
```
d8e8df3

Commits on Nov 19, 2019

Append UDF with RDD or RF. (#381 )

ruebot authored and ianmilligan1 committed Nov 19, 2019

Loading status checks…
```
- Addresses #223
```
b98ba4b

Commits on Nov 18, 2019

Extend more Matchbook utilities to DataFrames (#380 ).

SinghGursimran authored and ruebot committed Nov 18, 2019

Loading status checks…
```
- Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
- Addresses #223
```
a081d7b

Commits on Nov 17, 2019

Rename DF functions to be consistent with Python DF functions. (#379 )

ruebot authored and ianmilligan1 committed Nov 17, 2019

Loading status checks…
```
- Resolves #366
```
67ca17d

Commits on Nov 14, 2019

Finalize converting NER Classifier to WANE Format (#378 ).

SinghGursimran authored and ruebot committed Nov 14, 2019

Loading status checks…
```
- Fully resolves #297 
- Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
```
f9ce826

Commits on Nov 12, 2019

Add df ExtractLinks udf; resolves #238 . (#377 )

SinghGursimran authored and ruebot committed Nov 12, 2019

Loading status checks…
```
- Add df ExtractLinks udf
- Add test
```
c353dae

Commits on Nov 10, 2019

Update README.md (#376 )

lintool authored and ruebot committed Nov 10, 2019

Loading status checks…
```
Tweaks the style of the license badge to look consistent with the other badges.
```
107def2

Commits on Nov 7, 2019

Change RemoveHttpHeader to RemoveHTTPHeader. (#374 )

SinghGursimran authored and ruebot committed Nov 7, 2019

Loading status checks…
```
Resolves #368.
```
25ca5a9

Commits on Nov 6, 2019

Updates description. See archivesunleashed/aut-docs#18 (#373 )

ruebot authored and ianmilligan1 committed Nov 6, 2019

Loading status checks…

08b9ae3

Commits on Nov 5, 2019

Align NER output to WANE format; addresses #297 (#361 )

ruebot authored and ianmilligan1 committed Nov 5, 2019

- Update Stanford core NLP
- Format NER output in json
- Add getPayloadDigest to ArchiveRecord
- Add test for getPayloadDigest
- Add payload digest to NER output
- Remove extractFromScrapeText
- Remove extractFromScrapeText test
- TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

379cc68

Various UDF implementation and cleanup for DF. (#370 )

lintool authored and ruebot committed Nov 5, 2019

- Replace ExtractBaseDomain with ExtractDomain
- Closes #367
- Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
- Closes #369
- Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
- Partially addresses #238
- Updates tests where necessary
- Punts on #368 UDF CaMeL cASe consistency issues

6686519

Commits on Oct 14, 2019

Update commons-compress to 1.19; CVE-2019-12402 (#365 )

ruebot authored and ianmilligan1 committed Oct 14, 2019

Loading status checks…

4e8b41d

Commits on Oct 9, 2019

Add ComputeSHA1 method; resolves #363 . (#364 )

ruebot authored and ianmilligan1 committed Oct 9, 2019

Loading status checks…
```
- Update tests where needed
- Add SHA1 method to ExtractImageDetails
- Add SHA1 to DataFrames binary extraction and analysis
```
03ac99c

Older

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut