Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Tree: 8eb43ff055
Commits on Dec 18, 2019
  1. Add additional filters for fextFiles; resolves #362. (#393)

    ruebot authored and ianmilligan1 committed Dec 18, 2019
    * Add additional filters for fextFiles; resolves #362.
    
    - Add filedesc, and dns filter (arc files)
    - Add test case
Commits on Dec 17, 2019
  1. udf API implementations for DataFrame (#391)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - add discardMimeTypesDF
    - add discardDateDF
    - add discardUrlsDF
    - add discardDomainsDF
    - update tests
    - addresses #223
  2. Add Serializable APIs for DataFrames (#389)

    SinghGursimran authored and ruebot committed Dec 17, 2019
    - Add keepValidPagesDF
    - Add HTTP status code column to all()
    - Add test for keepValidPagesDF
    - Addresses #223
  3. Add and update tests, resolve textFiles bug. (#388)

    ruebot authored and ianmilligan1 committed Dec 17, 2019
    - Add ExtractDateDF test
    - Fix conditional logic of textFiles filter to resolve #390
    - Add test for conditional logic fix for #390
    - Remove cruft ExtractUrls, left over from Twitter analysis removal
    (see:
    https://github.com/lintool/warcbase/blob/cab311ed8b0bceb666865fa76dd3bc5a86402e13/warcbase-core/src/test/scala/org/warcbase/spark/matchbox/ExtractUrlsTest.scala)
    - Tweak null/nothing on a few tests
Commits on Dec 5, 2019
  1. Add new DataFrame matchbox udfs (#387)

    SinghGursimran authored and ruebot committed Dec 5, 2019
    - Add DetectLanguageDF
    - Add ExtractBoilerpipeTextDF
    - Add ExtractDateDF
    - Update tests
    - Rename existing ExtractDate, ExtractBoilerpipeText, DetectLanguage udfs by appending RDD
    - Partially addresses #223
Commits on Nov 28, 2019
  1. Clean-up underscore import, and scalastyle warnings. (#386)

    ruebot authored and ianmilligan1 committed Nov 28, 2019
Commits on Nov 21, 2019
  1. Add "Extract popular images" DataFrame implementation (#382).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add tests for ExtractPopularImagesDF
    - Rename ExtractPopularImages to ExtractPopularImagesRDD
    - Addresses #223
  2. Add all() method and refactor DF UDFs (#383).

    SinghGursimran authored and ruebot committed Nov 21, 2019
    - Add `all()` DataFrame method 
    - Refactor fixity DataFrame UDFs
    - Add ComputeImageSize UDF
    - Add Python implementation of `all()`
    - Addresses #223
  3. Rename pages() to webpages(). (#384)

    ruebot authored and ianmilligan1 committed Nov 21, 2019
    - Part of work on #233
Commits on Nov 19, 2019
  1. Append UDF with RDD or RF. (#381)

    ruebot authored and ianmilligan1 committed Nov 19, 2019
    - Addresses #223
Commits on Nov 18, 2019
  1. Extend more Matchbook utilities to DataFrames (#380).

    SinghGursimran authored and ruebot committed Nov 18, 2019
    - Extend GetExtensionMime, ExtractImageLinks, ComputeMD5, and ComputeSHA1 to DataFrames
    - Addresses #223
Commits on Nov 17, 2019
  1. Rename DF functions to be consistent with Python DF functions. (#379)

    ruebot authored and ianmilligan1 committed Nov 17, 2019
    - Resolves #366
Commits on Nov 14, 2019
  1. Finalize converting NER Classifier to WANE Format (#378).

    SinghGursimran authored and ruebot committed Nov 14, 2019
    - Fully resolves #297 
    - Overrides NER Classifier output to PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations
Commits on Nov 12, 2019
  1. Add df ExtractLinks udf; resolves #238. (#377)

    SinghGursimran authored and ruebot committed Nov 12, 2019
    - Add df ExtractLinks udf
    - Add test
Commits on Nov 10, 2019
  1. Update README.md (#376)

    lintool authored and ruebot committed Nov 10, 2019
    Tweaks the style of the license badge to look consistent with the other badges.
Commits on Nov 7, 2019
  1. Change RemoveHttpHeader to RemoveHTTPHeader. (#374)

    SinghGursimran authored and ruebot committed Nov 7, 2019
    Resolves #368.
Commits on Nov 6, 2019
  1. Updates description. See archivesunleashed/aut-docs#18 (#373)

    ruebot authored and ianmilligan1 committed Nov 6, 2019
Commits on Nov 5, 2019
  1. Align NER output to WANE format; addresses #297 (#361)

    ruebot authored and ianmilligan1 committed Nov 5, 2019
    - Update Stanford core NLP
    - Format NER output in json
    - Add getPayloadDigest to ArchiveRecord
    - Add test for getPayloadDigest
    - Add payload digest to NER output
    - Remove extractFromScrapeText
    - Remove extractFromScrapeText test
    - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢
  2. Various UDF implementation and cleanup for DF. (#370)

    lintool authored and ruebot committed Nov 5, 2019
    - Replace ExtractBaseDomain with ExtractDomain
    - Closes #367
    - Address bug in ArcTest; RemoveHTML -> RemoveHttpHeader
    - Closes #369
    - Wraps RemoveHttpHeader and RemoveHTML for use in data frames.
    - Partially addresses #238
    - Updates tests where necessary
    - Punts on #368 UDF CaMeL cASe consistency issues
Commits on Oct 14, 2019
  1. Update commons-compress to 1.19; CVE-2019-12402 (#365)

    ruebot authored and ianmilligan1 committed Oct 14, 2019
Commits on Oct 9, 2019
  1. Add ComputeSHA1 method; resolves #363. (#364)

    ruebot authored and ianmilligan1 committed Oct 9, 2019
    - Update tests where needed
    - Add SHA1 method to ExtractImageDetails
    - Add SHA1 to DataFrames binary extraction and analysis
Commits on Sep 11, 2019
  1. Update keepValidPages to include a filter on 200 OK. (#360)

    ruebot authored and ianmilligan1 committed Sep 11, 2019
    - Add status code filter to keepValidPages
    - Add MimeTypeTika to valid pages DF
    - Update tests since we filter more and better now 😄
    - Resolves #359
Commits on Sep 3, 2019
  1. Update to Spark 2.4.4 (#358)

    ruebot authored and ianmilligan1 committed Sep 3, 2019
Commits on Aug 27, 2019
Commits on Aug 23, 2019
  1. Add discardLanguage filter to RecordLoader. (#353)

    ruebot authored and ianmilligan1 committed Aug 23, 2019
    - Clean up doc comments
    - Add test
    - Resolves #352
Commits on Aug 22, 2019
  1. Improve test coverage. (#354)

    ruebot authored and ianmilligan1 committed Aug 22, 2019
    - Add tests a few more filters in RecordLoader
    - Add binary extration DataFrameLoader tests
Commits on Aug 21, 2019
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Aug 21, 2019
  2. Add binary extraction DataFrames to PySpark. (#350)

    ruebot authored and ianmilligan1 committed Aug 21, 2019
    * Add binary extration DataFrames to PySpark.
    - Address #190
    - Address #259
    - Address #302
    - Address #303
    - Address #304
    - Address #305
    - Address #306
    - Address #307
    - Resolves #350 
    - Update README
  3. Update LICENSE and license headers. (#351)

    ruebot authored and ianmilligan1 committed Aug 21, 2019
    - Update LICENSE file to full Apache 2.0 license
    - Reconfigure license-maven-plugin
    - Update all license headers in java and scala files to include
    copyright year, and project name
    - Move LICENSE_HEADER.txt to config
    - Update scalastyle config
Commits on Aug 18, 2019
  1. Add method for determining binary file extension. (#349)

    jrwiebe authored and ruebot committed Aug 18, 2019
    This PR implements the strategy described in the discussion of the above issue to get an extension for a file described by a URL and a MIME type. It creates a GetExtensionMime object in the matchbox.
    
    This PR also removes most of the filtering by URL from the image, audio, video, presentation, spreadsheet, and word processor document extraction methods, since these were returning false positives. (CSV and TSV files are a special case, since Tika detects them as "text/plain" based on content.)
    
    Finally, I have inserted toLowerCase into the getUrl.endsWith() filter tests, which could possibly bring in some more CSV and TSV files
    
    * Adds method for getting a file extension from a MIME type.
    * Add getExtensions method to DetectMimeTypeTika.
    * Matchbox object to get extension of URL
    * Use GetExtensionMime for extraction methods; minor fixes.
    * Remove tika-parsers classifier
    * Remove most filtering by file extension from binary extraction methods; add CSV/TSV special cases.
    * Fix GetExtensionMime case where URL has no extension but a MIME type is detected
    * Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunleashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.
    * Remove filtering on URL for audio, video, and images.
    * Remove filtering on URL for images; add DF fields to image extraction
    * Remove saveImageToDisk and its test
    * Remove robots.txt check and extraneous imports
    * Close files so we don't get too many files open again.
    * Add GetExtensionMimeTest
    * Resolve #343
Commits on Aug 17, 2019
  1. Add keep and discard by http status. (#347)

    ruebot authored and ianmilligan1 committed Aug 17, 2019
    - Add keep and discard by http status RecordLoader
    - Add tests
    - Clean up/add doc comments in RecordLoader
    - Resolve #315
Commits on Aug 16, 2019
  1. Add office document binary extraction. (#346)

    ruebot authored and ianmilligan1 committed Aug 16, 2019
    - Add Word Processor DF and binary extraction
    - Add Spreadsheets DF and binary extraction
    - Add Presentation Program DF and binary extraction
    - Add Text files DF and binary extraction
    - Add tests for new DF and binary extractions
    - Add test fixtures for new DF and binary extractions
    - Resolves #303
    - Resolves #304
    - Resolves #305
    - Use aut-resources repo to distribute our shaded tika-parsers 1.22
    - Close TikaInputStream
    - Add RDD filters on MimeTypeTika values
    - Add CodeCov configuration yaml
    - Includes work by @jrwiebe, see #346 for all commits before squash
Commits on Aug 14, 2019
  1. Use version of tika-parsers without a classifier. (#345)

    jrwiebe authored and ruebot committed Aug 14, 2019
    Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact
    was unnecessary.
  2. Use Tika's detected MIME type instead of ArchiveRecord getMimeType. (#…

    ruebot authored and ianmilligan1 committed Aug 14, 2019
    …344)
    
    - Move audio, pdf, and video DF extraction to tuple map
    - Provide two MimeType columns; mime_type_web_server and mime_type_tika
    - Update tests
    - Resolves #342
Older
You can’t perform that action at this time.