Skip to content
Permalink
Tree: 86fb5433b1
Commits on Aug 18, 2019
  1. Close files so we don't get too many files open again.

    ruebot committed Aug 18, 2019
Commits on Aug 17, 2019
  1. Remove robots.txt check and extraneous imports

    jrwiebe committed Aug 17, 2019
  2. Merge branch 'master' into get-extension

    jrwiebe committed Aug 17, 2019
  3. Add keep and discard by http status. (#347)

    ruebot authored and ianmilligan1 committed Aug 17, 2019
    - Add keep and discard by http status RecordLoader
    - Add tests
    - Clean up/add doc comments in RecordLoader
    - Resolve #315
  4. Make saveImageToDisk() extension lowercase

    jrwiebe committed Aug 17, 2019
  5. Use detected MIME type

    jrwiebe committed Aug 17, 2019
Commits on Aug 16, 2019
  1. Insert `toLowerCase` into `getUrl.endsWith()` calls in io.archivesunl…

    jrwiebe committed Aug 16, 2019
    …eashed.packages; apply to `FilenameUtils.getExtension` in `GetExtensionMime`.
  2. Remove most filtering by file extension from binary extraction method…

    jrwiebe committed Aug 16, 2019
    …s; add CSV/TSV special cases.
  3. Comments

    jrwiebe committed Aug 16, 2019
  4. Bring up to date with master

    jrwiebe committed Aug 16, 2019
  5. Add office document binary extraction. (#346)

    ruebot authored and ianmilligan1 committed Aug 16, 2019
    - Add Word Processor DF and binary extraction
    - Add Spreadsheets DF and binary extraction
    - Add Presentation Program DF and binary extraction
    - Add Text files DF and binary extraction
    - Add tests for new DF and binary extractions
    - Add test fixtures for new DF and binary extractions
    - Resolves #303
    - Resolves #304
    - Resolves #305
    - Use aut-resources repo to distribute our shaded tika-parsers 1.22
    - Close TikaInputStream
    - Add RDD filters on MimeTypeTika values
    - Add CodeCov configuration yaml
    - Includes work by @jrwiebe, see #346 for all commits before squash
  6. Merge remote-tracking branch 'remotes/origin/master' into get-extension

    jrwiebe committed Aug 16, 2019
    # Conflicts:
    #	src/main/scala/io/archivesunleashed/matchbox/DetectMimeTypeTika.scala
Commits on Aug 14, 2019
  1. Use version of tika-parsers without a classifier. (#345)

    jrwiebe authored and ruebot committed Aug 14, 2019
    Ivy couldn't handle it, and specifying one for the custom tika-parsers artifact
    was unnecessary.
  2. Use Tika's detected MIME type instead of ArchiveRecord getMimeType. (#…

    ruebot authored and ianmilligan1 committed Aug 14, 2019
    …344)
    
    - Move audio, pdf, and video DF extraction to tuple map
    - Provide two MimeType columns; mime_type_web_server and mime_type_tika
    - Update tests
    - Resolves #342
Commits on Aug 13, 2019
  1. Add audio & video binary extraction (#341)

    ruebot authored and ianmilligan1 committed Aug 13, 2019
    - Add Audio & Video binary extraction.
    - Add filename, and extenstion column to audio, pdf, and video DF
    - Pass binary bytes instread of string to DetectMimeTypeTika in DF (s/getContentString/getBinaryBytes)
    - Updates saveToDisk to use file extension from DF column
    - Adds tests for Audio, PDF, and Video DF extraction
    - Add test fixtures for Audio, PDF, and Video DF extraction
    - Rename SaveBytesTest to SaveImageBytes test
    - Eliminate bytes->string->bytes conversion that was causing data loss in DetectMimeTypeTika
    - Update tika-parsers dep from JitPack
    - Remove tweet cruft
    - Resolves #306
    - Resolves #307
    - Includes work by @jrwiebe, see #341 for all commits before squash
Commits on Aug 12, 2019
  1. Add PDF binary extraction. (#340)

    jrwiebe authored and ruebot committed Aug 12, 2019
    Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome.
    
    - Updates getImageBytes to getBinaryBytes
    - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk
    - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0.
    - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection.
    - Added generic saveToDisk method to save a bytes column of a DataFrame to files
    - Updates tests
    - Resolves #302
    - Further addresses #308
    - Includes work by @ruebot, see #340 for all commits before squash
Commits on Aug 8, 2019
  1. More scalastyle work; addresses #196. (#339)

    ruebot authored and ianmilligan1 committed Aug 8, 2019
    - Remove all underscore imports, except shapeless
    - Address all scalastyle warnings
    - Update scalastyle config for magic numbers, and null (only used in
    tests)
Commits on Aug 7, 2019
  1. Replace computeHash with ComputeMD5; resolves #333. (#338)

    ruebot authored and jrwiebe committed Aug 7, 2019
    * Replace computeHash with ComputeMD5; resolves #333.
    
    * I suppose these are redundant.
Commits on Aug 6, 2019
  1. Make ArchiveRecord.getContentBytes consistent,#334 (#335)

    ianmilligan1 authored and ruebot committed Aug 6, 2019
  2. Update Tika to 1.22; address security alerts. (#337)

    ruebot authored and ianmilligan1 committed Aug 6, 2019
    - Update Tika to 1.22
    - pom.xml surgery to get aut to build again with --packages
Commits on Jul 31, 2019
  1. Update test coverage for data frames (#336).

    ruebot authored and ianmilligan1 committed Jul 31, 2019
    - This commit will fall under @ruebot, but @jrwiebe did the heavy lifting here; see #336 for his commits before they were squashed down.
    - Resolves #265
    - Resolves #263
    - Update Scaladocs
Commits on Jul 25, 2019
  1. Enable S3 access (#332)

    jrwiebe authored and ruebot committed Jul 25, 2019
    * Update POM to access data stored in Amazon S3, per #319
    * In RecordLoader detect FileSystem based on path.
    * Resolves #319
Commits on Jul 23, 2019
  1. Updates to pom following 0e701b2 (#328)

    ruebot authored and ianmilligan1 committed Jul 23, 2019
    - Remove explicit Guava dependency (should have been remove in
    0e701b2)
    - Update Scala to 2.11.12; aligns with Spark 2.4.3
    - Update Scala test
    - Update Shapeless
    - Update Scala lang parsers
    - Fix a typo in a test
Commits on Jul 18, 2019
  1. Python formatting, and gitignore additions. (#326)

    ruebot authored and ianmilligan1 committed Jul 18, 2019
    - Run black and isort on Python files.
    - Move Spark config to example file.
    - Update gitignore for 7a61f0e
    additions.
Older
You can’t perform that action at this time.