Skip to content
Permalink
Tree: 73981a79bb
Commits on Aug 12, 2019
  1. Add PDF binary extraction. (#340)

    jrwiebe authored and ruebot committed Aug 12, 2019
    Introduces the new extractPDFDetailsDF() method and brings in changes to make our use of Tika's MIME type detection more efficient, as well as POM updates to use a shaded version of tika-parsers in order to eliminate a dependency version conflict that has long been troublesome.
    
    - Updates getImageBytes to getBinaryBytes
    - Refactor SaveImage class to more general SaveBytes, and saveToDisk to saveImageToDisk
    - Only instantiate Tika when the DetectMimeTypeTika singleton object is first referenced. See https://git.io/fj7g0.
    - Use TikaInputStream to enabler container-aware detection. Until now we were only using the default Mime Magic detection. See https://tika.apache.org/1.22/detection.html#Container_Aware_Detection.
    - Added generic saveToDisk method to save a bytes column of a DataFrame to files
    - Updates tests
    - Resolves #302
    - Further addresses #308
    - Includes work by @ruebot, see #340 for all commits before squash
Commits on Aug 8, 2019
  1. More scalastyle work; addresses #196. (#339)

    ruebot authored and ianmilligan1 committed Aug 8, 2019
    - Remove all underscore imports, except shapeless
    - Address all scalastyle warnings
    - Update scalastyle config for magic numbers, and null (only used in
    tests)
Commits on Aug 7, 2019
  1. Replace computeHash with ComputeMD5; resolves #333. (#338)

    ruebot authored and jrwiebe committed Aug 7, 2019
    * Replace computeHash with ComputeMD5; resolves #333.
    
    * I suppose these are redundant.
Commits on Aug 6, 2019
  1. Make ArchiveRecord.getContentBytes consistent,#334 (#335)

    ianmilligan1 authored and ruebot committed Aug 6, 2019
  2. Update Tika to 1.22; address security alerts. (#337)

    ruebot authored and ianmilligan1 committed Aug 6, 2019
    - Update Tika to 1.22
    - pom.xml surgery to get aut to build again with --packages
Commits on Jul 31, 2019
  1. Update test coverage for data frames (#336).

    ruebot authored and ianmilligan1 committed Jul 31, 2019
    - This commit will fall under @ruebot, but @jrwiebe did the heavy lifting here; see #336 for his commits before they were squashed down.
    - Resolves #265
    - Resolves #263
    - Update Scaladocs
Commits on Jul 25, 2019
  1. Enable S3 access (#332)

    jrwiebe authored and ruebot committed Jul 25, 2019
    * Update POM to access data stored in Amazon S3, per #319
    * In RecordLoader detect FileSystem based on path.
    * Resolves #319
Commits on Jul 23, 2019
  1. Updates to pom following 0e701b2 (#328)

    ruebot authored and ianmilligan1 committed Jul 23, 2019
    - Remove explicit Guava dependency (should have been remove in
    0e701b2)
    - Update Scala to 2.11.12; aligns with Spark 2.4.3
    - Update Scala test
    - Update Shapeless
    - Update Scala lang parsers
    - Fix a typo in a test
Commits on Jul 18, 2019
  1. Python formatting, and gitignore additions. (#326)

    ruebot authored and ianmilligan1 committed Jul 18, 2019
    - Run black and isort on Python files.
    - Move Spark config to example file.
    - Update gitignore for 7a61f0e
    additions.
  2. Move data frame fields names to snake_case. (#327)

    ruebot authored and ianmilligan1 committed Jul 18, 2019
    - Resolves #229
Commits on Jul 17, 2019
  1. Update to Spark 2.4.3 and update Tika to 1.20. (#321)

    ruebot authored and ianmilligan1 committed Jul 17, 2019
    * Update to Spark 2.4.3 and update Tika to 1.20.
    
    - Resolves #295
    - Resolves #308
    - Resolves #286
    - Pulls in unfinished work by @jrwiebe and @borislin.
    
    * Add patched lang-detector
Commits on Jul 15, 2019
  1. Remove Tweet utils. (#323)

    ruebot authored and ianmilligan1 committed Jul 15, 2019
    - Resolves #322
    - Resolves #206
    - Resolves #194
Commits on Jul 8, 2019
  1. Test Java 8 & 11, and remove OracleJDK; resolves #324. (#325)

    ruebot authored and ianmilligan1 committed Jul 8, 2019
Commits on Jul 5, 2019
  1. Add image analysis and extraction w/TensorFlow (#318)

    h324yang authored and ruebot committed Jul 5, 2019
Commits on Apr 22, 2019
  1. Makes ArchiveRecordImpl serializable by removing non-serializable ARC…

    jrwiebe authored and ruebot committed Apr 22, 2019
    …Record and WARCRecord variables. Also removes unused headerResponseFormat variable. (#316)
Commits on Mar 23, 2019
  1. Resolve cobertura-maven-plugin class issue; resolves #313. (#314)

    ruebot authored and jrwiebe committed Mar 23, 2019
    - Exclude slf4j binding logback-classic
    (mojohaus/cobertura-maven-plugin#6 (comment))
Commits on Mar 18, 2019
Commits on Jan 31, 2019
  1. Log closing of ARC and WARC files, resolves #156 (#301)

    jrwiebe authored and ruebot committed Jan 31, 2019
    * Log opening and closing of archive files as per #156
    * Remove redundant log message. Spark already logs the file that is to be read when an executor computes an RDD.
Commits on Jan 24, 2019
  1. Delete saved image file; resolves #299 (#300)

    jrwiebe authored and ruebot committed Jan 24, 2019
Commits on Nov 28, 2018
  1. Remove Deprecated ExtractGraph app; resolves #291. (#293)

    greebie authored and ruebot committed Nov 28, 2018
    * Remove deprecated ExtractGraph.scala file.
    * Remove deprecated ExtractGraphTest.scala file.
  2. Add .getHttpStatus and .getArchiveFile to ArchiveRecordImpl class #198

    greebie authored and ruebot committed Nov 28, 2018
    …& #164 (#292)
    
    * Resolves #198
    * Resolves #164
    * Add getHttpStatus to ArchiveRecord class & trait
      - add .getHttpStatus to potential outputs
      - add tests for .getHttpStatus calls
      - improve ArchiveRecord testing overall.
    * Add .getArchiveFile feature to ArchiveRecordImpl.
      - add getArchiveFile to trait
      - add getArchiveFile for ArchiveRecordImpl
      - add tests for getArchiveFile.
    * Other code style fixes.
    * Include updates to tests.
Commits on Nov 22, 2018
  1. Update license headers for #208. (#290)

    ruebot authored and ianmilligan1 committed Nov 22, 2018
  2. Change Id generation for graphs from using hashes for urls to using .…

    greebie authored and ruebot committed Nov 22, 2018
    …zipWithUniqueIds() (#289)
    
    * Resolves #243 
    * Create GEXF with proper ids instead of hash to avoid collisions.
    * Add WriteGEXF files.
    * Add WriteGraph file and test.
    * Add test for Graphml output.
    * Add xml escaping for edges.
    * Add test case for non-escaped edges.
    * Add additional tests to cover for more potential cases of graphml and gexf files.
    * Coverage for null cases in urls.
Commits on Oct 19, 2018
  1. CVE-2018-11771 update (#288)

    ruebot authored and ianmilligan1 committed Oct 19, 2018
Commits on Oct 18, 2018
  1. CVE-2017-17485 update; follow-on to #281. (#287)

    ruebot authored and ianmilligan1 committed Oct 18, 2018
Commits on Oct 17, 2018
  1. Update Apache Tika - security vulnerabilities; resolves #131. (#285)

    ruebot authored and ianmilligan1 committed Oct 17, 2018
    - CVE-2018-1338
    - CVE-2018-11762
    - CVE-2018-11761
    - CVE-2016-6809
    - CVE-2018-1339
    - CVE-2018-11796
    - CVE-2016-4434
    - CVE-2018-1335
  2. Only trigger TravisCI on master. (#283)

    ruebot authored and ianmilligan1 committed Oct 17, 2018
  3. [skip travis] Update README (#284)

    ruebot authored and ianmilligan1 committed Oct 17, 2018
  4. Fix bug and unit test for ExtractDomain; resolves #277 (#278)

    borislin authored and ruebot committed Oct 17, 2018
  5. Replace backslash with forward slash in URL; resolves #269 (#276)

    borislin authored and ruebot committed Oct 17, 2018
    * Fix backslash in URL
    * Add backslash test in ExtractDomainTest
  6. Missed something for #208. (#282)

    ruebot authored and ianmilligan1 committed Oct 17, 2018
Commits on Oct 16, 2018
  1. CVE-2018-7489 fix. (#281)

    ruebot authored and ianmilligan1 committed Oct 16, 2018
  2. Update jackson-databind version; resolves #279. (#280)

    ruebot authored and ianmilligan1 committed Oct 16, 2018
Commits on Oct 9, 2018
  1. Clean-up pom.xml to remove plugin warnings; resolves #273. (#274)

    ruebot authored and ianmilligan1 committed Oct 9, 2018
Commits on Oct 4, 2018
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Oct 4, 2018
Older
You can’t perform that action at this time.