Permalink
Commits on Aug 11, 2018
  1. ExtractBoilerpipeText to remove headers as well. #253 (#256)

    greebie authored and ruebot committed Aug 11, 2018
    * ExtractBoilerpipeText now removes headers.
  2. Add additional tweet fields to TweetUtils; partially address #194. (#254

    ruebot authored and ianmilligan1 committed Aug 11, 2018
    )
    
    - Adds:
      - retweet_count
      - favorite_count
      - in_reply_to_status_id_str
      - in_reply_to_user_id_str
      - in_reply_to_screen_name
      - source
      - user.protected
      - user.profile_image_url
      - user.description
      - user.location
      - user.name
      - user.url
      - user.time_zone
    - Updates some doc comments
    - Updates tests
Commits on Aug 10, 2018
  1. Add support for full_text in tweets; resolve #192. (#252)

    ruebot authored and ianmilligan1 committed Aug 10, 2018
  2. Get rid of 'filesystem-root relative reference' warning. (#251)

    ruebot authored and ianmilligan1 committed Aug 10, 2018
Commits on Aug 9, 2018
  1. Remove stray characters from example commands. (#250)

    ruebot authored and ianmilligan1 committed Aug 9, 2018
  2. Deal with final scalastyle assessments, and Convert nulls to Option(T…

    greebie authored and ruebot committed Aug 9, 2018
    …). (#249)
    
    * Fully resolves #196 
    * Resolves #212
Commits on Aug 1, 2018
  1. Address main scalastyle errors - #196 (#248)

    greebie authored and ruebot committed Aug 1, 2018
    * Deal with wildcard import lint issues.
    * Fix some magic numbers & duplicate string runs.
    * Lint fixes, mostly explicit import warnings.
    * All other scalastyle issues require refactoring.
Commits on Jul 29, 2018
  1. Add ExtractGraphX including algorithms for PageRank and Components. I…

    greebie authored and ianmilligan1 committed Jul 29, 2018
    …ssue 203 (#245)
    
    * pom.xml change for GraphX
    * Changes for GraphXSLS
    * Changes for SLS graph
    * Changes for GraphX
    * Changes for converting WARC RDD to GraphX object
    * Rename extractor to ExtractGraphX
    * Various lint fixes (usually Magic Numbers)
    * Remove illegal imports from scala style (we use wildcard imports a lot)
    * Add WriteGraphXMLTest.
Commits on Jul 27, 2018
  1. Fix TravisCI build issues (#244)

    ruebot authored and ianmilligan1 committed Jul 27, 2018
    * Make the TravisCI build less verbose since we're hitting the 4MB log limit.
    * Pin site.plugin and project-info-reports.plugin so mvn site builds.
      - See:
        - https://stackoverflow.com/questions/51091539/maven-site-plugins-3-3-java-lang-classnotfoundexception-org-apache-maven-doxia
        - https://travis-ci.org/archivesunleashed/aut/jobs/408259462#L3201-L3202
Commits on May 28, 2018
  1. Data frame implementation of extractors. Also added cmd arguments to r…

    TitusAn authored and ruebot committed May 28, 2018
    …esolve #235 (#236)
    
    * initial implementation
    * Data frame implementation of extractors.
    * fix documentation.
Commits on May 25, 2018
  1. Save images from dataframe to disk (#234)

    JWZ2018 authored and lintool committed May 25, 2018
    * Save images from dataframe to disk
    * Fix spacing
    * Move save images to inline
    * Refactor to chain and fix concurrency issue
    * Add save image test
    * Move saveToDisk to df
Commits on May 22, 2018
  1. Add missing dependencies in; addresses #227. (#233)

    ruebot authored and lintool committed May 22, 2018
Commits on May 21, 2018
  1. ArchiveRecord + impl moved into same Scala file; code cleanup. (#230)

    lintool authored and ruebot committed May 21, 2018
  2. Add Extract Image Details API (#226); Adresses #220

    JWZ2018 authored and ruebot committed May 21, 2018
    * Add Extract Image Details API
    * Change check for jpeg and fix spacing
    * Add tiff parser
    * Use AutoDetectParser and read Numeric fields
    * Use ComputeImageSize
    * Hex encode hash and base64 encode image bytes
    * Fix test
    * Change df column names
Commits on May 16, 2018
  1. Implement DomainFrequency, DomainGraph and PlainText extractor that c…

    TitusAn authored and lintool committed May 16, 2018
    …an be run from command line (#225)
    
    * Resolves issue 195. Implement DomainFrequency, DomainGraph and PlainText extractor that can be run via command line in spark-submit, along with their tests
    
    * Restructure CommandLineAppRunner to make it more robust. Add option to write GEXF output for DomainGraphExtractor (enable via --output-format GEXF). Add support for multiple input files. Other polish and cleanup.
Commits on May 15, 2018
  1. Remove duplicate call of keepValidPages (#224)

    JWZ2018 authored and ruebot committed May 15, 2018
  2. Extract Image Links DF API + Test (#221)

    JWZ2018 authored and ruebot committed May 15, 2018
    * Extract Image Links DF API
    * Add extract image links text
    * Remove unnecessary comment from test
    * Add doc comments
    * Addresses #220
Commits on May 14, 2018
  1. Update Apache Spark to 2.3.0; resolves #218 (#219)

    ruebot authored and ianmilligan1 committed May 14, 2018
    - Update tests to use workaround for SPARK-2243
    - Comment out ExtractGraph test as per https://github.com/archivesunleashed/aut/pull/204/files#diff-4541b9834513985c360b64093fd45073
    - Align Hadoop version with Apache Spark pom.xml https://github.com/apache/spark/blob/branch-2.3/pom.xml#L120
  2. Resolve archivesunleashed/docker-aut#17 (#217)

    ruebot authored and ianmilligan1 committed May 14, 2018
Commits on May 2, 2018
  1. Create issue templates (#216)

    ruebot authored and ianmilligan1 committed May 2, 2018
    * Create issue templates
  2. Exposing Scala DataFrames in PySpark (#214); resolves #209.

    lintool authored and ruebot committed May 2, 2018
    * DataFrameLoader - provides bridge to PySpark.
    * Initial python classes for aut.
    * Better packaging of Python modules.
Commits on Apr 27, 2018
  1. Update project description; resolves #208. (#211)

    ruebot authored and ianmilligan1 committed Apr 27, 2018
  2. Initial DataFrames merge (#210); Partially addresses #190

    lintool authored and ruebot committed Apr 27, 2018
    * Initial stab at df.
    * Initial stab of what link extraction would look like with DFs.
    * Added test case.
    * Docs.
  3. Add more instructions on how to use things to the README. (#207)

    ruebot authored and lintool committed Apr 27, 2018
Commits on Apr 26, 2018
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Apr 26, 2018
  2. [maven-release-plugin] prepare release aut-0.16.0

    ruebot committed Apr 26, 2018
  3. Downgrade json4s-jackson to unbork Twitter analysis; Resolves issue #197

    lintool authored and ruebot committed Apr 26, 2018
    , see also: json4s/json4s#316 (#205)
  4. Resolves #199: mime-type was incorrectly parsed from content-type whe…

    dportabella authored and ruebot committed Apr 26, 2018
    …n cha… (#200)
  5. Update README.md (#202)

    lintool authored and ianmilligan1 committed Apr 26, 2018
  6. Code refformatting. (#201)

    lintool authored and ruebot committed Apr 26, 2018
Commits on Apr 11, 2018
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Apr 11, 2018
  2. [maven-release-plugin] prepare release aut-0.15.0

    ruebot committed Apr 11, 2018
  3. Improve and clean-up Scaladocs; resolves #184 (#193)

    ruebot authored and ianmilligan1 committed Apr 11, 2018
    - update gitignore
    - add site build to TravisCI config
    - add scalastyle config
    - improve scala docs on every scala file
    - incorporate @greebie's work on scaladocs
Commits on Apr 6, 2018
  1. make ArchiveRecord a trait (#186)

    helgeho authored and ruebot committed Apr 6, 2018
  2. Major refactoring of package structure (#189)

    lintool authored and ruebot committed Apr 6, 2018
    * io.archivesunleashed.spark -> io.archivesunleashed package renaming.
    * Resolves #188
    * Resolves #178 
    * Resolves #179 
    * Addresses #180