Skip to content
Permalink
Tree: 69007e2f88
Commits on May 19, 2020
  1. Implement Scala Matchbox UDFs in Python. (#463)

    ruebot committed May 19, 2020
    - Resolves #408
    - Alphabetizes DataFrameloader functions
    - Alphabetizes UDFs functions
    - Move DataFrameLoader to df packages
    - Move UDFs out of df into their own package
    - Rename UDFs (no more DF tagged to the end).
    - Update tests as necessary
    - Partially addresses #410, #409
    - Supersedes #412.
Commits on May 10, 2020
  1. Import clean-up for df package. (#462)

    ruebot committed May 10, 2020
Commits on May 4, 2020
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed May 4, 2020
  2. [skip travis] README updates (#460)

    ruebot committed May 4, 2020
    - `$` should only be used if output is also shown (mdl)
    - Add UserDoc badge, and yank buried documentation section
    - Additional formatting and typo fixes
  3. Set spark-submit app name to be "aut - extractorName". (#459)

    ruebot committed May 4, 2020
    - Resolves #458
Commits on Apr 27, 2020
  1. Add RemovePrefixWWWDF to DomainFrequencyExtractor. (#457)

    ruebot committed Apr 27, 2020
    - Resolves #456
    - Update test
Commits on Apr 23, 2020
Commits on Apr 22, 2020
  1. Add option to save to Parquet for app. (#454)

    ruebot committed Apr 22, 2020
    - Resolves #448
    - Update test
    - Add CSV headers to coalesce CSV output
    - Update README
  2. Update PlainTextExtractor to output a single column; text. (#453)

    ruebot committed Apr 22, 2020
    - Resolves #452
    - PlainTextExtractor runs ExtractBoilerplate on `content`
    - Update test
Commits on Apr 21, 2020
  1. Add a number of additional app extractors. (#451)

    ruebot committed Apr 21, 2020
    - Resolves #447
    - Add AudioInformationExtractor, ImageInformationExtractor,
    PDFInformationExtractor, PresentationProgramInformationExtractor,
    SpreadsheetInformationExtractor, TextFilesInformationExtractor,
    VideoInformationExtractor, WebGraphExtractor,
    WordProcessorInformationExtractor
    - Add tests for the new extractors
    - Update CommandLineApp to use new extractors
    - Add domain, and language column WebPagesExtractor
    - Change "TEXT" to "csv"
    - Lower case "GEXF" and "GRAPHML"
Commits on Apr 20, 2020
  1. Remove RDD option in app; DataFrame only now. (#450)

    ruebot committed Apr 20, 2020
    - Resolves #449
    - Updates and renames tests were applicable
    - Update README to reflect updates
Commits on Apr 15, 2020
  1. [skip-travis] Add spark-submit option to README; resolves #444. (#446)

    ruebot committed Apr 15, 2020
  2. [maven-release-plugin] prepare for next development iteration

    ruebot committed Apr 15, 2020
Commits on Apr 14, 2020
  1. Remove WriteGraph; resolves #439. (#441)

    ruebot committed Apr 14, 2020
    * Cleanup WriteGraphML doc comments.
Commits on Apr 13, 2020
  1. Remove GraphX support; resolves #442. (#443)

    ruebot committed Apr 13, 2020
    - Remove graphx dependencies from pom
    - Remove ExtractGraphX and related tests
    - Remove WriteGraphXML and related tests
Commits on Apr 11, 2020
  1. Add graphml output to CommandLineApp and DomainGraphExtractor. (#438)

    ruebot committed Apr 11, 2020
    * Resolves #435
    * Adds GRAPHML option to CommandLineApp
    * Adds DataFrame method to DomainGraphExtractor
    * Updates CommandLineApp, and WriteGraphML tests
Commits on Apr 8, 2020
  1. Align RDD and DF output for DomainGraphExtractor. (#437)

    ruebot committed Apr 8, 2020
    - Resolves #436
    - Remove WWW prefix for RDD was double escaping
    - Update DF so it matches RDD output (it wasn't even close before
    🤦)
    - Update tests so they're basically testing the same thing
Commits on Apr 7, 2020
  1. Add imagegraph, and webgraph to command line app. (#432)

    ruebot committed Apr 7, 2020
    - Resolves #431
    - Adds webpages, and imagegraph to command line app
    - Adds tests for new functionality
    - Clean-up doc comments
    - Convert files with dos line endings to unix line endings
    - Update CommandLineApp tests
Commits on Mar 23, 2020
  1. Tweak hasDate to handle Seq. (#430)

    ruebot committed Mar 23, 2020
    Tweak hasDate to handle Seq.
    - Addresses #425
    - Add test for hasDate
Commits on Mar 18, 2020
  1. Restyle keep/discard filter UDFs in the context of DataFrames (#429)

    ruebot committed Mar 18, 2020
    Co-authored-by: g285sing <g285sing@student.cs.uwaterloo.ca> (@SinghGursimran)
    
    - Resolves #425
    - Replace all keep/discard DF udfs with `hasXYZ()`
    - Update tests
Commits on Feb 20, 2020
  1. Update Spark and Hadoop versions. (#426)

    ruebot committed Feb 20, 2020
    - Update Spark to 2.4.5
    - Update Hadoop to 2.7.4 (for RADOS/S3 support)
    - Tweak README
Commits on Feb 12, 2020
  1. Add logic so UDFs that filter on url should also filter on src (#424).

    SinghGursimran and ruebot committed Feb 12, 2020
    - Resolves #418 
    - Update tests
    
    Co-authored-by: Nick Ruest <ruestn@gmail.com>
Commits on Feb 11, 2020
  1. [skip travis] Add pre-print link to README. (#423)

    ruebot committed Feb 11, 2020
    * [skip travis] Add pre-print link to README.
Commits on Feb 10, 2020
  1. Add img alt text to imagegraph(); resolves #420. (#422)

    ruebot committed Feb 10, 2020
    - Update ExtractImageLinksRDD to grab alt text
    - Add alt_text column to imagegraph
    - Update tests
  2. Rename imageLinks to imagegraph; resolves #419 (#421)

    ruebot committed Feb 10, 2020
    * Rename imageLinks to imagegraph; resolves #419
Commits on Feb 6, 2020
  1. [maven-release-plugin] prepare for next development iteration

    ruebot committed Feb 5, 2020
Commits on Feb 5, 2020
Commits on Jan 21, 2020
  1. Clean up test descriptions, addresses #372. (#416)

    ruebot authored and ianmilligan1 committed Jan 21, 2020
    - Clean up test descriptions
    - Rename typo filename
  2. Add ExtractImageDetailsDF. (#415)

    SinghGursimran authored and ruebot committed Jan 21, 2020
    - Add test
    - Addresses #223
Commits on Jan 18, 2020
  1. Add crawl_date to binary DataFrames and imageLinks. (#414)

    ruebot authored and ianmilligan1 committed Jan 18, 2020
    - Resolves #413
    - Update tests where necessary
Commits on Jan 17, 2020
  1. Various DataFrame implementation updates for documentation clean-up; …

    ruebot authored and ianmilligan1 committed Jan 17, 2020
    …Addresses #372.
    
    - .all() column HttpStatus to http_status_code
    - Adds archive_filename to .all()
    - Significant README updates for setup
    - See also: archivesunleashed/aut-docs#39
Older
You can’t perform that action at this time.