@@ -0,0 +1,26 @@
require_ci_to_pass: yes

precision: 2
round: down
range: "50...80"

project: yes
patch: yes
changes: no

conditional: yes
loop: yes
method: no
macro: no

layout: "header, diff"
behavior: default
require_changes: no
@@ -0,0 +1,20 @@
@@ -0,0 +1,27 @@
dist: xenial
language: java

- master

- openjdk11

- "echo $JAVA_OPTS"
- "export JAVA_OPTS=-Xmx512m"
- "export MAVEN_OPTS=-Dorg.slf4j.simpleLogger.defaultLogLevel=warn"

- mvn install -B -V
- mvn javadoc:jar
- mvn javadoc:test-aggregate
- mvn site

- bash <(curl -s

secure: CCjvzkv9khqeAIgbMjXnIoQi0qZ55K6RtxGk9bqzY+r/xiUTmgat9N9+Alyuq3kK9rNNoQZQwR9rOyvPf9ymkifFnGxSglBSLHXzpxnftwOCasB0wf0OkENYfa8BrDhSk9EZPHsfGNqtcb5tm6/hLK2Kd49qGkYkT1ct3O0jWwWsn0SOmyNh2znIxMwCGKUMmrk/opVEKLvXmZRM7jCStCzFRrfR/d0QrPa9MYOLaFy75bVK8NcIJd4s6seOMf9OifBnfE34FY9DOL8fWnZEIx9eG6ajMYDP+6gn/v9JOZoNybfTojrpsWqCK1ytItzeToMAz9n8ULB0sUXAY0zk5u1VMaWQa9w/769hwATkNv49GI/MLM2apJY2HaBvzPizWIrVpR89uilM+pxUaH51D94cnWjtVLaSt7BMJ1K/dy2hpEaBElmG0iWYsqpdpKTJkVCDOYxs8sumEFsvIUWcQkiuk5EKrxfAjqcUpf5yTvkhFtkiIU2oxf2sGXXVFGocM+dpzbFXlhmk76caeRD+tw9bNfDAbuy7JjEfVS7ls3gmUHu3298JZhfiR89YxBx7BDZ7Kr9vurdXaYihoCqkXykw8D7MiZGRcdMJbGmRGmsILho9KtlJJsP7BNG6W3uA/z5gRzlV3RJVjXigWDCpOUxp+TVNP9ug4ymmSf2g6cQ=
@@ -0,0 +1,81 @@
# Archives Unleashed Project Code of Conduct

## Our Pledge

* The Archives Unleashed Project believes in supporting an open, inclusive, and
diverse community which respects the experience, expertise, and knowledge of
all community members.
* The Archives Unleashed community is dedicated to providing a harassment-free
experience for everyone, and welcomes individuals regardless age, body size,
disability, ethnicity, gender identity and expression, level of experience,
nationality, personal appearance, race, religion, or sexual identity and
* To foster respectful collaborations this code of conduct applies to all
Archives Unleashed spaces, includes, but is not limited to, GitHub, Slack,
Medium, social media platforms and meeting spaces, both online and off.
* Anyone who violates this code of conduct may be sanctioned or expelled from
these spaces at the discretion of the Archives Unleashed Project Team.

## Our Standards

Examples of behavior that contributes to creating a positive environment

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at [][version]

@@ -0,0 +1,53 @@
# Welcome!

If you are reading this document then you are interested in contributing The Archives Unleashed Project. All contributions are welcome: use-cases, documentation, code, ptatches, bug reports, feature requests, etc. You do not need to be a programmer to speak up!

### Use cases

If you would like to submit a use case for The Archives Unleashed Toolkit, please submit and issue [here](, and begin the issue title with "Use Case:".

### Documentation

You can contribute documentation in two different ways. One way is to create an issue [here]( and begin the issue title with "Documentation:".

### Request a new feature

To request a new feature you should [open an issue]( or create a use case as described above (see _use case_ section above), and summarize the desired functionality. Begin the issue title with "Enhancement:".

### Report a bug

To report a bug you should [open an issue]( that summarizes the bug. Set the label to "bug".

In order to help us understand and fix the bug it would be great if you could provide us with:

1. The steps to reproduce the bug. This includes information about e.g. The Archives Unleashed Toolkit version you were using, whether on a single node or cluster, etc.
2. The expected behavior.
3. The actual, incorrect behavior.

Feel free to search the issue queue for existing issues (aka tickets) that already describe the problem; if there is such a ticket please add your information as a comment.

### Contribute code

_If you are interested in contributing code to The Archives Unleashed Toolkit but do not know where to begin:_

In this case you should [browse open issues](

Contributions to The Archives Unleased Toolkit codebase should be sent as GitHub pull requests. See section _Create a pull request_ below for details. If there is any problem with the pull request we can work through it using the commenting features of GitHub.

* For _small patches_, feel free to submit pull requests directly for those patches.
* For _larger code contributions_, please use the following process. The idea behind this process is to prevent any wasted work and catch design issues early on.

1. [Open an issue](, if a similar issue does not exist already. If a similar issue does exist, then you may consider participating in the work on the existing issue.
2. Comment on the issue with your plan for implementing the issue. Explain what pieces of the codebase you are going to touch and how everything is going to fit together.
3. The Archives Unleashed Toolkit committers will work with you on the design to make sure you are on the right track.
4. Implement your issue, create a pull request (see below), and iterate from there.

### Create a pull request

Take a look at [Creating a pull request]( In a nutshell you need to:

1. [Fork]( The Archives Unleashed Toolkit GitHub repository at []( to your personal GitHub account.
2. Commit any changes to your fork.
3. Send a [pull request]( to The Archives Unleashed Toolkit GitHub repository that you forked in step 1. If your pull request is related to an existing issue -- for instance, because you reported a [bug/issue]( earlier -- prefix the title of your pull request with the corresponding issue number (e.g. `issue-123: ...`). Please also include a reference to the issue in the description of the pull. This can be done by using '#' plus the issue number like so '#123', also try to pick an appropriate name for the branch in which you're issuing the pull request from.

You may want to read [Syncing a fork]( for instructions on how to keep your fork up to date with the latest changes of the upstream (official) `twut` repository.
@@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2019 Archives Unleashed

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
@@ -1 +1,88 @@
# twut
# twut

[![Build Status](](
[![Maven Central](](
[![Contribution Guidelines](](./

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

## Getting Started

### Easy

If you have Apache Spark ready to go, it's as easy as:

$ spark-shell --packages "io.archivesunleashed:twut:0.0.1-SNAPSHOT"

### A little less easy

You can download the [latest release here]( and include it like so:

$ spark-shell --jars /path/to/twut-0.0.1-SNAPSHOT-fatjar.jar"

## Usage

`twut` expects Tweets to be supplied in a DataFrame.


Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import io.archivesunleashed.twut._
import io.archivesunleashed.twut._
scala> val tweets = "/home/nruest/Projects/au/twut/src/test/resources/10-sample.jsonl"
tweets: String = /home/nruest/Projects/au/twut/src/test/resources/10-sample.jsonl
scala> val tweetsDF =
19/12/02 13:38:51 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
tweetsDF: org.apache.spark.sql.DataFrame = [contributors: string, coordinates: string ... 33 more fields]
scala> twut.ids(tweetsDF).show
| id_str|

## Documentation! Or, how do I use this?

Once built or downloaded, you can follow the basic set of recipes and tutorials [here](

# License

Licensed under the [Apache License, Version 2.0](

# Acknowledgments

This work is primarily supported by the [Andrew W. Mellon Foundation]( Other financial and in-kind support comes from the [Social Sciences and Humanities Research Council](, [Compute Canada](, the [Ontario Ministry of Research, Innovation, and Science](, [York University Libraries](, [Start Smart Labs](, and the [Faculty of Arts]( and [David R. Cheriton School of Computer Science]( at the [University of Waterloo](

Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
