Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
Permalink
Browse files

PySpark implementation; resolves #1.

- Implement PySpark functionality for all existing functionality.
- Update documentation
- It'd be nice if I could sort out calling the Scala code like we do in
aut. Probably tied to #2.
  • Loading branch information
ruebot committed Dec 10, 2019
1 parent d77eacb commit 3c37dfcc6fd8308a75bbc32eeb6473b8868ed71f
@@ -44,8 +44,21 @@ only showing top 2 rows

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.ids(df).show(2, False)
+-------------------+
|id_str |
+-------------------+
|1201505319257403392|
|1201505319282565121|
+-------------------+
```
## Extract User Information

Multi-column DataFrame containing the following columns: `favourites_count`, `followers_count`, `friends_count`, `id_str`, `location`, `name`, `screen_name`, `statuses_count`, and `verified`.
@@ -71,11 +84,24 @@ only showing top 2 rows

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.userInfo(df).show(2, False)
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|favourites_count|followers_count|friends_count|id_str |location|name |screen_name |statuses_count|verified|
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|8302 |101 |133 |1027887558032732161|nct🌱 |车美 |M_chemei |3720 |false |
|2552 |73 |218 |2548066344 |null |ひーこ☆禿げても愛せ|heeko_gr_029|15830 |false |
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
```
## Extract Tweet Text

Multi-column DataFrame containing the following columns: `full_text` (if it is available), and `text`.
Single-column or two columns (`text`, and `full-text`) containing Tweet text.

### Scala DF

@@ -93,12 +119,25 @@ text(tweetsDF).show(2, false)
|Baket ang pogi mo??? |
|今日すげぇな!#安元江口と夜あそび|
+---------------------------------+
only showing top 2 rows
```

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.text(df).show(2, False)
+---------------------------------+
|text |
+---------------------------------+
|Baket ang pogi mo??? |
|今日すげぇな!#安元江口と夜あそび|
+---------------------------------+
```

## Extract Tweet Times

@@ -125,10 +164,26 @@ only showing top 2 rows

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.times(df).show(2, False)
+------------------------------+
|created_at |
+------------------------------+
|Mon Dec 02 14:16:05 +0000 2019|
|Mon Dec 02 14:16:05 +0000 2019|
+------------------------------+
```

## Extract Tweet Sources

Single-column DataFrame containing the source of the tweet.

### Scala DF

```scala
@@ -157,10 +212,34 @@ sources(tweetsDF).show(10, false)

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.sources(df).show(10, False)
+------------------------------------------------------------------------------------+
|source |
+------------------------------------------------------------------------------------+
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> |
|<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a> |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> |
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> |
+------------------------------------------------------------------------------------+
```

## Extract Hashtags

Single-column DataFrame containg Hashtags.

### Scala DF

```scala
@@ -180,10 +259,25 @@ hashtags(tweetsDF).show

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.hashtags(df).show()
+------------------+
| hashtags|
+------------------+
|安元江口と夜あそび|
+------------------+
```

## Extract Urls

Single-column DataFrame containing urls.

### Scala DF

```scala
@@ -204,10 +298,34 @@ urls(tweetsDF).show(10, false)

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.urls(df).show(10, False)
+-----------------------+
|url |
+-----------------------+
|https://t.co/hONLvNozJg|
|https://t.co/mI5HYXLXwy|
|https://t.co/OLcG6Vu6dI|
|https://t.co/6KANFBbSa2|
|https://t.co/NRKfpoyubk|
|https://t.co/K66y3z4o8U|
|https://t.co/k0k7VndNzw|
|https://t.co/Vf7qICr4v0|
|https://t.co/fqRf3Og2qw|
|https://t.co/kC3XMKWBR8|
+-----------------------+
```

## Extract Animated Gif Urls

Single-column DataFrame containing animated gif urls.

### Scala DF

```scala
@@ -229,10 +347,27 @@ animatedGifUrls(tweetsDF).show(10, false)

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.animatedGifUrls(df).show(10, False)
+-----------------------------------------------------------+
|animated_gif_url |
+-----------------------------------------------------------+
|https://pbs.twimg.com/tweet_video_thumb/EKyat33U4AEpVFf.jpg|
|https://pbs.twimg.com/tweet_video_thumb/EKyQ1fAU8AM7r1I.jpg|
|https://pbs.twimg.com/tweet_video_thumb/EKyau1OU8AAD_OZ.jpg|
+-----------------------------------------------------------+
```

## Extract Image Urls

Single-column DataFrame containing image urls.

### Scala DF

```scala
@@ -241,7 +376,7 @@ import io.archivesunleashed._
val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)
imageUrls(tweetsDF).show(10, false)
imageUrls(tweetsDF).show(5, false)
+-----------------------------------------------+
|image_url |
@@ -256,10 +391,27 @@ imageUrls(tweetsDF).show(10, false)

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.imageUrls(df).show(5, False)
+-----------------------------------------------+
|image_url |
+-----------------------------------------------+
|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|
|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|
|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|
|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|
|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|
+-----------------------------------------------+
```
## Extract Media Urls

Single-column DataFrame containing animated gif urls, image urls, and video urls.
### Scala DF

```scala
@@ -283,10 +435,29 @@ mediaUrls(tweetsDF).show(5, false)

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.mediaUrls(df).show(5, False)
+-----------------------------------------------+
|image_url |
+-----------------------------------------------+
|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|
|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|
|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|
|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|
|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|
+-----------------------------------------------+
```

## Extract Video Urls

Single-column DataFrame containing video urls.

### Scala DF

```scala
@@ -310,10 +481,29 @@ videoUrls(tweetsDF).show(5, false)

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
SelectTweet.videoUrls(df).show(5, False)
+---------------------------------------------------------------------------------------------------+
|video_url |
+---------------------------------------------------------------------------------------------------+
|https://video.twimg.com/ext_tw_video/1201113203125583872/pu/pl/mLQJE9rIBSE6DaQ_.m3u8?tag=10 |
|https://video.twimg.com/ext_tw_video/1201113203125583872/pu/vid/460x258/o5wbkNtC_yVBiGvM.mp4?tag=10|
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/pl/1LRDIgIbWofMDpOa.m3u8?tag=10 |
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/vid/360x638/KrMl6qgy_8ugHBW-.mp4?tag=10|
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/vid/320x568/lMOyqZH6fnCoDGzI.mp4?tag=10|
+---------------------------------------------------------------------------------------------------+
```

## Remove Sensitive Tweets

Filters outs tweets labeled as sensisitive.

### Scala DF

```scala
@@ -328,10 +518,20 @@ res0: Long = 246

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
FilterTweet.removeSensitive(df).count()
246
```

## Remove Retweets

Filters out retweets.

### Scala DF

```scala
@@ -346,10 +546,20 @@ res0: Long = 230

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
FilterTweet.removeRetweets(df).count()
230
```

## Remove Non-verified User Tweets

Filters out tweets from non-verified users.

### Scala DF

```scala
@@ -364,8 +574,15 @@ res0: Long = 5

### Python DF

**TODO**
```python
from twut import *
path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)
FilterTweet.removeNonVerified(df).count()
5
```

## Work with DataFrame Results

0 comments on commit 3c37dfc

Please sign in to comment.
You can’t perform that action at this time.