More complete Twitter Ingestion #194

greebie · Apr 6, 2018

I think we can strike retweeted from this list. Looking at a few of my really large datasets, I'm not seeing retweeted equal anything but false.

ruebot · Aug 10, 2018

I think we can strike retweeted from this list. Looking at a few of my really large datasets, I'm not seeing retweeted equal anything but false.

Sounds good to me @ruebot!

ianmilligan1 · Aug 10, 2018

Sounds good to me @ruebot!

place is an array. What do we want out of it?

Example:

$ zcat WomensMarch-20170123.json.gz | jq .place
{
  "full_name": "Holden, MA",
  "url": "https://api.twitter.com/1.1/geo/id/00173cd41f2b16d3.json",
  "country": "United States",
  "place_type": "city",
  "bounding_box": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -71.918466,
          42.312204
        ],
        [
          -71.918466,
          42.400332
        ],
        [
          -71.801795,
          42.400332
        ],
        [
          -71.801795,
          42.312204
        ]
      ]
    ]
  },
  "country_code": "US",
  "attributes": {},
  "id": "00173cd41f2b16d3",
  "name": "Holden"
}

ruebot · Aug 10, 2018

place is an array. What do we want out of it?

Example:

$ zcat WomensMarch-20170123.json.gz | jq .place
{
  "full_name": "Holden, MA",
  "url": "https://api.twitter.com/1.1/geo/id/00173cd41f2b16d3.json",
  "country": "United States",
  "place_type": "city",
  "bounding_box": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -71.918466,
          42.312204
        ],
        [
          -71.918466,
          42.400332
        ],
        [
          -71.801795,
          42.400332
        ],
        [
          -71.801795,
          42.312204
        ]
      ]
    ]
  },
  "country_code": "US",
  "attributes": {},
  "id": "00173cd41f2b16d3",
  "name": "Holden"
}

All of the entities options are arrays too. So we'll have to map all those out as well.

  "entities": {
    "hashtags": [],
    "symbols": [],
    "user_mentions": [],
    "urls": []
  }

user_mentions

$ zcat WomensMarch-20170123.json.gz | head -n 1 | jq .entities.user_mentions
[
  {
    "indices": [
      0,
      15
    ],
    "id_str": "18395177",
    "screen_name": "solangeknowles",
    "name": "solange knowles",
    "id": 18395177
  },
  {
    "indices": [
      16,
      28
    ],
    "id_str": "800942537083068416",
    "screen_name": "womensmarch",
    "name": "Women's March",
    "id": 800942537083068400
  }
]

hashtags

$ zcat WomensMarch-20170123.json.gz | head -n 1 | jq .entities.hashtags
[
  {
    "indices": [
      12,
      21
    ],
    "text": "BREAKING"
  },
  {
    "indices": [
      88,
      99
    ],
    "text": "TrumpLeaks"
  },
  {
    "indices": [
      100,
      117
    ],
    "text": "alternativefacts"
  },
  {
    "indices": [
      118,
      124
    ],
    "text": "amjoy"
  }
]

urls

$ zcat WomensMarch-20170123.json.gz | head -n 1 | jq .entities.urls
[
  {
    "url": "https://t.co/psP0GzBgZB",
    "indices": [
      100,
      123
    ],
    "expanded_url": "http://www.trendinalia.com/twitter-trending-topics/singapore/singapore-today.html",
    "display_url": "trendinalia.com/twitter-trendi…"
  }
]

ruebot · Aug 10, 2018

All of the entities options are arrays too. So we'll have to map all those out as well.

  "entities": {
    "hashtags": [],
    "symbols": [],
    "user_mentions": [],
    "urls": []
  }

user_mentions

$ zcat WomensMarch-20170123.json.gz | head -n 1 | jq .entities.user_mentions
[
  {
    "indices": [
      0,
      15
    ],
    "id_str": "18395177",
    "screen_name": "solangeknowles",
    "name": "solange knowles",
    "id": 18395177
  },
  {
    "indices": [
      16,
      28
    ],
    "id_str": "800942537083068416",
    "screen_name": "womensmarch",
    "name": "Women's March",
    "id": 800942537083068400
  }
]

hashtags

$ zcat WomensMarch-20170123.json.gz | head -n 1 | jq .entities.hashtags
[
  {
    "indices": [
      12,
      21
    ],
    "text": "BREAKING"
  },
  {
    "indices": [
      88,
      99
    ],
    "text": "TrumpLeaks"
  },
  {
    "indices": [
      100,
      117
    ],
    "text": "alternativefacts"
  },
  {
    "indices": [
      118,
      124
    ],
    "text": "amjoy"
  }
]

urls

$ zcat WomensMarch-20170123.json.gz | head -n 1 | jq .entities.urls
[
  {
    "url": "https://t.co/psP0GzBgZB",
    "indices": [
      100,
      123
    ],
    "expanded_url": "http://www.trendinalia.com/twitter-trending-topics/singapore/singapore-today.html",
    "display_url": "trendinalia.com/twitter-trendi…"
  }
]

I think "country code," "full name" and "coordinates" for location.
I think "id" & "screenname" for mentions
Just the text for the hashtags. If people really care about where the hashtag was placed, they should probably use a twitter-specific tool.
Expanded url is all we need from the urls.

greebie · Aug 10, 2018

I think "country code," "full name" and "coordinates" for location.
I think "id" & "screenname" for mentions
Just the text for the hashtags. If people really care about where the hashtag was placed, they should probably use a twitter-specific tool.
Expanded url is all we need from the urls.

@SamFritz can we add a item for our August 22nd agenda to discuss how we should pull out these entities? Basically, it's how we want to store them; array, list, comma delimited strings. In the twarc tags.py utility, we just have a nested loop to get at them. I'm assuming we don't want to do something like this to get the multiple .entities.hashtags.text, entities.mentions.name, etc.

...unless @lintool knows of some Scala tricks with json off the top of his head.

ruebot · Aug 10, 2018

@SamFritz can we add a item for our August 22nd agenda to discuss how we should pull out these entities? Basically, it's how we want to store them; array, list, comma delimited strings. In the twarc tags.py utility, we just have a nested loop to get at them. I'm assuming we don't want to do something like this to get the multiple .entities.hashtags.text, entities.mentions.name, etc.

...unless @lintool knows of some Scala tricks with json off the top of his head.

As for .geo and .place, here is an example with both in it:

"place":
{
  "country_code": "US",
  "url": "https://api.twitter.com/1.1/geo/id/67b98f17fdcf20be.json",
  "country": "United States",
  "place_type": "city",
  "bounding_box": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -71.191421,
          42.227797
        ],
        [
          -71.191421,
          42.399542
        ],
        [
          -70.986004,
          42.399542
        ],
        [
          -70.986004,
          42.227797
        ]
      ]
    ]
  },
  "full_name": "Boston, MA",
  "attributes": {},
  "id": "67b98f17fdcf20be",
  "name": "Boston"
},
"geo":
{
  "type": "Point",
  "coordinates": [
    42.33887,
    -71.08839
  ]
}

ruebot · Aug 10, 2018

As for .geo and .place, here is an example with both in it:

"place":
{
  "country_code": "US",
  "url": "https://api.twitter.com/1.1/geo/id/67b98f17fdcf20be.json",
  "country": "United States",
  "place_type": "city",
  "bounding_box": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -71.191421,
          42.227797
        ],
        [
          -71.191421,
          42.399542
        ],
        [
          -70.986004,
          42.399542
        ],
        [
          -70.986004,
          42.227797
        ]
      ]
    ]
  },
  "full_name": "Boston, MA",
  "attributes": {},
  "id": "67b98f17fdcf20be",
  "name": "Boston"
},
"geo":
{
  "type": "Point",
  "coordinates": [
    42.33887,
    -71.08839
  ]
}

Good idea. My basic instincts suggest we either use classes, or just accept tuples. The coordinates thing seems complicated though and could just be a quagmire of nested Objects if we go the class route.

greebie · Aug 10, 2018

Good idea. My basic instincts suggest we either use classes, or just accept tuples. The coordinates thing seems complicated though and could just be a quagmire of nested Objects if we go the class route.

ruebot self-assigned this Aug 10, 2018

ruebot referenced this issue Aug 10, 2018
Merged
Add support for full_text in tweets; resolve #192. #252

ruebot referenced this issue Aug 10, 2018
Merged
Add additional tweet fields to TweetUtils; partially address #194. #254

ruebot added enhancement feature labels Aug 20, 2018

archivesunleashed/aut

More complete Twitter Ingestion #194

ruebot self-assigned this Aug 10, 2018

ruebot referenced this issue Aug 10, 2018

Add support for full_text in tweets; resolve #192. #252

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

ianmilligan1 commented Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

greebie commented Aug 10, 2018

ruebot added a commit to ruebot/aut that referenced this issue Aug 10, 2018

ruebot referenced this issue Aug 10, 2018

Add additional tweet fields to TweetUtils; partially address #194. #254

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

greebie commented Aug 10, 2018

ruebot added a commit to ruebot/aut that referenced this issue Aug 11, 2018

ianmilligan1 added a commit that referenced this issue Aug 11, 2018

ruebot added enhancement feature labels Aug 20, 2018

archivesunleashed/aut

Join GitHub today

More complete Twitter Ingestion #194

Comments

ruebot self-assigned this Aug 10, 2018

ruebot referenced this issue Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

ianmilligan1 commented Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

greebie commented Aug 10, 2018

ruebot added a commit to ruebot/aut that referenced this issue Aug 10, 2018

ruebot referenced this issue Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

ruebot commented Aug 10, 2018

This comment has been minimized.

greebie commented Aug 10, 2018

ruebot added a commit to ruebot/aut that referenced this issue Aug 11, 2018

ianmilligan1 added a commit that referenced this issue Aug 11, 2018

ruebot added enhancement feature labels Aug 20, 2018