Skip to content
Code and generated data for Master's Thesis Erik Dolstra - Record Linkage on consecutive share and post actions
Branch: master
Clone or download
Latest commit df6fc35 Jul 24, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md Update readme.md - 2 Jul 24, 2019
generated_events_x1.csv Code and Data as Demo Jul 24, 2019
generated_events_x4.csv
generated_tweets_x1.csv
generated_tweets_x4.csv
public_classification_synthetic_data.scala Code and Data as Demo Jul 24, 2019
public_generate_data.scala Code and Data as Demo Jul 24, 2019

README.md

RecordLinkageSharesAndPosts

Code and generated data for Master's Thesis Erik Dolstra - Record Linkage on consecutive share and post actions

This repository contains four CSV files and two pieces of scala code. This code was generated for a thesis for a Master's in Information Studies at the University of Amsterdam. The goal of this thesis is to write an algorithm that matches Share Intents (people clicking a share button on an online article) with the corresponding Social Media Posts. An abstract of the thesis is found below. The task of writing this thesis is commisioned by the Persgroep. The data is based on Twitter Shares of articles from the website AD.nl.

The data is generated and fully anonymized. I removed the identifier that corresponds to a cookie ID on AD.nl and the identifier that corresponds to a Twitter Profile. The process of data generation is further described in the Thesis. Additionally the file locations (S3 buckets) in the code are also fake.

The code is written in Scala Spark (https://spark.apache.org/docs/latest/api/scala/). The code "public_generate_data.scala" can't be executed without input from real Shares and Posts. For privacy reasons this input data can't be published. The output of this code is the data that is added to this repository. The code "public_classification_synthetic_data.scala" can be executed with the CSV files that are added to this repository as input.

Summary of the thesis: To harness the rich amount of public data online, more companies combine data from online social networks with their own private data. A subset of these organizations publishes online articles, which can be shared by readers to online social networks with the use of share buttons. In this research, we show that the timestamp of an intent to share and a consecutive post on a social network can be compared in a record linkage algorithm. We apply a novel method of comparison on the time differences between the consecutive share and post actions. We call this time difference intent time. We implement record linkage with intent time on a test case from a Dutch newspaper and Twitter. We then validate our algorithm on synthetic datasets, where we measure the recall, precision and F1-score. When a cost-based record linkage algorithm with intent time is compared to a baseline probabilistic record linkage algorithm then the F1-score and recall are higher, but the precision is lower. Our research is applicable for anyone who wants to perform record linkage on timestamps of online article shares and prefers F1-score over precision.

You can’t perform that action at this time.