German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV
The repository provides the status IDs and associated scripts for creating a German-language Twitter dataset consisting of 3,699,623 tweets from 2020/03/19 until 2020/06/26. The data set will be continuously extended. For a brief introduction to the dataset see
- Rieger, J. & von Nordheim, G. (2021). corona100d - German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV. DoCMA Working Paper #4.
In addition, this repository provides the data and scripts related to the talk
- von Nordheim, G. & Rieger, J. (2020). corona100d - Best practices in the creation process, analysis and publication of an open data corpus [corona100d – Best Practices bei der Genese, Analyse und Publikation eines Open-Data-Korpus]. SciCAR 2020.
For bug reports, comments and questions please use the issue tracker.
- rtweet to scrape tweets.
- longurl to expand short urls.
- urltools to extract url cores from urls.
- tosca to manage and manipulate the text data to a structure requested by
ldaPrototype
. - ldaPrototype to determine a prototype from a number of runs of Latent Dirichlet Allocation.
- batchtools to calculate (prototypes of) LDAs on the High Performace Compute Cluster LiDO3.
- data.table to manage data tables.
- lubridate to handle dates.
- tm and stringr to preprocess the text data.
- spelling to identify gibberish in texts.
- RCurl and RJSONIO to scrape articles with diffbot.
- httr to scrape articles with scrapinghub.
- RColorBrewer and ggwordcloud to visualize some statistics.
Please note: For legal reasons the repository cannot provide all data. Please let us know if you feel that there is anything missing that we could add.
The numbered scripts describe the general workflow during data set creation. The main folder contains all scripts relevant for the creation of the raw dataset. The scicar
folder contains - consecutively numbered - additional scripts for creating a follow-up corpus from linked articles in the tweets. In the subfolder scraping_articles
are the parsers for the article scrapers. The two txt
files (status_id.txt
in the main folder and status_id_Articles.txt
in scicar
) specify the 3,699,623 status ids of the tweets in the base data set and the status ids of the filtered data set (tweets with links) on which a LDA was calculated (85,920 ids).
The scripts corona100d.R
and wordcloud.R
give an first insight into the base dataset corona100d
(see also wordclouds.pdf
and counts.pdf
), while corona100dArticles.R
contains code to fit LDAs. The necessary data to model the LDA are given by docs.rds
and vocab.rds
; lda.R
then shows code for a minimal evaluation of the LDA results.