corona100d

German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV

The repository provides the status IDs and associated scripts for creating a German-language Twitter dataset consisting of 3,699,623 tweets from 2020/03/19 until 2020/06/26. The data set will be continuously extended. For a brief introduction to the dataset see

Rieger, J. & von Nordheim, G. (2021). corona100d - German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV. DoCMA Working Paper #4.

In addition, this repository provides the data and scripts related to the talk

von Nordheim, G. & Rieger, J. (2020). corona100d - Best practices in the creation process, analysis and publication of an open data corpus [corona100d – Best Practices bei der Genese, Analyse und Publikation eines Open-Data-Korpus]. SciCAR 2020.

For bug reports, comments and questions please use the issue tracker.

Related Software

rtweet to scrape tweets.
longurl to expand short urls.
urltools to extract url cores from urls.
tosca to manage and manipulate the text data to a structure requested by ldaPrototype.
ldaPrototype to determine a prototype from a number of runs of Latent Dirichlet Allocation.
batchtools to calculate (prototypes of) LDAs on the High Performace Compute Cluster LiDO3.
data.table to manage data tables.
lubridate to handle dates.
tm and stringr to preprocess the text data.
spelling to identify gibberish in texts.
RCurl and RJSONIO to scrape articles with diffbot.
httr to scrape articles with scrapinghub.
RColorBrewer and ggwordcloud to visualize some statistics.

Usage

Please note: For legal reasons the repository cannot provide all data. Please let us know if you feel that there is anything missing that we could add.

The numbered scripts describe the general workflow during data set creation. The main folder contains all scripts relevant for the creation of the raw dataset. The scicar folder contains - consecutively numbered - additional scripts for creating a follow-up corpus from linked articles in the tweets. In the subfolder scraping_articles are the parsers for the article scrapers. The two txt files (status_id.txt in the main folder and status_id_Articles.txt in scicar) specify the 3,699,623 status ids of the tweets in the base data set and the status ids of the filtered data set (tweets with links) on which a LDA was calculated (85,920 ids).

The scripts corona100d.R and wordcloud.R give an first insight into the base dataset corona100d (see also wordclouds.pdf and counts.pdf), while corona100dArticles.R contains code to fit LDAs. The necessary data to model the LDA are given by docs.rds and vocab.rds; lda.R then shows code for a minimal evaluation of the LDA results.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
scicar		scicar
01scrape.R		01scrape.R
02reduce.R		02reduce.R
README.md		README.md
insight.R		insight.R
rehydrate.R		rehydrate.R
status_id.txt		status_id.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corona100d

German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV

Related Software

Usage

About

Languages

JonasRieger/corona100d

Folders and files

Latest commit

History

Repository files navigation

corona100d

German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV

Related Software

Usage

About

Resources

Stars

Watchers

Forks

Languages