Skip to content

Rieger, J. & von Nordheim, G. (2021). corona100d - German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV. DoCMA Working Paper #4.

Notifications You must be signed in to change notification settings

JonasRieger/corona100d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

corona100d

German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV

The repository provides the status IDs and associated scripts for creating a German-language Twitter dataset consisting of 3,699,623 tweets from 2020/03/19 until 2020/06/26. The data set will be continuously extended. For a brief introduction to the dataset see

  • Rieger, J. & von Nordheim, G. (2021). corona100d - German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV. DoCMA Working Paper #4.

In addition, this repository provides the data and scripts related to the talk

  • von Nordheim, G. & Rieger, J. (2020). corona100d - Best practices in the creation process, analysis and publication of an open data corpus [corona100d – Best Practices bei der Genese, Analyse und Publikation eines Open-Data-Korpus]. SciCAR 2020.

For bug reports, comments and questions please use the issue tracker.

Related Software

Usage

Please note: For legal reasons the repository cannot provide all data. Please let us know if you feel that there is anything missing that we could add.

The numbered scripts describe the general workflow during data set creation. The main folder contains all scripts relevant for the creation of the raw dataset. The scicar folder contains - consecutively numbered - additional scripts for creating a follow-up corpus from linked articles in the tweets. In the subfolder scraping_articles are the parsers for the article scrapers. The two txt files (status_id.txt in the main folder and status_id_Articles.txt in scicar) specify the 3,699,623 status ids of the tweets in the base data set and the status ids of the filtered data set (tweets with links) on which a LDA was calculated (85,920 ids).

The scripts corona100d.R and wordcloud.R give an first insight into the base dataset corona100d (see also wordclouds.pdf and counts.pdf), while corona100dArticles.R contains code to fit LDAs. The necessary data to model the LDA are given by docs.rds and vocab.rds; lda.R then shows code for a minimal evaluation of the LDA results.

About

Rieger, J. & von Nordheim, G. (2021). corona100d - German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV. DoCMA Working Paper #4.

Resources

Stars

Watchers

Forks

Languages