Skip to content

ozanozyegen/TurkishWikiExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Introduction

I have created this code to extract turkish wiki dumps for testing NLP algorithms. Other extractors I've tried have character issues and didn't work well with Turkish Wiki Dump.

This module contains code for manipulating wikipedia dumps available at http://download.wikimedia.org/backup-index.html

It is tested with ("https://dumps.wikimedia.org/trwiki/20170601/trwiki-20170601-pages-articles.xml.bz2") Turkish Wikipedia Dump. It is not tested with other dumps.

Installation

Required libraries are re, string, mwxml and cleantext. Written in Python 3.6

Program requires wiki dump to be named as "wiki.xml" in the root directory.

About

Wikipedia dump extractor for Turkish language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages