-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7bc6e07
commit a76f2fd
Showing
15 changed files
with
431 additions
and
174 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
data/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# MediaWiki | ||
|
||
## Steps: | ||
|
||
1. Run `download.sh YYYYMMDD` to download xml dumps | ||
2. Run `to_dolma.sh YYYYMMDD` (date must match) to convert to the dolma format | ||
3. Run `python preprocess.py --input ... --output ...` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
.DS_Store | ||
.idea | ||
*.log | ||
tmp/ | ||
|
||
*.tern-port | ||
node_modules/ | ||
npm-debug.log* | ||
yarn-debug.log* | ||
yarn-error.log* | ||
*.tsbuildinfo | ||
.npm | ||
.eslintcache | ||
logs/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# WTF WIKIPEDIA parsing server | ||
|
||
We use the dolma format and a server running `wtf_wikipedia` for wikitext parsing instead of they dumpster dip as we want to be able to parse wikitext even when it is not in the standard xml format. | ||
|
||
## Starting the Server | ||
|
||
1. Install HAProxy `sudo apt install haproxy` | ||
2. Install nvm and node | ||
3. Install dependencies `npm install` | ||
4. edit `haproxy.cfg` to include one `server ${name} 127.0.0.1:${port} check` line for each server you plan to run. | ||
5. move/link `haproxy.cfg` to `/etc/haproxy/haproxy.cfg` | ||
6. Restart haproxy (`systemctl restart haproxy` on systemd based systems) | ||
7. Run `./start ${numserver}`. Should match the number of `server` lines in `haproxy` | ||
8. Go to `localhost:8404/stats` to check that each server is seen by haproxy | ||
|
||
## Why? | ||
|
||
Each server uses a worker pool with `1` worker. This is because `wtf_wikipedia` is syncronous code, so we need to run it in a thread to be able to use timeouts to cancel execution for long running documents. This also helps in cases where the parsing causes an OoM error, this happens in the thread instead of the real server. | ||
|
||
We then have multiple copies of the server behing the load balancer, this allows for recovery in cases where the main server itself crashes. | ||
|
||
### v8 garbage collection | ||
|
||
v8, and therefore node, seem to have a pretty complex garbage collector and includes things like different heaps for persistant objects and "young" objects that are short-lived. Despite various efforts to set the sizes for these heaps (defaults to 64 and 32 GB in our code for each worker), I have found a lot of javascript OoM error, even though they seem to say that the heap is much smaller than the limits. This is set in the optinos for the constructor for the worker pool. | ||
|
||
There were also cases where using a large worker pool and a single server, the main server can have OoM errors. This crashes the whole server and grinds the dolma conversion to a halt. Even with commandline arguments to set the size of the heap, this was still happening, again despite it seeming to not have much on the heap. When this happens, our load balancer stops routing traffic to this server and out start script brings a new version online. Once it is live it is added back to the pool. | ||
|
||
These errors tend to happen on pages that have over 2 million characters. | ||
|
||
## Settings | ||
|
||
It seems to be fast to try to make sure that each server is currently working on 1 document and have already received a second document to be processed next. As the python code is syncronous, this means we need ~twice as many dolma processes as we have servers. Having extra python processes allows for the server to not have to wait for python string manipulataions. | ||
|
||
On a Ryzen 9 7950X using 30 dolma processes and 16 servers, the whole system processes ~5.5k documents/second and takes ~4 hours and 15 mins to process wikipeadia + talk pages and the other mediawiki pages. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
defaults | ||
mode http | ||
timeout client 10m | ||
timeout connect 10m | ||
timeout server 10m | ||
timeout http-request 10m | ||
balance leastconn | ||
|
||
frontend stats | ||
mode http | ||
bind 127.0.0.1:8404 | ||
stats enable | ||
stats uri /stats | ||
stats refresh 5s | ||
stats admin if LOCALHOST | ||
|
||
frontend wtf | ||
bind 127.0.0.1:5000 | ||
default_backend wtf_workers | ||
|
||
backend wtf_workers | ||
option httpchk | ||
http-check send meth GET uri /health | ||
http-check expect status 200 | ||
server wtf1 127.0.0.1:5001 check | ||
server wtf2 127.0.0.1:5002 check | ||
server wtf3 127.0.0.1:5003 check | ||
server wtf4 127.0.0.1:5004 check | ||
server wtf5 127.0.0.1:5005 check | ||
server wtf6 127.0.0.1:5006 check | ||
server wtf7 127.0.0.1:5007 check | ||
server wtf8 127.0.0.1:5008 check | ||
server wtf9 127.0.0.1:5009 check | ||
server wtf10 127.0.0.1:5010 check | ||
server wtf11 127.0.0.1:5011 check | ||
server wtf12 127.0.0.1:5012 check | ||
server wtf13 127.0.0.1:5013 check | ||
server wtf14 127.0.0.1:5014 check | ||
server wtf15 127.0.0.1:5015 check | ||
server wtf16 127.0.0.1:5016 check |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
#!/usr/bin/env bash | ||
|
||
NUMSERVERS=${1:-16} | ||
|
||
function port { | ||
local id=${1} | ||
if [[ ${id} -ge 10 ]]; then | ||
echo "50${id}" | ||
else | ||
echo "500${id}" | ||
fi | ||
} | ||
|
||
function launch { | ||
local id=${1} | ||
node --max-old-space-size=65536 --max-semi-space-size=16384 parser.js --port $(port ${id}) --timeout 180 --maxworkers 1 >> ./logs/worker${id}.log 2>&1 & | ||
} | ||
|
||
function ping { | ||
local id=${1} | ||
echo $(curl -I -X GET localhost:$(port ${id})/health 2> /dev/null | head -n 1 | cut -d$" " -f2) | ||
} | ||
|
||
mkdir -p logs | ||
|
||
while true; do | ||
for i in $(seq 1 $NUMSERVERS); do | ||
if [[ $(ping ${i}) -ne "200" ]]; then | ||
echo "Worker ${i} not running, starting." | ||
launch ${i} | ||
fi | ||
done | ||
sleep 5 | ||
done |
Oops, something went wrong.