WikiData Dump Processor
Processing JSON dumps from wikidata.org 1.
quick usage
script ./wdq1.pl --date 2015-08-16 ./wdq2.pl --date 2015-08-16 --scan
The scripts will run for several hours (2016-08-15 took 4.5 hours on my
machine), so it might be useful to record log messages into a transcript
file.
wdq1.pl
Take gzipped dump file which is a gigantic JSON array and transcribe it
element by element to a series of output files.
- properties are kept in a JSON structure and dumped at the end as props.json
- items are analyzed and filterd for properties of interest but are also transcribed into a series out output files called “out/wdq#####.cmp”, just a bit larger than currently 500 MdB each.
- interesting information extracted by the filters mentioned above is written into a series of CSV files (tab separated):
filename | description |
---|---|
items.csv | item catalog |
props.csv | property catalog |
P####.csv | filtered property #### |
wkt1.pl
TODO: describe …
TODO: gnd1.pl
TODO: write and describe …
wdq2.pl
Creates an index for items.csv to be able to load individual frames
from the item store and render them to STDOUT.
TODO:
- factor out at least the rendering step into a library for other scripts
to use.
data/out/wdq#####.cmp
Each item as a JSON structure is compressed individually and written to
a file with this name pattern. The positional information in the items
and P-catalogs are intended for subsequent processing steps (see wdq2.pl).
CSV files
items.csv
column | label | note |
---|---|---|
0 | line | input file line number |
1 | pos | input file begin byte position (within the decompressed stream) |
2 | fo_count | put/wdq file number |
3 | fo_pos_beg | out/wdq file begin position |
4 | fo_pos_end | out/wdq file end position |
5 | id | item ID |
6 | type | item type (should be always “item”) |
7 | cnt_label | number of labels |
8 | cnt_desc | number of descriptions |
9 | cnt_aliases | number of aliases |
10 | cnt_claims | number of claims |
11 | cnt_sitelink | number of sitelinks |
12 | lang | primary language |
13 | label | label string in that primary language |
14 | filtered_props | list of properties recorded in P####.csv files |
15 | claims | complete list of properties |
lang and label
Only one label is recorded, the first available language is selected from an ordered list:
my @langs= qw(en de it fr);props.csv
column | label | note |
---|---|---|
0 | prop | property ID |
1 | def_cnt | number of times this property was defined: should be 1 |
2 | use_cnt | number of times this property was used in claims in processed items |
3 | datatype | format of property values |
4 | label_en | property’s english label |
5 | descr_en | property’s english description |
TODO:
- [_] check if it makes sense to select a primary language for label and description.
P####.csv
column | label | note |
---|---|---|
0 | line | |
1 | pos | |
2 | fo_count | |
3 | fo_pos_beg | |
4 | fo_pos_end | |
5 | id | |
6 | type | |
7 | cnt_label | |
8 | cnt_desc | |
9 | cnt_aliases | |
10 | cnt_claims | |
11 | cnt_sitelink | |
12 | lang | |
13 | label | |
14 | val | item’s property value |
All other columns are the same as defined before under the heading “items.csv”.
TODO
- [X] take date parameter as a commandline argument and derive other parameters from that
- [X] write props.json into the output directory
- [_] fetch the dump from dumps server (check if file already exists or was changed)
- [_] add code (which should go into a library) to retrieve selected items from wdq files
- [_] add a section describing similar known projects
Wikitionary
fetch dumps from 2, 3 and 4 and possibly other wiktionaries
{en,de,nl}wiktionary--pages-meta-current.xml.bz2
e.g. https://dumps.wikimedia.org/enwiktionary/20170501/enwiktionary-20170501-pages-meta-current.xml.bz2
Links
- 1 https://dumps.wikimedia.org/other/wikidata/
- 2 https://dumps.wikimedia.org/enwiktionary/
- 3 https://dumps.wikimedia.org/dewiktionary/
- 4 https://dumps.wikimedia.org/nlwiktionary/
Todo: add a way to get the proper date
wget https://dumps.wikimedia.org/enwiktionary/20160801/enwiktionary-20160801-pages-meta-current.xml.bz2 wget https://dumps.wikimedia.org/dewiktionary/20160801/dewiktionary-20160801-pages-meta-current.xml.bz2