README.textile
WikiData Dump Processor
Processing JSON dumps from wikidata.org 1.
quick usage
script ./wdq1.pl --date 2015-08-16 ./wdq2.pl --date 2015-08-16 --scan
The scripts will run for several hours (2016-08-15 took 4.5 hours on my
machine), so it might be useful to record log messages into a transcript
file.
wdq1.pl
Take gzipped dump file which is a gigantic JSON array and transcribe it
element by element to a series of output files.
- properties are kept in a JSON structure and dumped at the end as props.json
- items are analyzed and filterd for properties of interest but are also transcribed into a series out output files called “out/wdq#####.cmp”, just a bit larger than currently 500 MdB each.
- interesting information extracted by the filters mentioned above is written into a series of CSV files (tab separated):
| filename | description |
|---|---|
| items.csv | item catalog |
| props.csv | property catalog |
| P####.csv | filtered property #### |
wkt1.pl
TODO: describe …
TODO: gnd1.pl
TODO: write and describe …
wdq2.pl
wdq2.pl —scan
Creates an index for items.csv to be able to load individual frames
from the item store and render them to STDOUT.
TODO:
- factor out at least the rendering step into a library for other scripts
to use.
wdq2.pl Q##### Q#####
Extracts wikidata data from processed dump file for give Wikidata IDs.
data/out/wdq#####.cmp
Each item as a JSON structure is compressed individually and written to
a file with this name pattern. The positional information in the items
and P-catalogs are intended for subsequent processing steps (see wdq2.pl).
CSV files
NOTE: all csv files are really TSV files: Tab separated columns with first line giving the column names.
items.csv
| column | label | note |
|---|---|---|
| 0 | line | input file line number |
| 1 | pos | input file begin byte position (within the decompressed stream) |
| 2 | fo_count | put/wdq file number |
| 3 | fo_pos_beg | out/wdq file begin position |
| 4 | fo_pos_end | out/wdq file end position |
| 5 | id | item ID |
| 6 | type | item type (should be always “item”) |
| 7 | cnt_label | number of labels |
| 8 | cnt_desc | number of descriptions |
| 9 | cnt_aliases | number of aliases |
| 10 | cnt_claims | number of claims |
| 11 | cnt_sitelink | number of sitelinks |
| 12 | lang | primary language |
| 13 | label | label string in that primary language |
| 14 | filtered_props | list of properties recorded in P####.csv files |
| 15 | claims | complete list of properties |
lang and label
Only one label is recorded, the first available language is selected from an ordered list:
my @langs= qw(en de it fr);props.csv
| column | label | note |
|---|---|---|
| 0 | prop | property ID |
| 1 | def_cnt | number of times this property was defined: should be 1 |
| 2 | use_cnt | number of times this property was used in claims in processed items |
| 3 | datatype | format of property values |
| 4 | label_en | property’s english label |
| 5 | descr_en | property’s english description |
TODO:
- [_] check if it makes sense to select a primary language for label and description.
P####.csv
| column | label | note |
|---|---|---|
| 0 | line | |
| 1 | pos | |
| 2 | fo_count | |
| 3 | fo_pos_beg | |
| 4 | fo_pos_end | |
| 5 | id | |
| 6 | type | |
| 7 | cnt_label | |
| 8 | cnt_desc | |
| 9 | cnt_aliases | |
| 10 | cnt_claims | |
| 11 | cnt_sitelink | |
| 12 | lang | |
| 13 | label | |
| 14 | val | item’s property value |
All other columns are the same as defined before under the heading “items.csv”.
TODO
- [X] take date parameter as a commandline argument and derive other parameters from that
- [X] write props.json into the output directory
- [x] fetch the dump from dumps server (check if file already exists or was changed) (wdq0.pl)
- [x] add code (which should go into a library) to retrieve selected items from wdq files (wdq2.pl)
- [_] add a section describing similar known projects
alternative download
see 5
Wikitionary
fetch dumps from 2, 3 and 4 and possibly other wiktionaries
{en,de,nl}wiktionary--pages-meta-current.xml.bz2
e.g. https://dumps.wikimedia.org/enwiktionary/20170501/enwiktionary-20170501-pages-meta-current.xml.bz2
Links
- 1 https://dumps.wikimedia.org/other/wikidata/
- 2 https://dumps.wikimedia.org/enwiktionary/
- 3 https://dumps.wikimedia.org/dewiktionary/
- 4 https://dumps.wikimedia.org/nlwiktionary/
- 5 https://dumps.wikimedia.org/wikidatawiki/entities/
Todo: add a way to get the proper date
wget https://dumps.wikimedia.org/enwiktionary/20160801/enwiktionary-20160801-pages-meta-current.xml.bz2 wget https://dumps.wikimedia.org/dewiktionary/20160801/dewiktionary-20160801-pages-meta-current.xml.bz2