Gerhard Gonter
wikidata-dump-processor

Repository



WikiData Dump Processor
Processing JSON dumps from wikidata.org ¹.
quick usage
script
./wdq1.pl --date 2015-08-16
./wdq2.pl --date 2015-08-16 --scan

The scripts will run for several hours (2016-08-15 took 4.5 hours on my

machine), so it might be useful to record log messages into a transcript

file.
wdq1.pl
Take gzipped dump file which is a gigantic JSON array and transcribe it

element by element to a series of output files.

	properties are kept in a JSON structure and dumped at the end as props.json
	items are analyzed and filterd for properties of interest but are also transcribed into a series out output files called “out/wdq#####.cmp”, just a bit larger than currently 500 MdB each.
	interesting information extracted by the filters mentioned above is written into a series of CSV files (tab separated):


		filename 
		description 
	
	
		 items.csv 
		 item catalog 
	
	
		 props.csv 
		 property catalog 
	
	
		 P####.csv 
		 filtered property #### 
	

wkt1.pl
TODO: describe …

TODO: gnd1.pl
TODO: write and describe …
wdq2.pl
Creates an index for items.csv to be able to load individual frames

from the item store and render them to STDOUT.
TODO:

	factor out at least the rendering step into a library for other scripts

  to use.

data/out/wdq#####.cmp
Each item as a JSON structure is compressed individually and written to

a file with this name pattern.  The positional information in the items

and P-catalogs are intended for subsequent processing steps (see wdq2.pl).

CSV files
items.csv

	
		column 
		label 
		note 
	
	
		   0 
		 line            
		 input file line number 
	
	
		   1 
		 pos             
		 input file begin byte position (within the decompressed stream) 
	
	
		   2 
		 fo_count        
		 put/wdq file number 
	
	
		   3 
		 fo_pos_beg      
		 out/wdq file begin position 
	
	
		   4 
		 fo_pos_end      
		 out/wdq file end position 
	
	
		   5 
		 id              
		 item ID 
	
	
		   6 
		 type            
		 item type (should be always “item”) 
	
	
		   7 
		 cnt_label       
		 number of labels       
	
	
		   8 
		 cnt_desc        
		 number of descriptions 
	
	
		   9 
		 cnt_aliases     
		 number of aliases      
	
	
		  10 
		 cnt_claims      
		 number of claims       
	
	
		  11 
		 cnt_sitelink    
		 number of sitelinks    
	
	
		  12 
		 lang            
		 primary language 
	
	
		  13 
		 label           
		 label string in that primary language 
	
	
		  14 
		 filtered_props  
		 list of properties recorded in P####.csv files 
	
	
		  15 
		 claims          
		 complete list of properties 
	

lang and label
Only one label is recorded, the first available language is selected from an ordered list:
my @langs= qw(en de it fr);
props.csv

	
		column 
		label 
		note 
	
	
		   0 
		 prop         
		 property ID 
	
	
		   1 
		 def_cnt      
		 number of times this property was defined: should be 1 
	
	
		   2 
		 use_cnt      
		 number of times this property was used in claims in processed items 
	
	
		   3 
		 datatype     
		 format of property values 
	
	
		   4 
		 label_en     
		 property’s english label 
	
	
		   5 
		 descr_en     
		 property’s english description 
	

TODO:

	[_] check if it makes sense to select a primary language for label and description.

P####.csv

	
		column 
		label 
		note 
	
	
		   0 
		 line          
		 
	
		   1 
		 pos           
		 
	
		   2 
		 fo_count      
		 
	
		   3 
		 fo_pos_beg    
		 
	
		   4 
		 fo_pos_end    
		 
	
		   5 
		 id            
		 
	
		   6 
		 type          
		 
	
		   7 
		 cnt_label     
		 
	
		   8 
		 cnt_desc      
		 
	
		   9 
		 cnt_aliases   
		 
	
		  10 
		 cnt_claims    
		 
	
		  11 
		 cnt_sitelink  
		 
	
		  12 
		 lang          
		 
	
		  13 
		 label         
		 
	
		  14 
		 val           
		 item’s property value 
	

All other columns are the same as defined before under the heading “items.csv”.
TODO

	[X] take date parameter as a commandline argument and derive other parameters from that
	[X] write props.json into the output directory
	[_] fetch the dump from dumps server (check if file already exists or was changed)
	[_] add code (which should go into a library) to retrieve selected items from wdq files
	[_] add a section describing similar known projects

Wikitionary
fetch dumps from ², ³ and ⁴ and possibly other wiktionaries
{en,de,nl}wiktionary--pages-meta-current.xml.bz2
e.g. https://dumps.wikimedia.org/enwiktionary/20170501/enwiktionary-20170501-pages-meta-current.xml.bz2
Links

	
¹ https://dumps.wikimedia.org/other/wikidata/
	
² https://dumps.wikimedia.org/enwiktionary/
	
³ https://dumps.wikimedia.org/dewiktionary/
	
⁴ https://dumps.wikimedia.org/nlwiktionary/

Todo: add a way to get the proper date
wget https://dumps.wikimedia.org/enwiktionary/20160801/enwiktionary-20160801-pages-meta-current.xml.bz2
wget https://dumps.wikimedia.org/dewiktionary/20160801/dewiktionary-20160801-pages-meta-current.xml.bz2