Skip to content
Snippets Groups Projects
Select Git revision
  • d6551a20b8b08999bfc376e246313d40840d7b7e
  • master default protected
  • release-1.10 protected
  • dev protected
  • replication_test
  • release-1.9 protected
  • 551-init-broker-service-permissions
  • 549-test-oai-pmh
  • 545-saving-multiple-times-breaks-pid-metadata
  • 499-standalone-compute-service-2
  • 539-load-tests
  • hotfix/helm-chart
  • luca_ba_new_interface
  • 534-bug-when-adding-access-to-user-that-is-not-registered-at-dashboard-service
  • release-1.8 protected
  • 533-integrate-semantic-recommendation
  • feature/openshift
  • 518-spark-doesn-t-map-the-headers-correct
  • 485-fixity-checks
  • 530-various-schema-problems-with-subsets
  • release-1.7 protected
  • v1.10.2 protected
  • v1.10.1 protected
  • v1.10.0-rc13 protected
  • v1.10.0-rc12 protected
  • v1.10.0-rc11 protected
  • v1.10.0-rc10 protected
  • v1.10.0-rc9 protected
  • v1.10.0-rc8 protected
  • v1.10.0-rc7 protected
  • v1.10.0-rc6 protected
  • v1.10.0-rc5 protected
  • v1.10.0-rc4 protected
  • v1.10.0-rc3 protected
  • v1.10.0-rc2 protected
  • v1.10.0rc1 protected
  • v1.10.0rc0 protected
  • v1.10.0 protected
  • v1.9.3 protected
  • v1.9.2 protected
  • v1.9.2-rc0 protected
41 results

components

  • Clone with SSH
  • Clone with HTTPS
  • WikiData Dump Processor

    Processing JSON dumps from wikidata.org 1.

    quick usage

    script
    ./wdq1.pl --date 2015-08-16
    ./wdq2.pl --date 2015-08-16 --scan
    

    The scripts will run for several hours (2016-08-15 took 4.5 hours on my
    machine), so it might be useful to record log messages into a transcript
    file.

    wdq1.pl

    Take gzipped dump file which is a gigantic JSON array and transcribe it
    element by element to a series of output files.

    • properties are kept in a JSON structure and dumped at the end as props.json
    • items are analyzed and filterd for properties of interest but are also transcribed into a series out output files called “out/wdq#####.cmp”, just a bit larger than currently 500 MdB each.
    • interesting information extracted by the filters mentioned above is written into a series of CSV files (tab separated):
    filename description
    items.csv item catalog
    props.csv property catalog
    P####.csv filtered property ####

    wkt1.pl

    TODO: describe …

    TODO: gnd1.pl

    TODO: write and describe …

    wdq2.pl

    Creates an index for items.csv to be able to load individual frames
    from the item store and render them to STDOUT.

    TODO:

    • factor out at least the rendering step into a library for other scripts
      to use.

    data/out/wdq#####.cmp

    Each item as a JSON structure is compressed individually and written to
    a file with this name pattern. The positional information in the items
    and P-catalogs are intended for subsequent processing steps (see wdq2.pl).

    CSV files

    items.csv

    column label note
    0 line input file line number
    1 pos input file begin byte position (within the decompressed stream)
    2 fo_count put/wdq file number
    3 fo_pos_beg out/wdq file begin position
    4 fo_pos_end out/wdq file end position
    5 id item ID
    6 type item type (should be always “item”)
    7 cnt_label number of labels
    8 cnt_desc number of descriptions
    9 cnt_aliases number of aliases
    10 cnt_claims number of claims
    11 cnt_sitelink number of sitelinks
    12 lang primary language
    13 label label string in that primary language
    14 filtered_props list of properties recorded in P####.csv files
    15 claims complete list of properties
    lang and label

    Only one label is recorded, the first available language is selected from an ordered list:

    my @langs= qw(en de it fr);

    props.csv

    column label note
    0 prop property ID
    1 def_cnt number of times this property was defined: should be 1
    2 use_cnt number of times this property was used in claims in processed items
    3 datatype format of property values
    4 label_en property’s english label
    5 descr_en property’s english description

    TODO:

    • [_] check if it makes sense to select a primary language for label and description.

    P####.csv

    column label note
    0 line
    1 pos
    2 fo_count
    3 fo_pos_beg
    4 fo_pos_end
    5 id
    6 type
    7 cnt_label
    8 cnt_desc
    9 cnt_aliases
    10 cnt_claims
    11 cnt_sitelink
    12 lang
    13 label
    14 val item’s property value

    All other columns are the same as defined before under the heading “items.csv”.

    TODO

    • [X] take date parameter as a commandline argument and derive other parameters from that
    • [X] write props.json into the output directory
    • [_] fetch the dump from dumps server (check if file already exists or was changed)
    • [_] add code (which should go into a library) to retrieve selected items from wdq files
    • [_] add a section describing similar known projects

    Wikitionary

    fetch dumps from 2, 3 and 4 and possibly other wiktionaries

    {en,de,nl}wiktionary--pages-meta-current.xml.bz2

    e.g. https://dumps.wikimedia.org/enwiktionary/20170501/enwiktionary-20170501-pages-meta-current.xml.bz2

    Links

    • 1 https://dumps.wikimedia.org/other/wikidata/
    • 2 https://dumps.wikimedia.org/enwiktionary/
    • 3 https://dumps.wikimedia.org/dewiktionary/
    • 4 https://dumps.wikimedia.org/nlwiktionary/

    Todo: add a way to get the proper date

    wget https://dumps.wikimedia.org/enwiktionary/20160801/enwiktionary-20160801-pages-meta-current.xml.bz2
    wget https://dumps.wikimedia.org/dewiktionary/20160801/dewiktionary-20160801-pages-meta-current.xml.bz2