Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
W
wikidata-dump-processor
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Gerhard Gonter
wikidata-dump-processor
Commits
f6f7b544
Commit
f6f7b544
authored
5 years ago
by
Gerhard Gonter
Browse files
Options
Downloads
Patches
Plain Diff
update a few notes
parent
1d0ab0b3
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.textile
+17
-9
17 additions, 9 deletions
README.textile
with
17 additions
and
9 deletions
README.textile
+
17
−
9
View file @
f6f7b544
...
@@ -38,6 +38,8 @@ TODO: write and describe ...
...
@@ -38,6 +38,8 @@ TODO: write and describe ...
h2. wdq2.pl
h2. wdq2.pl
h3. wdq2.pl --scan
Creates an index for items.csv to be able to load individual frames
Creates an index for items.csv to be able to load individual frames
from the item store and render them to STDOUT.
from the item store and render them to STDOUT.
...
@@ -45,15 +47,21 @@ TODO:
...
@@ -45,15 +47,21 @@ TODO:
* factor out at least the rendering step into a library for other scripts
* factor out at least the rendering step into a library for other scripts
to use.
to use.
h3. wdq2.pl Q##### Q#####
Extracts wikidata data from processed dump file for give Wikidata IDs.
h3. data/out/wdq#####.cmp
h3. data/out/wdq#####.cmp
Each item as a JSON structure is compressed individually and written to
Each item as a JSON structure is compressed individually and written to
a file with this name pattern. The positional information in the items
a file with this name pattern. The positional information in the items
and P-catalogs are intended for subsequent processing steps (see wdq2.pl).
and P-catalogs are intended for subsequent processing steps (see wdq2.pl).
h3. CSV files
h2. CSV files
NOTE: all csv files are really TSV files: Tab separated columns with first line giving the column names.
h
4
. items.csv
h
3
. items.csv
|_. column |_. label |_. note |
|_. column |_. label |_. note |
| 0 | line | input file line number |
| 0 | line | input file line number |
...
@@ -73,14 +81,14 @@ h4. items.csv
...
@@ -73,14 +81,14 @@ h4. items.csv
| 14 | filtered_props | list of properties recorded in P####.csv files |
| 14 | filtered_props | list of properties recorded in P####.csv files |
| 15 | claims | complete list of properties |
| 15 | claims | complete list of properties |
h
5
. lang and label
h
4
. lang and label
Only one label is recorded, the first available language is selected from an ordered list:
Only one label is recorded, the first available language is selected from an ordered list:
my @langs= qw(en de it fr);
my @langs= qw(en de it fr);
h
4
. props.csv
h
3
. props.csv
|_. column |_. label |_. note |
|_. column |_. label |_. note |
| 0 | prop | property ID |
| 0 | prop | property ID |
...
@@ -93,7 +101,7 @@ h4. props.csv
...
@@ -93,7 +101,7 @@ h4. props.csv
TODO:
TODO:
* [_] check if it makes sense to select a primary language for label and description.
* [_] check if it makes sense to select a primary language for label and description.
h
4
. P####.csv
h
3
. P####.csv
|_. column |_. label |_. note |
|_. column |_. label |_. note |
| 0 | line | |
| 0 | line | |
...
@@ -114,15 +122,15 @@ h4. P####.csv
...
@@ -114,15 +122,15 @@ h4. P####.csv
All other columns are the same as defined before under the heading "items.csv".
All other columns are the same as defined before under the heading "items.csv".
h
3
. TODO
h
2
. TODO
* [X] take date parameter as a commandline argument and derive other parameters from that
* [X] take date parameter as a commandline argument and derive other parameters from that
* [X] write props.json into the output directory
* [X] write props.json into the output directory
* [
_
] fetch the dump from dumps server (check if file already exists or was changed)
* [
x
] fetch the dump from dumps server (check if file already exists or was changed)
(wdq0.pl)
* [
_
] add code (which should go into a library) to retrieve selected items from wdq files
* [
x
] add code (which should go into a library) to retrieve selected items from wdq files
(wdq2.pl)
* [_] add a section describing similar known projects
* [_] add a section describing similar known projects
h
3
. alternative download
h
2
. alternative download
see [5]
see [5]
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment