diff --git a/README.md b/README.md index aad64be8cbc10bfe7d4e6fbb28cc0bec55fe2276..5e0a78c13015f67b2a46c856a8483a0ca984e2db 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,14 @@ -# Newseye +# Uses of the term telegraph in the context of journalism -Newseye Jupyter Notebook: János Békési, Martin Gasteiner +## Intro to Topic Modeling on a Data Set created by the Newseye Project + +A Jupyter Notebook by [János Békési](mailto:janos.bekesi@univie.ac.at) and [Martin Gasteiner](mailto:martin.gasteiner@univie.ac.at) ## Data -22 MB transkribus json data (down to article level) resp. 12 MB csv data of the same +22 MB transkribus OCR result json data (down to article level) resp. 12 MB resulting csv data +included in the repository in `./data-telegraf`. ## Workflow @@ -15,7 +18,16 @@ input data (datetime series with different starting or ending points, unforeseen the output formatting has to consider presentation quirks or sequence fittings. Though time spent will probably be more than estimated, any of those obstacles will be overcome with a bit of patience and insistence. + + +## The Notebook + +The presented Jupyter notebook `workflow.ipynb`contains a complete sequence of processing steps to generate topic models +from data OCRed by [Transkribus](https://readcoop.eu/transkribus), along with some visualizations to direct +the interpretation of results. + +To provide some guidance, we tried to prepend most cells with a short explanation; however, an intermediate +skill level regarding [Python](https://python.org) programming and [Jupyter](https://jupyter.org/) notebooks +might be quite helpful. -When preparing data for topic modelling, especially when using scanned data, it is crucious to allow for -some document structure (pages, sections, documents). diff --git a/newseye_tm.ipynb b/newseye_tm.ipynb deleted file mode 100644 index 625f64db15dbc839fffff6d7bcc858beb5fcc740..0000000000000000000000000000000000000000 --- a/newseye_tm.ipynb +++ /dev/null @@ -1,48 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Newseye Topic Modelling" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import datetime\n", - "import csv\n", - "from pathlib import Path\n", - "\n", - "import pandas as pd\n", - "import gensim\n", - "from gensim.utils import simple_preprocess\n", - "from gensim.models.coherencemodel import CoherenceModel" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.3" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/workflow.ipynb b/workflow.ipynb index 6678e93659b629f85b96603e84d1de255e6a4e68..372891f4d157553698c30d3534aa6971a0d258ea 100644 --- a/workflow.ipynb +++ b/workflow.ipynb @@ -4,15 +4,39 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Workflow of Processing Transcribus Data: From Scans to Topic Model Visualizations\n", + "# Uses of the Term *Telegraph* in the Context of Journalism\n", + "\n", + "## Intro to Topic Modeling on a Data Set created by the Newseye Project\n", + "\n", + "The data analysed in the following by software libraries [Gensim](https://radimrehurek.com/gensim/) and [Mallet](http://mallet.cs.umass.edu/) were initially taken from the [ANNO](https://anno.onb.ac.at/infos_zeizs.htm) system of the Austrian National Library, which is and was involved in the [Newseye](https://www.newseye.eu/) project (2018-2021). The data was then re-OCRed through the Transkribus programme and the layout of the newspapers was also analysed. Significantly improved Optical Caracter Recognition coupled with a form of Article Separation makes it possible to build a data platform that enables a new and very different quality of search, addressing and also storage. After these enrichment processes, the data was imported into the Newseye platform, which is based on the Content Management System [Blacklight](https://blacklight-cms.net).\n", + "The data used here is based on a general search on the basis of the search engine [Solr](https://solr.apache.org/) for articles on the word telegraph in the following time periods: 1864-1874, 1895-1901, 1911-1922.\n", + "The search was carried out in the following German-language newspapers namely Neue Freie Presse, Innsbrucker Nachrichten, Arbeiterzeitung and Illustrierte Kronen Zeitung. The resulting data package was exported as JSON and processed with regard to topic modelling.\n", + "The total number of hits was 14949, of which the following is a breakdown by newspaper:\n", + "\n", + "```\n", + " Neue Freie Presse 10,981\n", + " Innsbrucker Nachrichten 2,212\n", + " Arbeiter Zeitung 1,259\n", + " Illustrierte Kronen Zeitung 497\n", + "````\n", + "\n", + "The search was carried out in the following German-language newspapers, namely *Neue Freie Presse*, *Innsbrucker Nachrichten*, *Arbeiter Zeitung* and *Illustrierte Kronen Zeitung*. The resulting data package was exported as JSON and processed with regard to topic modelling." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Workflow of Processing Transcribus Data: From Scans to Topic Model Visualizations\n", "\n", "First, the necessary libraries and modules are imported and some variables are initiated. Note the use of \n", - "the convenience module `tm_utils.py` containing helper functions for operations needed frequently. " + "the convenience module `tm_utils.py` containing helper functions for operations needed frequently. Above all, functions for saving and retrieving of calculation results are defined there." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -171,7 +195,7 @@ "source": [ "## Preprocessing and Stopwords\n", "\n", - "We remove short word components (`simple_preprocess`) like punctuation, articles and so on, and remove stopwords, too (\"Telegraph\" is added to the stopword list, because it was the search criterion for generating the document corpus in the first place). Bigrams and trigrams are searched and added, and finally we lemmatize each document, i.e. remove flections and word variants by using `spacy` (URL), a tool for natural language processing." + "We remove short word components (`simple_preprocess`) like punctuation, articles and so on, and remove stopwords, too (\"Telegraph\" is added to the stopword list, because it was the search criterion for generating the document corpus in the first place). Bigrams and trigrams are searched and added, and finally we lemmatize each document, i.e. remove flections and word variants by using [spacy.io](https://spacy.io/), a tool for natural language processing." ] }, { @@ -201,7 +225,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -226,7 +250,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -271,7 +295,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -279,7 +303,26 @@ "output_type": "stream", "text": [ "found 1452 bigrams\n", - "['service_bestehend', 'pension_carte', 'gesgesellschaft_oesterr', 'santen_bremen', 'wirkung_muskel', 'autoritativer_seite', 'mitleidenschaft_gezogen', 'bozen_dolomiten', 'binnen_kurzem', 'holzverkleidung_flugelthuren', 'privat_forstbank', 'speciell_empfohlen', 'million_pfund', 'grand_hotel', 'kuche_gedeckter', 'pensionats_genugt', 'romische_korrespondent', 'seehandlung_osterr', 'arbeiter_soldatenrates', 'rangsklasse_technischer']\n" + "['prompte_bedienung',\n", + " 'deutsche_nachrichtenburo',\n", + " 'versehen_benutzung',\n", + " 'renovirt_zimmer',\n", + " 'todten_verwundeten',\n", + " 'wiener_abendpost',\n", + " 'traunthaler_kohlenw',\n", + " 'restaurants_pensionats',\n", + " 'kuche_badezimmer',\n", + " 'ranges_ideale',\n", + " 'entzundungen_knochenbruchen',\n", + " 'veranda_bestehend',\n", + " 'vermiethet_anfragen',\n", + " 'comfort_ausgestattet',\n", + " 'weiten_strand',\n", + " 'spater_abendstunde',\n", + " 'wiener_maschin',\n", + " 'getotet_verwundet',\n", + " 'ausland_prospecte',\n", + " 'sausenstein_erbeten']\n" ] } ], @@ -289,7 +332,16 @@ "for row in data_lemmatized:\n", " [bigrams.add(x) for x in row if x.find('_') > -1]\n", "print(\"found {} bigrams\".format(len(bigrams)))\n", - "print(list(bigrams)[:20])" + "pprint(list(bigrams)[:20])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## LDA: Using Latent Dirichlet Allocation to Generate Topics\n", + "\n", + "LDA (Latent Dirichlet Allocation) is one of several possible methods to generate probabilities for words constituting a \"topic\" (as bunch of words pertinent to a certain content or context). Algorithmically this is achieved by cleverly iterating conditional probabilites for word ocurrences. [Gensim](https://radimrehurek.com/gensim/) offers a robust and performant implementation in the Python language, which is why we are using this library in our explorations." ] }, { @@ -483,7 +535,7 @@ "## Visualization with pyLDAvis\n", "\n", "A very handy visualization can be rendered with `pyLDAvis`. If it is not displayed after the following code, you can use this [link](./data-telegraf/saved_lda_telegraph_20210324-183052.html) to download \n", - "the source file and view it locally in your browser. " + "the source file and view it locally in your browser; or, you can view this whole notebook in a [Jupyter notebook viewer](https://nbviewer.jupyter.org/urls/gitlab.phaidra.org/bekesij9/newseye-test/-/raw/ad821c9ff6b944114fdc0013eb6edd3750488660/workflow.ipynb) to access the visualization in its context." ] }, { @@ -564,34 +616,34 @@ "pyLDAvis.display(visu)\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Time Series Visualization: Topic Models along the Time Axis\n", + "\n", + "Time series visualization shows a development along the time axis for different topics. We divided the data into the respective sources, i.e. newspapers to avoid too noisy charts. Also, the original source data had to be sorted by their timestamp, and then again we generated a chart for each continuous section of data. The next three cells were necessary to prepare the source data accordingly.\n", + "\n", + "In hindsight, certainly a more sophisticated form of visualization can be imagined, for instance a sort of overlay of different newspapers, ready with interactive choice of newspaper and/or topic. However this effort would only be worth its while if the resulting insights were surprising, stunning or indeed counterintuitive. So, for the time being we confine our outcome to \"dumb\" charts and let the rest be done by imagination, as is adequate to an explorative enterprise like ours." + ] + }, { "cell_type": "code", - "execution_count": 44, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", - "['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'topic_15', 'topic_16', 'topic_17', 'topic_18', 'topic_19', 'topic_20', 'topic_21', 'topic_22', 'topic_23', 'topic_24', 'topic_25', 'topic_26', 'topic_27']\n" - ] - } - ], + "outputs": [], "source": [ "# now for the timeseries...\n", "topics = {}\n", "topic_no = 28\n", "topic_headers = []\n", "trow_templ = [0 for x in range(topic_no)]\n", - "print(trow_templ)\n", "# collect topic words\n", "for t in lda_model.show_topics(num_topics=topic_no, num_words=10, log=False, formatted=False):\n", " tnum = t[0]\n", " tl = t[1]\n", " topics[tnum] = \", \".join([word for word, prop in tl])\n", " topic_headers.append(f\"topic_{tnum}\")\n", - "print(topic_headers)\n", "outputfile = DATA.joinpath(f\"{corpusname}_01.csv\")\n", "raw_df = pd.read_csv(outputfile, sep=\";\")\n", "#create an array for use in time series\n", @@ -616,33 +668,6 @@ "df.to_csv(DATA.joinpath(timeseries), sep=\";\", index=False)\n" ] }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "corpusname: telegraph_\n", - "telegraph_20210330-150450 telegraph\n", - "loading dict and corpus from data-telegraf/dict_telegraph_20210323-162000.dict, data-telegraf/corpus_telegraph_20210323-162000.mm\n", - "LdaModel(num_terms=61923, num_topics=28, decay=0.5, chunksize=100)\n" - ] - } - ], - "source": [ - "# retrieve ldamodel\n", - "data_lemmatized = tm_utils.get_lemmatized(corpusname=corpusname, datadir=DATA)\n", - "id2word, corpus = tm_utils.get_corpus_dictionary(data_lemmatized, \n", - " corpusname=corpusname, save=False, \n", - " datadir=DATA, from_file=True)\n", - "modelfile = str(DATA.joinpath(\"{}_lda_top28.model\".format(corpusname)))\n", - "lda_model = gensim.models.ldamodel.LdaModel.load(modelfile)\n", - "print(lda_model)" - ] - }, { "cell_type": "code", "execution_count": null, @@ -684,8 +709,7 @@ " data.append([date] + values)\n", " #print(\"\\n\")\n", "df = pd.DataFrame(data, columns=[\"date\"] + topic_headers)\n", - "timeseries = f\"{corpusname}_time.csv\"\n", - "#df.to_csv(DATA.joinpath(timeseries), sep=\";\", index=False)\n" + "timeseries = f\"{corpusname}_time.csv\"\n" ] }, { @@ -725,10 +749,7 @@ " print(np)\n", " npdf = raw_df[raw_df['newspaper_id'] == np]\n", " short_id = \"\".join([x[0] for x in np.split(\"_\")])\n", - " # npdf.to_csv(DATA.joinpath(f\"time_{corpusname}_{short_id}.csv\"), sep=\";\", index=False)\n", " data = []\n", - " #if short_id == \"nfp\":\n", - " # continue\n", " for row in npdf.itertuples():\n", " old_id, date = row[1], row[4]\n", " modeldata = lda_model[corpus][old_id]\n", @@ -748,7 +769,7 @@ "source": [ "## Plots for Topic Development along the Time Axis\n", "\n", - "If topics are viewed with regard to their ... during the time ... we can render nice diagrams of topic developments during time. Here we only display one of those diagrams, but all of them can be downloaded from the `./images` folder." + "If topics are viewed with regard to their ebb and flow in time we can render nice diagrams of topic developments throughout the years. Here we only display one of those diagrams, but all of them can be downloaded from the `./images` folder alongside this notebook in the repository." ] }, { @@ -790,12 +811,14 @@ "print(len(topics), \"Topics\")\n", "npnames = ['innsbrucker_nachrichten', 'neue_freie_presse', 'arbeiter_zeitung',\n", " 'illustrierte_kronen_zeitung']\n", + "# intervals necessary to avoid gaps in the charts\n", "intervals = [\n", " ('1810-01-01', '1880-01-01'),\n", " ('1880-01-01', '1910-01-01'),\n", " ('1910-01-01', '1925-01-01'),\n", " ('1925-01-01','2010-01-01')\n", "]\n", + "# a few newspaper data are absent for certain intervals, hence some exceptions\n", "exception_map = {0: {'ikz': ('1880-01-01', '1925-01-01'),\n", " 'az': ('1880-01-01', '1925-01-01')},\n", " 1: {'ikz': ('1880-01-01', '1925-01-01'),\n", @@ -805,7 +828,6 @@ " short_id = \"\".join([x[0] for x in n.split(\"_\")])\n", " print(\"Processing\", n)\n", " data_raw = pd.read_csv(DATA.joinpath(f\"ts_topic_{short_id}_{corpusname}.csv\"), sep=\";\")\n", - " #data = data['']\n", " for iidx, intv in enumerate(intervals):\n", " ifrom = intv[0]\n", " ito = intv[1]\n", @@ -835,7 +857,6 @@ " if values.empty:\n", " continue\n", " maxval = max(values)\n", - " # print(\"col:\", col_topic_idx, \"topic: \", col, \"maxval:\", maxval)\n", " if maxval:\n", " try:\n", " xval = data.loc[data[f'topic_{col}'] == maxval].index.item()\n", @@ -843,7 +864,6 @@ " xval = None\n", " else:\n", " xval = None\n", - " # print(\"xval: {} for topic {} (max: {})\".format(xval, col, maxval))\n", " descr = \", \".join(topics[col].split(\", \")[:4])\n", " src = \"\"\n", " label=\"Topic {}: {}... {}\".format(col+1, descr, src)\n", @@ -854,11 +874,9 @@ " plt.title(f'Newseye Topic Modelling {n}: \"Telegraph\")', fontsize=12)\n", " plt.xlabel(\"Telegraph Newspaper {} ({})\".format(n, data.shape[0]))\n", " plt.ylabel(\"Topic percentage / 100\")\n", - " # plt.xlim(-18, 700)\n", " plt.autoscale(enable=True, axis=\"x\")\n", " plt.savefig(str(imgdir.joinpath(f\"telegraph_topics_{short_id}_{iidx}.png\")) )\n", " if presentation:\n", - " # print(\"Displaying ikz time series only; charts are saved to ./images\")\n", " if short_id == 'ikz' and iidx == 0:\n", " plt.show()\n", " else:\n", @@ -887,7 +905,7 @@ ], "source": [ "%%time\n", - "# since only 2 topics raise above the 0.1 threshold (12, 18), we try with mallet (methodologically unsound, but for display's sake)\n", + "# since only 2 topics raise above the 0.1 threshold (12, 18), we try with mallet (methodologically unsound, but ok for display's sake)\n", "\n", "mallet_path = '/usr/local/bin/mallet-2.0.8/bin/mallet'\n", "\n", @@ -917,7 +935,7 @@ "* Most representative topics\n", "* Distribution of topics\n", "\n", - "We only show an abridged version of the dominant topic list; our `datasette` instance would render all data, so that the direct link to actual source data (column \"Link\")would be functional.\n", + "We only show an abridged version of the dominant topic list; our [Datasette](https://datasette.io/) instance would render all data, so that the direct link to actual source data (column \"Link\") would be functional.\n", "\n", "### Most Dominant Topic per Document\n", "\n",