corpus software | Dr. Peter Schildhauer

After handing in chapter 7 (Textual Function) and while waiting for feedback on those nearly 80 pages, I started working on the chapter on multimodal structure. This chapter is basically a core linguistic one and should contain analyses on the following aspects:

Macro Structure:
- layout of blog pages (header, sidebars, body etc.)
- blog pages as part of a network of pages (about, pictures, homepage etc etc)
Meso Structure (is that a proper term?):
- key elements of blog postings (meta links, tags, categories)
- key elements of sidebars (meta links, blogrolls etc etc.)
- thematic structure of blog postings (?)
Micro Structure:
- language and image
- register / style: key words, frequency counts, sentence and word length…
- hyperlinks and their uses
- topics, subtopics, topical coherence

As always, I did have a rough idea about what the chapter should deal with, but I did not know how to gather the necessary data. I did quite extensive research on corpus software, comparing the abilities of particular programs, always asking myself whether I could need what was offered.

I came across the program TreeTagger by Helmut Schmid (described in detail in Schmid 1994). This software can be used on .txt files and creates a vertical .txt file (one token per line) with a POS tag added to each token the tagger knows. Installing the program on windows is not easy for dummies as it was actually designed to run on LINUX and still needs the command shell. There is, however, also a graphical interface, which I tried out (of course) und which works quite well.

TreeTagger serves as POS Tagger only. M.Eik Michalke provides a software package – koRpus – which works within the R-framework. The koRpus package can tag .txt files using TreeTagger and afterwards do some frequency analyses on the text in question. As it is written by a psychologist, its focus lies on readability measures. As my knowledge of R is quite limited and (after everything had looked really promising for a while) became disappointed with the measures available (and especially with the way the data generated by the analyses is stored and made available for further use – I was not able to really figure that out, not even using the graphical R-interface RKWard), I decided not to use koRpus and look further for other software.

And I found: WordSmith, a software that offers the following (unfortunately, not on an open-source base as the R packages and therefore not for free…):

word lists, frequency analyses and measures such as sentence- and word length
key words in texts or groups of texts based on the word lists of single texts, key words can be compared with established corpora such as the BNC
concordances (even though I probably will not need those)

I was especially thrilled by the key word feature, as this makes possible to identifiy key topics when the 10 to 20 most frequent nouns in a text (or all texts of a period) are understood as indicators to the topics mostly dealt with. An example: I did a key word analysis on two texts of period one and found IT-words among the most frequent nouns. This was what I expected as the first weblog authors were mainly IT experts and their weblogs dealt (among other, more personal topics as in EatonWeb, for instance) with IT stuff, software, new links on the web, Apple vs. Microsoft and so on and so forth. I now hope to use this key word-tool for a broader analyses, aiming at extrapolating topical shifts across the periods.

So, currently I am working myself through all corpus texts again (330), doing the following steps (as always, I use SPSS for my statistics):

I count the hyperlinks used in the entries. I differentiate between external links (the URL points to another domain), internal links (the URL remains within the same domain, links to categories, e.g. are internal as well), meta links (Permalinks, Trackbacks and Comment links, mostly at the end of postings; categories do not belong here and are counted as internal links as some period I weblogs already offer internal category links, but no other meta links. I also want to get neat data for the categories) and other links (mail:to, download etc.)
I count other meso-structural features such as BlogRolls, guest books and so on. Maybe there are some trends that show after some counting…
I determine a layout-type – Schlobinski & Siever (2005) suggested some and I extended their typology.
I code the text in MAXQDA for special features like emoticons, rebus forms, oral features, graphostyle…
I generate a pdf-file from the website which is imported to MAXQDA as well. This pdf-file is used for coding the language-image interplay and image types. Currently, I am doing some rough coding, intending to get more fine grained later on.
I generate a .txt file with the postings of the weblog. This .txt file will later be used in WordSmith.

This procedure takes a while. As it is quite exhausting as well, I can only analyse around 20 texts per day. So that means around 6 weeks of work until I can move on to the WordSmith analyses and the language-image interplay (I’m really dreading that…).

Dr. Peter Schildhauer

Category Archives: corpus software

Chapter 8: Multimodal Structure