Category Archives: corpus

Chapter 8: Multimodal Structure

After handing in chapter 7 (Textual Function) and while waiting for feedback on those nearly 80 pages, I started working on the chapter on multimodal structure. This chapter is basically a core linguistic one and should contain analyses on the following aspects:

  • Macro Structure:
    • layout of blog pages (header, sidebars, body etc.)
    • blog pages as part of a network of pages (about, pictures, homepage etc etc)
  • Meso Structure (is that a proper term?):
    • key elements of blog postings (meta links, tags, categories)
    • key elements of sidebars (meta links, blogrolls etc etc.)
    • thematic structure of blog postings (?)
  • Micro Structure:
    • language and image
    • register / style: key words, frequency counts, sentence and word length…
    • hyperlinks and their uses
    • topics, subtopics, topical coherence

As always, I did have a rough idea about what the chapter should deal with, but I did not know how to gather the necessary data. I did quite extensive research on corpus software, comparing the abilities of particular programs, always asking myself whether I could need what was offered.

I came across the program TreeTagger by Helmut Schmid (described in detail in Schmid 1994). This software can be used on .txt files and creates a vertical .txt file (one token per line) with a POS tag added to each token the tagger knows. Installing the program on windows is not easy for dummies as it was actually designed to run on LINUX and still needs the command shell. There is, however, also a graphical interface, which I tried out (of course) und which works quite well.

TreeTagger serves as POS Tagger only. M.Eik Michalke provides a software package – koRpus – which works within the R-framework. The koRpus package can tag .txt files using TreeTagger and afterwards do some frequency analyses on the text in question. As it is written by a psychologist, its focus lies on readability measures. As my knowledge of R is quite limited and (after everything had looked really promising for a while) became disappointed with the measures available (and especially with the way the data generated by the analyses is stored and made available for further use – I was not able to really figure that out, not even using the graphical R-interface RKWard), I decided not to use koRpus and look further for other software.

And I found: WordSmith, a software that offers the following (unfortunately, not on an open-source base as the R packages and therefore not for free…):

  • word lists, frequency analyses and measures such as sentence- and word length
  • key words in texts or groups of texts based on the word lists of single texts, key words can be compared with established corpora such as the BNC
  • concordances (even though I probably will not need those)

I was especially thrilled by the key word feature, as this makes possible to identifiy key topics when the 10 to 20 most frequent nouns in a text (or all texts of a period) are understood as indicators to the topics mostly dealt with. An example: I did a key word analysis on two texts of period one and found IT-words among the most frequent nouns. This was what I expected as the first weblog authors were mainly IT experts and their weblogs dealt (among other, more personal topics as in EatonWeb, for instance) with IT stuff, software, new links on the web, Apple vs. Microsoft and so on and so forth. I now hope to use this key word-tool for a broader analyses, aiming at extrapolating topical shifts across the periods.

So, currently I am working myself through all corpus texts again (330), doing the following steps (as always, I use SPSS for my statistics):

  1. I count the hyperlinks used in the entries. I differentiate between external links (the URL points to another domain), internal links (the URL remains within the same domain, links to categories, e.g. are internal as well), meta links (Permalinks, Trackbacks and Comment links, mostly at the end of postings; categories do not belong here and are counted as internal links as some period I weblogs already offer internal category links, but no other meta links. I also want to get neat data for the categories) and other links (mail:to, download etc.)
  2. I count other meso-structural features such as BlogRolls, guest books and so on. Maybe there are some trends that show after some counting…
  3. I determine a layout-type – Schlobinski & Siever (2005) suggested some and I extended their typology.
  4. I code the text in MAXQDA for special features like emoticons, rebus forms, oral features, graphostyle…
  5. I generate a pdf-file from the website which is imported to MAXQDA as well. This pdf-file is used for coding the language-image interplay and image types. Currently, I am doing some rough coding, intending to get more fine grained later on.
  6. I generate a .txt file with the postings of the weblog. This .txt file will later be used in WordSmith.

This procedure takes a while. As it is quite exhausting as well, I can only analyse around 20 texts per day. So that means around 6 weeks of work until I can move on to the WordSmith analyses and the language-image interplay (I’m really dreading that…).

Corpus Update

As I have pointed out in my first post, one comment about the diachronic corpus of Personal Weblogs my thesis is based on concerned the number of texts especially in the later periods (An outline of the corpus structure can be found in the talks “Anhything goes – everything done?” and “Stability, Diversity, and Change. The Textual Functions of Personal Weblogs”) People argued that a low number of texts was fine for period one, as there were only few weblogs around in these days. However, higher numbers of texts were expected for later periods as the access grew easier with more recent collection dates.

I have been thinking about these comments ever since, trying to find arguments for not extending the corpus. What I found, however, were quite weak excuses. Even more, I started wondering how I could justify a particular number of texts for a period in question at all. I came up with the following line of reasoning:

  • I work with both qualitative and quantitative methods, even though my general focus lies on the qualitative end of the continuum. Text numbers, therefore, have to be justified both from a qualitative and a quantitative point of view.
  • The qualitative framework of my thesis is heavily inspired by Grounded Theory (eg. following Glaser & Holton 2004). In Grounded Theory, there is a process called “Theoretical Sampling” combining data collection, coding and analysis. The basic idea is that data collection is guided by the emerging theory and strives for theoretical saturation. In other words: If nothing new is found, no conflicting cases, no cases challenging the categories established so far, the analyst has reached some point close enough to theoretical saturation to stop collecting samples. (footnote: He might as well have turned blind to new phenomena by excessive preceeding analysis. Anyway, further collection of samples would not help the research project in that case, either.) So that’s exactly my qualitative part of the argumentation: Collecting text samples until nothing new or challenging is discovered. This point had already almost been reached after collecting and analysing 80 to 90 texts for the periods II.A to II.C, but it was good to put my categories to the test by collecting more texts and assimilating them into my theory.
  • From a quantitative point of view, a researcher has to make some kind of informed guess on how many cases will probably be enough to make some statistically sound statements. One formula suggested by Raithel (2008: 62) uses the number of variables to be joint in one analytical step (e.g. a correlation study of two variables) and associated features (e.g. two features for the variable “gender”) ; this value is multiplied by 10: n >= 10 * K^V As I try to trace the change within several variables which are investigated apart from each other, my analytical steps quite often only contain one variable with a particular number of features. The variable with the highest number of features at present is the textual function with about ten distinct features (e.g. Update, Filter, Sharing Experience as outlined in my last post. Consequently, about 100 texts per period are roughly enough according to this formula. This is quite a tight budget; if I want to correlate the variable “textual function” with the variable “gender of author” I have to point out that the results give some hint at a possible statistical connection but have to be taken with a pinch of salt.

I think that both arguments taken together form a fairly stable basis for the justification of the number of cases. I guess 100 texts in the periods II.A, II.B and II.C are also a good compromise between striving for ever higher case numbers and the feasability of qualitatively and thoroughly analysing, say, 500 texts in each period.

So, after the extension phase that took me a bit more than one week of searching for texts, coding, basically repeating all analytical steps I had done before and updating the numbers in my thesis, the corpus looks like that now (snapshot from my screen, sorry for the quality):


Two conferences in three weeks…

The last three weeks were quite exhausting, exciting and in general a thrilling experience for me as a doctoral student.

In February, I had the opportunity of presenting my diachronic corpus of Personal Weblogs to an audience of media linguists and communication scientists on the conference of the DGPuK section “Mediensprache”. The focus of the talk was my methodology of collecting corpus candidates and selecting those that were added to the corpus. I also presented some ideas about the use of images in Personal Weblogs. The slides and the manuscript of the talk can be found on the “publication”-page.

The feedback was quite positive. Michael Klemm suggested conducting interviews, especially concerning the question of media selection – the choice between a weblog, facebook, twitter and other forms of communication. I am thinking about this suggestion; probably I won’t have the time and space to include that in my doctoral thesis. I guess I should focus on the material I have gained from analysing the metablogging in my corpus texts. However, it might be a good idea to mention the idea of conducting interviews as matter of further research in my conclusion-section.

Another comment concerned the size of my corpus, in particular the 80 texts in period II.C. I should be aware that people will always ask why there is a particular number of texts, why not more, why not less. I am thinking of extending the corpus to 100 texts per period in part II. This entails a lot of work; however, according to my estimation formula (I use Raithels (2008: 62) formula n>=10*K^v with K being the number of features per variable and v being the number) 100 texts are a safe number to work with as all my variables do not have more than 8-ish different features and my study does not need to look at more than 2 variables simultaneously. Be that as it may, I find this insisting on numbers a bit frustrating. I mean, I DO have 80 texts per period II.B and II.C and even 93 for period II.A. And I DO work with a sheer flood of examples from those texts – so why is that not enough to describe some patterns and their change(s)?

Another, very interesting suggestion was that of a connection between media development and topics – never thought about the fact that fashion blogs came into existence because of the ease of embedding images! Thanks to Christof Barth (Trier University) for that idea!

My second talk was last weekend (14th NLK) and dealt with the textual functions of the Personal Weblogs in DIABLOK. I presented my methodology – a combination of Grounded Theory-style content analysis (Glaser & Holton 2004; Mayring 2010) and linguistic analysis à la Klaus Brinker (1983, 2000, 2010). I basically work with ethnocategories here – so I try to find out what bloggers say they do functionwise and analyse these functional patterns linguistically. I suggested functional patterns called Update, Filter, and Sharing Experience.

My mentor Alexander Brock, who was also present at the conference, was not quite content with the names of the functional patterns, especially regarding the Update-function. I am not sure whether I get him right: His point is that “Update” actually only concerns a special kind of information structure, a ratio of new and old information. In my opinion, “Update” is a functional pattern that the blogging community has termed like that and which can be recognized by structural, contextual and functional features (see my slides for examples).

Our compromise, however (even though it might be the result of a misunderstanding) is quite a useful one: My mentor suggested not to present all the ethnocategories as seperate sub chapters but rather group them according to their dominating functional component. So there will be sub chapters on informational, appellative, and contact functions as well as on production-oriented functions (thinking by writing, releasing emotional tension, creative expression).

Apart from that, I got a highly interesting comment about the DarkNet with its utter anonymity and a possible comparison of my Personal Weblogs with the textual patterns to be found there. Thank you, Marco, for that – I will definitely follow this trace some day!