As I have pointed out in my first post, one comment about the diachronic corpus of Personal Weblogs my thesis is based on concerned the number of texts especially in the later periods (An outline of the corpus structure can be found in the talks “Anhything goes – everything done?” and “Stability, Diversity, and Change. The Textual Functions of Personal Weblogs”) People argued that a low number of texts was fine for period one, as there were only few weblogs around in these days. However, higher numbers of texts were expected for later periods as the access grew easier with more recent collection dates.
I have been thinking about these comments ever since, trying to find arguments for not extending the corpus. What I found, however, were quite weak excuses. Even more, I started wondering how I could justify a particular number of texts for a period in question at all. I came up with the following line of reasoning:
- I work with both qualitative and quantitative methods, even though my general focus lies on the qualitative end of the continuum. Text numbers, therefore, have to be justified both from a qualitative and a quantitative point of view.
- The qualitative framework of my thesis is heavily inspired by Grounded Theory (eg. following Glaser & Holton 2004). In Grounded Theory, there is a process called “Theoretical Sampling” combining data collection, coding and analysis. The basic idea is that data collection is guided by the emerging theory and strives for theoretical saturation. In other words: If nothing new is found, no conflicting cases, no cases challenging the categories established so far, the analyst has reached some point close enough to theoretical saturation to stop collecting samples. (footnote: He might as well have turned blind to new phenomena by excessive preceeding analysis. Anyway, further collection of samples would not help the research project in that case, either.) So that’s exactly my qualitative part of the argumentation: Collecting text samples until nothing new or challenging is discovered. This point had already almost been reached after collecting and analysing 80 to 90 texts for the periods II.A to II.C, but it was good to put my categories to the test by collecting more texts and assimilating them into my theory.
- From a quantitative point of view, a researcher has to make some kind of informed guess on how many cases will probably be enough to make some statistically sound statements. One formula suggested by Raithel (2008: 62) uses the number of variables to be joint in one analytical step (e.g. a correlation study of two variables) and associated features (e.g. two features for the variable “gender”) ; this value is multiplied by 10: n >= 10 * K^V As I try to trace the change within several variables which are investigated apart from each other, my analytical steps quite often only contain one variable with a particular number of features. The variable with the highest number of features at present is the textual function with about ten distinct features (e.g. Update, Filter, Sharing Experience as outlined in my last post. Consequently, about 100 texts per period are roughly enough according to this formula. This is quite a tight budget; if I want to correlate the variable “textual function” with the variable “gender of author” I have to point out that the results give some hint at a possible statistical connection but have to be taken with a pinch of salt.
I think that both arguments taken together form a fairly stable basis for the justification of the number of cases. I guess 100 texts in the periods II.A, II.B and II.C are also a good compromise between striving for ever higher case numbers and the feasability of qualitatively and thoroughly analysing, say, 500 texts in each period.
So, after the extension phase that took me a bit more than one week of searching for texts, coding, basically repeating all analytical steps I had done before and updating the numbers in my thesis, the corpus looks like that now (snapshot from my screen, sorry for the quality):