Corpus Description

The following short description of the DIABLOC corpus is mainly adapted from a draft I submitted to language@internet (still under review). Please see Schildhauer (2014, German) and the forthcoming book “The Personal Weblog: A Linguistic History” (early 2016) for further details.

Corpus Design

I compiled the DIAchronic BLOg Corpus DIABLOC in 2012-2013. The Internet Archive enabled me to view blog pages in the state of a specific point of time. In compiling the corpus, I consistently followed an ethno-category based approach, which means that I took as instance of a genre (in my case: weblog and personal weblog, see below) whatever members of the blogging community declared to be just that. Following the discourse of the blogging community, I used the following ethno-genre labels (cf. Schildhauer 2014, p. 97-99):[i]

Weblog. This is the first blogging-related genre label, coined by Jorn Barger in 1997. The term is applied by two other weblog-authors – Jessie James Garret (Infoshift) and Cameron Barrett (Camworld) – to all websites resembling theirs in 1998 and 1999. Thereby, it became a genre-label for first generation blogs – the practice of a fairly small community. Part I of the research corpus (see the table below) contains 30 of these weblogs from 1997-2000. In order to compile these texts, I used Garret’s listing of ye olde school-weblogs as well as Cameron Barrett’s blog roll on his weblog Camworld, which the author used to collect “other sites like his” (Blood 2000). Both sources taken together[ii] provide a useful gateway to the weblogs of the first generation community.

Personal weblog. After its occasional use by first generation bloggers, the first blog directory Globe of Blogs (founded in 2002) uses this term to label a meta-category which authors can use to classify their blog. Since 2002, then, personal weblog has been used as a genre label within the blogging community. In order to compile part II of the research corpus (table  below), I entered the list pages of the personal weblog-category via the Internet Archive. Part II contains 300 blog-pages[iii] in three sub-periods.

    Blog-Pages Postings Token
DIABLOC Period Iweblogs 1997-2000 30


ca. 34,075

Period IIpersonal weblogs II.A 2002-2005 100


ca. 224,025

II.B 2006-2008 100


ca. 211,420

  II.C 2009-2012 100


ca. 302,017

    Total: 330


ca. 771,537

This corpus was complemented by further sources, such as essays written by blog-authors on genre issues (e.g. Barrett 1999a/b) and blogging manuals, which were used to reconstruct the respective communication forms.

Criteria for Selecting Period II Corpus Texts

Whereas the small number of period I weblogs available at all led me to include every blog from this period I could get hold of into the corpus, the situation looked a bit different for period II. Every blog I could reach via a Globe of Blogs list page was at first a corpus candidate only and had to pass several tests before eventually entering the corpus.

First of all, only blogs which were available in the Internet Archive for their target period (see table above for periods) were put to the following tests at all: A blog archived in 2012 (and not somewhen between 2006 and 2008) could never have been used to represent period II.B as it might have undergone profound changes in the subsequent years (e.g. layout or the deletion / modification of entries).

The second and actually most essential test is concerned with the validity of an individual personal weblog for a specific corpus period (II.A, II.B or II.C). Globe of Blogs does not provide information on when a blog was registered. It is therefore possible that a blog found on a 2007 list had been registered as early as in 2002, developed further in the following years and wouldn’t have been classified as personal weblog at all in 2007. Therefore, merely allocating every blog found on a 2007 list to period II.B would have not created a collection representative of what bloggers in 2007 thought a personal weblog looked like. For that reason, I have applied two activity tests:

  • As a precondition to be selected for the target period at all, a personal weblog had to be active – i.e. show current entries – in the respective period.
  • I selected personal weblog only whose archives did not date back further than the period boundary. A personal weblog with archived entries from 2004 would not have been selected for II.B, for instance. If there were no archives in the sense of a list of links somewhere in the sidebar, I just browsed the blog for its first entry. If neither was possible (e.g. pages with earlier entries were not available in the Archive), the blog did not pass the test. Generally, I allowed a tolerance range of a couple of months. For instance a blog found in 2006 with some entries dating back to November 2005 was still allocated to period II.B. One exception here is the personal weblog it’s ok to cry here (II.B) whose archives even date back to 1976 to document the author’s birth. As the actual entries set in right in period II.B, I decided to include the blog in the corpus.

In order to narrow down the variation potentially caused by language and / or culture areas (Luginbühl 2014), I selected only blogs whose authors had located their blogs on Globe of Blogs in the area of one of the main varieties of English (USA, Canada, UK, Australia, New Zealand). This is quite broad a range and the test itself does not stand on very solid ground but this was the only option available to focus on the genre-inherent instead of language- or culture-induced variation.

For the purpose of multimodal analysis, I collected blogs only which were archived with sufficient quality. In some cases, I tolerated missing images (hyperlinked to other servers, for instance) in case the rest of the blog was sufficiently archived.

Finally, I applied an ethical criterion: Whenever authors explicitly prohibited copying, processing or some other further use of their data, the blog was not collected. Additionally, I followed the guidelines of the Association for Internet Research (Ess & AoIR ethics working committee 2002; see also Marx & Weidacher 2014: 16-24 and Paccagnella 1997).



[i] The term blog itself is used as a hyperonym for a number of blog-genres such as the corporate blog, the science blog and others and was therefore not relevant for the corpus compilation.

[ii] These sources were triangulated with Blood (2000) and several others (e.g. Barrett 2009a/b; Winer 2001 and Rosenberg 2009).

[iii] The corpus contains the homepage of each blog with the respective postings.


