A CAS fellow’s new blog aims to use data to explore some myths about the Norwegian language.

‘Written Norwegian is a popular topic of discussion, and assumptions abound,’ Helge Dyvik, a professor in the Department of Linguistic, Literary and Aesthetic Studies at the University of Bergen (UiB), writes on the blog. ‘Today it is easier to test these assumptions than ever before thanks to large collections of data -- or language resources -- that have been developed for an increasing number of languages over the last few years.’

One such resource is known as a treebank, which is a large collection of text in which each sentence has been analysed syntactically (and sometimes also semantically). Treebanks are often used in language technology development, as they give developers a database of examples of grammatical phenomena.

Dyvik’s blog is based on NorGramBank, a Norwegian treebank containing about 70 million words developed as part of the Infrastructure for the Exploration of Syntax and Semantics (INESS) project, in which he has participated. The millions of words come from sources such as newspaper articles, novels, parliamentary records, and other publications.

NorGramBank has also inspired the name of the blog: NorGram-Tall (literally ‘NorGram numbers’).

In posts published so far, Dyvik has explored the use of the masculine indefinite article (‘en’) with the feminine word ‘jente’ (‘girl’), how often and for which authors the plural of neuter nouns ends in '-a' or '-ene,' and the use of passive construction, among other topics.

Dyvik is this year participating in the CAS project SynSem: From Form to Meaning - Integrating Linguistics and Computing.

‘This blog is being developed during my stay at CAS, where I am in close contact with leading international scholars in the fields of computational grammar development, syntactic and semantic analysis, and treebanks,’ Dyvik writes.