Thursday, December 20, 2007

The beta regression

Sorry folks, but this post is (for once) almost purely scientific, and does not include anything about eating or drinking. The reason for this is that the article with which I worked for most of last spring has now been published, and as I think that most of you won’t probably bother to read it, I decided to write something about it into this blog. If advertising would not work, so much time and money would not be spent on it…

The aim of my PhD project is to develop and compare different methods for estimating forest canopy cover. As accurate field measurements of cover percent are quite slow, one alternative is to use statistical model to predict it from easily measurable forest variables. This issue had been around since my master’s thesis, where my first attempts to model the cover with simple linear regressions were made with a data collected near Metla’s research station in Suonenjoki. The results were relatively good, but linear regression models just don’t function very well in modelling percentages: no scientist wants to see percentages that are negative or larger than 100%. In addition, the old models did not function nearly as well when more data were included, so there was still a lot of work to be done when I started the process in early 2007.

I spent some time studying the different modelling techniques: logistic regression, nonlinear curves and generalized linear models, to see if any of these would function better with this kind of data. Nothing very useful appeared, until I found this lecture material by Juha Heikkinen. There was a hint: Juha suggested using the beta distribution in modelling proportions. Fine, but how? Luckily, nowadays there is always one to ask: Google. After some surfing I discovered that there was an R library called betareg, which was just the thing I was looking for: the beta regression was both simple and suitable for this kind of problem. Another interesting thing was that even though a lot of statistics are used in forest science, I could not find any earlier forest-related studies where this technique would have been used, only something related to psychology. So maybe this paper would have something “new” after all.

What remained then was the tough job of curve fitting. In this case it meant writing numerous R-scripts to handle the data, testing the different model shapes, and fixing the small bugs in the library. I tested many different model shapes with not too good results, as the correlations were heavily nonlinear. In the end I decided to use cubic functions to make the polynomials flexible enough, but fortunately the beta regression’s logistic link function would take care of extrapolation problems. As the analysis was done, what remained was the writing of the manuscript, which was finished in May. The manuscript was then offered to Silva Fennica, and after some revisions it was accepted for publication. And now, here it is, and it is time for new challenges.