srhe

The Society for Research into Higher Education


Leave a comment

The potential of automated text analysis for higher education research

by Stijn Daenekindt

Together with Jeroen Huisman, I recently published an article in which we mapped the field of research on higher education. In a previous blogpost we reflected on some key findings, but only briefly mentioned the method we used to analyze the abstracts of 16,928 research articles (which totals to over 2 million words). Obviously we did not read all these texts ourselves. Instead, we applied automated text analysis. In the current blogpost, I will discuss this method to highlight its potential for higher education research.

Automated text analysis holds tremendous potential for research into higher education. This because, higher education institutions—ie our research subjects— ‘live’ in a world that is dominated by the written word. Much of what happens in and around higher education institutions eventually gets documented. Indeed, higher education institutions produce an enormous amount and variety of texts, eg grant proposals, peer reviews and rejection letters, academic articles and books, course descriptions, mission statements, commission reports, evaluations of departments and universities, policy reports, etc. Obviously, higher education researchers are aware of the value of these documents and they have offered a lot of insightful case studies by closely reading such documents. However, for some types of research questions, analysing a small sample of texts just doesn’t do the job. When we want to analyse huge amounts of text data, which are unfeasible for close reading by humans, automated text analysis can help us.

There are various forms of automated text analysis. One of the most popular techniques is topic modelling. This machine learning technique is able to automatically extract clusters of words (ie topics). A topic model analyses patterns of word co-occurrence in documents to reveal latent themes. Two basic principles underlie a topic model. The first is that each document consists of a mixture of topics. So, imagine that we have a topic model that differentiates two topics, then document A could consist of 20% topic 1 and 80% topic 2, while document B might consist of 50% topic 1 and 50% topic 2. The second principle of topic modelling is that every topic is a mixture of words. Imagine that we fit a topic model on every edition of a newspaper over the last ten years. A first possible topic could include words such as ‘goal’, ‘score, ‘match’, ‘competition’ and ‘injury’. A second topic, then, could include words such as ‘stock’, ‘dow_jones, ‘investment, ‘stock_market’ and ‘wall_street’. The model can identify these clusters of words, because they often co-occur in texts. That is, it is far more likely that the word ‘goal’ co-occurs with the word ‘match’ in a document, then it is to co-occur with the word ‘dow_jones’.

Topic models allow us to reveal the structure of large amounts of textual data by identifying topics. Topics are basically a set of words. More formally, topics are expressed as a set of word probabilities. To learn what the latent theme is about we can order all the words in decreasing probability. The two illustrative topics (see previous paragraph) clearly deal with the general themes ‘sports’ and ‘financial investments’. In this way, what topic models do with texts actually closely resembles what exploratory factor analysis does with survey data, ie revealing latent dimensions that structure the data. But how is the model able to find interpretable topics? As David Blei explains, and this may help to get a more intuitive understanding of the method, topic models trade off two goals: (a) the model tries to assign the words of each document to as few topics as possible, and (b) the model tries, in each topic, to assign high probability to as few words as possible. These goals are at odds. For example, if the model allocates all the words of one document to one single topic, then (b) becomes unrealistic. If, on the other hand, every topic consists of just a few words, then (a) becomes unrealistic. It is by trading off both goals that the topic model is able to find interpretable sets of tightly co-occurring words.

Topic models focus on the co-occurrence of words in texts. That is, they model the probability that a word co-occurs with another word anywherein a document. To the model, it does not matter if ‘score’ and ‘match’ are used in the same sentence in a document or if one is used in the beginning of the document while the other one is used at the end. This puts topic modelling in the larger group of ‘bag-of-words approaches’, a group of methods that treat documents as …well … bags of words. Ignoring word order is a way to simplify and reduce the text, which yields various nice statistical properties. On the other hand, this approach may result in the loss of meaning. For example, the sentences ‘I love teaching, but I hate grading papers’ and ‘I hate teaching, but I love grading papers’ obviously have different meanings, but this is ignored by bag-of-words techniques.

So, while bag-of-word techniques are very useful to classify texts and to understand what the texts are about, the results will not tell us much about how topics are discussed. Other methods from the larger set of methods of automated text analysis are better equipped for this. For example, sentiment analysis allows one to analyze opinions, evaluations and emotions. Another method, word embedding, focusses on the context in which a word is embedded. More specifically, the method finds words that share similar contexts. By subsequently inspecting a words’ nearest neighbors — ie which are the words often occurring in the neighborhood of our word of interest — we get an idea of what that word means in the text. These are just a few examples of the wide range of existing methods of automated text analysis and each of them has its pros and cons. Choosing between them ultimately comes down to finding the optimal match between a research question and a specific method.

More collections of electronic text are becoming available every day. These massive collections of texts present massive opportunities for research on higher education, but at the same time they present us with a problem: how can we analyze these? Methods of automated text analysis can help us to understand these large collections of documents. These techniques, however, do not replace humans and close reading. Rather, these methods are, as aptly phrased by Justin Grimmer and Brandon Stewart, ‘best thought of as amplifying and augmenting careful reading and thoughtful analysis’. When using automated text analysis in this way, the opportunities are endless and I hope to see higher education researchers embrace these opportunities (more) in the future.

Stijn Daenekindt is a Postdoctoral Researcher at Ghent University (Department of Sociology). He has a background in sociology and in statistics and has published in various fields of research. Currently, he works at the Centre for Higher Education Governance Ghent. You can find an overview of his work at his Google Scholar page.


1 Comment

The (future) state of higher education research?

by Stijn Daenekindt and Jeroen Huisman

Parallel to the exponential growth of research on higher education, we see an increasing number of scientific contributions aiming to take stock of our field of research. Such stock-taking activities range from reflective and possibly somewhat impressionistic thoughts of seasoned scholars to in-depth reviews of salient higher education themes. Technological advancements (such as easy electronic access to research output and an increasingly broader set of analytical tools) obviously have made life easier for analysts. We recently embarked upon a project to explore the thematic diversity in the field of research in higher education. The results have recently been published in Higher Education. Our aim was to thematically map the field of research on higher education and to analyse how our field has evolved over time.

For this endeavour, we wanted our analysis to be large-scale. We aimed at including a number of articles that would do justice to the presumed variety in research into higher education. We did not, however, want the scale of our analysis to jeopardize the depth of our analysis. Therefore, we decided not to limit our analyses to, for example, an analysis of citation patterns or of keywords. Finally, to forestall bias (stemming from our personal knowledge about and experience in the field), we applied an inductive approach. These criteria led us to collect 16,928 journal articles on higher education published between 1991 and 2018 and to analyse each article’s abstract by applying topic modelling. Topic modelling is a method of automated text analysis and a follow-up blogpost (also on srheblog.com) will address the method. For now, it suffices to know that topic modelling is a machine learning technique that automatically analyses the co-occurrence of words to detect themes/topics and to find structure in a large collection of text.

In this blogpost, we present a glimpse of our findings and some additional thoughts for further discussion. In our analysis, we differentiate 31 research topics which inductively emerged from the data. For example, we found topics dealing with university ranking and performance, sustainability, substance use of college students, research ethics, etc. The bulk of these research topics were studied at the individual level (16 topics), with far fewer at the organisational (5) and system level (3). A final set of topics related either clearly to disciplines (eg teaching psychology) or to more generic themes (methods, academic writing, ethics). This evidences the richness of research into higher education. Indeed, our field of research certainly is not limited in terms of perspectives and unleashes “the whole shebang” of possible perspectives to gain new insights into higher education.

The existence of different perspectives also comprises potential dangers, however. Studies applying a certain approach on higher education — say, a system-level approach — may suffer from tunnel vision and lose sight of individual- and organization-level aspects of higher education. This may be problematic as processes on the different levels are obviously related to one another. In our analysis we find that studies indeed tend to focus on one level. For example, system-level topics tend to be exclusively combined with other system-level topics. This should not come as a big surprise, but there is potential danger in this and it may hamper the development of a more integrated field of research on higher education.

In our analysis, we also find a certain restraint to combine topics which are located at the same level. For example, topics on teaching practices are very rarely combined with topics on racial and ethnic minorities — even though both topics are situated at the individual level. To us, this was surprising as the combination of ethnicity and educational experiences is a blossoming field in the sociology of education. The fact that topics at the same level are only rarely combined is less understandable then the fact that topics on different levels are rarely combined. We hope that our analysis aids others researchers to identify gaps in the literature and that it motivates them to address these gaps.

A second finding we wish to address here relates to specialisation. Our analysis suggests that there is a trend of specialisation in our field of research. We looked at the number of topics combined in articles and we see that topic diversity declines over time. This is, on the one hand, not that surprising. Back in 1962, Kuhn already argued that the system of modern science encourages researchers towards further specialisation. So, it makes sense that over time, and parallel to the growth of the field of research on higher education, researchers specialise more and demarcate their own topic of expertise. On the other hand, it may be considered a problematic evolution as it can hamper our field of research to develop towards further maturity.

But what should we think of the balance between healthy expansion and specialization, on the one hand, and inefficient fragmentation, on the other? We lean towards evaluating the current state of higher education research as moving towards fragmentation. Other researchers, such as Malcom Tight, Bruce Macfarlane and Sue Clegg have similarly lamented the fragmented nature of our field of research. Our analysis adds to this by showing the trends over time: we observe more specialisation (not necessarily bad), but there are also signs of disintegration over time (not good). Other analyses we are currently carrying out also indicate thematic disintegration and suggest clear methodological boundaries. It looks like many researchers focusing on the same topic remain in their “comfort zone” and use a limited set of methods. For sure, many methodological choices are functional (as in fit-for-purpose), but the lack of diversity is striking. Moreover, we see that many higher education researchers stick to rather traditional techniques (survey, interviews, case studies) and that new methods hardly get picked up in our field. A final observation is that we hardly see methodological debates in our field. In related disciplines we often see healthy methodological discussions that improve the available “toolkit” (for example here). In our field, it appears that scholars shy away from such discussions and it suggests methodological conservatism and/or methodological tunnel vision.

There are still many things to investigate to arrive at a full assessment of the state of the art. One important question is how our field compares to other fields or disciplines. But if we were to accept the idea of fragmentation, it is pertinent to start thinking how to combat this. Reversing this trend is obviously not straightforward. But here are a few ideas. Individual scholars could try to get out of their comfort zone by applying other perspectives to their favourite research object and/or by applying their favourite perspective to new research topics. Related, researchers should be encouraged to use techniques less commonly used in our field and see whether they yield different outcomes (vignettes, experimental designs, network analysis, QCA/fuzzy logic, [auto-]ethnography and – of course – topic models). In addition, journal editors could be more flexible and inclusive in terms of the format of the submissions they consider. For example, they could explicitly welcome submissions in the format of ‘commentaries/ a reply to’. This would stimulate debate and open up the floor for increased cross-fertilisation of research into higher education and, in general, signal the maturity of research into higher education. Finally, there is scope for alternative peer review processes. Currently, only editors (and sometimes peer reviewers seeing the outcome of a peer review process) gain full insight in feedback offered by peers. If we would make these processes more visible to a broader readership – e.g. through open peer review, which still can be double-blind – we would gain much more insight in methodological and theoretical debates, that would definitely support the healthy growth of our field.  

This post is based on the article: Daenekindt, S and Huisman, J (2020) ‘Mapping the scattered field of research on higher education. A correlated topic model of 17,000 articles, 1991–2018’ Higher Education, 1-17. Stijn Daenekindt is a Postdoctoral Researcher at Ghent University (Department of Sociology). SRHE Fellow Jeroen Huisman is a Full Professor at Ghent University (Department of Sociology).

MarciaDevlin


Leave a comment

Australian HE reform could leave students worse off

By Marcia Devlin

Australia is in full election campaign mode. What a returned conservative government means for higher education is a little worrying, although what a change of government means is worrying for different reasons.

Two years ago, the then federal Minister for Education, Christopher Pyne, proposed a radical set of changes for higher education funding including, among other things, a 20% cut to funding and full fee deregulation. While the latter received support from some institutions and Vice-Chancellors, there were very few supporters of the whole package. Among those who did not support it were the ‘cross-benchers’ – the independent and minor party members of the Parliament of Australia who have held the balance of power since elected in 2014 – and so the proposals were not passed.

The government have since introduced Senate voting reforms which means the minor parties will not be able to swap preferences in order to secure Senate seats as they have done in the past, and there is less likelihood of a future cross bench like this one. Which is a shame for higher education in my view as these folk actually listened to the sector and public and responded accordingly. Mr Pyne has now moved onto other responsibilities. But just before he moved, this actually happened: https://www.youtube.com/watch?v=Hc9NRwp6fiI

The new and current Education Minister, Simon Birmingham has released a discussion paper in lieu of budget measures: Continue reading

Simon Marginson


Leave a comment

Equality of opportunity: the first fifty years

By Simon Marginson

The article below is abridged from the keynote address given at the SRHE’s 50th Anniversary Colloquium at Church House, London on June 26th 2015.  The full text of this keynote address is available via www.srhe.ac.uk/downloads/SimonMarginsonKeynote.pdf 

Thomas Piketty’s Capital in the Twenty-first Century (2014) clarifies  the distinction between (1)  societies in which incomes are relatively equal and/or there is a high degree of middle class growth and social mobility, which includes (albeit in different ways and for rather different reasons) both the Scandinavian countries and emerging East Asia; and (2) societies like the United States or the UK that are relatively closed in character, with highly unequal wage structures, growing capital concentrations, and static middle classes that are under considerable pressure to defend their past-gained economic and status positions. Continue reading