topic models | SRHE Blog

by Yusuf Oldac and Francisco Olivos

We recently embarked upon a project to explore the development of higher education research topics over the last decades. The results were published in Review of Education. Our aim was to thematically map the field of research on higher education and to analyse how the field has evolved over time between 2000 and 2021. This blog post summarises our findings and reflects on the implications for HE research.

HE research continues to grow. HE researchers are located in globally diverse geographical locations and publish on diversifying topics. Studies focusing on the development of HE with a global-level analysis are increasingly emerging. However, most of these studies are limited to scientometric network analyses that do not include a content-related focus. In addition, they are deductive, indicating that they tried to fit their new findings into existing categories. Recently, Daenekindt and Huisman (2020) were able to capture the scholarly literature on higher education through an analysis of latent themes by utilising topic modelling. This approach got attention in the literature, and the study’s contribution was highlighted in an earlier SRHE blog post. We also found their study useful and built on it in our novel analysis. However, their analysis focused only on generating topics from a wide range of higher education journals and did not identify explanatory factors, such as change over the years or the location of publication. After identifying this gap, we worked towards moving one step further.

A central contribution of our study is the inclusion of a set of research content explanatory factors, namely: time, region, funding, collaboration type, and journals, to investigate the topics of HE research. In methodological terms, our study moves ahead of the description of the topic prevalence to the explanation of the prevalence utilizing structural topic modelling (Roberts et al, 2013).

Structural topic modelling is a machine learning technique that examines the content of provided text to learn patterns in word usage without human supervision in a replicable and transparent way (Mohr & Bogdanov, 2013). This powerful technique expands the methodological repertoire of higher education research. On one hand, computational methods make it possible to extract meaning from large datasets; on the other, they allow the prediction of emerging topics by integrating the strengths of both quantitative and qualitative approaches. Nevertheless, many scholars in HE remain reluctant to engage with such methods, reflecting a degree of methodological conservatism or tunnel vision (see Huisman and Daenekindt’s SRHE blog post).

In this blog post, our intention is not to go deep into the minute details of this methodological technique, but to share a glimpse of our main findings through the use of such a technique. With the corpus of all papers published between 2000 and 2021 in the top six generalist journals of higher education, as listed by Cantwell et al (2022) and Kwiek (2021) both, we analysed a dataset of 6,562 papers. As a result, we identified 15 emergent research topics and several major patterns that highlight the thematic changes over the last decades. Below, we share some of our findings, accompanied by relevant visualisations.

Glimpse at the main findings with relevant visuals

The emergent 15 higher education topics and three visibly rising ones

Our topic modelling analysis revealed 15 distinct topics, which are largely in line with the topics discussed in previous studies on this line (eg Teichler, 1996; Tight, 2003; Horta & Jung, 2014). However, there are added nuances in our analysis. For example, the most prevalent topics are policy and teaching/learning, which are widely acknowledged in the field, but new themes have emerged and strengthened over time. These themes include identity politics and discrimination, access, and employability. These areas, conceptually linked to social justice, have become central to higher education research, especially in US-based journals but not limited to them. The visual below demonstrates the changes over the years for all 15 topics.

The Influence of funding on higher education research topics

Research funding plays a crucial role in shaping certain topics, particularly gender inequality, access, and doctoral education. Studies that received funding exhibited a higher prevalence of these socially significant topics, underscoring the importance of targeted funding to support research with social impact. The data visualisation below summarises the influence of reported funding for each topic. The novelty of this pattern needs to be highlighted because we have not come across a previous study looking into the influence of funding existence on research topics in the higher education field.

The impact of collaboration on higher education research topics

Collaborative publications are more prevalent in topics such as teaching and learning, and diversity and social relations. By contrast, theoretical discussions, identity politics, policy, employability, and institutional management are more common in solo-authored papers. This pattern aligns with the nature of these topics and the data requirements for research. Please see the visualised data below.

We highlight that although the relationship between collaboration and citation impact or researcher productivity is well studied, we are not aware of any evidence of the effect of collaboration patterns on topic prevalence, particularly in studies focusing on higher education. So, this finding is a novel contribution to higher education research.

Higher education journals’ topic preferences

Although the six leading journals claim to be generalist, our analysis shows they have differing publication preferences. For example, Higher Education focuses on policy and university governance, while Higher Education Research and Development stands out for teaching/learning and indigenous knowledge. Journal of Higher Education and Review of Higher Education, two US-based journals, have the highest prevalence of identity politics and discrimination topics. Last, Studies in Higher Education has a significantly higher prevalence in teaching and learning, theoretical discussions, doctoral education, and emotions, burnout and coping than most of the journals.

Regional differences in higher education research topics

Topic focus varies significantly by the region of the first author. First, studies from Asia exhibit the highest prevalence of academic work and institutional management. Studies from Africa show a higher prevalence of identity politics and discrimination. Moreover, studies published by first authors from Eastern European countries stand out with the higher prevalence of employability. Lastly, the policy topic has a high prevalence across all regions. However, studies with first authors from Asia, Eastern Europe, Africa, and Latin America and the Caribbean showed a higher prevalence of policy research in higher education than those from North America and Western Europe. By contrast, indigenous knowledge is most prominent in Western Europe (including Australia and New Zealand). The figure below demonstrates these in visual format.

Concluding remarks

Higher education research has grown and diversified dramatically over the past two decades. The field is now established globally, with an ever-expanding array of topics and contributors. In this blog post, we shared the results of our analysis in relation to the influence of targeted funding, collaborative practices, regional differences, and journal preferences on higher education research topics. We have also indicated that certain topics have risen in prevalence in the last two decades. More patterns are included in the main research study published in Review of Education.

It is important to note that we could only include the higher education papers published up to 2021, the latest available data year when we started the analyses. The impact of generative artificial intelligence and recent major shifts in the global geopolitics, including the new DEI policies in the US and overall securitisation of science tendencies, may not be reflected fully in this dataset. These themes are very recent, and future studies, including replications with similar approaches, may help provide newly emerging patterns.

Dr Yusuf Oldac is an Assistant Professor in the Department of Education Policy and Leadership at The Education University of Hong Kong. He holds a PhD degree from the University of Oxford, where he received a full scholarship. Dr Oldac’s research spans international and comparative higher education, with a current focus on global science and knowledge production in university settings.

Dr Francisco Olivos obtained his PhD in Sociology from The Chinese University of Hong Kong. He joined Lingnan University in August 2021. His research lies in the intersections between cultural sociology, social stratification, and subjective well-being, using quantitative and computational methods.

by Stijn Daenekindt

Together with Jeroen Huisman, I recently published an article in which we mapped the field of research on higher education. In a previous blogpost we reflected on some key findings, but only briefly mentioned the method we used to analyze the abstracts of 16,928 research articles (which totals to over 2 million words). Obviously we did not read all these texts ourselves. Instead, we applied automated text analysis. In the current blogpost, I will discuss this method to highlight its potential for higher education research.

Automated text analysis holds tremendous potential for research into higher education. This because, higher education institutions—ie our research subjects— ‘live’ in a world that is dominated by the written word. Much of what happens in and around higher education institutions eventually gets documented. Indeed, higher education institutions produce an enormous amount and variety of texts, eg grant proposals, peer reviews and rejection letters, academic articles and books, course descriptions, mission statements, commission reports, evaluations of departments and universities, policy reports, etc. Obviously, higher education researchers are aware of the value of these documents and they have offered a lot of insightful case studies by closely reading such documents. However, for some types of research questions, analysing a small sample of texts just doesn’t do the job. When we want to analyse huge amounts of text data, which are unfeasible for close reading by humans, automated text analysis can help us.

There are various forms of automated text analysis. One of the most popular techniques is topic modelling. This machine learning technique is able to automatically extract clusters of words (ie topics). A topic model analyses patterns of word co-occurrence in documents to reveal latent themes. Two basic principles underlie a topic model. The first is that each document consists of a mixture of topics. So, imagine that we have a topic model that differentiates two topics, then document A could consist of 20% topic 1 and 80% topic 2, while document B might consist of 50% topic 1 and 50% topic 2. The second principle of topic modelling is that every topic is a mixture of words. Imagine that we fit a topic model on every edition of a newspaper over the last ten years. A first possible topic could include words such as ‘goal’, ‘score, ‘match’, ‘competition’ and ‘injury’. A second topic, then, could include words such as ‘stock’, ‘dow_jones, ‘investment, ‘stock_market’ and ‘wall_street’. The model can identify these clusters of words, because they often co-occur in texts. That is, it is far more likely that the word ‘goal’ co-occurs with the word ‘match’ in a document, then it is to co-occur with the word ‘dow_jones’.

Topic models allow us to reveal the structure of large amounts of textual data by identifying topics. Topics are basically a set of words. More formally, topics are expressed as a set of word probabilities. To learn what the latent theme is about we can order all the words in decreasing probability. The two illustrative topics (see previous paragraph) clearly deal with the general themes ‘sports’ and ‘financial investments’. In this way, what topic models do with texts actually closely resembles what exploratory factor analysis does with survey data, ie revealing latent dimensions that structure the data. But how is the model able to find interpretable topics? As David Blei explains, and this may help to get a more intuitive understanding of the method, topic models trade off two goals: (a) the model tries to assign the words of each document to as few topics as possible, and (b) the model tries, in each topic, to assign high probability to as few words as possible. These goals are at odds. For example, if the model allocates all the words of one document to one single topic, then (b) becomes unrealistic. If, on the other hand, every topic consists of just a few words, then (a) becomes unrealistic. It is by trading off both goals that the topic model is able to find interpretable sets of tightly co-occurring words.

Topic models focus on the co-occurrence of words in texts. That is, they model the probability that a word co-occurs with another word anywherein a document. To the model, it does not matter if ‘score’ and ‘match’ are used in the same sentence in a document or if one is used in the beginning of the document while the other one is used at the end. This puts topic modelling in the larger group of ‘bag-of-words approaches’, a group of methods that treat documents as …well … bags of words. Ignoring word order is a way to simplify and reduce the text, which yields various nice statistical properties. On the other hand, this approach may result in the loss of meaning. For example, the sentences ‘I love teaching, but I hate grading papers’ and ‘I hate teaching, but I love grading papers’ obviously have different meanings, but this is ignored by bag-of-words techniques.

So, while bag-of-word techniques are very useful to classify texts and to understand what the texts are about, the results will not tell us much about how topics are discussed. Other methods from the larger set of methods of automated text analysis are better equipped for this. For example, sentiment analysis allows one to analyze opinions, evaluations and emotions. Another method, word embedding, focusses on the context in which a word is embedded. More specifically, the method finds words that share similar contexts. By subsequently inspecting a words’ nearest neighbors — ie which are the words often occurring in the neighborhood of our word of interest — we get an idea of what that word means in the text. These are just a few examples of the wide range of existing methods of automated text analysis and each of them has its pros and cons. Choosing between them ultimately comes down to finding the optimal match between a research question and a specific method.

More collections of electronic text are becoming available every day. These massive collections of texts present massive opportunities for research on higher education, but at the same time they present us with a problem: how can we analyze these? Methods of automated text analysis can help us to understand these large collections of documents. These techniques, however, do not replace humans and close reading. Rather, these methods are, as aptly phrased by Justin Grimmer and Brandon Stewart, ‘best thought of as amplifying and augmenting careful reading and thoughtful analysis’. When using automated text analysis in this way, the opportunities are endless and I hope to see higher education researchers embrace these opportunities (more) in the future.

Stijn Daenekindt is a Postdoctoral Researcher at Ghent University (Department of Sociology). He has a background in sociology and in statistics and has published in various fields of research. Currently, he works at the Centre for Higher Education Governance Ghent. You can find an overview of his work at his Google Scholar page.

	Fenella Watson on How many Black professors shou…
	Zarus Cenac on How many Black professors shou…
	Jennie Golding on How many Black professors shou…
	Rob Warwick on When papers become currency
	srhebloged on When papers become currency

SRHE Blog

The Society for Research into Higher Education

Tag Archives: topic models

A topic modelling analysis of higher education research published between 2000 and 2021

Glimpse at the main findings with relevant visuals

The potential of automated text analysis for higher education research