198844 VU Datenanalyse II: Fortgeschrittene computergestützte Textanalyse für Sozialwissenschaftler

Wintersemester 2024/2025 | Stand: 06.08.2024 LV auf Merkliste setzen
198844
VU Datenanalyse II: Fortgeschrittene computergestützte Textanalyse für Sozialwissenschaftler
VU 3
5
wöch.
jährlich
Englisch

This course introduces social scientists to advanced, deep learning-based quantitative text analysis methods.

Participants of this course will learn about the conceptual motivation and methodological foundations of text embedding methods and large neural language models (LLMs).

Moreover, they will gather plenty of practical experience with applying these methods in social science research using the Python programming language.

Next to conveying a solid conceptual understanding as well as hands-on experience with applying these methods, the course puts a strong emphasis on introducing and discussing social science use cases as well as ethical considerations.

Having successfully completed this module, students will

- know and understand the methodological foundations of text embedding methods, transfer learning, Transformers, and large language models (LLMs)

- know and understand typical applications of these methods in computational social science research

- be able to apply these methods to analyze text data for social-scientific purposes

- be able to reflect critically about the application of the techniques in social science research

This course delves into advanced computational text analysis methods and the deep machine learning techniques that enable them.

People, media outlets, political parties, and corporations express their views, priorities, and objectives in public, contributing to an ever-growing amount of digitized text data.

When analyzed using quantitative text analysis methods, these communications offer researchers a unique perspective on social, political, and economic dynamics, facilitating the study of cultural norms, public opinion, media framing, political competition, international conflicts, and more.

Traditional quantitative approaches like the "bag-of-words" method often struggle to capture the nuanced and complex ways in which humans communicate, however, limiting researchers' ability to accurately measure social science concepts with textual data.

Advancements in deep learning, text embedding, neural language modeling help overcome these challenges and are hence increasingly adopted by computational social science researchers and analysts.

Participants in this course will explore recent developments in quantitative text analysis, including the application of large language models (LLMs) and generative Artificial Intelligence (AI) such as ChatGPT.

They will gain an overview of the expanding research that applies these methods to address significant questions in the social sciences.

Module I: Quantitative text analysis basics 

  • bag-of-words text pre-processing (“tokenization”) and representation (how to represent document with word count vectors)
  • conceptual overview of common tasks (clustering, topic modeling, classification)
  • limitations and shortcomings of the bag-of-words approach for computational text analysis

Module II: Word and text embedding

  • conceptual foundations and deep learning background
  • applications in computational social science research
  • Python implementation with `gensim`

Module III: Transformers 

  • conceptual motivation and methodological background
  • applications in computational social science research
  • transfer learning with Transformers
  • fine-tuning for text classification with the `transformers` Python library
  • clustering and topic modeling with sentence transformers and the `BERTopic` Python library

Module IV: Large Language Models

  • motivation and methodological background 
  • current applications in computational social science research
  • prompting LLMs to automate classic text analysis tasks in Python

The course consists of lectures complemented by practical sessions and in-class group work, and assessments through take-home assignments and a final group project (see the Assessment section below).

All code examples, exercises, assignments, and group work in the course will be implemented in Python.

Introductory materials for (i) learning or refreshing Python programming basics, (ii) transitioning from R to Python, and (iii) installing and setting up Python will be provided by the instructor at the beginning of the course.

Please refer to the Prerequisites section for details.

The instructor will provide data and code templates for all in-class exercises and data and example solutions for take-home assignments.

Assessment will be based on 

  1. active participation in in-class discussions and group work throughout the course (contributing 10% to the final grade)
  2. two take-home assignments at the end of Modules II and III, respectively, from which participants must complete and pass one (contributing 30% to the final grade)
  3. a final group project (short presentation of project aims and planned approach mid-January + written report, contributing 60% to the final grade)

The take-home assignments will assess participants' ability to

  • apply and implement the learned methods to produce quantitative insights addressing the respective assignment's problem and task description
  • competently present and describe their findings
  • critically evaluate their findings

In the final group project, participants will collaborate in teams of 2-4 people.

In justified cases, the instructor reserves the right to grant individual participants the possibility to complete the final project on their own.

The final group project will assess participants' ability to

  • identify a social-scientific research question that can be addressed with the learned methods
  • develop and implement an original application that addresses their research question
  • justify their methodological choices based on the content covered in the course lectures and mandatory readings
  • competently present and describe their findings
  • critically evaluate potential limitations of their approach and findings 

The overall assessment will take into account 

  • the level of methods knowledge and understanding of acquired by the students throughout the course
  • their capacity for thinking creatively, analytically, and critically
  • their capacity to design and evaluate solutions for concrete social-scientific research problems
  • their capacity to present effectively findings and conclusions

The full list of references will be made available in the syllabus before the course.

Quantitative text analysis viewed from various disciplinary perspectives

  • communication science: Boumans, J. W., & Trilling, D. (2016). Taking Stock of the Toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4(1), 8–23. DOI:10.1080/21670811.2015.1096598
  • economics: Gentzkow, M. Kelly, B. & Taddy, M. (2019). Text as Data. Journal of Economic Literature, 57(3): 535-74. DOI:10.1257/jel.20181020
  • political psychology: Schoonvelde, M., Schumacher, G., & Bakker, B. N. (2019). Friends With Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology. Journal of Social and Political Psychology, 7(1), 124-143. DOI:10.5964/jspp.v7i1.964
  • political science: Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. DOI:10.1093/pan/mps028
  • sociology: Bonikowski, B., & Nelson, L. K. (2022). From Ends to Means: The Promise of Computational Text Analysis for Theoretically Driven Sociological Research. Sociological Methods & Research, 51(4), 1469-1483. DOI:10.1177/00491241221123088

Primers, textbooks and applications

  • Van Atteveldt, W., Trilling, D. & Arcila, C. (2021). Computational Analysis of Communication: A practical introduction to the analysis of texts, networks, and images with code examples in Python and R, Wiley-Blackwell, Chapters 9-11. https://cssbook.net/
  • Nielbo, K. L., et al. (2024). Quantitative text analysis. Nat Rev Methods Primers, 4. DOI:10.1038/s43586-024-00302-w
  • Jurafsky, D., & Martin, J. H. (2024) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition., 3rd edition. https://web.stanford.edu/~jurafsky/slp3/
  • Wankmüller, S. (2022). Introduction to Neural Transfer Learning With Transformers for Social Science Text Analysis. Sociological Methods & Research, 0(0). DOI:10.1177/00491241221134527
  • Garg, N., at al. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635-E3644. DOI:10.1073/pnas.1720347115
  • Licht, H., et al. (2024). Measuring and understanding parties' anti-elite strategies. The Journal of Politics, 0(0). DOI:10.1086/730711
  • O'Hagan, S. & Schein, A. (2023). Measurement in the Age of LLMs: An Application to Ideological Scaling. ArXiv:2312.09203
  • Basic knowledge of Python 
    • creating and manipulating strings, lists and dictionaries
    • interacting with objects and methods
    • reading and manipulating data frames with pandas
    • using loops
    • defining new functions
  • recommended (helpful but not required): Basic knowledge of quantitative research methods
    • basic understanding of matrix algebra
    • understanding of linear and logistic regression analysis
Therefore, it is recommended to take this coruse after the Introduction to Programming: Programming in Python, and even better after Computer Programing Prerequisites and Data Analyis I: Data Analytics.

Students advanced in completion of the Minor Digital Science get precedence.

siehe Termine
  • SDG 4 - Hochwertige Bildung: Inklusive, gleichberechtigte und hochwertige Bildung gewährleisten und Möglichkeiten lebenslangen Lernens für alle fördern
Gruppe 0
Datum Uhrzeit Ort
Do 03.10.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 10.10.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 17.10.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 24.10.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 31.10.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 07.11.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 14.11.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 21.11.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 28.11.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 05.12.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 12.12.2024
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 09.01.2025
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 16.01.2025
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 23.01.2025
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Do 30.01.2025
15.00 - 17.45 UR 2 (Sowi) UR 2 (Sowi) Barrierefrei
Gruppe Anmeldefrist
198844-0 01.09.2024 00:00 - 21.09.2024 23:59 Zur LV anmelden
Licht H.