About Language & AI: an interview with Christine de Kock

Submitted by Jane Durkin on Wed, 11/08/2021 - 10:28

Artificial Intelligence (AI) is an increasingly central aspect of language science research encompassing many areas from digital humanities and corpus linguistics, NLP applications like speech recognition and chat bots, to the use of machine learning to model human cognition.

Cambridge University is a world-leading centre for language and AI research. In this series of interviews, we talk to researchers from across Cambridge about their work in this field.

Christine de Kock is a second year PhD candidate in the Department of Computer Science and Technology’s Natural Language and Information Processing Group.

Her research focuses on constructive disagreement in online conversations.

Together with her supervisor, Andreas Vlachos, she has developed a corpus of disagreements extracted from Wikipedia Talk pages called WikiDisputes.

She is currently working on the Language Sciences Incubator Fund project Empirical evaluation of Graham's hierarchy of disagreement, combining expertise in psychology, linguistics and NLP to examine the role of language in conflict resolution in online contexts.

Prior to coming to Cambridge, Christine worked as a data scientist for Media 24, a major South African media company where she developed a news recommendation system and worked on hate speech in their comments platforms.

She has also recently completed an internship at Wikipedia on collaboration patterns, specifically constructed patterns of collaboration and the issues that arise on biographies of living people.

Tell me about your research

In collaboration with my PhD supervisor Andreas Vlachos, I am essentially researching the question of how to disagree well.

There’s a lot of work at the moment on hate speech and trolling. But online disagreement per se is not something that’s well studied, even though it occurs a lot of the time.

We believe that disagreements are not a necessarily bad thing. In some cases, they can be really good and can lead to more constructive collaborations.

We're trying to get an understanding of how disagreements work. With that in mind, we hope people can be educated on better ways to disagree online.

To give an example, in our recent paper I Beg to Differ: A study of constructive disagreement in online conversations, we used a model to predict when something will happen based on how a conversation is going.

The two things we tried to predict were the end of a conversation, so the resolution of a disagreement, and secondly, when personal attacks will happen – hate speech and things like that.

After reading, say five turns in a conversation, you can often say whether this is likely to go well or not. We are trying to train models to be able to have that sort of judgment.

What about WikiDisputes?

WikiDisputes is a corpus of around 7,000 disagreements that we extracted from Wikipedia Talk pages.

Every article on Wikipedia has a Talk page, which is a forum for discussion of its content. Because the entire Wikipedia platform is created collaboratively, this is a way for people who edit it to talk to each other.

Of course, frequently people do not agree on the content of a page. Then there are disputes on the Talk pages, and a ‘dispute tag’ is added to the Wikipedia page to indicate there’s something going on.

We created the WikiDisputes corpus by processing the whole edit history of Wikipedia – about 20 terabytes of text – to align the dispute tags with relevant conversations on the Talk pages and find instances that we know are disagreements.

Why Wikipedia?

For a language person, Wikipedia is amazing.

It's this global ongoing conversation. Also the fact that you have different versions of the same article that's improved over time allows you to study the process of how articles get better.

A common problem in natural language processing is that a corpus is domain specific. For instance, if you train a model on a legal text, it’s all language specific to the legal profession.

Although Wikipedia does have its own particular dialect and jargon, it is not constrained to any specific topic. We believe this allows you to study better the mechanics that underlie disagreements and how people resolve them.

What inspired you to research constructive disagreements?

I think there’s something inherently interesting in reading conversations online. People often say, “I came here to read the comments”. People are interested in what others think.

Before my PhD, I worked as a data scientist for a South African media company called Media 24.

South Africa is a very polarized country and hate speech in the comment platforms was a massive problem.

We ended up with a problem where we had to shut down all comment platforms because there was so much hate speech that the company was afraid that they would be sued.

I started reading up on what other large media companies like the New York Times and the Washington Post had done to combat hate speech in their common spaces.

I came across this whole field and realized it's a problem all over the world and no one really knows what to do about it.

So I came up with this research proposal and contacted Andreas. Luckily, he wanted to do it.

What is the potential impact of this kind of research?

There are a number of tools I can envision. Wikipedia, for instance, might be interested in having editing tools to facilitate constructive discussions on the site.

Moderation is a difficult subject. You don't want to stop people having the conversation they're having.

But while people are typing you could perhaps have a pop-up or something to give people a pause and have them consider what they're saying.

What does the future hold?

I think there's a hopeful future for women in computer science. That’s quite wonderful.

When I first started in computer science there were not a lot of women around. That’s changed over the years I've been in the field.

Ethics in NLP is also a big topic of discussion.

There was recently a lot of success with language models trained on larger amounts of data than was previously possible. The quality of the language understanding and generation is amazing.

However, recently an ethics team at Google wrote a paper about these large-scale language models and the fact that because they are trained on the text scraped from the Internet, there are all these biases encoded in the models.

That's one reason why we decided to go with disagreements on Wikipedia as opposed to Facebook because we felt generally the platform is known for low levels of toxicity. Another reason, of course, is that Wikipedia data is open sourced.

How can we create more opportunities for interdisciplinary collaboration?

My research is quite interdisciplinary. It’s frequently referred to as computational sociolinguistics.

We're currently working on a research proposal with someone from psychology, and we've also had discussions with people from sociology before.

I think that's super important in computer science, because we are developing these models, which are meant to be used by people. It’s important to remember who the users will be.

Interdisciplinary research can be good in that way. It can inform you about the broader implications.

I've found that most of the collaborations I've had, especially between different disciplines, have come from things like meeting someone at a dinner and figuring out our research is related or talking to someone at a conference.

For this reason it's so important we have these conferences.

Unfortunately, due to COVID, that's not been happening. It's all gone online. It's great and that it allows more people to be a part of it, but I do feel we have fewer conversations that lead to collaborations because we’re not looking each other in the eye and having informal conversations about the things we are passionate about.

Image: Photo of Christine de Kock taken off the coast of the sub-Antarctic island, Marion, while on expedition with the South African National Space Agency in 2018 / 2019

About Language & AI: an interview with Christine de Kock

Artificial Intelligence (AI) is an increasingly central aspect of language science research encompassing many areas from digital humanities and corpus linguistics, NLP applications like speech recognition and chat bots, to the use of machine learning to model human cognition.

Cambridge University is a world-leading centre for language and AI research. In this series of interviews, we talk to researchers from across Cambridge about their work in this field.

Tell me about your research

What about WikiDisputes?

Why Wikipedia?

What inspired you to research constructive disagreements?

What is the potential impact of this kind of research?

What does the future hold?

How can we create more opportunities for interdisciplinary collaboration?

What we do

Events

Study at Cambridge

About the University

Research at Cambridge