Contribution Complete Research Paper
Fürstenberghaus - F3
02 - Big Data
Who Are We Listening to? Detecting User-generated Content (UGC) on the Web
The analysis of text-based user-generated content (UGC) on the Web has become one highly acclaimed topic in recent years both in theory and practice. As users are able to participate and publicly comment on almost any webpage nowadays, UGC occurs scattered across the web and mixes with various content types such as advertising texts, product descriptions or other editorial articles. Holistic research that aims to listen to the voice of the consumer therefore needs to separate UGC from non-UGC. Unfortunately the UGC characteristic is not a directly observable attribute of content. As the amount of public available textual data on the web is vast and increases rapidly, manual classification is not applicable in this "big data" environment. From this, the previously unmet need emerges to perform UGC classification automatically, for which we provide three contributions. First, we show that UGC incorporates signals that enable humans to context-free decide whether a text has been written by another user. Second, we show that these signals can be utilized by supervised machine learning to perform UGC classification automatically. Third, we demonstrate and evaluate the fundamental feasibility of our approach on a dataset of German language web texts.