Data Mining and Web Mining

School of Business and Economics | Data Mining and Web Mining

Data Mining and Web Mining

Data mining is the process of non-trivial discovery from implied, previously unknown, and potentially useful information from data in large databases. Hence it is a core element in knowledge discovery, often used synonymously. The data is integrated and cleaned so that the relevant data is taken. Data mining presents discovered data that is not just clear to data mining analysts but also for domain experts who may use it to derive actionable recommendations. Successful applications of data mining include the analysis of genetic patterns, graph mining in finance, and consumer behavior in marketing.

The Institute of Information Systems has developed and researched a wide spectrum of data mining applications with a focus on web applications in education, B2C retail applications, and knowledge management. One focus is on the analysis of the web as "the world's largest database." In particular we analyze and develop methods and tools for exploratory analysis of behavioral data. Another area of interest is the transition of from temporal data analysis (still plays an important role) that implicitly assumes a stationary role for the described domains, to the analysis of the dynamic aspect of such data (these data, as a rule, are too complicated to examine using standard time series analysis techniques).

Web mining describes the application of traditional data mining techniques onto the web resources and has facilitated the further development of these techniques to consider the specific structures of web data. The analyzed web resources contain (1) the actual web site (2) the hyperlinks connecting these sites and (3) the path that online users take on the web to reach a particular site. Web usage mining then refers to the derivation of useful knowledge from these data inputs. The content of the raw data for web usage mining on the one hand, and the expected knowledge to be derived from it on the other, pose a special challenge. While the input data are mostly web server logs and other primarily technically oriented data, the desired output is an understanding of user behavior in the domain of online information search, online shopping, online learning etc. This requires on the one hand an understanding and formal modeling of the behavior examined in the domain and on the other a picture of how the input data figures in these models. We are investigating "semantic web" approaches as a promising avenue for the formal and computational aspects of this goal. The contents aspects of this goal require an understanding of behavioral theories in the investigated domains and a highly interdisciplinary research approach. The eventual presentation of the mining results for domain experts should consider general aspects of user interface design as well as domain-specific customs. Further, the development of visualizations as an important design element of user oriented mining systems is in the focus of our research efforts.

User behavior and data availability tend to change over time. Therefore the dynamism of a domain is an important question in every mining analysis and in each presentation of mining results for domain experts. Most data mining algorithms tend to treat the dataset being analyzed as a static unit. However a dataset may change in terms of content and/or structure over time, either due to updates or just because the data was collected over a long period of time. Regarding updates, it seems sufficient to update the patterns discovered previously from the data. Most of the "incremental mining techniques" proposed to solve this task are based on their static counterparts and re-use information from earlier mining runs, to update patterns. The data collection over a long time period creates another situation. In this case the data experiences only one form of update: insertions of data. The distribution of entities in the data set can change on account of external and/or internal factors. Due to these changes, the patterns over time may also change (pattern evolution). There are two types of pattern change: changes in the essential make up of a pattern, for example the relationship in the data as reflected by the specific pattern, and changes in the statistical measurement of the pattern. Both types of changes can have a significant influence on the decision process and hence should be observed. The pattern supervision necessitates a data model that contains a temporal component to illustrate a specific pattern for the corresponding time. A second question that automatically comes in to play is: which patterns should be supervised or observed? The interesting thing is that even when examining smaller data amounts, the number of discovered patters is often very big. In these cases the analyst must chose a manageable subset of the patterns. Our research focuses on formal descriptions of pattern evolution and supervision, the efficient development of algorithms for these tasks and the implementation of suitable tools.

The area is closely related to knowledge management, data protection and data security. In particular questions from knowledge management are highly relevant because the web usually implies the access to information and therefore the construction of knowledge. This raises a number of E-privacy questions. Data collection and data analysis practices are coming under increasing scrutiny from legislation and technical proposals that aim at either minimizing recording or at extending it.

Researchers in this Area

Prof. Dr. Bettina Berendt
Prof. Oliver Günther, Ph.D.
Maximilian Teltzrow

Selected Publications

Baron, S., Spiliopoulou, M., Günther, O.: Efficient Monitoring of Patterns in Data Mining Environments. In Proc. Seventh East-European Conference on Advance in Databases and Information Systems (ADBIS 2003), Dresden, Germany. Springer 2003

Berendt, B.: Using site semantics to analyze, visualize, and support navigation. Data Mining and Knowledge Discovery, 6, 37-59, 2002

Berendt, B., Brenstein, E.: Visualizing Individual Differences in Web Navigation: STRATDYN, a Tool for Analyzing Navigation Patterns. Behavior Research Methods, Instruments, & Computers, 33, 243-257, 2001

Berendt, B., Spiliopoulou, M.: Analyzing navigation behaviour in web sites integrating multiple information systems. The VLDB Journal, 9, 56-75, 2000

Spiliopoulou, M., Pohle, C., Teltzrow, M.: Modelling Web Site Usage with Sequences of Goal-Oriented Tasks, In Proc. Multikonferenz Wirtschaftsinformatik, in: E-Commerce - Netze, Märkte, Technologien, Physica-Verlag, Heidelberg, 2002.

Humboldt-Universität zu Berlin - School of Business and Economics