Comparative analysis of system logs and streaming data anomaly detection algorithms

Ліщитович, А.А.ЛіщитовичПавленко, В.В.ПавленкоШматок, О.О.ШматокФіненко, Ю.Ю.ФіненкоComparative analysis of system logs and streaming data anomaly detection algorithmsПорівняльний аналіз системних журналів і потокових даних алгоритми виявлення аномалійКиївський національний університет імені Тараса Шевченка2020detection of anomaliessystem logsdecision treeclusteringdata analysishierarchical temporal memoryвиявлення аномалійсистемні журналидерево прийняття рішенькластеризаціяаналіз данихієрархічна часова пам’ятьMy UniversityMy University2026-03-172026-03-172020-02-01ukСтаття[APA 7] Ліщитович, А., Павленко, В., Шматок, О., & Фіненко, Ю. (2020). Comparative analysis of system logs and streaming data anomaly detection algorithms. Безпека інформаційних систем і технологій, (1(2)), 50–59. https://doi.org/10.17721/ISTS.2020.1.50-59[ДСТУ] Comparative analysis of system logs and streaming data anomaly detection algorithms / А. Ліщитович et al. Безпека інформаційних систем і технологій. 2020. no. 1(2). P. 50—59. DOI: 10.17721/ISTS.2020.1.50-59 (date of access: 17.07.2026).УДК 004.8, 004.62, 004.9310.17721/ISTS.2020.1.50-59https://ir.library.knu.ua/handle/15071834/12635Creative Commons Attribution 4.0 Internationalhttps://creativecommons.org/licenses/by/4.0/This paper provides with the description, comparative analysis of multiple commonly used approaches of the analysis of system logs, and streaming data massively generated by company IT infrastructure with an unattended anomaly detection feature. An importance of the anomaly detection is dictated by the growing costs of system downtime due to the events that would have been predicted based on the log entries with the abnormal data reported. Anomaly detection systems are built using standard workflow of the data collection, parsing, information extraction and detection steps. Most of the document is related to the anomaly detection step and algorithms like regression, decision tree, SVM, clustering, principal components analysis, invariants mining and hierarchical temporal memory model. Model-based anomaly algorithms and hierarchical temporary memory algorithms were used to process HDFS, BGL and NAB datasets with ~16m log messages and 365k data points of the streaming data. The data was manually labeled to enable the training of the models and accuracy calculation. According to the results, supervised anomaly detection systems achieve high precision but require significant training effort, while HTM-based algorithm shows the highest detection precision with zero training. Detection of the abnormal system behavior plays an important role in large-scale incident management systems. Timely detection allows IT administrators to quickly identify issues and resolve them immediately. This approach reduces the system downtime dramatically.Most of the IT systems generate logs with the detailed information of the operations. Therefore, the logs become an ideal data source of the anomaly detection solutions. The volume of the logs makes it impossible to analyze them manually and requires automated approaches.У цьому документі подано опис та порівняльний аналіз декількох загальноприйнятих підходів до аналізу системних журналів та потокових даних, що масово генеруються ІТ-інфраструктурою компанії, та виявленню аномалій. Важливість виявлення аномалії продиктована зростаючими витратами у випадку простою системи через події, які могли б бути передбачені на основі записів журналу з попереджувальними даними. Системи виявлення аномалій побудовані за допомогою стандартного процесу збору даних, аналізу, вилучення інформації та виявлення відхилень. Виявлення аномальної поведінки системи відіграє важливу роль у масштабних системах управління інцидентами. Своєчасне виявлення дозволяє ІТ-адміністраторам швидко виявити проблеми та негайно їх вирішити. Такий підхід значно скорочує час простою системи. Більшість ІТ-систем генерують журнали з детальною інформацією про операції. Тому журнали стають ідеальним джерелом даних рішень виявлення аномалії. Обсяг журналів унеможливлює їх аналіз вручну та вимагає автоматизованих підходів.Більша частина документа стосується кроку виявлення аномалії та таких алгоритмів, як регресія, дерево рішень, SVM, кластеризація, аналіз основних компонентів, видобуток інваріантів та ієрархічна модель тимчасової пам'яті. Алгоритми пошуку аномалії, що базуються на моделях, та ієрархічні алгоритми тимчасової пам'яті використовувались для обробки наборів даних HDFS, BGL та NAB з ~16 млн. повідомленнями журналу та ~365 тис. точками потокових даних. Дані були вручну позначені мітками, щоб дозволити навчання моделей та розрахунок точності їх роботи. Відповідно до результатів, системи контрольованого виявлення аномалій досягають високої точності, але потребують значних зусиль для тренувань моделей, тоді як алгоритм на основі HTM моделі показує найвищу точність виявлення при відсутності тренування.