Archives for 2019

You are browsing the site archives by date.

Deflect Labs Report #6: Phishing and Web Attacks Targeting Uzbek Human Right Activists and Independent Media

Key Findings

  • We’ve discovered infrastructure used to launch and coordinate attacks targeting independent media and human rights activists from Uzbekistan
  • The campaign has been active since early 2016, using web and phishing attacks to suppress and exploit their targets
  • We have no evidence of who is behind this campaign but the target list points to a new threat actor targeting Uzbek activists and media

Introduction

The Deflect project was created to protect civil society websites from web attacks, following the publication of “Distributed Denial of Service Attacks Against Independent Media and Human Rights Sites report by the Berkman Center for Internet & Society. During that time we’ve investigated many DDoS attacks leading to the publication of several reports.

The attacks leading to the publication of this report quickly stood out from the daily onslaught of malicious traffic on Deflect, at first because they were using professional vulnerability scanning tools like Acunetix. The moment we discovered that the origin server of these scans was also hosting fake gmail domains, it became evident that something bigger was going on here.

In this report, we describe all the pieces put together about this campaign, with the hope to contribute to public knowledge about the methods and impact of such attacks against civil society.

Context : Human Rights and Surveillance in Uzbekistan

Emblem of Uzbekistan (wikipedia)

Uzbekistan is defined by many human-rights organizations as an authoritarian state, that has known strong repression of civil society. Since the collapse of the Soviet Union, two presidents have presided over a system that institutionalized torture and repressed freedom of expression, as documented over the years by Human Rights Watch, Amnesty International and Front Line Defenders, among many others. Repression extended to media and human rights activists in particular, many of whom had to leave the country and continue their work in diaspora.

Uzbekistan was one of the first to establish a pervasive Internet censorship infrastructure, blocking access to media and human rights websites. Hacking Team servers in Uzbekistan were identified as early as 2014 by the Citizen Lab. It was later confirmed that Uzbek National Security Service (SNB) were among the customers of Hacking Team solutions from leaked Hacking Team emails. A Privacy International report from 2015 describes the installation in Uzbekistan of several monitoring centers with mass surveillance capabilities provided by the Israeli branch of the US-based company Verint Systems and by the Israel-based company NICE Systems. A 2007 Amnesty International report entitled ‘We will find you anywhere’ gives more context on the utilisation of these capabilities, describing digital surveillance and targeted attacks against Uzbek journalists and human-right activists. Among other cases, it describes the unfortunate events behind the closure of uznews.net – an independent media website established by Galima Bukharbaeva in 2005 following the Andijan massacre. In 2014, she discovered that her email account had been hacked and information about the organization, including names and personal details journalists in Uzbekistan was published online. Galima is now the editor of Centre1, a Deflect client and one of the targets of this investigation.

A New Phishing and Web Attack Campaign

On the 16th of November 2018, we identified a large attack against several websites protected by Deflect. This attack used several professional security audit tools like NetSparker and WPScan to scan the websites eltuz.com and centre1.com.


Peak of traffic during the attack (16th of November 2018)

This attack was coming from the IP address 51.15.94.245 (AS12876 – Online AS but an IP range dedicated to Scaleway servers). By looking at older traffic from this same IP address, we found several cases of attacks on other Deflect protected websites, but we also found domains mimicking google and gmail domains hosted on this IP address, like auth.login.google.email-service[.]host or auth.login.googlemail.com.mail-auth[.]top. We looked into passive DNS databases (using the PassiveTotal Community Edition and other tools like RobTex) and crossed that information with attacks seen on Deflect protected websites with logging enabled. We uncovered a large campaign combining web and phishing attacks against media and activists. We found the first evidence of activity from this group in February 2016, and the first evidence of attacks in December 2017.

The list of Deflect protected websites chosen by this campaign, may give some context to the motivation behind them. Four websites were targeted:

  • Fergana News is a leading independent Russian & Uzbek language news website covering Central Asian countries
  • Eltuz is an independent Uzbek online media
  • Centre1 is an independent media organization covering news in Central Asia
  • Palestine Chronicle is a non-profit organization working on human-rights issues in Palestine

Three of these targets are prominent media focusing on Uzbekistan. We have been in contact with their editors and several other Uzbek activists to see if they had received phishing emails as part of this campaign. Some of them were able to confirm receiving such messages and forwarded them to us. Reaching out further afield we were able to get confirmations of phishing attacks from other prominent Uzbek activists who were not linked websites protected by Deflect.

Palestine Chronicle seems to be an outlier in this group of media websites focusing on Uzbekistan. We don’t have a clear hypothesis about why this website was targeted.

A year of web attacks against civil society

Through passive DNS, we identified three IPs used by the attackers in this operation :

  • 46.45.137.74 was used in 2016 and 2017 (timeline is not clear, Istanbul DC, AS197328)
  • 139.60.163.29 was used between October 2017 and August 2018 (HostKey, AS395839)
  • 51.15.94.245 was used between September 2018 and February 2019 (Scaleway, AS12876)

We have identified 15 attacks from the IPs 139.60.163.29 and 51.15.94.245 since December 2017 on Deflect protected websites:

Date IP Target Tools used
2017/12/17 139.60.163.29 eltuz.com WPScan
2018/04/12 139.60.163.29 eltuz.com Acunetix
2018/09/15 51.15.94.245 www.palestinechronicle.com eltuz.com www.fergana.info and uzbek.fergananews.com Acunetix and WebCruiser
2018/09/16 51.15.94.245 www.fergana.info Acunetix
2018/09/17 51.15.94.245 www.fergana.info Acunetix
2018/09/18 51.15.94.245 www.fergana.info NetSparker and Acunetix
2018/09/19 51.15.94.245 eltuz.com NetSparker
2018/09/20 51.15.94.245 www.fergana.info Acunetix
2018/09/21 51.15.94.245 www.fergana.info Acunetix
2018/10/08 51.15.94.245 eltuz.com, www.fergananews.com and news.fergananews.com Unknown
2018/11/16 51.15.94.245 eltuz.com, centre1.com and en.eltuz.com NetSparker and WPScan
2019/01/18 51.15.94.245 eltuz.com WPScan
2019/01/19 51.15.94.245 fergana.info www.fergana.info and fergana.agency Unknown
2019/01/30 51.15.94.245 eltuz.com and en.eltuz.com Unknown
2019/02/05 51.15.94.245 fergana.info Acunetix

Besides classic open-source tools like WPScan, these attacks show the utilization of a wide range of commercial security audit tools, like NetSparker or Acunetix. Acunetix offers a trial version that may have been used here, NetSparker does not, showing that the operators may have a consistent budget (standard offer is $4995 / year, a cracked version may have been used).

It is also surprising to see so many different tools coming from a single server, as many of them require a Graphical User Interface. When we scanned the IP 51.15.94.245, we discovered that it hosted a Squid proxy on port 3128, we think that this proxy was used to relay traffic from the origin operator computer.

Extract of nmap scan of 51.15.94.245 in December 2018 :

3128/tcp  open     http-proxy Squid http proxy 3.5.23
|_http-server-header: squid/3.5.23
|_http-title: ERROR: The requested URL could not be retrieved

A large phishing campaign

After discovering a long list of domains made to resemble popular email providers, we suspected that the operators were also involved in a phishing campaign. We contacted owners of targeted websites, along with several Uzbek human right activists and gathered 14 different phishing emails targeting two activists between March 2018 and February 2019 :

Date Sender Subject Link
12th of March 2018 g.corp.sender[@]gmail.com У Вас 2 недоставленное сообщение (You have 2 undelivered message) http://mail.gmal.con.my-id[.]top/
13th of June 2018 service.deamon2018[@]gmail.com Прекращение предоставления доступа к сервису (Termination of access to the service) http://e.mail.gmall.con.my-id[.]top/
18th of June 2018 id.warning.users[@]gmail.com Ваш новый адрес в Gmail: alexis.usa@gmail.com (Your new email address in Gmail: alexis.usa@gmail.com) http://e.mail.users.emall.com[.]my-id.top/
10th of July 2018 id.warning.daemons[@]gmail.com Прекращение предоставления доступа к сервису (Termination of access to the service) hxxp://gmallls.con-537d7.my-id[.]top/
10th of July 2018 id.warning.daemons[@]gmail.com Прекращение предоставления доступа к сервису (Termination of access to the service) http://gmallls.con-4f137.my-id[.]top/
18th of July 2018 service.deamon2018[@]gmail.com [Ticket#2011031810000512] – 3 undelivered messages http://login-auth-goglemail-com-7c94e3a1597325b849e26a0b45f0f068.my-id[.]top/
2nd of August 2018 id.warning.daemon.service[@]gmail.com [Important Reminder] Review your data retention settings None
16th of October 2018 lolapup.75[@]gmail.com Экс-хоким Ташкента (Ex-hokim of Tashkent) http://office-online-sessions-3959c138e8b8078e683849795e156f98.email-service[.]host/
23rd of October 2018 noreply.user.info.id[@]gmail.com Ваш аккаунт будет заблокировано (Your account will be blocked.) http://gmail-accounts-cb66d53c8c9c1b7c622d915322804cdf.email-service[.]host/
25th of October 2018 warning.service.suspended[@]gmail.com Ваш аккаунт будет заблокировано. (Your account will be blocked.) http://gmail-accounts-bb6f2dfcec87551e99f9cf331c990617.email-service[.]host/
18th of February 2019 service.users.blocked[@]gmail.com Важное оповещение системы безопасности (Important Security Alert) http://id-accounts-blocked-ac5a75e4c0a77cc16fe90cddc01c2499.myconnection[.]website/
18th of February 2019 mail.suspend.service[@]gmail.com Оповещения системы безопасности (Security Alerts) http://id-accounts-blocked-326e88561ded6371be008af61bf9594d.myconnection[.]website/
21st of February 2019 service.users.blocked[@]gmail.com Ваш аккаунт будет заблокирован. (Your account will be blocked.) http://id-accounts-blocked-ffb67f7dd7427b9e4fc4e5571247e812.myconnection[.]website/
22nd of February 2019 service.users.blocked[@]gmail.com Прекращение предоставления доступа к сервису (Termination of access to the service) http://id-accounts-blocked-c23102b28e1ae0f24c9614024628e650.myconnection[.]website/

Almost all these emails were mimicking Gmail alerts to entice the user to click on the link. For instance this email received on the 23rd of October 2018 pretends that the account will be closed soon, using images of the text hosted on imgur to bypass Gmail detection :

The only exception was an email received on the 16th of October 2018 pretending to give confidential information on the former Hokim (governor) of Tashkent :

Emails were using simple tricks to bypass detection, at times drw.sh url shortener (this tool belongs to a Russian security company Doctor Web) or by using open re-directions offered in several Google tools.

Every email we have seen used a different sub-domain, including emails from the same Gmail account and with the same subject line. For instance, two different emails entitled “Прекращение предоставления доступа к сервису” and sent from the same address used hxxp://gmallls.con-537d7.my-id[.]top/ and http://gmallls.con-4f137.my-id[.]top/ as phishing domains. We think that the operators used a different sub-domain for every email sent in order to bypass Gmail list of known malicious domains. This would explain the large number of sub-domains identified through passive DNS. We have identified 74 sub-domains for 26 second-level domains used in this campaign (see the appendix below for full list of discovered domains).

We think that the phishing page stayed online only for a short time after having sent the email in order to avoid detection. We got access to the phishing page of a few emails. We could confirm that the phishing toolkit checked if the password is correct or not (against the actual gmail account) and suspect that they implemented 2 Factor authentication for text messages and 2FA applications, but could not confirm this.

Timeline for the campaign

We found the first evidence of activity in this operation with the registration of domain auth-login[.]com on the 21st of February 2016. Because we discovered the campaign recently, we have little information on attacks during 2016 and 2017, but the domain registration date shows some activity in July and December 2016, and then again in August and October 2017. It is very likely that the campaign started in 2016 and continued in 2017 without any public reporting about it.

Here is a first timeline we obtained based on domain registration dates and dates of web attacks and phishing emails :

To confirm that this group had some activity during 2016 and 2017, we gathered encryption (TLS) certificates for these domains and sub-domains from the crt.sh Certificate Transparency Database. We identified 230 certificates generated for these domains, most of them created by Cloudfare. Here is a new timeline integrating the creation of TLS certificates :

We see here many certificates created since December 2016 and continuing over 2017, which shows that this group had some activity during that time. The large number of certificates over 2017 and 2018 comes from campaign operators using Cloudflare for several domains. Cloudflare creates several short-lived certificates at the same time when protecting a website.

It is also interesting to note that the campaign started in February 2016, with some activity in the summer of 2016, which happens to when the former Uzbek president Islam Karimov died, news first reported by Fergana News, one of the targets of this attack campaign.

Infrastructure Analysis

We identified domains and subdomains of this campaign through analysis of passive DNS information, using mostly the Community access of PassiveTotal. Many domains in 2016/2017 reused the same registrant email address, b.adan1@walla.co.il, which helped us identify other domains related to this campaign :

Based on this list, we identified subdomains and IP addresses associated with them, and discovered three IP addresses used in the operation. We used Shodan historical data and dates of passive DNS data to estimate the timeline of the utilisation of the different servers :

  • 46.45.137.74 was used in 2016 and 2017
  • 139.60.163.29 was used between October 2017 and August 2018
  • 51.15.94.245 was used between September and February 2019

We have identified 74 sub-domains for 26 second-level domains used in this campaign (see the appendix for a full list of IOCs). Most of these domains are mimicking Gmail, but there are also domains mimicking Yandex (auth.yandex.ru.my-id[.]top), mail.ru (mail.ru.my-id[.]top) qip.ru (account.qip.ru.mail-help-support[.]info), yahoo (auth.yahoo.com.mail-help-support[.]info), Live (login.live.com.mail-help-support[.]info) or rambler.ru (mail.rambler.ru.mail-help-support[.]info). Most of these domains are sub-domains of a few generic second-level domains (like auth-mail.com), but there are a few specific second-level domains that are interesting :

  • bit-ly[.]host mimicking bit.ly
  • m-youtube[.]top and m-youtube[.]org for Youtube
  • ecoit[.]email which could mimick https://www.ecoi.net
  • pochta[.]top likely mimick https://www.pochta.ru/, the Russian Post website
  • We have not found any information on vzlom[.]top and fixerman[.]top. Vzlom means “break into” in Russian, so it could have hosted or mimicked a security website

A weird Cyber-criminality Nexus

It is quite unusual to see connections between targeted attacks and cyber-criminal enterprises, however during this investigation we encountered two such links.

The first one is with the domain msoffice365[.]win which was registered by b.adan1@walla.co.il (as well as many other domains from this campaign) on the 7th of December 2016. This domain was identified as a C2 server for a cryptocurrency theft tool called Quant, as described in this Forcepoint report released in December 2017. Virus Total confirms that this domain hosted several samples of this malware in November 2017 (it was registered for a year). We have not seen any malicious activity from this domain related to our campaign, but as explained earlier, we have marginal access to the group’s activity in 2017.

The second link we have found is between the domain auth-login[.]com and the groups behind the Bedep trojan and the Angler exploit kit. auth-login[.]com was linked to this operation through the subdomain login.yandex.ru.auth-login[.]com that fit the pattern of long subdomains mimicking Yandex from this campaign and it was hosted on the same IP address 46.45.137.74 in March and April 2016 according to RiskIQ. This domain was registered in February 2016 by yingw90@yahoo.com (David Bowers from Grovetown, GA in the US according to whois information). This email address was also used to register hundreds of domains used in a Bedep campaign as described by Talos in February 2016 (and confirmed by several other reports). Angler exploit kit is one of the most notorious exploit kit, that was commonly used by cyber-criminals between 2013 and 2016. Bedep is a generic backdoor that was identified in 2015, and used almost exclusively with the Angler exploit kit. It should be noted that Trustwave documented the utilization of Bedep in 2015 to increase the number of views of pro-Russian propaganda videos.

Even if we have not seen any utilisation of these two domains in this campaign, these two links seem too strong to be considered cirmcumstantial. These links could show a collaboration between cyber-criminal groups and state-sponsored groups or services. It is interesting to remember the potential involvement of Russian hacking groups in attacks on Uznews.net editor in 2014, as described by Amnesty international.

Taking Down Servers is Hard

When the attack was discovered, we decided to investigate without sending any abuse requests, until a clearer picture of the campaign emerged. In January, we decided that we had enough knowledge of the campaign and started to send abuse requests – for fake Gmail addresses to Google and for the URL shorteners to Doctor Web. We did not receive any answer but noticed that the Doctor Web URLs were taken down a few days after.

Regarding the Scaleway server, we entered into an unexpected loop with their abuse process. Scaleway operates by sending the abuse request directly to the customer and then asks them for confirmation that the issue has been resolved. This process works fine in the case of a compromised server, but does not work when the server was rented intentionally for malicious activities. We did not want to send an abuse request because it would have involved giving away information to the operators. We contacted Scaleway directly and it took some time to find the right person on the security team. They acknowledged the difficulty of having an efficient Abuse Process, and after we sent them an anonymized version of this report along with proof that phishing websites were hosted on the server, they took down the server around the 25th of January 2019.

Being an infrastructure provider, we understand the difficulty of dealing with abuse requests. For a lot of hosting providers, the number of requests is what makes a case urgent or not. We encourage hosting providers to better engage with organisations working to protect Civil Society and establish trust relationships that help quickly mitigate the effects of malicious campaigns.

Conclusion

In this report, we have documented a prolonged phishing and web attack campaign focusing on media covering Uzbekistan and Uzbek human right activists. It shows that once again, digital attacks are a threat for human-right activists and independent media. There are several threat actors known to use both phishing and web attacks combined (like the Vietnam-related group Ocean Lotus), but this campaign shows a dual strategy targeting civil society websites and their editors at the same time.

We have no evidence of government involvement in this operation, but these attacks are clearly targeted on prominent voices of Uzbek civil society. They also share strong similarities with the hack of Uznews.net in 2014, where the editor’s mailbox was compromised through a phishing email that appeared as a notice from Google warning her that the account had been involved in distributing illegal pornography.

Over the past 10 years, several organisations like the Citizen Lab or Amnesty International have dedicated lots of time and effort to document digital surveillance and targeted attacks against Civil Society. We hope that this report will contribute to these efforts, and show that today, more than ever, we need to continue supporting civil society against digital surveillance and intrusion.

Counter-Measures Against such Attacks

If you think you are targeted by similar campaigns, here is a list of recommendations to protect yourself.

Against phishing attacks, it is important to learn to recognize classic phishing emails. We give some examples in this report, but you can read other similar reports by the Citizen Lab. You can also read this nice explanation by NetAlert and practice with this Google Jigsaw quizz. The second important point is to make sure that you have configured 2-Factor Authentication on your email and social media accounts. Two-Factor Authentication means using a second way to authenticate when you log-in besides your password. Common second factors include text messages, temporary password apps or hardware tokens. We recommend using either temporary password apps (like Google Authenticator; FreeOTP) or Hardware Keys (like YubiKeys). Hardware keys are known to be more secure and strongly recommended if you are an at-risk activist or journalist.

Against web attacks, if you are using a CMS like WordPress or Drupal, it is very important to update both the CMS and its plugins very regularly, and avoid using un-maintained plugins (it is very common to have websites compromised because of outdated plugins). Civil society websites are welcome to apply to Deflect for free website protection.

Appendix

Acknowledgement

We would like to thank Front Line Defenders and Scaleway for their help. We would also like to thank ipinfo.io and RiskIQ for their tools that helped us in the investigation.

Indicators of Compromise

Top level domains :

email-service.host
email-session.host
support-email.site
support-email.host
email-support.host
myconnection.website
ecoit.email
my-cabinet.com
my-id.top
msoffice365-online.org
secretonline.top
m-youtube.top
auth-mail.com
mail-help-support.info
mail-support.info
auth-mail.me
auth-login.com
email-x.com
auth-mail.ru
mail-auth.top
msoffice365.win
bit-ly.host
m-youtube.org
vzlom.top
pochta.top
fixerman.top

You can find a full list of indicators on github : https://github.com/equalitie/deflect_labs_6_indicators

Read More

Deflect Labs Report #5 – Baskerville

Using Machine Learning to Identify Cyber Attacks

The Deflect platform is a free website security service defending civil society and human rights groups from digital attack. Currently, malicious traffic is identified on the Deflect network by Banjax, a system that uses handwritten rules to flag IPs that are behaving like attacking bots, so that they can be challenged or banned. While Banjax is successful at identifying the most common bruteforce cyber attacks, the approach of using a static set of rules to protect against the constantly evolving tools available to attackers is fundamentally limited. Over the past year, the Deflect Labs team has been working to develop a machine learning module to automatically identify malicious traffic on the Deflect platform, so that our mitigation efforts can keep pace with the methods of attack as these grow in complexity and sophistication.

In this report, we look at the performance of the Deflect Labs’ new anomaly detection tool, Baskerville, in identifying a selection of the attacks seen on the Deflect platform during the last year. Baskerville is designed to consume incoming batches of web logs (either live from a Kafka stream, or from Elasticsearch storage), group them into request sets by host website and IP, extract the browsing features of each request set, and make a prediction about whether the behaviour is normal or not. At its core, Baskerville currently uses the Scikit-Learn implementation of the Isolation Forest anomaly detection algorithm to conduct this classification, though the engine is agnostic to the choice of algorithm and any trained Scikit-Learn classifier can be used in its place. This model is trained on normal web traffic data from the Deflect platform, and evaluated using a suite of offline tools incorporated in the Baskerville module. Baskerville has been designed in such a way that once the performance of the model is sufficiently strong, it can be used for real-time attack alerting and mitigation on the Deflect platform.

To showcase the current capabilities of the Baskerville module, we have replayed the attacks covered in the 2018 Deflect Labs report: Attacks Against Vietnamese Civil Society, passing the web logs from these incidents through the processing and prediction engine. This report was chosen for replay because of the variety of attacks seen across its constituent incidents. There were eight attacks in total considered in this report, detailed in the table below.

Date Start (approx.) Stop (approx.) Target
2018/04/17 08:00 10:00 viettan.org
2018/04/17 08:00 10:00 baotiengdan.com
2018/05/04 00:00 23:59 viettan.org
2018/05/09 10:00 12:30 viettan.org
2018/05/09 08:00 12:00 baotiengdan.com
2018/06/07 01:00 05:00 baotiengdan.com
2018/06/13 03:00 08:00 baotiengdan.com
2018/06/15 13:00 23:30

baotiengdan.com

Table 1: Attack time periods covered in this report. The time period of each attack was determined by referencing the number of Deflect and Banjax logs recorded for each site, relative to the normal traffic volume.

How does it work?

Given one request from one IP, not much can be said about whether or not that user is acting suspiciously, and thus how likely it is that they are a malicious bot, as opposed to a genuine user. If we instead group together all the requests to a website made by one IP over time, we can begin to build up a more complete picture of the user’s browsing behaviour. We can then train an anomaly detection algorithm to identify any IPs that are behaving outside the scope of normal traffic.

The boxplots below illustrate how the behaviour during the Vietnamese attack time periods differs from that seen during an average fortnight of requests to the same sites. To describe the browsing behaviour, 17 features (detailed in the Baskerville documentation) have been extracted based on the request sets (note that the feature values are scaled relative to average distributions, and do not have a physical interpretation). In particular, it can be seen that these attack time periods stand out by having far fewer unique paths requested (unique_path_to_request_ratio), a shorter average path depth (path_depth_average), a smaller variance in the depth of paths requested (path_depth_variance), and a lower payload size (payload_size_log_average). By the ‘path depth’, we mean the number of slashes in the requested URL (so ‘website.com’ has a path depth of zero, and ‘website.com/page1/page2’ has a path depth of two), and by ‘payload size’ we mean the size of the request response in bytes.

Figure 1: The distributions of the 17 scaled feature values during attack time periods (red) and non-attack time periods (blue). It can be seen that the feature distributions are notably different during the attack and non-attack periods.

The separation between the attack and non-attack request sets can be nicely visualised by projecting along the feature dimensions identified above. In the three-dimensional space defined by the average path depth, the average log of the payload size, and the unique path to request ratio, the request sets identified as malicious by Banjax (red) are clearly separated from those not identified as malicious (blue).

Figure 2: The distribution of request sets along three of the 17 feature dimensions for IPs identified as malicious (red) or benign (blue) by the existing banning module, Banjax. The features shown are the average path depth, the average log of the request payload size, and the ratio of unique paths to total requests, during each request set. The separation between the malicious (red) and benign (blue) IPs is evident along these dimensions.

Training a Model

A machine learning classifier enables us to more precisely define the differences between normal and abnormal behaviour, and predict the probability that a new request set comes from a genuine user. For this report, we chose to train an Isolation Forest; an algorithm that performs well on novelty detection problems, and scales for large datasets.

As an anomaly detection algorithm, the Isolation Forest took as training data all the traffic to the Vietnamese websites over a normal two-week period. To evaluate its performance, we created a testing dataset by partitioning out a selection of this data (assumed to represent benign traffic), and combining this with the set of all requests coming from IPs flagged by the Deflect platform’s current banning tool, Banjax (assumed to represent malicious traffic). There are a number of tunable parameters in the Isolation Forest algorithm, such as the number of trees in the forest, and the assumed contamination with anomalies of the training data. Using the testing data, we performed a gridsearch over these parameters to optimize the model’s accuracy.

Replaying the Attacks

The model chosen for use in this report has a precision of 0.90, a recall of 0.86, and a resultant f1 score of 0.88, when evaluated on the testing dataset formulated from the Vietnamese website traffic, described above. If we take the Banjax bans as absolute truth (which is almost certainly not the case), this means that 90% of the IPs predicted as anomalous by Baskerville were also flagged by Banjax as malicious, and that 88% of all the IPs flagged by Banjax as malicious were also identified as anomalous by Baskerville, across the attacks considered in the Vietnamese report. This is demonstrated visually in the graph below, which shows the overlap between the Banjax flag and the Baskerville prediction (-1 indicates malicious, and +1 indicates benign). It can be seen that Baskerville identifies almost all of the IPs picked up by Banjax, and additionally flags a fraction of the IPs not banned by Banjax.

Figure 3: The overlap between the Banjax results (x-axis) and the Baskerville prediction results (colouring). Where the Banjax flag is -1 and the prediction colour is red, both Banjax and Baskerville agree that the request set is malicious. Where the Banjax flag is +1 and the prediction colour is blue, both modules agree that the request set is benign. The small slice of blue where the Banjax flag is -1, and the larger red slice where the Banjax flag is +1, indicate request sets about which the modules do not agree.

The performance of the model can be broken down across the different attack time periods. The grouped bar chart below compares the number of Banjax bans (red) to the number of Baskerville anomalies (green). In general, Baskerville identifies a much greater number of request sets as being malicious than Banjax does, with the exception of the 17th April attack, for which Banjax picked up slightly more IPs than Baskerville. The difference between the two mitigation systems is particularly pronounced on the 13th and 15th June attacks, for which Banjax scarcely identified any malicious IPs at all, but Baskerville identified a high proportion of malicious IPs.

Figure 4: The verdicts of Banjax (left columns) and Baskerville (right columns) across the 6 attack periods. The red/green components show the number of request sets that Banjax/Baskerville labelled as malicious, while the blue/purple components show the number that they labelled as benign. The fact that the green bars are almost everywhere higher than the red bars indicates that Baskerville picks up more traffic as malicious than does Banjax.

This analysis highlights the issue of model validation. It can be seen that Baskerville is picking up more request sets as being malicious than Banjax, but does this indicate that Baskerville is too sensitive to anomalous behaviour, or that Baskerville is outperforming Banjax? In order to say for sure, and properly evaluate Baskerville’s performance, a large testing set of labelled data is needed.

If we look at the mean feature values across the different attacks, it can be seen that the 13th and 15th June attacks (the red and blue dots, respectively, in the figure below) stand out from the normal traffic in that they have a much lower than normal average path depth (path_depth_average), and a much higher than normal 400-code response rate (response4xx_to_request_ratio), which may have contributed to Baskerville identifying a large proportion of their constituent request sets as malicious. Since a low average path depth (e.g. lots of requests made to ‘/’) and a high 400 response code rate (e.g. lots of requests to non-existent pages) are indicative of an IP behaving maliciously, this may suggest that Baskerville’s predictions were valid in these cases. But more labelled data is required for us to be certain about this evaluation.

Figure 5: Breakdown of the mean feature values during the two attack periods (red, blue) for which Baskerville identified a high proportion of malicious IPs, but Banjax did not. These are compared to the mean feature values during a normal two-week period (green).

Putting Baskerville into Action

Replaying the Vietnamese attacks demonstrates that it is possible for the Baskerville engine to identify cyber attacks on the Deflect platform in real time. While Banjax mitigates attacks using a set of static human-written rules describing what abnormal traffic looks like, by comprehensively describing how normal traffic behaves, the Baskerville classifier is able to identify new types of malicious behaviour that have never been seen before.

Although the performance of the Isolation Forest in identifying the Vietnamese attacks is promising, we would require a higher level of accuracy before the Baskerville engine is used to automatically ban IPs from accessing Deflect websites. The model’s accuracy can be improved by increasing the amount of data it is trained on, and by performing additional feature engineering and parameter tuning. However, to accurately assess its skill, we require a large set of labelled testing data, more complete than what is offered by Banjax logs. To this end, we propose to first deploy Baskerville in a developmental stage, during which IPs that are suspected to be malicious will be served a Captcha challenge rather than being absolutely banned. The results of these challenges can be added to the corpus of labelled data, providing feedback on Baskerville’s performance.

In addition to the applications of Baskerville for attack mitigation on the Deflect platform, by grouping incoming logs by host and IP into request sets, and extracting features from these request sets, we have created a new way to visualise and analyse attacks after they occur. We can compare attacks not just by the IPs involved, but also by the type of behaviour displayed. This opens up new possibilities for connecting disparate attacks, and investigating the agents behind them.

Where Next?

The proposed future of Deflect monitoring is the Deflect Labs Information Sharing and Analysis Centre (DL-ISAC). The underlying idea behind this project, summarised in the schematic below, is to split the Baskerville engine into separate User Module and Clearinghouse components (dealing with log processing and model development, respectively), to enable a complete separation of personal data from the centralised modelling. Users would process their own web logs locally, and send off feature vectors (devoid of IP and host site details) to receive a prediction. This allows threat-sharing without compromising personally identifiable information (PII). In addition, this separation would enable the adoption of the DL-ISAC by a much broader range of clients than the Deflect-hosted websites currently being served. Increasing the user base of this software will also increase the amount of browsing data we are able to collect, and thus the strength of the models we are able to train.

Baskerville is an open-source project, with its first release scheduled next quarter. We hope this will represent the first step towards enabling a new era of crowd-sourced threat information sharing and mitigation, empowering internet users to keep their content online in an increasingly hostile web environment.

Figure 6: A schematic of the proposed structure of the DL-ISAC. The infrastructure is split into a log-processing user endpoint, and a central clearinghouse for prediction, analysis, and model development.

A Final Word: Bias in AI

In all applications of machine learning and AI, it is important to consider sources of algorithmic bias, and how marginalised users could be unintentionally discriminated against by the system. In the context of web traffic, we must take into account variations in browsing behaviour across different subgroups of valid, non-bot internet users, and ensure that Baskerville does not penalise underrepresented populations. For instance, checks should be put in place to prevent disadvantaged users with slower internet connections from being banned because their request behaviour differs from those users that benefit from high-speed internet. The Deflect Labs team is committed to prioritising these considerations in the future development of the DL-ISAC.

Read More