Shining Light into the Tunnel: Understanding and Classifying Network Traffic of Residential Proxies

Ronghong Huang1, Dongfang Zhao2, Xianghang Mi1, Xiaofeng Wang2,
1University of Science and Technology of China, 2Indiana University Bloomington

Abstract

Emerging in recent years, residential proxies (RESIPs) feature multiple unique characteristics when compared with traditional network proxies (e.g., commercial VPNs), particularly, the deployment in residential networks rather than data center networks, the worldwide distribution in tens of thousands of cities and ISPs, and the large scale of millions of exit nodes. All these factors allow RESIP users to effectively masquerade their traffic flows as ones from authentic residential users, which leads to the increasing adoption of RESIP services, especially in malicious online activities.

However, regarding the (malicious) usage of RESIPs (i.e., what traffic is relayed by RESIPs), current understanding turns out to be insufficient. Particularly, previous works on RESIP traffic studied only the maliciousness of web traffic destinations and the suspicious patterns of visiting popular websites.

Also, a general methodology is missing regarding capturing large-scale RESIP traffic and analyzing RESIP traffic for security risks.

Furthermore, considering many RESIP nodes are found to be located in corporate networks and are deployed without proper authorization from device owners or network administrators, it is becoming increasingly necessary to detect and block RESIP traffic flows, which unfortunately is impeded by the scarcity of realistic RESIP traffic datasets and effective detection methodologies.

Residential proxies

Methodology Pipeline

Residential proxies (RESIP) are web proxies located in either residential or cellular networks. As illustrated in Figure, a typical RESIP works in a backconnect mode wherein proxy traffic sourced from a proxy customer will be first sent to the gateway server, which in turn forwards the traffic to a RESIP node, before exiting to the traffic destination. What's more, most RESIP services will allow proxy customers to specify where they want to exit the relayed traffic, in terms of countries and cities. Also, proxy customers are able to stick their traffic to the same exit node, either through passing the same session id or connecting to some sticky proxy gateways which usually bind to an exit node for every 5 or 10 minutes.

Methodology

Methodology Pipeline
  • The RESIP Traffic Collector
  • To collect RESIP traffic, a RESIP traffic collector is built up, which consists of two major components: a set of RESIP node runners and a central RESIP node dispatcher (i.e., controller). Specifically, the dispatcher is designed to instruct the RESIP node runners regarding how to deploy RESIP nodes and how to back up the resulting network traffic along with other log files.On the other hand, a RESIP node runner is responsible for running a RESIP node for a period of time by following deployment instructions from the central dispatcher.

  • The RESIP Node Collector
  • As illustrated in Figure, to capture RESIP nodes, web clients and web servers are first deployed under our control. Then, probes in the form of simple HTTP requests will be sent from web clients, forwarded through proxy gateway servers, exited at RESIP nodes, before reaching the web servers. Through this process, the web servers will be able to observe the public IP address of each involved RESIP node.

  • The RESIP Traffic Analyzer
  • The large scale of the captured RESIP traffic renders manual analysis impractical. Therefore, a RESIP traffic analyzer consisting of multiple traffic processing modules was designed and implemented. It contains the following five parts:

    • Signature-Based RESIP Traffic Classification
    • Traffic destination addresses extraction
    • Querying and analyzing threat intelligence of traffic desti- nations
    • Abnormal traffic analysis
    • RESIP usage analysis

Traffic Towards Sensitive Websites

We observe that miscreants have been abusing RESIPs to relay their visiting traffic towards security-sensitive websites. This allows miscreants to masquerade their visits as benign from local residents and thus lower the possibility of being detected or even blocked.

  • Visiting government/military/education websites
  • A website is typically considered as security sensitive when it is operated by government agencies, military organizations, or educational institutions. Such websites tend to be registered under special suffixes, e.g., a government website usually has the gov in its suffix while it is edu for education websites, and mil for military ones. In total, 20,713 traffic flows have been observed to visit one of 316 government websites, 9 military websites, and 560 education websites. Among these flows, 89.62% were through HTTPs while the remaining flows were through HTTP.

  • Visiting sensitive websites running on unusual TCP ports
  • Many RESIP flows were found to be visits towards sensitive websites that appear to be not intended for public access. In total, we have manually confirmed suspicious traffic towards 73 different security-sensitive websites. These websites belong to diverse categories, e.g., industrial control systems, airport control systems, watering control systems, remote desktops inside government agencies, etc. Also, all of these websites are located in China, and visits to these websites were relayed through RESIP nodes deployed in China.

RESIP Traffic Classifications

To defend against the non-negligible security risks of RESIP traffic, one promising option is to deploy a RESIP traffic detector at various vantage points, e.g., a local device, the network gateway of an organization, or a router operated by an ISP. Therefore, we further pursue this task through the combination of machine learning, a set of robust features inherent in each traffic flow, and a large-scale ground truth dataset consisting of real-world relayed flows, tunnel flows, and non-RESIP traffic.

We explore two machine learning paradigms. One is to fine-tune a pre-trained transformer-based model which allows us to automatically encode a traffic flow into a generic representation. The other is to manually design features and utilize classic classification algorithms to build up the classification models. The following table shows the performance of our model.

The performance of various classification models for RESIP flows detection.
Model Precision Recall F1-Score FPR1
TF-BERT2 0.9036 0.8891 0.8963 0.0535
RF-BERT2 0.9287 0.9304 0.9296 0.0396
TF-RF2 0.9571 0.9139 0.9350 0.0223
RF-RF2 0.9620 0.8880 0.9235 0.0191

1FPR stands for the false positive rate.

2TF-BERT is downstream task that fine-tunes the ET-BERT model for tunnel flow classification. RF-BERT is downstream task that fine-tunes the ET-BERT model for relayed flow classification. TF-RF denotes Random Forest classifiers specialized in tunnel flow classification, while RF-RF denotes Random Forest classifiers designed for relayed flow classification.

BibTeX