Downloading and analyzing the source code of all available Chrome Extensions

January 6, 2021
plugins static-analysis scraping python bash

But, why?

From algorithmic fairness to adversarial machine learning, there are tons of hot issues in the field of security and privacy in machine learning systems in production. You can evade, or poison, or even steal a machine learning model from API access alone, for instance. Then there are issues such as membership inference, etc.

example of evasion attacks

An example of evasion attacks on STOP signs- an autonomous car can be fooled to believe that it is a speed limit sign (source: mlconference.ai)

Could there be potential security holes lurking around the Chrome browser extensions (aka plugins) that use machine learning as one of their main jobs? So, our goal was this- analyze all the Google Chrome extensions that use Machine Learning and discover and enumerate potential security issues.

The goal was far from humble since the total number of Google Chrome extensions could be very large and there are no major hints either. The journey is going to be interesting, and it seems that we would need to do all of the following:

  1. Download all chrome extensions.
  2. Examine them to extract the ones that are using Machine Learning.
  3. Find, probe into, and report potential security issues (not covered here).

Tools of the Trade

The machine that I had access to had the following configurations:

CPU Model Name Intel® Xeon® CPU E5-2695 v4 @ 2.10GHz
Number of Cores 72
Available RAM 120 GB
OS Ubuntu
Ping 18 ms
Download Bandwidth 700 Mbps

Besides the machine, I primarily relied on Python, BASH, awk, sed, and (e)grep to get almost everything done.

But how do I crawl for all available extensions?

The first challenge is to download all the Chrome extensions available on the chrome web store, and in order to do that, you’d need URLs pointing to all the extensions first. For instance, the URL to Google Translate extension is this: https://chrome.google.com/webstore/detail/google-translate/aapbdbdomjkkjkaonfhkkikfgjllcleb. We will worry a bit later about how to actually get the source code from there.

There’s a hard and decently slow way to do it. It goes something like this- run a headless selenium browser, go to the chrome webstore and you will find featured extensions from each category. Then visit the page dedicated for each category, and then collect all the extension URLs there. Once you’ve done that, you will have enough ‘seed’ URLs.

related plugins

Related Section for a given Chrome Extensions- Scroll across and collect all the URLs for the related section

Now visit each of them, scroll down to the bottom, gather all the extensions that are in the related related section. Then repeat the same for each of those extensions. We are hoping that this way we would be able to reach all the extensions. But there are few problems with this approach.

  1. After sometime, the discovery rate of previously unvisited extensions starts to slow down. I don’t have exact statistical summary since I stopped this process after gathering 7k extension URLs over some 4 hours. I did this on my Macbook Pro (8GM memory).

  2. This method is tediously slow. Visiting each page would, on average, take 4 seconds to load the page reliably. Even if I spawned 72 processes in parallel, a total of 118518 extensions. And 214421 would have taken at least ~23 hrs. But once you take the fact the discovery of newer items is much lower, you’re looking at few days of crawling alone! (Of course, you could parallelize downloading the code itself, but that was too much for me to think at once at the BASH terminal.)

What I had instead was a hybrid approach- 7hrs of crawling using selenium combined with a whole 118,518 extension URLs that I found by Google dorking Google itself. For our next step, I have 214,421 extensions hosted on the Chrome webstore. Let’s move on.

Downloading the Chrome extensions

There are multiple ways to do it.

  1. Automate the extension installation process and read the extensions from your system after installing each extension.
  2. Use another extension to download chrome extensions; there’s even a dedicated site for this.
distribution of plugin source sizes

Top Chrome extensions based on their sizes

The mean size of each of the 214k extensions was 2.094MB. However, the 90th quantile is 4.64 MB, and 95th quantile is 10.01 MB, and the 99th quantile is 32.87 MB.

When published by developers, Google Chrome extensions are packed into and hosted as .crx files. Say you download them each, using curl, and each extension takes 2 seconds, then you’d need at least 4.96 days! But since the downloading process is mostly IO bound, we could use GNU parallel. It took less than 6 hours with 20 download jobs in parallel to download all the extensions as crx files. The total size of the download was 448.15 GB!

But one more thing. Before I can analyze any of the extension source codes, and since I didn’t know how to proceed, I was better of extracting the crx files. Extracting is as simple as unzip -d $extension, but took around 8 hours as a single process. The total size of the uncompressed extensions was 564 GB.

Find the Extensions that are using Machine Learning

Now is the fun part, and since we do not know a priori any methods to do the analysis, any acceptable method should have a very high recall. What we want to do here is a sort of static analysis i.e. figuring out the functionality and implications of the code without actually running them. But before you go around finding the right tools and creating your awesome static analysis pipeline on 214,421 extensions comprising of more than 500 GB worth of source material, let’s just pause for a moment and think about it.

As yourself this question- when was the last time that you wrote your machine learning library because you wanted to do some classification? The answer is never except maybe once when you tried to learn the basics. Based on this observation, you are better off looking for patterns of ML (Deep Learning) usage based on the well-known deep learning libraries.

Once you have done your basic research, you are almost ready to do your cheap and effective static analysis on such a large dataset. It cheap because it is literally human-supervised pattern search in the string of the actual code, and it is effective because it works (and is fast enough). What’s the tool we use? I would suggest that the classic (e)grep is enough!

Let’s find all the extensions that are using any of these libraries.

find . -name "*.js"|\
egrep -o -r -I -i '(brain|synaptic|convnet|webdnn|tensorflow)\.js' 

By now, you’d figure out that the majority of the true positive ones are actually using only Tensorflow library. So just run one more search for the egrep parameters -i tensorflow, and call it a day. Interestingly enough, not having \.js gave very high false positive rate since the words brain, tensorflow, and synaptic were not rare in comments, and strings.

Once you have gathered all the files, use sed, and awk to clean up and remove unnecessary noise and duplicates. Finally, you would be left with only around 65 extensions that indeed use machine learning! How can I be sure that my method worked? Well, I had certain extensions that I knew beforehand that they were using machine learning and they served as canaries. The above methods were supposed to get them all. By the way, if you are curious, the top 5 Chrome extension using machine learning (deep learning, to be precise) in the order of number of downloads are- AdBlock Plus (10 million+), Visual Effects for Google Meet(2 million+), Guardio: Antivirus and Malware Removal (300k+), Background Blur of Google Meet (100k+), and Reach: Emoji, GIFs filters for Google Meet (100k+). The top categories for these extensions using deep learning under the hood were Productivity, followed by Social and Communication, Search Tools, and Social and Communication.

Conclusions

A maximum of 214,421 Chrome extensions were downloaded in less than a day, and their source codes were examined in another few hours using the general tools that GNU/UNIX offers out of the box. Selenium was also used for collecting initial crawling data and some Python scripting. Finally, we found that only around 65 extensions use deep learning frameworks to do machine learning within the browser context. We will deal in a future post if some of these extensions using deep learning could have potential security issues.

If you have a message for the author, please write to: lambdacircle *at* protonmail *dot* com

The Y and Z Combinators in Python

December 29, 2020
lambda-calculus python

Discovering the Y-Combinator by Mistake

December 1, 2020
lambda-calculus python