rasta_data_manipulation | ||
rasta_exp | ||
.gitignore | ||
GPLv3 | ||
README.md |
RASTA
Rasta stands for Reproducibility of Android Static Tools and Analysis.
This repository contains the source code for reproducing the experiments of the paper "Evaluating the Re-Usability of Android Static Analysis Tools" published in the conference ICSR 2024.
The provided source code enables to rebuild Docker and Singularity images for several static analysis tools of the literature, but pre-build images can be retrieved directly from the following locations:
- Docker images: https://hub.docker.com/u/histausse (or see full list at the end of this README)
- Singularity images: https://zenodo.org/records/10980349
The Docker image provides an interactive container to the user for analyzing an APK file. The Singularity image helps to run batch analysis for a dataset of applications on a Singularity cluster. Additionally, the source code contains scripts for extracting the status of each APK analysis (failed/finished) and some characteristics (time, memory) and pushing these values in a database for further statistics.
The input data and pre-computed output data are provided from outside this repository.
If someone wants to reuse a specific analyzing tool, without installing it and by using our Docker images, have a look at the end of the readme.
Data
Some data are needed to reproduce the experiment (at least, the androzoo indexes we used to sample our dataset). Those data are too heavy to be stored in a git repository, so they need to be downloaded from zenodo to the root of this repository:
curl https://zenodo.org/records/10137905/files/rasta_data_v1.0.tgz?download=1 | tar -xz
Dependencies
To run the Rasta experiment, some tools are required:
- Docker (e.g. version 24.0.6),
- Singularity (e.g version 3.11.1)
- a modern version of Python (e.g. Python 3.10 or 3.11).
- gzip
- sqlite3
One way to install those tools is to use Nixpkgs (nix-shell -p docker singularity python310 python310Packages.numpy python310Packages.matplotlib sqlite
), another way is to follow the instructions of the different tools (https://docs.sylabs.io/guides/3.11/user-guide/, https://docs.docker.com/).
Warning
(One years later, 2025):
Since Ubuntu 23.10, apparmor prevent the creation of unprivileged namespace by default. This means singularity wont work without a specific apparmor profile (wich is not installed by nix-shell).
Fortunately, Ubuntu now has a package for singularity:
singularity-container
. Using your distribution package should be the prefered method for installing the tools.
They are also some python dependencies that need to be installed in a virtual env:
python3 -m venv venv
source venv/bin/activate
pip install rasta_data_manipulation/
pip install -r rasta_exp/requirements.txt
From now on, all commands are run from inside this venv.
Re-generating datasets
The datasets we used (Drebin and Rasta, split in 10 balanced sets) are located in data/dataset
:
- Drebin: drebin
- Rasta: set0, set1, ..., set9
It is possible to reproduce the generation of these datasets, using latest.csv.gz
and year_and_sdk.csv.gz
that comes from Androzoo. Use the following command regenerate the Rasta dataset:
rasta-gen-dataset data/androzoo/latest.csv.gz data/androzoo/year_and_sdk.csv.gz -o data/dataset
Container Images
The containers are stored in data/imgs
. They can be regenerated with:
cd rasta_exp
./build_docker_images.sh ../data/imgs
cd ..
The images can also be directly downloaded from the Zenodo archive using:
cd rasta_exp
./download_sif_images.sh ../data/imgs
cd ..
The container and binary of Perfchecker is not provided as the Perfchecker binary is only available on demand.
Running experiments
The results of the experiments are stored in data/results/archives/
. They can be extracted with:
mkdir -p data/results/reports/rasta
mkdir -p data/results/reports/drebin
for archive in $(ls data/results/archives/status_set*.tgz); do tar -xzf ${archive} --directory data/results/reports/rasta; done
tar -xzf data/results/archives/status_drebin.tgz --directory data/results/reports/drebin
They can also be regenerated by recomputing all our experiments. You will need some weeks or months...
To run the experiment using a Singularity image hosted on your own computer, you must simplify the settings.ini
file that is intended to run on Singularity cluster. This file is located in the rasta_exp
directory. The following 3 lines is sufficient to configure the experiment for running on your local computer:
[AndroZoo]
apikey = <KEY>
base_url = https://androzoo.uni.lu
Do not forget to replace <KEY>
by your AndroZoo key.
Then, you can run the experiment for all tools and for the Rasta and Drebin dataset by doing:
./rasta_exp/run_exp_local.sh ./data/imgs ./data/dataset/drebin ./data/results/reports/drebin/status_drebin
for i in {0..9}; do
./rasta_exp/run_exp_local.sh ./data/imgs "./data/dataset/set${i}" "./data/results/reports/rasta/status_set${i}"
done;
This takes a lot of times, probably several months. You should adapt this last script to either reduce:
- the number of static analysis tools to evaluate
- the dataset size
- other parameters in the source code such as the timeout
Pushing results into a database
The generated file reports are JSON files that can be parsed after the finishing of the previous experiments. The provided parsing script help to push some information into databases to help further analysis. We provided pre-computed dumps of the database that can be obtained at this stage. The dumps can be obtained by doing:
zcat data/results/drebin.sql.gz | sqlite3 data/results/drebin.db
zcat data/results/rasta.sql.gz | sqlite3 data/results/rasta.db
To re-generate the database from the JSON reports of the previous experiments:
./rasta_data_manipulation/make_db.sh ./data
Generating the database requires an androzoo API key and a lot of times because we download the apks to get there total dex size (the value indicated in latest.csv only take into account the size of classes.dex
and not the sum of the size of all dex file when they are more than one).
Database Usage
Most of the results presented in the paper can be regenerated from the database using the following script:
./rasta_data_manipulation/extract_result.sh ./data
They are 4 tables in the database, apk
, tool
, exec
and error
that we describe in the following.
Apk table
The data related to the apks of the dataset are in the apk
table that has the following columns:
sha256
: The hash of the apkfirst_seen_year
: The first year the apk has been seenapk_size
: The total size of the apkvt_detection
: The number of detections by Virus Totalmin_sdk
: The min SDK indicated by the apkmax_sdk
: The max SDK indicated by the apktarget_sdk
: The target SDK indicated by the apkapk_size_decile
: The decile of size apk the apk belong todex_date
: The date indicated in the dex filepkg_name
: The name of the apkvt_scan_date
: The year when the apk was provided to Virus Totaldex_size
: The total size of the dex filesadded
: The year the apk was added to AndrooZoomarkets
: Where the apk was collecteddex_size_decile
: The decile of dex size the apk belong todex_size_decile_by_year
: The decile of dex size for the first_seen_year of the apk
Tool table
The data related to the tools used by the experiment are in the tool
table. Its columns are:
tool_name
: The name of the tooluse_python
: If the tool uses pythonuse_java
: If the tool uses javause_scala
: If the tool uses scalause_ocaml
: If the tool uses ocamluse_ruby
: If the tool uses rubyuse_prolog
: If the tool uses prologuse_soot
: If the tool uses sootuse_androguard
: If the tool uses androguarduse_apktool
: If the tool uses apktool
Exec table
The data related to the execution of an analysis are in the exec
table. Columns are:
sha256
: The hash of the tested apktool_name
: The name of the tested tooltool_status
: The status of the analysis: FAILED, FINISHED, TIMEOUT, OTHERtime
: The duration of the analysisexit_status
: The exit status code return by the executiontimeout
: If the execution timedoutmax_rss_mem
: The memory used by the analysis
They are other values collected by time
during the analysis:
avg_rss_mem
page_size
kernel_cpu_time
user_cpu_time
nb_major_page_fault
nb_minor_page_fault
nb_fs_input
nb_fs_output
nb_socket_msg_received
nb_socket_msg_sent
nb_signal_delivered
Error table
The error collected during the analysis are stored in the error
table.
All columns are not used, depending on the error_type
.
tool_name
: The name of the tool that raised the errorsha256
: The hash of the apk analyzed when the error was raisederror_type
: The type of error (Log4j, Java, Python, Xsb, Ocaml, Log4jSimpleMsg, Ruby)error
: The name of the errormsg
: The message of the errorcause
: Rough estimation of the cause of the errorfirst_line
: The line number of the first line of the error in the loglast_line
: The line number of the last line of the error in the loglogfile_name
: The file in which the error was collected (usually stdout and stderr)file
: The file of the ruby script that raised the errorline
: The line number of the instruction that raised the errorfunction
: The function that raised the errorlevel
: The level of the log (eg FATAL, CRITICAL)origin
: The origin of the error (java class referred by log4j)raised_info
: 'Raised at' information (for Ocaml errors)called_info
: 'Called from' information (for Ocaml errors)
Database usage
The data can be explored using SQL queries. tool_name
and sha256
are the usual foreign keys used for joins. For example, this SQL query gives the average time taken by an analysis made by tools using soot, associated with the average size of bytecode of the applications analyzed, grouped by deciles of this size on the whole dataset:
$ sqlite3 data/results/rasta.db
sqlite> SELECT AVG(dex_size), AVG(time)
FROM exec
INNER JOIN apk ON exec.sha256=apk.sha256
INNER JOIN tool ON exec.tool_name=exec.tool_name
WHERE tool.use_soot = TRUE AND exec.tool_status = 'FAILED'
GROUP BY dex_size_decile
ORDER BY AVG(dex_size);
Reusing a Specific Tool
If you don't want to use the dockerhub image, you can build them using:
cd rasta_exp
./build_docker_images.sh ../data/imgs
cd ..
The obtained images are named histausse/rasta-<tool-name>:icsr2024
, and the environment variables associated are in rasta_exp/envs/<tool-name>_docker.env
. The build_docker_images.sh can be edited to chose only one tool to be built.
After building a tool, a container can be entered interactively by doing:
docker run --rm --env-file=rasta_exp/envs/mallodroid_docker.env -v /tmp/mnt:/mnt -it histausse/rasta-mallodroid:icsr2024 bash
Here, /tmp/mnt
is mounted to /mnt
in the container. Put the apk
in /tmp/mnt
to analyze it.
To run the analysis of the APK, run /run.sh <apk>
where <apk>
is the name of the apk in /mnt
, without the /mnt
prefix. The artifact of the analysis are stored in /mnt
, including the stdout
, stderr
and result of the time
command.
root@e3c39c14e382:/# ls /mnt
E29CCE76464767F97DAE039DBA0A0AAE798DF1763AD02C6B4A45DE81762C23DA.apk
root@e3c39c14e382:/# /run.sh E29CCE76464767F97DAE039DBA0A0AAE798DF1763AD02C6B4A45DE81762C23DA.apk
root@e3c39c14e382:/# ls /mnt/
E29CCE76464767F97DAE039DBA0A0AAE798DF1763AD02C6B4A45DE81762C23DA.apk report stderr stdout
The report directory contains the result of the time command. The stdout and stderr contains the trace of execution of the tool on the APK. If extra files are generated by the tool, you should find them in the this directory.
The run.sh
script can be customized to modify the run parameters used for this tool. The script that is copied into the Docker image is located at rasta_exp/docker/<tool name>/home_build/run.sh
.
Dockerhub images
The docker images are available on dockerhub under the names:
histausse/rasta-adagio:icsr2024
histausse/rasta-amandroid:icsr2024
histausse/rasta-anadroid:icsr2024
histausse/rasta-androguard-dad:icsr2024
histausse/rasta-androguard:icsr2024
histausse/rasta-apparecium:icsr2024
histausse/rasta-blueseal:icsr2024
histausse/rasta-dialdroid:icsr2024
histausse/rasta-didfail:icsr2024
histausse/rasta-droidsafe:icsr2024
histausse/rasta-flowdroid:icsr2024
histausse/rasta-gator:icsr2024
histausse/rasta-ic3-fork:icsr2024
histausse/rasta-ic3:icsr2024
histausse/rasta-iccta:icsr2024
histausse/rasta-mallodroid:icsr2024
histausse/rasta-redexer:icsr2024
histausse/rasta-saaf:icsr2024
histausse/rasta-wognsen:icsr2024
LICENSE
This repository is licensed under the GPLv3, please notice that this license do not apply to the tested tools.
Remember, this program is provided "as is" without warranty of any kind.