README review.

This commit is contained in:
Jean-Francois Lalande 2023-11-16 16:41:47 +01:00
parent cd1e91bb99
commit b3aaf28ebf

View file

@ -1,10 +1,14 @@
# RASTA
Reproducibility of the Rasta experiment.
Rasta stands for Reproducibility of Android Static Tools and Analysis.
This repository contains the source code for reproducing the experiments of the paper "Evaluating the Re-Usability of Android Static Analysis Tools". The provided source code enables to rebuild Docker and Singularity images for several static analysis tools of the literature. These Docker provides an interative container to the user for analyzing an APK file. The Singularity image helps to run batch analysis for a dataset of applications on a Singularity cluster. Additionally, the source code contains scripts for extracting the status of each APK analysis (failed/finished) and some characteristics (time, memory) and pushing these values in a database for further statistics.
The input data and pre-computed output data are provided from [outside this repository](https://zenodo.org/records/10137905).
## Data
Some data are needed to reproduce the experiment (at the very least, the androzoo indexes we used to sample our dataset). Those data are too heavy to be stored in a git, so they need to be download from zenodo to the root of this repository:
Some data are needed to reproduce the experiment (at least, the androzoo indexes we used to sample our dataset). Those data are too heavy to be stored in a git repository, so they need to be downloaded from zenodo to the root of this repository:
```
curl https://zenodo.org/records/10137905/files/rasta_data_v1.0.tgz?download=1 | tar -xz
@ -30,17 +34,24 @@ pip install rasta_data_manipulation/
pip install -r rasta_exp/requirements.txt
```
From now end, all commands are run from inside the venv.
From now on, all commands are run from inside this venv.
## Dataset
## Re-generating datasets
The datasets we used (Drebin and Rasta, split in 10 balanced sets) are in `data/dataset`.
The datasets we used (Drebin and Rasta, split in 10 balanced sets) are located in `data/dataset`:
To reproduce the generation of the dataset, `latest.csv.gz` and `year_and_sdk.csv.gz` are required: `rasta-gen-dataset data/androzoo/latest.csv.gz data/androzoo/year_and_sdk.csv.gz -o data/dataset` (this will no generate the drebin dataset)
- Drebin: drebin
- Rasta: set0, set1, ..., set9
It is possible to reproduce the generation of these datasets, using `latest.csv.gz` and `year_and_sdk.csv.gz` that comes from Androzoo. Use the following command regenerate the Rasta dataset:
```
rasta-gen-dataset data/androzoo/latest.csv.gz data/androzoo/year_and_sdk.csv.gz -o data/dataset
```
## Container Images
The containers are stored in `data/imgs`. They can be regenerated with
The containers are stored in `data/imgs`. They can be regenerated with:
```
cd rasta_exp
@ -48,13 +59,13 @@ cd rasta_exp
cd ..
```
(The container images will be released with the final release)
(To avoid to rebuild these containers, we will upload them to the Docker hub repository when the paper is published.)
The container and binary of Perfchecker is not provided as this tool is only available on demand.
The container and binary of Perfchecker is not provided as the Perfchecker binary is only available on demand.
## Experiment
## Running experiments
The results of the experiment are stored in `data/results/archives/`. They can be extracted with:
The results of the experiments are stored in `data/results/archives/`. They can be extracted with:
```
mkdir -p data/results/reports/rasta
@ -63,9 +74,9 @@ for archive in $(ls data/results/archives/status_set*.tgz); do tar -xzf ${archiv
tar -xzf data/results/archives/status_drebin.tgz --directory data/results/reports/drebin
```
They can also be regenerated by running the experiment.
They can also be regenerated by recomputing our experiments.
To run the experiment local, first you must set the `settings.ini` file in `rasta_exp`. Replacing it by this is enough (don't forget to replace `<KEY>` by your AndroZoo key):
To run the experiment using a Singularity image hosted on your own computer, you must simplify the `settings.ini` file that is intended to run on Singularity cluster. This file is located in the `rasta_exp` diretory. The following 3 lines is sufficient to configure the experiment for running on your local computer:
```
[AndroZoo]
@ -73,7 +84,9 @@ apikey = <KEY>
base_url = https://androzoo.uni.lu
```
Then, you can run the experiment with:
Do not forget to replace `<KEY>` by your AndroZoo key.
Then, you can run the experiment for all tools and for the Rasta and Drebin dataset by doing:
```
./rasta_exp/run_exp_local.sh ./data/imgs ./data/dataset/drebin ./data/results/reports/drebin/status_drebin
@ -82,20 +95,22 @@ for i in {0..9}; do
done;
```
(This takes a lot of times)
This takes a lot of times, probably several months. You should adapt this last script to either reduce:
## Database
- the number of static analysis tools to evaluate
- the dataset size
- other parameters in the source code such as the timeout
The reports are parsed into databases to help analyzing them. The database can be extracted from their dumps or generated from the reports and dataset.
## Pushing results into a database
To extract the dumps:
The generated file reports are JSON files that can be parsed after the finishing of the previous experiments. The provided parsing script help to push some information into databases to help further analysis. We provided pre-computed dumps of the database that can be obtained at this stage. The dumps can be obtained by doing:
```
zcat data/results/drebin.sql.gz | sqlite3 data/results/drebin.db
zcat data/results/rasta.sql.gz | sqlite3 data/results/rasta.db
```
To generate the databases:
To re-generate the database from the JSON reports of the previous experiments:
```
./rasta_data_manipulation/make_db.sh ./data
@ -106,19 +121,17 @@ Generating the database require an androzoo API key and a lot of times because w
## Database Usage
Most of the results used in the paper can be extracted with:
Most of the results presented in the paper can be extracted with:
```
./rasta_data_manipulation/extract_result.sh ./data
```
They are 4 tables in the database, `apk`, `tool`, `exec` and `error`.
They are 4 tables in the database, `apk`, `tool`, `exec` and `error` that we describe in the following.
### Apk table
The data related to the apks of the dataset are in the `apk` table.
The entry of the `apk` table have the columns:
The data related to the apks of the dataset are in the `apk` table that has the following columns:
- `sha256`: The hash of the apk
- `first_seen_year`: The first year the apk has been seen
@ -139,9 +152,7 @@ The entry of the `apk` table have the columns:
### Tool table
The data related to the tools used by the experiment are in the `tool` table.
Its columns are:
The data related to the tools used by the experiment are in the `tool` table. Its columns are:
- `tool_name`: The name of the tool
- `use_python`: If the tool uses python
@ -156,7 +167,7 @@ Its columns are:
### Exec table
The data related to the execution of an analysis are in the `exec` table.
The data related to the execution of an analysis are in the `exec` table. Columns are:
- `sha256`: The hash of the tested apk
- `tool_name`: The name of the tested tool
@ -167,6 +178,7 @@ The data related to the execution of an analysis are in the `exec` table.
- `max_rss_mem`: The memory used by the analysis
They are other values collected by `time` during the analysis:
- `avg_rss_mem`
- `page_size`
- `kernel_cpu_time`
@ -201,13 +213,9 @@ All columns are not used, depending on the `error_type`.
- `raised_info`: 'Raised at' information (for Ocaml errors)
- `called_info`: 'Called from' information (for Ocaml errors)
### Usage
### Database usage
The data can be explored using SQL queries. `tool_name` and `sha256` are the usual foreign keys used for joins.
#### Exemple:
This SQL query gives the average time taken by an analysis made by tool using soot, associated with the average size of bytecode of the applications analysed, grouped by deciles of this size on the whole dataset:
The data can be explored using SQL queries. `tool_name` and `sha256` are the usual foreign keys used for joins. For example, this SQL query gives the average time taken by an analysis made by tools using soot, associated with the average size of bytecode of the applications analyzed, grouped by deciles of this size on the whole dataset:
```
$ sqlite3 data/results/rasta.db
@ -222,17 +230,25 @@ ORDER BY AVG(dex_size);
## Reusing a Specific Tool
The containers are not on docker hub yet, so they need to be built using `build_docker_images.sh`. The images are named `rasta-<tool-name>`, and the environment variables associated are in `rasta_exp/envs/<tool-name>_docker.env`.
The containers are not on docker hub yet, so they need to be built using:
To enter a container, run:
```
cd rasta_exp
./build_docker_images.sh ../data/imgs
cd ..
```
The obtained images are named `rasta-<tool-name>`, and the environment variables associated are in `rasta_exp/envs/<tool-name>_docker.env`. The build_docker_images.sh can be edited to chose only one tool to be built.
After building a tool, a container can be entered interactively by doing:
```
docker run --rm --env-file=rasta_exp/envs/mallodroid_docker.env -v /tmp/mnt:/mnt -it rasta-mallodroid bash
```
Here, `/tmp/mnt` is mounted to `/mnt` in the container. Put the `apk` to analyze in it.
Here, `/tmp/mnt` is mounted to `/mnt` in the container. Put the `apk` in `/tmp/mnt` to analyze it.
To run the analysis, run `/run.sh <apk>` where `<apk>` is the name of the apk in `/mnt`, without the `/mnt` prefix. The artifact of the analysis are stored in `/mnt`, including the `stdout`, `stderr` and result of the `time` command.
To run the analysis of the APK, run `/run.sh <apk>` where `<apk>` is the name of the apk in `/mnt`, without the `/mnt` prefix. The artifact of the analysis are stored in `/mnt`, including the `stdout`, `stderr` and result of the `time` command.
```
root@e3c39c14e382:/# ls /mnt
@ -241,3 +257,8 @@ root@e3c39c14e382:/# /run.sh E29CCE76464767F97DAE039DBA0A0AAE798DF1763AD02C6B4A4
root@e3c39c14e382:/# ls /mnt/
E29CCE76464767F97DAE039DBA0A0AAE798DF1763AD02C6B4A45DE81762C23DA.apk report stderr stdout
```
The `run.sh` script can be customized to modify the run parameters used for this tool. The script that is copied into the Docker image is located at `rasta_exp/docker/<tool name>/home_build/run.sh`.