thesis/2_background/4_datasets.typ

23 lines
2.5 KiB
Typst

#import "../lib.typ": todo, etal, APK
== Application Datasets <sec:bg-datasets>
Computing if an application contains a possible information flow is an example of a static analysis goal.
Some datasets have been built especially for evaluating tools that are computing information flows inside Android applications.
One of the first well known dataset is DroidBench, that was released with the tool Flowdroid~@Arzt2014a.
Later, the dataset ICC-Bench was introduced with the tool Amandroid~@weiAmandroidPreciseGeneral2014 to complement DroidBench by introducing applications using Inter-Component data flows.
These datasets contain carefully crafted applications containing flows that the tools should be able to detect.
These hand-crafted applications can also be used for testing purposes or to detect any regression when the software code evolves.
Contrary to real world applications, the behavior of these hand-crafted applications is known in advance, thus providing the ground truth that the tools try to compute.
However, these datasets are not representative of real-world applications~@Pendlebury2018 and the obtained results can be misleading.
Contrary to DroidBench and ICC-Bench, some approaches use real-world applications.
Bosu #etal~@bosuCollusiveDataLeak2017 use DIALDroid to perform a threat analysis of Inter-Application communication and published DIALDroid-Bench, an associated dataset.
Similarly, Luo #etal released TaintBench~@luoTaintBenchAutomaticRealworld2022 a real-world dataset and the associated recommendations to build such a dataset.
These datasets are useful for carefully spotting missing taint flows, but contain only a few dozen of applications.
In addition to those datasets, Androzoo~@allixAndroZooCollectingMillions2016 collect applications from several application market places, including the Google Play store (the official Google application store), Anzhi and AppChina (two chinese stores), or FDroid (a store dedicated to free and open source applications).
Currently, Androzoo contains more than 25 millions applications, that can be downloaded by researchers from the SHA256 hash of the application.
Androzoo provide additionnal information about the applications, like the date the application was detected for the first time by Androzoo or the number of antivirus from VirusTotal that flaged the application as malicious.
In addition to providing researchers with an easy access to real world applications, Androzoo make it a lot easier to share datasets for reproducibility: instead of sharing hundreds of #APK files, the list of SHA256 is enough.