diff --git a/3_rasta/4_discussion.typ b/3_rasta/4_discussion.typ index e4dc058..dc08693 100644 --- a/3_rasta/4_discussion.typ +++ b/3_rasta/4_discussion.typ @@ -1,389 +1,434 @@ +#import "@local/template-thesis-matisse:0.0.1": todo, etal +#import "X_var.typ": * +#import "X_lib.typ": * + == Discussion -\subsection{State-of-the-art comparison} +=== State-of-the-art comparison -Our finding are consistent with the numerical results of Pauck {\it et al.} that showed that \mypercent{106}{180} of DIALDroid-Bench~\cite{bosuCollusiveDataLeak2017} real-world applications are analyzed successfully with the 6 evaluated tools~\cite{pauckAndroidTaintAnalysis2018}. -Six years after the release of DIALDroid-Bench, we obtain a lower ratio of \mypercent{40.05}{100} for the same set of 6 tools but using the Rasta dataset of \NBTOTALSTRING applications. -We extended this result to a set of \nbtoolsvariationsrun\xspace tools and obtained a global success rate of \resultratio. We confirmed that most tools require a significant amount of work to get them running~\cite{reaves_droid_2016}. -%Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen {\it et al.} in addition to DroidSafe and IccTa, already identified by Pauck {\it et al.}. -% +Our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench@bosuCollusiveDataLeak2017 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018. +Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications. +We extended this result to a set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. We confirmed that most tools require a significant amount of work to get them running@reaves_droid_2016. +// Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal. -Investigating the reason behind tools' errors is a difficult task and will be investigated in a future work. For now, our manual investigations show that the nature of errors varies from one analysis to another, without any easy solution for the end user for fixing it. +Investigating the reason behind tools' errors is a difficult task and will be investigated in a future work. +For now, our manual investigations show that the nature of errors varies from one analysis to another, without any easy solution for the end user for fixing it. - -\subsection{Recommendations} +=== Recommendations Finally, we summarize some takeaways that developers should follow to improve the success of reusing their developed software. -For improving the reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review. For improving the reusability developers should - write a documentation about the tool usage and provide a minimal working example and describe the expected results. Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine. Additionally, a small dataset -should be provided for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments. +For improving the reliability of their software, developers should use classical development best practices, for example continuous integration, testing, code review. +For improving the reusability developers should write a documentation about the tool usage and provide a minimal working example and describe the expected results. +Interactions with the running environment should be minimized, for example by using a docker container, a virtual environment or even a virtual machine. +Additionally, a small dataset should be provided for a more extensive test campaign and the publishing of the expected result on this dataset would ensure to be able to evaluate the reproducibility of experiments. -Finally, an important remark concerns the libraries used by a tool. We have seen two types of libraries: - a)~internal libraries manipulating internal data of the tool; - b)~external libraries that are used to manipulate the input data (APKs, bytecode, resources). -We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files). We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries. -%: for example, old versions of apktool are the top most libraries raising errors. +Finally, an important remark concerns the libraries used by a tool. +We have seen two types of libraries: +- internal libraries manipulating internal data of the tool; +- external libraries that are used to manipulate the input data (APKs, bytecode, resources). +We observed by our manual investigations that external libraries are the ones leading to crashes because of variations in recent APKs (file format, unknown bytecode instructions, multi-DEX files). +We believe that the developer should provide enough documentation to make possible a later upgrade of these external libraries. +//: for example, old versions of apktool are the top most libraries raising errors. -\subsection{Threats to validity} +=== Threats to validity Our application dataset is biased in favor of Androguard, because Androzoo have already used Androguard internally when collecting applications and discarded any application that cannot be processed with this tool. -Despite our best efforts, it is possible that we made mistakes when building or using the tools. It is also possible that we wrongly classified a result as a failure. To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion. %Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved. +Despite our best efforts, it is possible that we made mistakes when building or using the tools. +It is also possible that we wrongly classified a result as a failure. +To mitigate this possible problem we contacted the authors of the tools to confirm that we used the right parameters and chose a valid failure criterion. +// Before running the final experiment, we also ran the tools on a subset of our dataset and looked manually the most common errors to ensure that they are not trivial errors that can be solved. -The timeout value, amount of memory are arbitrarily fixed. For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference. +The timeout value, amount of memory are arbitrarily fixed. +For mitigating their effect, a small extract of our dataset has been analyzed with more memory/time for measuring any difference. -Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong. For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness). +Finally, the use of VirusTotal for determining if an application is a malware or not may be wrong. +For limiting this impact, we used a threshold of at most 5 antiviruses (resp. no more than 0) reporting an application as being a malware (resp. goodware) for taking a decision about maliciousness (resp. benignness). +/* +== Discussion -% -%\section{Discussion} -%\label{sec:discussion} -% -%\newcommand{\mrc}[1]{\rotcell{\makebox[0pt][l]{#1}}} -% -%\settowidth\rotheadsize{androguarda} -% -%%\newcommand{\mynum}[1]{% -%% \ifthenelse{\equal{\first}{}}{\num[round-mode=places,round-precision=1]{#1}}{\textbf{\num[round-mode=places,round-precision=1]{#1}}} -%%} -%% -%% -%%\newcommand{\mynums}[1]{% -%% \ifthenelse{\equal{\first}{}}{\num[round-mode=places,round-precision=0]{#1}}{\textbf{\num[round-mode=places,round-precision=0]{#1}}} -%%} -%\newcommand{\mynum}[1]{\num[round-mode=places,round-precision=1]{#1}} -% -% -% -%\newcommand{\mynums}[1]{\num[round-mode=places,round-precision=0]{#1}} -% -% -% -%\newcommand{\mynumm}[1]{\num[round-mode=places,round-precision=1]{#1}} -% -% -% \begin{table*}[tb] -% \scriptsize -% \caption{Average number of errors, analysis time, memory per unitary analysis -- compared by exit status } -% \label{tab:avgerror} -% -% \begin{tabular}{r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r} -% \toprule -% Exit status& & \mrc{adagio} & \mrc{amandroid} & \mrc{anadroid} & \mrc{androguard} & \mrc{androguard\_dad} & \mrc{apparecium}& \mrc{blueseal} &\mrc{dialdroid}& \mrc{didfail}& \mrc{droidsafe}& \mrc{flowdroid}& \mrc{gator}& \mrc{ic3}& \mrc{ic3\_fork}& \mrc{iccta}& \mrc{mallodroid}& \mrc{perfchecker}& \mrc{redexer}& \mrc{saaf}& \mrc{wognsen\_et\_al} \\ -% \midrule -% & \multicolumn{21}{c}{\bf Average number of errors (and standard deviation)} \\ -%\cline{2-22} -% \csvreader[ -% late after line = \\, -% %separator=semicolon, -% head to column names, -% ]{average_number_of_error_by_exec.csv}{}{% -% \first & \type & \mynum{\adagio} & \mynum{\amandroid} & \mynum{\anadroid} & \mynum{\androguard} & \mynum{\androguarddad} & \mynum{\apparecium} & \mynum{\blueseal} & \mynum{\dialdroid} & \mynum{\didfail} & \mynum{\droidsafe} & \mynum{\flowdroid} & \mynum{\gator} & \mynum{\ic} & \mynum{\icfork} & \mynum{\iccta} & \mynum{\mallodroid} & \mynum{\perfchecker} & \mynum{\redexer} & \mynum{\saaf} & \mynum{\wognsenetal} -% }% -%\midrule -% & \multicolumn{21}{c}{\bf Average time (s)} \\ -%\cline{2-22} -% \csvreader[ -%late after line = \\, -%%separator=semicolon, -%head to column names, -%]{average_time-final.csv}{}{% -% \first & \type & \mynums{\adagio} & \mynums{\amandroid} & \mynums{\anadroid} & \mynums{\androguard} & \mynums{\androguarddad} & \mynums{\apparecium} & \mynums{\blueseal} & \mynums{\dialdroid} & \mynums{\didfail} & \mynums{\droidsafe} & \mynums{\flowdroid} & \mynums{\gator} & \mynums{\ic} & \mynums{\icfork} & \mynums{\iccta} & \mynums{\mallodroid} & \mynums{\perfchecker} & \mynums{\redexer} & \mynums{\saaf} & \mynums{\wognsenetal} -%}% -%\midrule -%& \multicolumn{21}{c}{\bf Average Memory (GB)} \\ -%\cline{2-22} -%\csvreader[ -%late after line = \\, -%%separator=semicolon, -%head to column names, -%]{average_mem-final.csv}{}{% -% \first & \type & \mynumm{\adagio} & \mynumm{\amandroid} & \mynumm{\anadroid} & \mynumm{\androguard} & \mynumm{\androguarddad} & \mynumm{\apparecium} & \mynumm{\blueseal} & \mynumm{\dialdroid} & \mynumm{\didfail} & \mynumm{\droidsafe} & \mynumm{\flowdroid} & \mynumm{\gator} & \mynumm{\ic} & \mynumm{\icfork} & \mynumm{\iccta} & \mynumm{\mallodroid} & \mynumm{\perfchecker} & \mynumm{\redexer} & \mynumm{\saaf} & \mynumm{\wognsenetal} -%}% -% \bottomrule -% \end{tabular} -%\end{table*} -% -% -%In this section, we investigate the reasons behind the high ratio of failures presented in Section~\ref{sec:xp}. Table~\ref{tab:avgerror} reports the average number of errors, the average time and memory consumption of the analysis of one APK file. We also compare our conclusions to the ones of the literature. -% -% -% -%\subsection{Failures Analysis} -%\label{sec:failure-analysis} -%%capture erreurs -%%fichiers -%%stdout, stderr -%%(only 4%) -%%android.jar en version 9 qui génère des erreurs -% -%During the running of our experiments we parse the standard output and error to capture: -% -%\begin{itemize} -% \item Java errors and stack traces -% \item Python errors and stack traces -% \item Ruby errors and stack traces -% \item Log4j messages with a ``ERROR'' or ``FATAL'' level -% \item XSB error messages -% \item Ocaml errors -%\end{itemize} -% -%% For example, Dialdroid reports in average \num{55.9} errors for one successful analysis. -%% On the contrary, some tools such as Blueseal report very few error at a time, making it easier to identify the cause of the failure. -% -%Because some tools send back a high number of errors in our logs (up to \num{46698} for one execution), we tried to determine the error that is linked to the failed status. -%Unfortunately, our manual investigations confirmed that the last error of a log output is not always the one that should be attributed to the global failure of the analysis. -%The error that seems to generate the failure can occur in the middle of the execution, be caught by the code and then other subsequent parts of the code may generate new errors as consequences of the first one. -%Similarly, the first error of a log is not always the cause of a failure. -%Sometimes errors successfully caught and handled are logged anyway. -%Thus, it is impossible to extract accurately the error responsible for a failed execution. -%Therefore, we investigated the nature of errors globally, without distinction between error messages in a log. -% -%\begin{figure*} -% \includegraphics[width=0.7\linewidth]{figs/repartition-of-error-types-among-tools.pdf} -% \caption{Heatmap of the ratio of errors reasons for all tools for the Rasta dataset} -% \label{fig:heatmap} -%\end{figure*} -% -%Figure~\ref{fig:heatmap} draws the most frequent error objects for each of the tools. -%A black square is an error type that represents more than 80\% of the errors raised by the considered tool. -%In between, gray squares show a ratio between 20\% and 80\% of the reported errors. -% -%First, the heatmap helps us to confirm that our experiments is running in adequate conditions. -%Regarding errors linked to memory, two errors should be investigated: \jv{OutOfMemoryError} and \jv{StackOverflowError}. -%The first one only appears for gator with a low ratio. Several tool have a low ratio of errors concerning the stack. -%These results confirm that the allocated heap and stack is sufficient for running the tools with the Rasta dataset. -%Regarding errors linked to the disk space, we observe few ratios for the exception \jv{IOException}, \jv{FileNotFoundError} and \jv{FileNotFoundException}. -%Manual inspections revealed that those errors are often a consequence of a failed apktool execution. -% -%Second, the black squares indicate frequent errors that need to be investigated separately. -%In the rest of this section, we manually analyzed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix. -% -% -% -%% Dialdroid: TODO -%% com.google.common.util.concurrent.ExecutionError -> memory error: java.lang.StackOverflowError, java.lang.OutOfMemoryError: Java heap space, java.lang.OutOfMemoryError: GC overhead limit exceeded +#figure({ + show table: set text(size: 0.50em) + show table.cell.where(y: 0): it => if it.x == 0 { it } else { rotate(-90deg, reflow: true, it) } + show table.cell.where(x: 0): it => text(size: 1.2em, it) + show "sigma": $sigma$ + let time_num(n) = num(calc.round(float(n), digits: 0)) + table( + columns: (auto, auto, ..for i in range(20) { (1fr,) }), + //inset: (x: 0% + 5pt, y: 0% + 2pt), + stroke: none, + align: center+horizon, + table.header( + [*Exit status*], [], + [adagio], [amandroid], [anadroid], [androguard], + [androguard_dad], [apparecium], [blueseal], [dialdroid], + [didfail], [droidsafe], [flowdroid], [gator], [ic3], + [ic3_fork], [iccta], [mallodroid], [perfchecker], + [redexer], [saaf], [wognsen_et_al], + ), + ..for i in range(2, 22) { + (table.vline(x: i, end: 3), + table.vline(x: i, start: 4)) + }, + + table.cell(colspan: 22, inset: 3pt)[], + table.hline(), + table.cell(colspan: 22)[*Average number of errors (and standard deviation)*], + table.hline(), + table.cell(colspan: 22, inset: 3pt)[], + + ..rasta_avg_nb_error_by_exec + .map(e => { + let row = ([#e.type], + num(e.adagio), num(e.amandroid), num(e.anadroid), num(e.androguard), + num(e.androguarddad), num(e.apparecium), num(e.blueseal), num(e.dialdroid), + num(e.didfail), num(e.droidsafe), num(e.flowdroid), num(e.gator), num(e.ic), + num(e.icfork), num(e.iccta), num(e.mallodroid), num(e.perfchecker), + num(e.redexer), num(e.saaf), num(e.wognsenetal) + ) + if e.first != "" { row.insert(0, table.cell(rowspan:2)[*#e.first*]) } + row + }).flatten(), + + table.cell(colspan: 22, inset: 3pt)[], + table.hline(), + table.cell(colspan: 22)[*Average time (s)*], + table.hline(), + table.cell(colspan: 22, inset: 3pt)[], + + ..rasta_avg_time + .map(e => { + let row = ([*#e.first*], + time_num(e.adagio), time_num(e.amandroid), time_num(e.anadroid), time_num(e.androguard), + time_num(e.androguarddad), time_num(e.apparecium), time_num(e.blueseal), time_num(e.dialdroid), + time_num(e.didfail), time_num(e.droidsafe), time_num(e.flowdroid), time_num(e.gator), time_num(e.ic), + time_num(e.icfork), time_num(e.iccta), time_num(e.mallodroid), time_num(e.perfchecker), + time_num(e.redexer), time_num(e.saaf), time_num(e.wognsenetal) + ) + if e.type != "" { row.insert(1, table.cell(rowspan:3)[*#e.type*]) } + row + }).flatten(), + + table.cell(colspan: 22, inset: 3pt)[], + table.hline(), + table.cell(colspan: 22)[*Average Memory (GB)*], + table.hline(), + table.cell(colspan: 22, inset: 3pt)[], + + ..rasta_avg_mem + .map(e => { + let row = ([*#e.first*], + num(e.adagio), num(e.amandroid), num(e.anadroid), num(e.androguard), + num(e.androguarddad), num(e.apparecium), num(e.blueseal), num(e.dialdroid), + num(e.didfail), num(e.droidsafe), num(e.flowdroid), num(e.gator), num(e.ic), + num(e.icfork), num(e.iccta), num(e.mallodroid), num(e.perfchecker), + num(e.redexer), num(e.saaf), num(e.wognsenetal) + ) + if e.type != "" { row.insert(1, table.cell(rowspan:3)[*#e.type*]) } + row + }).flatten(), + + table.cell(colspan: 22, inset: 3pt)[], + table.hline(), + ) + [ + ]}, + caption: [Average number of errors, analysis time, memory per unitary analysis -- compared by exit status], +) + + + + +In this section, we investigate the reasons behind the high ratio of failures presented in @sec:rasta-xp. +@tab:rasta-avgerror reports the average number of errors, the average time and memory consumption of the analysis of one APK file. +We also compare our conclusions to the ones of the literature. + +=== Failures Analysis + +/* +capture erreurs +fichiers +stdout, stderr +(only 4%) +android.jar en version 9 qui génère des erreurs +*/ + +During the running of our experiments we parse the standard output and error to capture: + +- Java errors and stack traces +- Python errors and stack traces +- Ruby errors and stack traces +- Log4j messages with a "ERROR" or "FATAL" level +- XSB error messages +- Ocaml errors + +For example, Dialdroid reports in average #num(55.9) errors for one successful analysis. +On the contrary, some tools such as Blueseal report very few error at a time, making it easier to identify the cause of the failure. + +Because some tools send back a high number of errors in our logs (up to #num(46698) for one execution), we tried to determine the error that is linked to the failed status. +Unfortunately, our manual investigations confirmed that the last error of a log output is not always the one that should be attributed to the global failure of the analysis. +The error that seems to generate the failure can occur in the middle of the execution, be caught by the code and then other subsequent parts of the code may generate new errors as consequences of the first one. +Similarly, the first error of a log is not always the cause of a failure. +Sometimes errors successfully caught and handled are logged anyway. +Thus, it is impossible to extract accurately the error responsible for a failed execution. +Therefore, we investigated the nature of errors globally, without distinction between error messages in a log. + +#todo()[alt text for rasta-heatmap] + +#figure( + image( + "figs/repartition-of-error-types-among-tools.svg", + width: 80%, + alt: "", + ), + caption: [Heatmap of the ratio of errors reasons for all tools for the Rasta dataset], +) + +@fig:rasta-heatmap draws the most frequent error objects for each of the tools. +A black square is an error type that represents more than 80% of the errors raised by the considered tool.In between, gray squares show a ratio between 20% and 80% of the reported errors. + +First, the heatmap helps us to confirm that our experiments is running in adequate conditions. +Regarding errors linked to memory, two errors should be investigated: `OutOfMemoryError` and `StackOverflowError`. +The first one only appears for gator with a low ratio. +Several tool have a low ratio of errors concerning the stack. +These results confirm that the allocated heap and stack is sufficient for running the tools with the Rasta dataset. +Regarding errors linked to the disk space, we observe few ratios for the exception `IOException`, `FileNotFoundError` and `FileNotFoundException`. +Manual inspections revealed that those errors are often a consequence of a failed apktool execution. + +Second, the black squares indicate frequent errors that need to be investigated separately. +In the rest of this section, we manually analyzed, when possible, the code that generates this high ratio of errors and we give feedback about the possible causes and difficulties to write a bug fix. + +/* +Dialdroid: TODO + com.google.common.util.concurrent.ExecutionError -> memory error: java.lang.StackOverflowError, java.lang.OutOfMemoryError: Java heap space, java.lang.OutOfMemoryError: GC overhead limit exceeded %% java.lang.RuntimeException: 'No call graph present in Scene. Maybe you want Whole Program mode (-w).', 'There were exceptions during IFDS analysis. Exiting.' 'Could not find method' -%% -%% -%% Didfail: DONE ? -%% java.lan.RuntimeException: "Could not find method" (1603), "not found: java.io.Serializable" (1362) ?, mostly originate from somewhere in soot -%% null pointer: mostly originate from somewhere in soot -%% File not found: error raised after a previous tool failed -%% -%% Gator: DONE -%% java.lang.RuntimeException: 'error: expected 1 element for annotation Deprecated. Got 1 instead.' (106 occ), misuse of `soot.dexpler.DexAnnotation.addAnnotation` ? as usual, buried under long list of call to soot, hard to pinpoint the cause. -%% java.lang.OutOfMemoryError: -%% java.io.IOException: No space left on device (169 occurences) -%% brut.androlib.AndrolibException: 198, various apktool, some ppb linked to java.io.IOException No space left on device -%% FileNotFoundError: ppb consequence of java.io.IOException: No such file or directory: '/tmp/gator-zxkd65ty/apktool.yml -%% -%% perfchecker: Done -%% java.lang.VerifyError: "Expecting a stackmap frame at branch target ", internet propose that it could be caused by Dexguard obfuscation -%% link error: probably problems with android.jar? -%% -%% redexer: -%% "File "src/ext/logging.ml", line 712, characters 12-17: Pattern matching failed": suspicious pattern matching but I don't know caml enough to debug. -%% -%% saaf: DONE -%% brut.androlib.AndrolibException: apktoool 1.5.2, "Could not decode arsc file" -%% de.rub.syssec.saaf.model.analysis.AnalysisException: encapsulate the apktool error -%% java.io.IOException: 'Expected: 0x001c0001, got: 0x00000000', still apktool -%% 38635 failures over the total of 38710 failures raise a 'brut.androlib.AndrolibException' apktool error. -%% -%% wognsen_et_al: -%% brut.androlib.AndrolibException: apktool 1.5.2, "Could not decode arsc file" -%% java.io.IOException: "Expected: 0x001c0001, got: 0x00000000|38598", apktool -%% java.lang.ArithmeticException: divide by zero, from apktool 'org.jf.dexlib.Code.Format.ArrayDataPseudoInstruction.getElementCount' -% -%% Amandroid: TODO -%% mainly java.lang.NullPointerException at org.argus.jawa.flow.pta.rfa.ReachingFactsAnalysis.process, line 68, don't speak scala well enought to understand what is null -% -% -%% Anadroid: DONE -%% subprocess.calledProcessError: subprocess.check_call([APK_TOOL, \"d\" , \"-f\", \"--no-src\", apk_fp, prj_d]) -%% java.io.IOException: somewhere in brut.androlib.res.decoder.ARSCDecoder.decode -%% brut.androlib.AndrolibException: raise by brut.androlib.res.decoder.ARSCDecoder.decode, somewhere in brut.apktool.Main.main -%% -%% main error msg for brut.androlib.AndrolibException is "Could not decode arsc file" -%% -%% Apktool v1.4.3, released December 8, 2011: two months after the parution of sdk 15 -%% min_sdk 9 to 13 ~50% of exec failled with "Could not decode arsc file", min_sdk 14 81%, 15 94%, >15 >=98%. -%% SELECT min_sdk, COUNT(*)*100/(SELECT COUNT(*) FROM apk AS apk2 WHERE apk2.min_sdk = apk.min_sdk) FROM error INNER JOIN apk ON error.sha256 = apk.sha256 WHERE tool_name = 'anadroid' AND msg='Could not decode arsc file' GROUP BY min_sdk ORDER BY min_sdk; -%% SELECT min_sdk, COUNT(*)*100/(SELECT COUNT(*) FROM apk AS apk2 WHERE apk2.min_sdk = apk.min_sdk) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256 WHERE tool_name = 'anadroid' AND tool_status = 'FAILED' GROUP BY min_sdk ORDER BY min_sdk; -%% SELECT AVG(cnt), MAX(cnt) FROM (SELECT COUNT(*) AS cnt FROM error WHERE tool_name = 'anadroid' AND msg='Could not decode arsc file' GROUP BY sha256); -% -%\paragraph{Androguard and Androguard\_dad} -% -%Surprisingly, while Androguard almost never fails to analyze an APK, the internal decompiler of Androguard (DAD) fails more than half of the time. -%The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems). -%It should be noticed that Androguard\_dad rarely fails on the Drebin dataset. -%This illustrate the importance to test tools on real and up-to-date APKs: even a bad handling of filenames can influence an analysis. -% -%% Androguard: Done -%% 35 error total, no real pattern, stuff like unexpected ID, uncrowned instructions, ect -%% -%% Androguard Dad: DONE -%% All 33819 OSError are '[Errno 36] File name too long: ': the tool try to create files with the name AND SIGNATURE of the disassembled methods, by the file name can be too long: -%% '/mnt/out/in/android/vyapar/paymentgateway/model/PaymentGatewayResponseModel$Data$AccountDetails/PaymentGatewayResponseModel$Data$AccountDetails copy$default (PaymentGatewayResponseModel$Data$AccountDetails String String String String String String String String String String String String String String List I Object)PaymentGatewayResponseModel$Data$AccountDetails.ag' -%% NullPointerException -%% dad: SError -% -%\paragraph{Mallodroid and Apparecium} -% -%Mallodroid and Apparecium stand out as the tools that raised the most errors in one run. -%They can raise more than \num{10000} error by analysis. -%However, it happened only for a few dozen of APKs, and conspicuously, the same APKs raised the same hight number of errors for both tools. -%The recurring error is a {\tt KeyError} raise by Androguard when trying to find a string by its identifier. -%Although this error is logged, it seems successfully handled and during a manual analysis of the execution, both tools seemingly perform there analysis without issue. -%This hight number of occurrences may suggest that the output is not valid. -%Still, the tools claim to return a result, so, from our perspective, we consider those analysis as successful. -% -%For other numerous errors, we could not identify the reason why those specific applications raise so many exceptions. -%However we noticed that Mallodroid and Apparecium use outdated version of Androguard (respectively the version 3.0 and 2.0), and neither Androguard v3.3.5 nor DAD with Androguard v3.3.5 raise those exceptions. -%This suggest the issue has been fixed by Androguard and that Mallodroid and Apparecium could benefit from a dependency upgrade. -% -%% Apparecium: DONE -%% The KeyError is raised from androguard when a non existing string is queried. It happens for only a few apks (~60), -%% but a lot of times. UnicodeEncodeError happened more frequently (2740 apks), also originate from androguard. -%% androguard version 2.0 -%% -%% mallodroid: Done -%% KeyError: from androguard `get_raw_string`, but do not lead to crash, 33 crash from androguard parsing xml. (androguard 3.0) -%% Instruction10x% -% -%\paragraph{Blueseal} -% -%Because Blueseal rarely log more than one error when crashing, it is easy to identify the relevant error. The majority of crashes comes from unsupported Android versions (due to the magic number of the DEX files not being supported by the version of back smali used by Blueseal) and methods whose implementation are not found (like native methods). -%% Blueseal: Done -%% Majority of runtimes error: 'No method source set for method ' are raised from soot.SceneTransformer.transform() called by edu.buffalo.cse.blueseal.BSFlow.CgTransformer.getDynamicEntryPoints(). -%% No idea how to fix. Update soot? version unclear ('trunk'...), but copyright up to 2010 so 2010? -%% -%\paragraph{Droidsafe and SAAF} -% -%Our investigation of the most common errors raised by Droidsafe and SAAF showed that they are often preceded by an error from apktool. -%Indeed, \num{28654} runs of Droidsafe and \num{38635} runs of SAAF failed after raising at least one of {\tt brut.androlib.AndrolibException} or \\ {\tt brut.androlib.err.UndefinedResObject}, suggesting that those tools would benefit from an upgrade of apktool. -% -% -%% Droidsafe: -%% UnknownHostException: 'normal', due to network isolation(?), from sfl4j, no impact on the reste of the tool -%% droidsafe.utils.CannotFindMethodException: 'Cannot find or resolve ' (eg: android.view.ViewTreeObserver: void removeOnGlobalLayoutListener), -%% mostly related to android API. First guest 'normal' as droidsafe model the android API and is not updated since ~SDK 19, but the error is replaced by an -%% apktool error for min sdk > 19.: 2.0.0rc2 -%% eg: android.view.ViewTreeObserver.removeOnGlobalLayoutListener: exist un android.jar for sdk 18 and 18, but no in droidsafe model -%% the error does not look fatals (it occurred in finished execution) but is more common on failed execution. (1 to 16 ratio) -%% TODO: conclusion? -%% -%% 28957 apk with an apktool error -%% -%% CannotFindMethodException -% -%\paragraph{Ic3 and Ic3\_fork} -% -%% ic3: DONE -%% jas.jasError: "Missing arguments for instruction ldc" or "Badly formatted number", old soot or bad dare? -%% 3778 / 10480 (~30) fails without error logged, probable that we don't capture dare failures. -%% -%% ic3_fork: DONE -%% java.lang.RuntimeException: "This operation requires resolving level SIGNATURES but is at resolving level DANGLING", and "Could not find method". Yet another case of error lost in a sea of soot -%% only 38 failures without error logged -%% -%% IccTa: Done -%% java.lang.RuntimeException: same number of "This operation requires resolving level SIGNATURES..." as ic3_fork, -%% lots of "No method source set for method ", half the time this occurs the exec failed (and ~30% of the time it finishes) -%% "Could not find method": fail every time, in edu.psu.cse.siis.ic3.SetupApplication.calculateSourcesSinksEntrypoints (and again, a lot of soot stack) -% -%We compared the number of errors between Ic3 and Ic3\_fork. -%Ic3\_fork reports less errors for all types of analysis which suggests that the author of the fork have removed the outputed errors from the original code: the thrown errors are captured in a generic {\tt RuntimeException} which removes the semantic, making it harder our investigations. -%Nevertheless, Ic3\_fork has more failures than Ic3: the number of errors reported by a tool is not correlated to the final success of its analysis. -% -% -%% jasError -% -%\paragraph{Flowdroid} -% -%Our exchanges with the authors of Flowdroid led us to expect more timeouts from too long executions than failed run. -%Surprisingly we only got \mypercent[2]{37}{\NBTOTAL} of timeout, and a hight number of failures. -%We tried to detect recurring causes of failures, but the complexity of Flowdroid make the investigation difficult. -%Most exceptions seems to be related to concurrency. %or display a generic messages. -%Other errors that came up regularly are {\tt java.nio.channels.ClosedChannelException} which is raised when Flowdoid fails to read from the APK, although we did not find the reason of the failure, null pointer exceptions when trying to check if a null value is in a {\tt ConcurrentHashMap} (in {\tt LazySummaryProvider.getClassFlows()}) and {\tt StackOverflowError} from {\tt StronglyConnectedComponentsFast.recurse()}. -%We randomly selected 20 APKs that generated stack overflows in Flowdroid and retried the analysis with 500G of RAM allocated to the JVM. -%18 of those runs still failed with a stack overflow without using all the allocated memory, the other two failed after raising null pointer exceptions from {\tt getClassFlows}. -%This shows that the lack of memory is not the primary cause of those failures. -% -%% Flowdroid: TODO java.nio.channels.ClosedChannelException cause or consequence? -%% java.nio.channels.ClosedChannelException: mosly the zip file reader that refuse an access (after another crash? hard to check) -%% java.lang.StackOverflowError: -%% java.lang.RuntimeException: mostly "There were exceptions during IFDS analysis. Exiting." -%% java.lang.NullPointerException: soot.jimple.infoflow.collect.ConcurrentHashSet.contains, from soot.jimple.infoflow.methodSummary.data.provider.LazySummaryProvider.getClassFlows -%% com.google.common.util.concurrent.ExecutionError: "java.lang.StackOverflowError" -%% -% -%%No hidden timeout, what do we believe? avg(time) = 80s, 30s when finished, 137 when failed, max(time) = 3639s when failed, 3284 when finished, 72 \% of the failures took less than a minute, 93\% less than 10, 92\% of failed exception raised a NullPointerException. -% -%% Pauck: Flowdroid avg 2m on DIALDroid-Bench (real worlds apks) -% -% -%\medskip -% -%As a conclusion, we observe that a lot of errors can be linked to bugs in dependencies. -%Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is a no trivial task that require familiarity with the inner code of the tools. -% -%\subsection{State of the art comparison} -% -%% Luo {\it et al.} released TaintBench~\cite{luoTaintBenchAutomaticRealworld2022} a real-world benchmark and the associated recommendations to build such a benchmark. These benchmarks confirmed that some tools such as Amandroid and Flowdroid are less efficient on real-world applications. -%% Pauck {\it et al.}~\cite{pauckAndroidTaintAnalysis2018} -%% Reaves {\it et al.}~\cite{reaves_droid_2016} -% -%We finally compare our results to the conclusions and discussions of previous papers~\cite{luoTaintBenchAutomaticRealworld2022, pauckAndroidTaintAnalysis2018, reaves_droid_2016}. -% -%First we confirm the hypothesis of Luo {\it et al.} that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets~\cite{luoTaintBenchAutomaticRealworld2022}. Even if Drebin is not hand-crafted, it is quite old and we obtained really good results compared to the Rasta dataset. -%When considering real-world applications, the size is rather different from hand crafted application, which impacts the success rate. -%We believe that it is explained by the fact that the complexity of the code increases with its size. -% -%%30*6 -%%180 -%%21+20+27+2+18+18 -%%106 -%%106/180*100 -%%58.88 -% -%Second, our finding are consistent with the numerical results of Pauck {\it et al.} that showed that \mypercent{106}{180} of DIALDroid-Bench 30 real-world applications are analyzed successfully with the 6 evaluated tools~\cite{pauckAndroidTaintAnalysis2018}. -%Six years after the release of DIALDroid-Bench, we obtain a lower ratio of \mypercent{40.05}{100} for the same set of 6 tools but using the Rasta dataset of \NBTOTALSTRING applications. -%We extended this result to our set of \nbtoolsvariationsrun\xspace tools and obtained a global success rate of \resultratio. -%Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen {\it et al.} in addition to DroidSafe and IccTa, already identified by Pauck {\it et al.}. -% -%% Pauck: 235 micro bench, 30 real* -%% Confirm didfail failled for min_sdk >= 19, all successful run (only 4%) indicated "Only phantom classes loaded, skipping analysis..." -% -%% SELECT tool_status, COUNT(*), AVG(dex_size) FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 WHERE min_sdk >= 19 AND tool_name = 'didfail' GROUP BY tool_status; -%% FAILED|16651|13139071.2363221 -%% FINISHED|694|6617861.33717579 -%% TIMEOUT|98|6048999.2244898 -%% SELECT msg, COUNT(*) FROM (SELECT DISTINCT exec.sha256, msg FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 INNER JOIN error ON exec.sha256 = error.sha256 AND exec.tool_name = error.tool_name WHERE min_sdk >= 19 AND exec.tool_name = 'didfail' AND exec.tool_status = 'FINISHED') GROUP BY msg; -%% |77 -%% Only phantom classes loaded, skipping analysis...|694 -%% -%% DroidSafe and IccTa Failled for SDK > 19 because of old apktool -%% -%% We obsered: (nb success < 2000 for min_skd >= 20) -%% ['anadroid', 'blueseal', 'dialdroid', 'didfail', 'droidsafe', 'ic3_fork', 'iccta', 'perfchecker', 'saaf', 'wognsen_et_al'] -%% anadroid|0 -%% blueseal|521 -%% dialdroid|812 -%% didfail|343 -%% droidsafe|35 -%% ic3_fork|1393 -%% iccta|612 -%% perfchecker|1921 -%% saaf|1588 -%% wognsen_et_al|386 -% -%Third, we extended to \nbtoolsselected\xspace different tools the work done by Reaves {\it et al.} on the usability of analysis tools (4 tools are in common, we added 16 new tools and two variations). -%We confirmed that most tools require a significant amount of work to get them running. -%We encounter similar issues with libraries and operating system incompatibilities, and noticed that, with time, dependencies issues may impact the build process. -%For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central. -%%, and even one case were the could not find anywhere the compiled version of sbt used to build a tool. +Didfail: DONE ? + java.lan.RuntimeException: "Could not find method" (1603), "not found: java.io.Serializable" (1362) ?, mostly originate from somewhere in soot + null pointer: mostly originate from somewhere in soot + File not found: error raised after a previous tool failed + +Gator: DONE + java.lang.RuntimeException: 'error: expected 1 element for annotation Deprecated. Got 1 instead.' (106 occ), misuse of `soot.dexpler.DexAnnotation.addAnnotation` ? as usual, buried under long list of call to soot, hard to pinpoint the cause. + java.lang.OutOfMemoryError: + java.io.IOException: No space left on device (169 occurences) + brut.androlib.AndrolibException: 198, various apktool, some ppb linked to java.io.IOException No space left on device + FileNotFoundError: ppb consequence of java.io.IOException: No such file or directory: '/tmp/gator-zxkd65ty/apktool.yml + +perfchecker: Done + java.lang.VerifyError: "Expecting a stackmap frame at branch target ", internet propose that it could be caused by Dexguard obfuscation + link error: probably problems with android.jar? + +redexer: + "File "src/ext/logging.ml", line 712, characters 12-17: Pattern matching failed": suspicious pattern matching but I don't know caml enough to debug. + +saaf: DONE + brut.androlib.AndrolibException: apktoool 1.5.2, "Could not decode arsc file" + de.rub.syssec.saaf.model.analysis.AnalysisException: encapsulate the apktool error + java.io.IOException: 'Expected: 0x001c0001, got: 0x00000000', still apktool + 38635 failures over the total of 38710 failures raise a 'brut.androlib.AndrolibException' apktool error. + +wognsen_et_al: + brut.androlib.AndrolibException: apktool 1.5.2, "Could not decode arsc file" + java.io.IOException: "Expected: 0x001c0001, got: 0x00000000|38598", apktool + java.lang.ArithmeticException: divide by zero, from apktool 'org.jf.dexlib.Code.Format.ArrayDataPseudoInstruction.getElementCount' + +Amandroid: TODO + mainly java.lang.NullPointerException at org.argus.jawa.flow.pta.rfa.ReachingFactsAnalysis.process, line 68, don't speak scala well enought to understand what is null + +Anadroid: DONE + subprocess.calledProcessError: subprocess.check_call([APK_TOOL, \"d\" , \"-f\", \"--no-src\", apk_fp, prj_d]) + java.io.IOException: somewhere in brut.androlib.res.decoder.ARSCDecoder.decode + brut.androlib.AndrolibException: raise by brut.androlib.res.decoder.ARSCDecoder.decode, somewhere in brut.apktool.Main.main + + main error msg for brut.androlib.AndrolibException is "Could not decode arsc file" + + Apktool v1.4.3, released December 8, 2011: two months after the parution of sdk 15 + min_sdk 9 to 13 ~50% of exec failled with "Could not decode arsc file", min_sdk 14 81%, 15 94%, >15 >=98%. + + SELECT min_sdk, COUNT(*)*100/(SELECT COUNT(*) FROM apk AS apk2 WHERE apk2.min_sdk = apk.min_sdk) FROM error INNER JOIN apk ON error.sha256 = apk.sha256 WHERE tool_name = 'anadroid' AND msg='Could not decode arsc file' GROUP BY min_sdk ORDER BY min_sdk; + SELECT min_sdk, COUNT(*)*100/(SELECT COUNT(*) FROM apk AS apk2 WHERE apk2.min_sdk = apk.min_sdk) FROM exec INNER JOIN apk ON exec.sha256 = apk.sha256 WHERE tool_name = 'anadroid' AND tool_status = 'FAILED' GROUP BY min_sdk ORDER BY min_sdk; + SELECT AVG(cnt), MAX(cnt) FROM (SELECT COUNT(*) AS cnt FROM error WHERE tool_name = 'anadroid' AND msg='Could not decode arsc file' GROUP BY sha256); +*/ + +_Androguard and Androguard_dad_ +Surprisingly, while Androguard almost never fails to analyze an APK, the internal decompiler of Androguard (DAD) fails more than half of the time. +The analysis of the logs shows that the issue comes from the way the decompiled methods are stored: each method is stored in a file named after the method name and signature, and this file name can quickly exceed the size limit (255 characters on most file systems). +It should be noticed that Androguard_dad rarely fails on the Drebin dataset. +This illustrate the importance to test tools on real and up-to-date APKs: even a bad handling of filenames can influence an analysis. + +/* +Androguard: Done + 35 error total, no real pattern, stuff like unexpected ID, uncrowned instructions, ect + +Androguard Dad: DONE + All 33819 OSError are '[Errno 36] File name too long: ': the tool try to create files with the name AND SIGNATURE of the disassembled methods, by the file name can be too long: + '/mnt/out/in/android/vyapar/paymentgateway/model/PaymentGatewayResponseModel$Data$AccountDetails/PaymentGatewayResponseModel$Data$AccountDetails copy$default (PaymentGatewayResponseModel$Data$AccountDetails String String String String String String String String String String String String String String List I Object)PaymentGatewayResponseModel$Data$AccountDetails.ag' +NullPointerException +dad: SError +*/ + +_Mallodroid and Apparecium_ +Mallodroid and Apparecium stand out as the tools that raised the most errors in one run. +They can raise more than #num(10000) error by analysis. +However, it happened only for a few dozen of APKs, and conspicuously, the same APKs raised the same hight number of errors for both tools. +The recurring error is a `KeyError` raise by Androguard when trying to find a string by its identifier. +Although this error is logged, it seems successfully handled and during a manual analysis of the execution, both tools seemingly perform there analysis without issue. +This hight number of occurrences may suggest that the output is not valid. +Still, the tools claim to return a result, so, from our perspective, we consider those analysis as successful. +For other numerous errors, we could not identify the reason why those specific applications raise so many exceptions. +However we noticed that Mallodroid and Apparecium use outdated version of Androguard (respectively the version 3.0 and 2.0), and neither Androguard v3.3.5 nor DAD with Androguard v3.3.5 raise those exceptions. +This suggest the issue has been fixed by Androguard and that Mallodroid and Apparecium could benefit from a dependency upgrade. + +/* +Apparecium: DONE + The KeyError is raised from androguard when a non existing string is queried. It happens for only a few apks (~60), + but a lot of times. UnicodeEncodeError happened more frequently (2740 apks), also originate from androguard. + androguard version 2.0 + +mallodroid: Done + KeyError: from androguard `get_raw_string`, but do not lead to crash, 33 crash from androguard parsing xml. (androguard 3.0) +Instruction10x% +*/ + +_Blueseal_ +Because Blueseal rarely log more than one error when crashing, it is easy to identify the relevant error. +The majority of crashes comes from unsupported Android versions (due to the magic number of the DEX files not being supported by the version of back smali used by Blueseal) and methods whose implementation are not found (like native methods). + +/* +Blueseal: Done + Majority of runtimes error: 'No method source set for method ' are raised from soot.SceneTransformer.transform() called by edu.buffalo.cse.blueseal.BSFlow.CgTransformer.getDynamicEntryPoints(). + No idea how to fix. Update soot? version unclear ('trunk'...), but copyright up to 2010 so 2010? +*/ + +_Droidsafe and SAAF_ +Our investigation of the most common errors raised by Droidsafe and SAAF showed that they are often preceded by an error from apktool. +Indeed, #num(28654) runs of Droidsafe and #num(38635) runs of SAAF failed after raising at least one of `brut.androlib.AndrolibException` or `brut.androlib.err.UndefinedResObject`, suggesting that those tools would benefit from an upgrade of apktool. + +/* +Droidsafe: + UnknownHostException: 'normal', due to network isolation(?), from sfl4j, no impact on the reste of the tool + droidsafe.utils.CannotFindMethodException: 'Cannot find or resolve ' (eg: android.view.ViewTreeObserver: void removeOnGlobalLayoutListener), + mostly related to android API. First guest 'normal' as droidsafe model the android API and is not updated since ~SDK 19, but the error is replaced by an + apktool error for min sdk > 19.: 2.0.0rc2 + eg: android.view.ViewTreeObserver.removeOnGlobalLayoutListener: exist un android.jar for sdk 18 and 18, but no in droidsafe model + the error does not look fatals (it occurred in finished execution) but is more common on failed execution. (1 to 16 ratio) + TODO: conclusion? + +28957 apk with an apktool error +CannotFindMethodException +*/ + +_Ic3 and Ic3_fork_ +We compared the number of errors between Ic3 and Ic3_fork. +Ic3_fork reports less errors for all types of analysis which suggests that the author of the fork have removed the outputed errors from the original code: the thrown errors are captured in a generic `RuntimeException` which removes the semantic, making it harder our investigations. +Nevertheless, Ic3_fork has more failures than Ic3: the number of errors reported by a tool is not correlated to the final success of its analysis. + +/* +ic3: DONE + jas.jasError: "Missing arguments for instruction ldc" or "Badly formatted number", old soot or bad dare? + 3778 / 10480 (~30) fails without error logged, probable that we don't capture dare failures. + +ic3_fork: DONE + java.lang.RuntimeException: "This operation requires resolving level SIGNATURES but is at resolving level DANGLING", and "Could not find method". Yet another case of error lost in a sea of soot + only 38 failures without error logged + +IccTa: Done + java.lang.RuntimeException: same number of "This operation requires resolving level SIGNATURES..." as ic3_fork, + lots of "No method source set for method ", half the time this occurs the exec failed (and ~30% of the time it finishes) + "Could not find method": fail every time, in edu.psu.cse.siis.ic3.SetupApplication.calculateSourcesSinksEntrypoints (and again, a lot of soot stack) +jasError +*/ + +_Flowdroid_ +Our exchanges with the authors of Flowdroid led us to expect more timeouts from too long executions than failed run. +#todo[Deja dit? : Surprisingly we only got #mypercent(37,NBTOTAL) of timeout, and a hight number of failures.] +We tried to detect recurring causes of failures, but the complexity of Flowdroid make the investigation difficult. +Most exceptions seems to be related to concurrency. //or display a generic messages. +Other errors that came up regularly are `java.nio.channels.ClosedChannelException` which is raised when Flowdoid fails to read from the APK, although we did not find the reason of the failure, null pointer exceptions when trying to check if a null value is in a `ConcurrentHashMap` (in `LazySummaryProvider.getClassFlows()`) and `StackOverflowError` from `StronglyConnectedComponentsFast.recurse()`. +We randomly selected 20 APKs that generated stack overflows in Flowdroid and retried the analysis with 500G of RAM allocated to the JVM. +18 of those runs still failed with a stack overflow without using all the allocated memory, the other two failed after raising null pointer exceptions from `getClassFlows`. +This shows that the lack of memory is not the primary cause of those failures. + +/* +Flowdroid: TODO java.nio.channels.ClosedChannelException cause or consequence? + java.nio.channels.ClosedChannelException: mosly the zip file reader that refuse an access (after another crash? hard to check) + java.lang.StackOverflowError: + java.lang.RuntimeException: mostly "There were exceptions during IFDS analysis. Exiting." + java.lang.NullPointerException: soot.jimple.infoflow.collect.ConcurrentHashSet.contains, from soot.jimple.infoflow.methodSummary.data.provider.LazySummaryProvider.getClassFlows + com.google.common.util.concurrent.ExecutionError: "java.lang.StackOverflowError" + + +No hidden timeout, what do we believe? avg(time) = 80s, 30s when finished, 137 when failed, max(time) = 3639s when failed, 3284 when finished, 72% of the failures took less than a minute, 93% less than 10, 92% of failed exception raised a NullPointerException. + +Pauck: Flowdroid avg 2m on DIALDroid-Bench (real worlds apks) +*/ + +As a conclusion, we observe that a lot of errors can be linked to bugs in dependencies. +Our attempts to upgrade those dependencies led to new errors appearing: we conclude that this is a no trivial task that require familiarity with the inner code of the tools. + +=== State of the art comparison + +Luo #etal released TaintBench@luoTaintBenchAutomaticRealworld2022 a real-world benchmark and the associated recommendations to build such a benchmark. +These benchmarks confirmed that some tools such as Amandroid and Flowdroid are less efficient on real-world applications. +// Pauck {\it et al.}~\cite{pauckAndroidTaintAnalysis2018} +// Reaves {\it et al.}~\cite{reaves_droid_2016} + +We finally compare our results to the conclusions and discussions of previous papers@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016. +First we confirm the hypothesis of Luo #etal that real-world applications lead to less efficient analysis than using hand crafted test applications or old datasets@luoTaintBenchAutomaticRealworld2022. +Even if Drebin is not hand-crafted, it is quite old and we obtained really good results compared to the Rasta dataset. +When considering real-world applications, the size is rather different from hand crafted application, which impacts the success rate. +We believe that it is explained by the fact that the complexity of the code increases with its size. + +/* +30*6 +180 +21+20+27+2+18+18 +106 +106/180*100 +58.88 +*/ + +Second, our finding are consistent with the numerical results of Pauck #etal that showed that #mypercent(106, 180) of DIALDroid-Bench 30 real-world applications are analyzed successfully with the 6 evaluated tools@pauckAndroidTaintAnalysis2018. +Six years after the release of DIALDroid-Bench, we obtain a lower ratio of #mypercent(40.05, 100) for the same set of 6 tools but using the Rasta dataset of #NBTOTALSTRING applications. +We extended this result to our set of #nbtoolsvariationsrun tools and obtained a global success rate of #resultratio. +Our investigations of crashes also confirmed that dependencies to older versions of Apktool are impacting the performances of Anadroid, Saaf and Wognsen #etal in addition to DroidSafe and IccTa, already identified by Pauck #etal. + +/* +Pauck: 235 micro bench, 30 real* +Confirm didfail failled for min_sdk >= 19, all successful run (only 4%) indicated "Only phantom classes loaded, skipping analysis..." + +SELECT tool_status, COUNT(*), AVG(dex_size) FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 WHERE min_sdk >= 19 AND tool_name = 'didfail' GROUP BY tool_status; +FAILED|16651|13139071.2363221 +FINISHED|694|6617861.33717579 +TIMEOUT|98|6048999.2244898 +SELECT msg, COUNT(*) FROM (SELECT DISTINCT exec.sha256, msg FROM exec INNER JOIN apk on exec.sha256 = apk.sha256 INNER JOIN error ON exec.sha256 = error.sha256 AND exec.tool_name = error.tool_name WHERE min_sdk >= 19 AND exec.tool_name = 'didfail' AND exec.tool_status = 'FINISHED') GROUP BY msg; +|77 +Only phantom classes loaded, skipping analysis...|694 + +DroidSafe and IccTa Failled for SDK > 19 because of old apktool + +We obsered: (nb success < 2000 for min_skd >= 20) + ['anadroid', 'blueseal', 'dialdroid', 'didfail', 'droidsafe', 'ic3_fork', 'iccta', 'perfchecker', 'saaf', 'wognsen_et_al'] +anadroid|0 +blueseal|521 +dialdroid|812 +didfail|343 +droidsafe|35 +ic3_fork|1393 +iccta|612 +perfchecker|1921 +saaf|1588 +wognsen_et_al|386 +*/ + +Third, we extended to #nbtoolsselected different tools the work done by Reaves #etal on the usability of analysis tools (4 tools are in common, we added 16 new tools and two variations). +We confirmed that most tools require a significant amount of work to get them running. +We encounter similar issues with libraries and operating system incompatibilities, and noticed that, with time, dependencies issues may impact the build process. +For instance we encountered cases where the repository hosting the dependencies were closed, or cases where maven failed to download dependencies because the OS version did not support SSL, now mandatory to access maven central. +//, and even one case were the could not find anywhere the compiled version of sbt used to build a tool. +*/ + diff --git a/3_rasta/5_conclusion.typ b/3_rasta/5_conclusion.typ index 4c64ab0..c797b60 100644 --- a/3_rasta/5_conclusion.typ +++ b/3_rasta/5_conclusion.typ @@ -1,13 +1,14 @@ +#import "@local/template-thesis-matisse:0.0.1": etal +#import "X_var.typ": * + == Conclusion -This paper has assessed the suggested results of the literature~\cite{luoTaintBenchAutomaticRealworld2022, pauckAndroidTaintAnalysis2018, reaves_droid_2016} about the reliability of static analysis tools for Android applications. -With a dataset of \NBTOTALSTRING applications we established that \resultunusable of \nbtoolsselectedvariations\xspace tools are not reusable, when considering that a tool that has more than 50\% of time a failure is unusable. -In total, the analysis success rate of the tools that we could run for the entire dataset is \resultratio. +This paper has assessed the suggested results of the literature@luoTaintBenchAutomaticRealworld2022 @pauckAndroidTaintAnalysis2018 @reaves_droid_2016 about the reliability of static analysis tools for Android applications. +With a dataset of #NBTOTALSTRING applications we established that #resultunusable of #nbtoolsselectedvariations tools are not reusable, when considering that a tool that has more than 50% of time a failure is unusable. +In total, the analysis success rate of the tools that we could run for the entire dataset is #resultratio. The characteristics that have the most influence on the success rate is the bytecode size and min SDK version. Finally, we showed that malware APKs have a better finishing rate than goodware. -In future works, we plan to investigate deeper the reported errors of the tools in order to analyze the most common types of errors, in particular for Java based tools. We also plan to extend this work with a selection of more recent tools performing static analysis. - -%Following Reaves {\it et al.} recommendations~\cite{reaves_droid_2016}, we publish the Docker and Singularity images we built to run our experiments alongside the Docker files. This will allow the research community to use directly the tools without the build and installation penalty. - -%\todo{check ce qui est dit sur ic3 et ic3fork} +In future works, we plan to investigate deeper the reported errors of the tools in order to analyze the most common types of errors, in particular for Java based tools. +We also plan to extend this work with a selection of more recent tools performing static analysis. +Following Reaves #etal recommendations@reaves_droid_2016, we publish the Docker and Singularity images we built to run our experiments alongside the Docker files. This will allow the research community to use directly the tools without the build and installation penalty. diff --git a/3_rasta/X_var.typ b/3_rasta/X_var.typ index 82e5f2b..1130e8e 100644 --- a/3_rasta/X_var.typ +++ b/3_rasta/X_var.typ @@ -17,3 +17,21 @@ delimiter: ";", row-type: dictionary, ) + +#let rasta_avg_nb_error_by_exec = csv( + "data/average_number_of_error_by_exec.csv", + delimiter: ",", + row-type: dictionary, +) + +#let rasta_avg_time = csv( + "data/average_time-final.csv", + delimiter: ",", + row-type: dictionary, +) + +#let rasta_avg_mem = csv( + "data/average_mem-final.csv", + delimiter: ",", + row-type: dictionary, +) diff --git a/3_rasta/data/average_mem-final.csv b/3_rasta/data/average_mem-final.csv new file mode 100644 index 0000000..69705b1 --- /dev/null +++ b/3_rasta/data/average_mem-final.csv @@ -0,0 +1,4 @@ +first,type,adagio,amandroid,anadroid,androguard,androguarddad,apparecium,blueseal,dialdroid,didfail,droidsafe,flowdroid,gator,ic,icfork,iccta,mallodroid,perfchecker,redexer,saaf,wognsenetal +FINISHED,memory,0.6,4.5,12.8,0.6,0.3,1.3,2.7,15.9,17.6,9.9,2,2,15.3,5,5,0,0,1,3,3 +FAILED,,0.3,4.9,2.8,0.4,1,0.6,1.7,3.9,68.3,14.8,5,41.5,130.9,5,12,0,1,1,1,1 +TIMEOUT,,0,19,82.6,0,68.1,2.1,15.4,37.2,99.8,0.2,20.2,1.1,81.1,20,2,0,1,0,9,0 diff --git a/3_rasta/data/average_number_of_error_by_exec.csv b/3_rasta/data/average_number_of_error_by_exec.csv new file mode 100644 index 0000000..0c14393 --- /dev/null +++ b/3_rasta/data/average_number_of_error_by_exec.csv @@ -0,0 +1,7 @@ +first,type,adagio,amandroid,anadroid,androguard,androguarddad,apparecium,blueseal,dialdroid,didfail,droidsafe,flowdroid,gator,ic,icfork,iccta,mallodroid,perfchecker,redexer,saaf,wognsenetal +FINISHED,errors,0.0,0.9,0.02,0.0,0.0,3.33,0.0,55.88,1.13,7.17,1.04,0.0,1.22,0.32,14.3,4.37,0.28,1.29,0.22,2.13 +,sigma,0.0,3.23,0.19,0.0,0.0,261.92,0.05,63.73,3.94,37.87,26.32,0.04,23.1,2.73,71.74,277.0,1.77,1.28,0.83,66.5 +FAILED,errors,0.0,2.79,2.34,1.35,1.0,21.63,1.02,33.79,6.6,12.53,14.64,0.32,3.66,1.29,17.34,1.0,1.15,3.45,6.35,4.11 +,sigma,0.0,8.7,0.94,0.48,0.02,466.97,0.21,108.56,31.56,74.01,49.07,0.78,18.06,0.71,42.81,0.0,4.7,4.52,22.97,48.81 +TIMEOUT,errors,0,9.78,0.01,0,0.0,4.3,0.01,60.94,1.06,26.64,0.75,0.0,2.13,0.91,3.68,0,1.24,0,91.29,1.31 +,sigma,0.0,9.76,0.11,0.0,0.0,79.98,0.11,101.73,2.98,97.18,1.72,0.0,5.19,3.19,15.33,0.0,4.3,0.0,353.75,3.42 diff --git a/3_rasta/data/average_time-final.csv b/3_rasta/data/average_time-final.csv new file mode 100644 index 0000000..15656fb --- /dev/null +++ b/3_rasta/data/average_time-final.csv @@ -0,0 +1,4 @@ +first,type,adagio,amandroid,anadroid,androguard,androguarddad,apparecium,blueseal,dialdroid,didfail,droidsafe,flowdroid,gator,ic,icfork,iccta,mallodroid,perfchecker,redexer,saaf,wognsenetal +FINISHED,time,17.19,405.02,149.29,15.59,26.56,97.54,158.38,767.67,270.1,676.08,29.11,33.19,156.19,158.7,90.2,27.79,4.31,16.16,56.44,696.09 +FAILED,,8.38,760.52,4.8,13.64,62.67,21.68,12.15,68.47,444.74,442.79,136.95,924.46,535.39,28.6,201.74,5.21,10.24,16.61,5.73,55.58 +TIMEOUT,,0,3600.84,3600.83,0,3603.59,3600.23,3600.73,3604.0,3600.02,3600.08,3601.18,3600.56,3601.03,3600.92,3600.38,0,3600.04,0,3602.02,3600.55 diff --git a/3_rasta/figs/repartition-of-error-types-among-tools.svg b/3_rasta/figs/repartition-of-error-types-among-tools.svg new file mode 100644 index 0000000..0c0c699 --- /dev/null +++ b/3_rasta/figs/repartition-of-error-types-among-tools.svg @@ -0,0 +1,2092 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/3_rasta/rasta.typ b/3_rasta/rasta.typ index 41a6fe8..e0af201 100644 --- a/3_rasta/rasta.typ +++ b/3_rasta/rasta.typ @@ -10,5 +10,5 @@ #include("1_related_work.typ") #include("2_methodology.typ") #include("3_experiments.typ") -//#include("4_discussion.typ") -//#include("5_conclusion.typ") +#include("4_discussion.typ") +#include("5_conclusion.typ") diff --git a/main.pdf b/main.pdf index 46f6807..d6e3885 100644 Binary files a/main.pdf and b/main.pdf differ