5 changed files with 20 additions and 58 deletions
--- a/5_theseus/3_static_transformation.typ
+++ b/5_theseus/3_static_transformation.typ
@ -225,31 +225,28 @@ The pseudo-code in @lst:renaming-algo shows the three steps of this algorithm:
 * #todo[interupting try blocks: catch block might expect temporary registers to still stored the saved value] ?
 */

-=== Implementation Details <sec:th-implem>
+=== Implementation Details

-Our initial idea was to use Apktool, but in @sec:rasta, we found that many errors raised by tools were due to trying to parse Smali incorrectly.
-Thus, we decided to avoid Apktool.
-
-Most of the contributions of the state of the art that perform instrumentation rely on Soot.
+Most of the contributions we saw performing instrumentation in the state of the art rely on Soot.
 Soot works on an intermediate representation, Jimple, that is easier to manipulate.
 However, Soot can be cumbersome to set up and use, and we initially wanted better control over the modified bytecode.
+Our initial idea was to use Apktool, but in @sec:rasta, we found that many errors raised by tools were due to trying to parse Smali incorrectly.
 In addition, although it might be due to the fact that they performed more complex analysis, tools based on Soot showed a trend of consuming a lot of memory and failing with unclear errors, supporting us in our idea of avoiding Soot.
 For these reasons, we decided to make our own instrumentation library from scratch.
-
-That library, Androscalpel, requires being able to parse, modify and generate valid #DEX files.
+That library requires being able to parse, modify and generate valid #DEX files.
 It was not as difficult as one would expect, thanks to the clear documentation of the Dalvik format from Google#footnote[https://source.android.com/docs/core/runtime/dex-format].
 In addition, when we had doubts about the specification, we had the option to check the implementation used by Apktool#footnote[https://github.com/JesusFreke/smali], or the code used by Android to check the integrity of the #DEX files#footnote[https://cs.android.com/android/platform/superproject/main/+/main:art/libdexfile/dex/dex_file_verifier.cc;drc=11bd0da6cfa3fa40bc61deae0ad1e6ba230b0954].

 We chose to use Rust to implement this library.
 It has both good performance and ergonomics.
 For instance, we could parallelise the parsing and generation of #DEX files without much effort.
-Because we are not using a high-level intermediate language like Jimple (used by Soot), the management of the Dalvik registers in the methods has to be done manually (by the user of the library), the same way it has to be done when using Apktool.
+Because we are not using a high-level intermediate language like Jimple (used by Soot), the management of registers has to be done manually (by the user of the library), the same way it has to be done when using Apktool.
 This poses a few challenges.

 A method declares a number of internal registers it will use (let's call this number $n$), and has access to an additional number of registers used to store the parameters (let's call this number $p$).
 Each register is referred to by a number from $0$ to $65535$.
 The internal registers are numbered from $0$ to $n$, and the parameter registers from $n$ to $n+p$.
-This means that when adding new registers to the method when instrumenting it (let's say we want to add $k$ registers), the new registers will be numbered from $n$ to $n+k$, and the parameter registers will be renumbered from $[|n, n+p[|$ to $[|n+k, n+k+p[|$.
+This means that when adding new registers to the method when instrumenting it (let's say we want to add $k$ registers, the new registers will be numbered from $n$ to $n+k$, and the parameter registers will be renumbered from $[|n, n+p[|$ to $[|n+k, n+k+p[|$.
 In general, this is not an issue, but some instructions can only operate on some registers (#eg `array-length`, which stores the length of an array in a register, only works on registers numbered between $0$ and $8$ excluded).
 This means that adding registers to a method can be enough to break a method.
 We solved this by adding instructions that move the content of registers $[|n+k, n+k+p[|$ to the registers $[|n, n+p[|$, and keeping the original register numbers ($[|n, n+p[|$) for the parameters in the rest of the body of the method.
@ -257,7 +254,7 @@ We solved this by adding instructions that move the content of registers $[|n+k,
 The next challenge arises when we need to use one of the new registers with an instruction that only accepts registers lower than $n+p$.
 In such cases, a lower register must be used, and its content will be temporarily saved in one of the new registers.
 This is not as easy as it seems: the Dalvik instructions differ depending on whether the register stores a reference or a scalar value, and Android does check that the register types match the instructions.
-The type of the register can be computed from the control flow graph of the method (we added the computation of such a graph, with the type of each register, as a feature in Androscalpel).
+The type of the register can be computed from the control flow graph of the method (we added the computation of such a graph, with the type of each register, as a feature in our library).
 An edge case that must not be overlooked is that each instruction inside a `try` block is branching to each of the `catch` blocks.
 This is a problem: it prevents us from restoring the registers to their original values before entering the `catch` blocks (or, if we restore the values at the beginning of the `catch` blocks and an exception is raised before the value is saved, the register will be overwritten by an invalid value).
 This means that when modifying the content of a `try` block, the block must be split into several blocks to prevent impromptu branching.
@ -267,7 +264,7 @@ We also found that some applications deliberately store files with names that wi
 For this reason, we also used our own library to modify the #APK files.
 We took special care to process the least possible files in the #APKs, and only strip the #DEX files and signatures, before adding the new modified #DEX files at the end.

-Unfortunately, we did not have time to compare the robustness of our solution to existing tools like Apktool and Soot, but we did a quick performance comparison, summarised in @sec:th-lib-perf.
+Unfortunately, we did not have time to compare the robustness of our solution to existing tools like Apktool and Soot. 
 In hindsight, we probably should have taken the time to find a way to use smali/backsmali (the backend of Apktool) as a library or use SootUp to do the instrumentation, but neither option has documentation to instrument applications this way.
 At the time of writing, the feature is still being developed, but in the future, Androguard might also become an option to modify #DEX files.
 Nevertheless, we published our instrumentation library, Androscalpel, for anyone who wants to use it. #todo[ref to code]
--- a/5_theseus/5_results.typ
+++ b/5_theseus/5_results.typ
@ -298,38 +298,6 @@ In red on the figure however, we have the calls that were hidded by reflection i
  caption: [Call Graph of `Main.main()` generated by Androguard after patching],
 ) <fig:th-cg-after>

-=== Androscalpel Performances <sec:th-lib-perf>
-
-Because we implemented our own instrumentation library, we wanted to compare it to other existing options.
-Unfortunately, we did not have time to compare the robustness and correctness of the generated applications.
-However, we did compare the performances of our library, Androscalpel, to Apktool and Soot.
-
-Due to time constraints, we could not test a complex transformation, as adding registers requires complex operations for both Androscalpel and Apktool (see @sec:th-implem for more details).
-We decided to test two operations: travelling the instructions of an application (a read-only operation), and regenerating an application, without modification (a read/write operation).
-It should be noted that all three of the tested tools have multiprocessing support, but we disabled the option when testing the generation of an application with Soot, as it raised errors.
-
-#figure({
-  let nb_col = 5
-  table(
-    columns: (1fr, 1fr, 1fr, 1fr, 1fr),
-    align: center+horizon,
-    table.header(
-      table.cell(colspan: 2)[Tool], [Soot], [Apktool], [Androscalpel],
-    ),
-    table.cell(rowspan: 2)[Read],
-      [Time], [], [], [],
-      [Mem], [], [], [],
-    table.cell(rowspan: 2)[Read/Write], 
-      [Time], [], [], [],
-      [Mem], [], [], [],
-  )},
-  caption: [Average time and memory consumption of Soot, Apktool and Androscalpel]
-) <tab:th-compare-perf>
-
-@tab:th-compare-perf compares the resources consumed by each tool for each operation.
-
-#todo[Conlude depending on the results of the experiment]
-
 #midskip

 To conclude, we showed that our approach indeed improves the results of analysis tools without impacting their finishing rates much.
--- a/5_theseus/6_limits.typ
+++ b/5_theseus/6_limits.typ
@ -78,6 +78,6 @@ Beyond the classic comparison of static versus dynamic, DroidRA has a similar go
 Two notable comparison criteria would be the failure rate and the number of edges added to an application call graph.
 The first criterion indicates how much the results can be used by other tools, while the second indicates how effective the approaches are.

-Because we elected to make our own software to modify the bytecode of the #APKs, it would be insightful to compare the finishing rate and performances of simple transformations with our tool, to the same transformation made with Apktool, Soot or SootUp (we only compared the performances for re-generating and application without transformations).
+Because we elected to make our own software to modify the bytecode of the #APKs, it would be insightful to compare the finishing rate and performances of simple transformations with our tool, to the same transformation made with Apktool, Soot or SootUp.
 An example of a transformation to test would be to log each method call and its return value.
 More than finding which solution is the best to instrument an application, this would allow us to compare the weaknesses of each tool and find if some recurring issues for some tools can be solved using a technical solution implemented by another tool (#eg some applications deliberately include files with names that crash the standard Java zip library).
--- a/6_conclusion/1_contributions.typ
+++ b/6_conclusion/1_contributions.typ
@ -31,7 +31,7 @@ We tested our method on a subset of recent applications from the dataset of our
 The results of our dynamic analysis suggest that we failed to correctly explore many applications, hinting at weaknesses in our experimental setup.
 Nonetheless, we did obtain some dynamic data, allowing us to pursue our experiment.
 We compared the finishing rate of tools on the original application and the instrumented application using the same experiment as in our first contribution, and found that, in general, the instrumentation only slightly reduces the finishing rate of analysis tools.
-We also confirmed that the instrumentation improves the result of analysis tools, allowing them to compute more comprehensive call graphs of the applications, or to detect new data flows.
+We also confirmed that the instrumentation does improve the result of analysis tools, allowing them to compute more comprehensive call graphs of the applications, or to detect new data flows.

 /*
 *
--- a/6_conclusion/2_futur.typ
+++ b/6_conclusion/2_futur.typ
@ -5,26 +5,25 @@

 In this section, we present what, in light of this thesis, we believe to be worthwhile avenues of work to improve the Android reverse engineering ecosystem.

-The main issues that appeared in all our work appear to be engineering ones.
-The errors we analysed in @sec:rasta showed that even something that should be basic, reading the content of an application, can be challenging.
+The main issue that appeared in all our work appears to be engineering one.
+The error we analysed in @sec:rasta showed that even something that should be basic, reading the content of an application, can be challenging.
@sec:cl also showed that reproducing the exact behaviour of Android is more difficult than it seems (in our specific case, it was the class loading algorithm, but we can expect other features to have similar edge cases).
 As long as those issues are not solved, we cannot build robust analysis tools.
-
-One avenue that is more research-oriented and that should be investigated would be to reuse for analysis purposes the code actually used by Android.
+One avenue we believe should be investigated would be to reuse the code actually used by Android.
 For instance, the parsing of #DEX, #APK, and resource files could be done using the same code as the #ART.
-This is possible thanks to #AOSP being open-source.
-However, this is not straightforward.
+This is possible thanks to #AOSP being open-source, and is already partially done by some Android build tools.
+However, this is not an easy solution.
 Dynamic analysis relying on patched versions of the #AOSP showed that it is difficult to maintain this kind of software over time.
 Doing this would require limiting the modifications to the actual source code of Android to minimise the changes needed at each Android update.
 Another obstacle to overcome is to decouple the compilation of the tool from the rest of #AOSP: it is a massive dependency that needs a lot of resources to build.
 Having such a dependency would be a barrier to entry, preventing others from modifying or improving the tool.
-Should those issues be solved, directly using the code from #AOSP would allow such a tool to stay up to date with Android and limit discrepancies between what Android does and what the tool sees.
+Should those issues be solved, directly using the code from #AOSP would allow such a tool to keep up with each new version of Android and limit invalid assumptions about Android behaviour.

-An orthogonal solution to this problem of not being able to analyse edge cases is to create a new benchmark to test the capacity of a tool to handle real-life applications.
+An orthogonal solution to this problem is to create a new benchmark to test the capacity of a tool to handle real-life applications.
 Benchmarks are usually targeted at some specific technique (#eg taint tracking), and accordingly, test for issues specific to the targeted technique (#eg accurately tracking data that passes through an array).
 We suggest using a similar method to what we did in @sec:rasta to keep the benchmark independent from the tested tools.
 Instead of checking the correctness of the tools, this benchmark should test if the tool is able to finish its analysis.
-Applications in this benchmark could either be real-life applications that was proven to be difficult to analyse (for instance, applications that crashed many of the tested tools in @sec:rasta), or hand-crafted applications reproducing corner cases or anti-reverse techniques encountered while analysing obfuscated applications (for instance, an application with gibberish binary file names inside `META-INF/` that can crash Jadx zip reader).
+Applications in this benchmark could either be real-life applications that proved difficult to analyse (for instance, applications that crashed many of the tested tools in @sec:rasta), or hand-crafted applications reproducing corner cases or anti-reverse techniques encountered while analysing obfuscated applications (for instance, an application with gibberish binary file names inside `META-INF/` that can crash Jadx zip reader).
 The main challenge with such a benchmark is that it would need frequent updates to follow Android evolutions, and be diverse enough to encompass a large spectrum of possible issues.

 Lastly, our experience with dynamic analysis led us to believe that there is a need for a new protocol and/or #API for automatic testing.
@ -39,7 +38,5 @@ We think that an #API or protocol that merges and delivers in a structured way a

 Integrating such a protocol into Android would open interesting perspectives.
 For instance, we could imagine Google requiring applications requesting critical permissions to provide test inputs with a high code coverage (maybe even 100% of coverage).
-Those tests would incentivise application developers to provide better quality code for applications handling sensitive data, but also to provide solutions for the coverage issue that comes with dynamic analysis.
-Requiring a high code coverage would force the developer to supply solutions for situations normally requiring human interaction.
-For example, if an application requires the user to authenticate themself, the developer would need to provide a testing account that can then be used for tests and analysis.
-Of course, we can expect malicious applications to implement evasion techniques when they detect an analysis following the tests they provided, but code coverage can be checked, and imposing constraints on the coverage of the tests should mitigate evasion.
+Those test inputs can then be used to analyse the application dynamically.
+