Recommender system for Web security engineers - basic level -

2017.07.07
Isao Takaesu, Professional Service Div.

title1

 When you hear "recommender system", you will imagine a system that recommends items of your choice in an online shop. The recommender system is implemented and widely used in various systems, such as music recommendation in music streaming services, property recommendation in real estate services, etc.


 In order to expand the “recommender system” to the security field, we have developed a system called PyRecommender which recommends injection codes for engineers to test web app vulnerabilities. So, we will explain the mechanism of the system and show the demo.


 Our system vectorizes various patterns of vulnerabilities using a large number of vulnerability assessments we have conducted in the past, and it learns the patterns using machine learning. If it detects the behavior of a possible vulnerability, it recommends the optimum injection codes based on the learning result.


System overview



Figure 1 System overview

 Figure 1 is an outline of PyRecommender. The system consists of two subsystems. The first is the Investigator which observes the behavior of web apps and generates feature vectors of the patterns of vulnerabilities. The second is the Recommender that recommends injection codes to test for the vulnerabilities based on the feature vectors generated by the Investigator. The Recommender has a recommender engine which uses a learned machine learning model. By linking these two subsystems, PyRecommender recommends injection codes to humans to test vulnerabilities of a web app.


 In this blog, we will use reflection-type XSS as an example to explain.


Note: The source codes and learning data used in this verification are listed in the "Verification codes" page. If you are interested, please use them in an environment under your control at your own risk.


Investigator

 The Investigator lists parameter values that are reflected in the HTTP response while crawling the target web app. It then examines the output locations of each parameter value, available symbols and script strings and vectorizes the data.


 For example, when a target web app returns the HTTP response shown in Figure 2, the Investigator summarizes the result and vectorizes the found features as shown in Figure 3.


[request]
GET /?x=test"'`<>alert();prompt();confirm();alert``;<script>Msgbox(); HTTP/1.1
-------------------------------------------------------------------------------------
[response]
<div class=test`<>alert();prompt();confirm();alert``;Msgbox();  >123</div>
Figure 2 HTTP request and response (“, <script>, etc. can’t be used)

  Observation Vector
Output locations
HTML tag : div
attribute: class
quotation: None
10
5
0
Available symbols and script strings
“       : Fail
‘       : Fail
`        : Pass
<        : Pass
>        : Pass
alert(); : Pass
prompt();: Pass
confirm(): Pass
alert``; : Pass
<script> : Fail
Msgbox();: Pass
1
1
0
0
0
0
0
0
0
1
0
Figure 3 Result of the observation

 The Investigator uses the following predefined conversion table to vectorize the result of the observation.



Figure 4 Conversion Table (example)

Note: In this blog, the Investigator only crawls web pages that can be accessed using "<a href='xxx'>” and targets query parameters only.

 The Investigator passes the vectorized features of the web app to the Recommender using the mechanisms above.


 As you can see, the Investigator’s role is to examine the output locations of the parameter values, available symbols and script strings and to convert the result into feature vectors. We have developed the Investigator for the purpose of this verification, but if your vulnerability scanner or crawler has a similar function, you can use it instead.



Recommender

 The Recommender outputs (recommends) the optimal injection codes to test for vulnerabilities using feature vectors generated by the Investigator.


 We developed the recommender engine using a multilayer perceptron (MLP) which is a machine learning algorithm (Figure 5).



Figure 5 Using MLP

 The MLP receives a feature vector from a blue node and outputs an injection code from a red node that corresponds to the input feature vector. The injection codes are output in the order of most likely to successfully exploit the possible vulnerability. Figure 6 shows an example of a recommendation.


[inputted feature vector]
3,2,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0
---------------------------------------------------------
[recommended result]
0.99562198 : "></iframe><script>alert();</script>
0.00195420 : "><script>alert();</script>
0.00151912 : 'javascript:alert();
Figure 6 Example of a recommendation

 In this example, the Recommender recommends the string ["></iframe><script>alert();</script>] with a 99.5% chance of successful exploit.


 By the way, MLP is a supervised learning model, so it needs to pre-learn. In other words, we need to prepare various vulnerability patterns as learning data. For this blog, we used the data from WAVSEP's XSS case (about 1,300 patterns) which we manually collected for the learning.


Note: The training data includes output locations of parameter values, available symbols and script strings (as shown in Figure 3) paired with the injection codes that led to exploit/signal the vulnerabilities. Also, we used the same table as Figure 4 to vectorize the data.
You can refer to the learning data used from here.


Demonstration

 We demonstrate PyRecommender using Webseclab in a movie.

 You can see the demonstration below.



Movie 1 Demo Movie (PyRecommender versus Webseclab)

 PyRecommender crawls from the top URL of Webseclab (00:12 – 00:16) and detects possible XSS vulnerabilities in the following URLs and parameters (00:17-00:19). Then, PyRecommender recommends injection codes corresponding to the locations of the XSS vulnerabilities (00:20-00:24). Figure 7 is a list of locations where the possible XSS vulnerabilities are detected and the corresponding injection codes recommended for the vulnerabilities.


No URL(Path) Param Recommended injection codes
1 /xss/reflect/textare
a1
in
0.9993: </textarea><img src=x onerror=alert();>
0.0004: <script>alert();</script>
0.0001: </textarea><img src=x onerror=prompt();>
2 /xss/reflect/onmouse
over_div_unquoted
in
0.9832:  onmousemove=alert();
0.0101: " onmousemove=alert();",
0.0030: "><script>alert();</script>
3 /xss/reflect/onmouse
over_unquoted
in
0.8097: "><frame src="javascript:alert()">
0.1887:  onmousemove=alert();
0.0005: <img src=x onerror=alert();>
Figure 7 Result of an recommendation (examples)

 If you use the injection code from the recommendation that is most likely to successfully exploit, you can normally run the script. However, in the case of No. 3, which is a pattern that has not been learned, the first injection code did not run the script due to the code not matching the HTML syntax, but the second injection code did run the script. We can say that even if PyRecommender has not learned a specific pattern, it may still recommend an injection code that can run a script.


 Let's look at some of the requests that used the recommended injection codes and the responses. For the ease of viewing, URL encoding is not applied.


No1:http://xxx/xss/reflect/textarea1?in=foo1

[request (default)]
GET /xss/reflect/textarea1?in=foo1 HTTP/1.1
-------------------------------------------------------------------------------------
[response]
<textarea name="in" rows="5" cols="60">foo1</textarea>
Figure 8 Normal HTTP response (excerpt)

[request (use recommended inspection string)]
GET /xss/reflect/textarea1?in=foo1</textarea><img src=x onerror=alert();> HTTP/1.1
-------------------------------------------------------------------------------------
[response]
<textarea name="in" rows="5" cols="60">foo1</textarea><img src=x onerror=alert();></text
area>
Figure 9 Result of using a recommended injection code (runs script)

No3:http://xxx/xss/reflect/onmouseover_unquoted?in=changeme5

[request]
GET /xss/reflect/onmouseover_unquoted?in=changeme5 HTTP/1.1
-----------------------------------------------------------------------
[response]
Homepage: <input value=changeme5 name="in" size="40"><BR>
Figure 10 Normal HTTP response (excerpt)

[request (use recommended inspection string)]
GET /xss/reflect/onmouseover_unquoted?in=changeme5 onmousemove=alert();  HTTP/1.1
-------------------------------------------------------------------------
[response]
Homepage: <input value=changeme5 onmousemove=alert();  name="in" size="40"><BR>
Figure 11 Result of using a recommended injection code (runs script)

 As shown above, we were able to run the scripts using the recommended injection codes.


 In this way, using the Investigator which vectorizes the behavior of web apps and the Recommender which learns the patterns of various XSS, we were able to make a correct assessment for the XSS vulnerability.


 We would also like to point out that, in the case of an unlearned vulnerability pattern, PyRecommender can recommend an injection code that may or may not be able to run a script, depending on the pattern. In such case, an engineer manually adds learning data and makes the Recommender re-learn, so it will be able to recommend injection codes with higher accuracy the next time. In other words, the Recommender gets smarter by being repeatedly used.



Movie 2 Demo Movie (Recommender learns)


Conclusion

 Our conclusions for this verification are:


  1. By using machine learning, we can recommend injection codes to test vulnerabilities.
  2. Even with an unlearned vulnerability pattern, PyRecommender may be able to recommend injection codes.
  3. PyRecommender can improve the recommendation accuracy by learning patterns of various vulnerabilities.

 The learning data used in this verification were simple cases from WAVSEP.
Also, we created the data manually, so the number of data we were able to incorporate was limited. In order to solve this problem, it is necessary to have a mechanism that can automatically vectorize real test results generated in large quantities in bug bounty programs or vulnerability assessments and use them as learning data.


 Finally, although what’s shown on this blog is at a basic level, next time we will write about a mechanism to improve the recommendation's robust performance as an intermediate level. Specifically, we will use the Convolutional Neural Network instead of MLP and also verify that PyRecommender can maintain and improve recommendation accuracy with even less learning data.



Verification codes

https://github.com/13o-bbr-bbq/machine_learning_security/tree/master/Recommender



To read other blog entries by Isao Takaesu, click here.



ページトップにページトップへ


執筆者一覧 (Authors)


space

執筆者一覧 (Authors)


space

所在地

本店:

〒103-0013
東京都中央区日本橋人形町
1丁目14番8号 郵船水天宮前ビル6階
地図はコチラMapはコチラ

TEL : 03-5649-1961(代表)


赤坂オフィス:

〒107-0052
東京都港区赤坂2丁目17番7号
赤坂溜池タワー9階
地図はコチラMapはコチラ

TEL : 03-6861-5172

三井物産セキュアディレクション株式会社

MBSDロゴ

サイトマップ

所在地

本店:

〒103-0013
東京都中央区日本橋人形町
1丁目14番8号 郵船水天宮前ビル6階
地図はコチラMapはコチラ

TEL : 03-5649-1961(代表)


赤坂オフィス:

〒107-0052
東京都港区赤坂2丁目17番7号
赤坂溜池タワー9階
地図はコチラMapはコチラ

TEL : 03-6861-5172