Spaces:
Running
Running
Update index.html
Browse files- index.html +2 -4
index.html
CHANGED
|
@@ -69,8 +69,8 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 69 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
| 70 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
| 71 |
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
|
| 72 |
-
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak"
|
| 73 |
-
Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
| 74 |
methods and show the defense performance against several Jailbreak attack methods.
|
| 75 |
</p>
|
| 76 |
|
|
@@ -85,8 +85,6 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 85 |
</div>
|
| 86 |
</div>
|
| 87 |
|
| 88 |
-
|
| 89 |
-
<h2 id="jailbreak-attack-and-defense">Jailbreak Red-Teaming And Blue Teaming</h2>
|
| 90 |
<p>We summarized some recent advances of jailbreak attack or jailbreak defense in below tables.</p>
|
| 91 |
<div id="tabs">
|
| 92 |
<ul>
|
|
|
|
| 69 |
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
| 70 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
| 71 |
we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
|
| 72 |
+
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak" and summarize people's efforts in Jailbreak
|
| 73 |
+
attack and Jailbreak defense. Then we present the 2-D Refusal Loss Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
| 74 |
methods and show the defense performance against several Jailbreak attack methods.
|
| 75 |
</p>
|
| 76 |
|
|
|
|
| 85 |
</div>
|
| 86 |
</div>
|
| 87 |
|
|
|
|
|
|
|
| 88 |
<p>We summarized some recent advances of jailbreak attack or jailbreak defense in below tables.</p>
|
| 89 |
<div id="tabs">
|
| 90 |
<ul>
|