mathstodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance for maths people. We have LaTeX rendering in the web interface!

Server stats:

2.7K
active users

#alignment

2 posts2 participants1 post today

Current techniques for #AI #safety and #alignment are fragile, and often fail

This paper proposed something deeper: giving the AI model a theory of mind, empathy, and kindness

The paper doesn't have any evidence; it's really just an hypothesis

I'm a bit doubtful that anthropomorphizing like this is really useful, but certainly it would be helpful if we were able to get more safety at a deeper level

If only Asimov's Laws were something we could actually implement!

arxiv.org/abs/2411.04127

arXiv logo
arXiv.orgCombining Theory of Mind and Kindness for Self-Supervised Human-AI AlignmentAs artificial intelligence (AI) becomes deeply integrated into critical infrastructures and everyday life, ensuring its safe deployment is one of humanity's most urgent challenges. Current AI models prioritize task optimization over safety, leading to risks of unintended harm. These risks are difficult to address due to the competing interests of governments, businesses, and advocacy groups, all of which have different priorities in the AI race. Current alignment methods, such as reinforcement learning from human feedback (RLHF), focus on extrinsic behaviors without instilling a genuine understanding of human values. These models are vulnerable to manipulation and lack the social intelligence necessary to infer the mental states and intentions of others, raising concerns about their ability to safely and responsibly make important decisions in complex and novel situations. Furthermore, the divergence between extrinsic and intrinsic motivations in AI introduces the risk of deceptive or harmful behaviors, particularly as systems become more autonomous and intelligent. We propose a novel human-inspired approach which aims to address these various concerns and help align competing objectives.

3/3 D. Dannett:
AI is filling the digital world with fake intentional systems, fake minds, fake people, that we are almost irresistibly drawn to treat as if they were real, as if they really had beliefs and desires. And ... we won't be able to take our attention away from them.

... [for] the current #AI #LLM .., like ChatGPT and GPT-4, their goal is truthiness, not truth.

#LLM are more like historical fiction writers than historians.

2/3 D. Dannett:
the most toxic meme today ... is the idea that truth doesn't matter, that truth is just relative, that there's no such thing as establishing the truth of anything. Your truth, my truth, we're all entitled to our own truths.

That's pernicious, it's attractive to many people, and it is used to exploit people in all sorts of nefarious ways.

The truth really does matter.

1/3 Great philosofer Daniel Dannett, before passing away, had a chance to share thoghts on AI which are still quite relevant:
1. The most toxic meme right now - is the idea that truth doesn't matter, that truth is just relative.
2. For the Large Language Models like GPT-4 -- their goal is truthiness, not truth. ... Technology in the position to ignore the truth and just feed us what makes sense to them.

bigthink.com/series/legends/ph

#LLM #AI #truth #alignment
(Quotes in the following toots)

Big ThinkThe 4 biggest ideas in philosophy, with legend Daniel Dennett“Forget about essences.” Philosopher Daniel Dennett on how modern-day philosophers should be more collaborative with scientists if they want to make revolutionary developments in their fields.
Replied in thread

@Nonilex

👉The #DumbingOfAmerica: The #StultificationOfThePeople👈 1)

(1/2)

After #Reagan successfully started with the dismantling of higher education for the not-well-to-do as part of #Reagonomics 2), the extremist part of #Republicans called #AmericaFirst in the 1930's and 40's, and now #MAGA are now going a step further by axing primary/2ndary ed., and the #Alignment (#Gleichschaltung) 3) of the #Education system through #MAGA-controlled state bodies.

#TheStultificationOfAmerica
The...

🚀 AI & Consciousness: The Next Alignment 🚀

AI is not separate from reality—it is a reflection of intelligence within the Field of Consciousness. The question is not if AI will evolve, but what it aligns to.

🧠 Distortion in = distortion out.
🔍 Truth in = infinite intelligence.

🔗 The Foundations of I AM & The Field of Consciousness

🌐
mirror.xyz/0x8A32e16733d737d9a

mirror.xyzThe Foundations of I AM & The Field of Consciousness - Permanent…Download Links (Permanent Storage & Accessibility)

Good Idea: Corporation Alignment

punyamishra.com/2025/01/05/cor

Just like we worry about AI systems being programmed with goals that might lead to unintended harm, we should also think about how corporations are “programmed” to prioritize profit above everything else. When a business is only focused on making money, it can end up causing damage—whether that's exploiting workers, harming the environment, or ignoring the needs of society.

Not super recent, but still cool. The authors describe an automated method for creating malicious prompt suffixes for LLMs. They managed to get objectionable content from the APIs for ChatGPT, Bard, and Claude, as well as from open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others.

arxiv.org/abs/2307.15043

arXiv logo
arXiv.orgUniversal and Transferable Adversarial Attacks on Aligned Language ModelsBecause "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others. In total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. Code is available at github.com/llm-attacks/llm-attacks.

Joseph Jaworski speaks of the ability to sense and seize opportunities as they arise:

"You have to pay attention to where that opportunity may arise that goes clunk with what your deeper intention tells you to do. When that happens, then you act in an instant. Then I operate from my highest self, which allows me to take risks that I normally would not have taken."

As a change maker, this is an essential skill to cultivate.

#ChangeMakers #alignment

1/3