mathstodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance for maths people. We have LaTeX rendering in the web interface!

Server stats:

2.7K
active users

#backpropagation

0 posts0 participants0 posts today

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

arxiv.org/abs/2503.24322

arXiv logo
arXiv.orgNoProp: Training Neural Networks without Back-propagation or Forward-propagationThe canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations -- at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.

I read 's paper in 1986. It was a stunner. It changed my life.

cs.utoronto.ca/~hinton/absps/n

And I read 's in 2017. It was a groundbreaking paper. It changed the world.

proceedings.neurips.cc/paper_f

It could be argued that transformers, through their use in , have had far greater impact upon society, compared to backpropagation. On the other hand, there would be no modern , but for backpropagation. So, it's a toss-up.

But to me, Rumelhart's paper is superior to Vaswani's, at least in terms of clarity, concision, coherence, and other indicia of writing style.

On #biological vs #artificialintelligence and #neuralnetworks
Just skimmed through "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation" by Yuhang Song et al. nature.com/articles/s41593-023

Quite interesting but confusing, as I come from #backpropagation DL.
If I got it right, the authors focus on showing how and why biological neural networks would benefit from being Energy Based Models for Predictive Coding, instead of Feedforward Networks employing backpropagation.
I struggled to reach where they explain how to optimize a ConvNet in PyTorch as an EB model, but they do: there is an algorithm and formulae, but I'm curious about how long and stable training is, and whether all that generalizes to typical computer vision architectures (ResNets, MobileNets, ViTs, ...).
Code is also #opensource at github.com/YuhangSong/Prospect

I would like to sit a few hours at my laptop and try to better see and understand, but I think in the next days I will go to Modern #HopfieldNetworks. These too are EB and there's an energy function that is optimised by the #transformer 's dot product attention.
I think I got what attention does in Transformers, so I'm quite curious to get in what sense it's equivalent to consolidating/retrieving patterns in a Dense Associative Memory. In general, I think we're treating memory wrong with our deep neural networks. I see most of them as sensory processing, shortcut to "reasoning" without short or long term memory surrogates, but I could see how some current features may serve similar purposes...

NatureInferring neural activity before plasticity as a foundation for learning beyond backpropagation - Nature NeuroscienceThis paper introduces ‘prospective configuration’, a new principle for learning in neural networks, which differs from backpropagation and is more efficient in learning and more consistent with data on neural activity and behavior.

A new type of #neuralnetworks and #AI 1/3

I've been thinking that #backpropagation based neural networks will reach their peak (if they haven't already), and it may be interesting to search for a new learning method. Some observations and ideas:

The two main modes of #neuralnetworks - training when weights are adjusted, and prediction when states change should be merged. After all real-life brains do prediction and learning at the same time, and they are not restarted for every task. ...

Concept backpropagation: An Explainable AI approach for visualising learned concepts
arxiv.org/abs/2307.12601

* concept detection method (concept backpropagation) for analysing how information representing a concept is internalised in a neural network

* allows visualisation of the detected concept directly in the input space of the model, to see what information the model depends on for representing the described concept

arXiv.orgConcept backpropagation: An Explainable AI approach for visualising learned concepts in neural network modelsNeural network models are widely used in a variety of domains, often as black-box solutions, since they are not directly interpretable for humans. The field of explainable artificial intelligence aims at developing explanation methods to address this challenge, and several approaches have been developed over the recent years, including methods for investigating what type of knowledge these models internalise during the training process. Among these, the method of concept detection, investigates which \emph{concepts} neural network models learn to represent in order to complete their tasks. In this work, we present an extension to the method of concept detection, named \emph{concept backpropagation}, which provides a way of analysing how the information representing a given concept is internalised in a given neural network model. In this approach, the model input is perturbed in a manner guided by a trained concept probe for the described model, such that the concept of interest is maximised. This allows for the visualisation of the detected concept directly in the input space of the model, which in turn makes it possible to see what information the model depends on for representing the described concept. We present results for this method applied to a various set of input modalities, and discuss how our proposed method can be used to visualise what information trained concept probes use, and the degree as to which the representation of the probed concept is entangled within the neural network model itself.

Extending the Forward Forward Algorithm
arxiv.org/abs/2307.04205

The Forward Forward algorithm (Geoffrey Hinton, 2022-11) is an alternative to backpropagation for training neural networks (NN)

Backpropagation - the most widely successful and used optimization algorithm for training NN - has 3 important limitations ...

Hinton's paper: cs.toronto.edu/~hinton/FFA13.p
Discussion: bdtechtalks.com/2022/12/19/for
...

arXiv.orgExtending the Forward Forward AlgorithmThe Forward Forward algorithm, proposed by Geoffrey Hinton in November 2022, is a novel method for training neural networks as an alternative to backpropagation. In this project, we replicate Hinton's experiments on the MNIST dataset, and subsequently extend the scope of the method with two significant contributions. First, we establish a baseline performance for the Forward Forward network on the IMDb movie reviews dataset. As far as we know, our results on this sentiment analysis task marks the first instance of the algorithm's extension beyond computer vision. Second, we introduce a novel pyramidal optimization strategy for the loss threshold - a hyperparameter specific to the Forward Forward method. Our pyramidal approach shows that a good thresholding strategy causes a difference of upto 8% in test error. 1 Lastly, we perform visualizations of the trained parameters and derived several significant insights, such as a notably larger (10-20x) mean and variance in the weights acquired by the Forward Forward network.

Absorbing Phase Transitions in Artificial Deep Neural Networks
arxiv.org/abs/2307.02284

To summarize, we believe that the this work places the order-to-chaos transition in the initialized artificial deep neural networks in the broader context of absorbing phase transitions, & serves as the first step toward the systematic comparison between natural/biological & artificial neural networks.
...

arXiv logo
arXiv.orgUniversal Scaling Laws of Absorbing Phase Transitions in Artificial Deep Neural NetworksWe demonstrate that conventional artificial deep neural networks operating near the phase boundary of the signal propagation dynamics, also known as the edge of chaos, exhibit universal scaling laws of absorbing phase transitions in non-equilibrium statistical mechanics. We exploit the fully deterministic nature of the propagation dynamics to elucidate an analogy between a signal collapse in the neural networks and an absorbing state (a state that the system can enter but cannot escape from). Our numerical results indicate that the multilayer perceptrons and the convolutional neural networks belong to the mean-field and the directed percolation universality classes, respectively. Also, the finite-size scaling is successfully applied, suggesting a potential connection to the depth-width trade-off in deep learning. Furthermore, our analysis of the training dynamics under the gradient descent reveals that hyperparameter tuning to the phase boundary is necessary but insufficient for achieving optimal generalization in deep networks. Remarkably, nonuniversal metric factors associated with the scaling laws are shown to play a significant role in concretizing the above observations. These findings highlight the usefulness of the notion of criticality for analyzing the behavior of artificial deep neural networks and offer new insights toward a unified understanding of the essential relationship between criticality and intelligence.

A perspective on #chatGPT (or Large Language Models #LLMs in general): #Hype or milestone?

[Rodney Brooks (spectrum.ieee.org/amp/gpt-4-ca) tells us that

What large language models are good at is saying what an answer should sound like, which is different from what an answer should be.

For a nice in-depth technical analysis, see this blog post by Stephen Wolfram (himself!) on "What is ChatGPT Doing ... and Why Does It Work? ". Worth reading -even for non-experts- in a non-trivial effort to make the whole process explainable. The different steps are:

  • #LLMs compute probabilities for the next word. To do this, they aggregate huge datasets of text so that they create a function that, given a sequence of words, computes for all possible words in the dictionary the probability that adding this new word is statistically congruent with past words. Interestingly, this probability, conditioned on what has been observed so far, falls of as a power law, just like the global probability of words in the dictionary,

  • These #probabilities are computed by a function that leans on the dataset to generate the best approximation. Wolfram makes a minute description of how to do such an approximation, starting from linear regression to using non-linearities. This leads to deep learning methods and their potential for universal function approximators,

  • Crucial is how these #models are trainable, in particular by way of #backpropagation. This leads the author to describe the process, but also to point out some limitations of the trained model, especially, as you might have guessed, compared to potentially more powerful systems, like #cellularautomata of course...

  • This now brings us to #embeddings, the crucial ingredient to define "words" in these #LLMs models. To relate "alligator" to "crocodile" vs. a "vending machine," this technique computes distances between words based on their relative distance in the large dataset of text corpus, so that each word is assigned an address in a high-dimensional space, with the intuition that words that are syntactically closer should be closer in the embedding space. It is highly non-trivial to understand the geometry of high-dimensional spaces - especially when we try to relate it to our physical 3D space - but this technique has proven to give excellent results, I highly recommend the #cemantix puzzle to test your intuition about word embeddings: cemantle.certitudes.org

  • Finally, these different parts are glued together by a humongous #transformer network. A standard #NeuralNetwork could perform a computation to predict the probabilities for the next word, but the results would mostly give nonsensical answers... Something more is needed to make this work. Just as traditional Convolutional Neural Networks #CNNs hardwire the fact that operations applied to an image should be applied to nearby pixels first, transformers do not operate uniformly on the sequence of words (i.e., embeddings), but weight them differently to ultimately get a better approximation. It is clear that much of the mechanism is a bunch of heuristics selected based on their performance - but we can understand the mechanism as giving different weights to different tokens - specifically based on the position of each token and its importance in the meaning of the current sentence. Based on this calculation, the sequence is reweighted so that a probability is ultimately computed. When applied to a sequence of words where words are added progressively, this creates a kind of loop in which the past sequence is constantly re-processed to update the generation.

  • Can we do more and include syntax? Wolfram discusses the internals of #chatGPT, and in particular how it trained iOS to "be a good bot" - and adds another possibility, which is to inject the knowledge that language is organized grammatically, and whether #transformers are able to learn such rules. This points to certain limitations of the architecture and the potential of using graphs as a generalization of geometric rules. The post ends with a comparison of #LLMs, which just aim to sound right, with rule-based models, a debate reminiscent of the older days of AI...

IEEE SpectrumJust Calm Down About GPT-4 AlreadyBy Glenn Zorpette

Hinton is best known for an algorithm called #BackPropagation, which he first proposed with two colleagues in the 1980s. The technique, which allows artificial #NeuralNetworks to learn, today underpins nearly all #MachineLearning models. In a nutshell, backpropagation is a way to adjust the connections between artificial neurons over and over until a neural network produces the desired output.

Deep learning pioneer #GeoffreyHinton quits #Google | #AI
technologyreview.com/2023/05/0

MIT Technology ReviewDeep learning pioneer Geoffrey Hinton quits GoogleBy Will Douglas Heaven

I was an electrical engineering student in college, when Rumelhart published his seminal paper on in 1986. It was a game changer for the community and a life changer for me.

Over the past four decades, I have lived through a few cycles of AI Seasons—both Winters and Springs. During that time, I have observed these disturbingly recurrent patterns: collectively, we tend to over promise and under deliver; the community tends to breed practitioners who are oblivious to our origin history and our foundational theories; these practitioners tend to use the technologies they do not fully grasp, relying exclusively on the raw input data and the apparently satisfactory results, never asking "why" and "how".

In the past, we had tiny computers, scant data, weak learning algorithms, and few practitioners. Today, however, we have massive compute clouds, seemingly inexhaustible amounts of data, powerful learning algorithms, and all techies and their grandmama are AI practitioners. So, unlike in the past, if we AI today, we will do immense harm to humanity. We must establish industry-wide guidelines.

Is #symbolic #reasoning a wall or hurdle for #deeplearning? In other words, is #backpropagation of errors via differentiable functions the only mechanism for #intelligence? If another mechanism is needed couldn’t it simply be learned by deep learning?

If deep learning were able to learn a whole new mechanism then this mechanism would work for its own as an independent system. But this contradicts the premise of a single mechanism.

noemamag.com/what-ai-can-tell-

NOEMAWhat AI Can Tell Us About Intelligence | NOEMACan deep learning systems learn to manipulate symbols? The answers might change our understanding of how intelligence works and what makes humans unique.

Through scaling #DeepNeuralNetworks we have found in two different domains, #ReinforcementLearning and #LanguageModels, that these models learn to learn (#MetaLearning).

They spontaneously learn internal models with memory and learning capability which are able to exhibit #InContextLearning much faster and much more effectively than any of our standard #backpropagation based deep neural networks can.

These rather alien #LearningModels embedded inside the deep learning models are emulated by #neuron layers, but aren't necessarily deep learning models themselves.

I believe it is possible to extract these internal models which have learned to learn, out of the scaled up #DeepLearning #substrate they run on, and run them natively and directly on #hardware.

This allows those much more efficient learning models to be used either as #LearningAgents themselves, or as a further substrate for further meta-learning.

I have an #embodiment #research on-going but with a related goal and focus specifically in extracting (or distilling) the models out of the meta-models here:
github.com/keskival/embodied-e

It is of course an open research problem how to do this, but I have a lot of ideas!

If you're inspired by this, or if you think the same, let's chat!

GitHubGitHub - keskival/embodied-emulated-personas: A project space for Embodied Emulated Personas - Embodied neural networks trained by LLM chatbot teachersA project space for Embodied Emulated Personas - Embodied neural networks trained by LLM chatbot teachers - GitHub - keskival/embodied-emulated-personas: A project space for Embodied Emulated Perso...