I read #Rumelhart's #backpropagation paper in 1986. It was a stunner. It changed my life.
http://www.cs.utoronto.ca/~hinton/absps/naturebp.pdf
And I read #Vaswani's #transformer in 2017. It was a groundbreaking paper. It changed the world.
It could be argued that transformers, through their use in #LLMs, have had far greater impact upon society, compared to backpropagation. On the other hand, there would be no modern #ML, but for backpropagation. So, it's a toss-up.
But to me, Rumelhart's paper is superior to Vaswani's, at least in terms of clarity, concision, coherence, and other indicia of writing style.
Today's lesson in #machinelearning : you can't analytically differentiate a physical process.
#backpropagation failed.
Trying to apply reinforcement learning to control a cooling fan now.
On #biological vs #artificialintelligence and #neuralnetworks
Just skimmed through "Inferring neural activity before plasticity as a foundation for learning beyond backpropagation" by Yuhang Song et al. https://www.nature.com/articles/s41593-023-01514-1
Quite interesting but confusing, as I come from #backpropagation DL.
If I got it right, the authors focus on showing how and why biological neural networks would benefit from being Energy Based Models for Predictive Coding, instead of Feedforward Networks employing backpropagation.
I struggled to reach where they explain how to optimize a ConvNet in PyTorch as an EB model, but they do: there is an algorithm and formulae, but I'm curious about how long and stable training is, and whether all that generalizes to typical computer vision architectures (ResNets, MobileNets, ViTs, ...).
Code is also #opensource at https://github.com/YuhangSong/Prospective-Configuration
I would like to sit a few hours at my laptop and try to better see and understand, but I think in the next days I will go to Modern #HopfieldNetworks. These too are EB and there's an energy function that is optimised by the #transformer 's dot product attention.
I think I got what attention does in Transformers, so I'm quite curious to get in what sense it's equivalent to consolidating/retrieving patterns in a Dense Associative Memory. In general, I think we're treating memory wrong with our deep neural networks. I see most of them as sensory processing, shortcut to "reasoning" without short or long term memory surrogates, but I could see how some current features may serve similar purposes...
Neural Networks
(1991) : Freeman, James A. Skapura, Dav...
isbn: 0201513765
#text_book #neural_network #simulated_annealing #algorithm #backpropagation #my_bibtex
Neural Networks
(1991) : Freeman, James A. Skapura, Dav...
isbn: 0201513765
#backpropagation #neural_network #simulated_annealing #text_book #algorithm #my_bibtex
Neural Networks
(1991) : Freeman, James A. Skapura, Dav...
isbn: 0201513765
#backpropagation #text_book #algorithm #simulated_annealing #neural_network #my_bibtex
Neural Networks
(1991) : Freeman, James A. Skapura, Dav...
isbn: 0201513765
#simulated_annealing #algorithm #neural_network #text_book #backpropagation #my_bibtex
A new type of #neuralnetworks and #AI 1/3
I've been thinking that #backpropagation based neural networks will reach their peak (if they haven't already), and it may be interesting to search for a new learning method. Some observations and ideas:
The two main modes of #neuralnetworks - training when weights are adjusted, and prediction when states change should be merged. After all real-life brains do prediction and learning at the same time, and they are not restarted for every task. ...
Concept backpropagation: An Explainable AI approach for visualising learned concepts
https://arxiv.org/abs/2307.12601
* concept detection method (concept backpropagation) for analysing how information representing a concept is internalised in a neural network
* allows visualisation of the detected concept directly in the input space of the model, to see what information the model depends on for representing the described concept
Extending the Forward Forward Algorithm
https://arxiv.org/abs/2307.04205
The Forward Forward algorithm (Geoffrey Hinton, 2022-11) is an alternative to backpropagation for training neural networks (NN)
Backpropagation - the most widely successful and used optimization algorithm for training NN - has 3 important limitations ...
Hinton's paper: https://www.cs.toronto.edu/~hinton/FFA13.pdf
Discussion: https://bdtechtalks.com/2022/12/19/forward-forward-algorithm-geoffrey-hinton
...
Absorbing Phase Transitions in Artificial Deep Neural Networks
https://arxiv.org/abs/2307.02284
To summarize, we believe that the this work places the order-to-chaos transition in the initialized artificial deep neural networks in the broader context of absorbing phase transitions, & serves as the first step toward the systematic comparison between natural/biological & artificial neural networks.
...
A perspective on #chatGPT (or Large Language Models #LLMs in general): #Hype or milestone?
[Rodney Brooks (https://spectrum.ieee.org/amp/gpt-4-calm-down-2660261157) tells us that
What large language models are good at is saying what an answer should sound like, which is different from what an answer should be.
For a nice in-depth technical analysis, see this blog post by Stephen Wolfram (himself!) on "What is ChatGPT Doing ... and Why Does It Work? ". Worth reading -even for non-experts- in a non-trivial effort to make the whole process explainable. The different steps are:
#LLMs compute probabilities for the next word. To do this, they aggregate huge datasets of text so that they create a function that, given a sequence of words, computes for all possible words in the dictionary the probability that adding this new word is statistically congruent with past words. Interestingly, this probability, conditioned on what has been observed so far, falls of as a power law, just like the global probability of words in the dictionary,
These #probabilities are computed by a function that leans on the dataset to generate the best approximation. Wolfram makes a minute description of how to do such an approximation, starting from linear regression to using non-linearities. This leads to deep learning methods and their potential for universal function approximators,
Crucial is how these #models are trainable, in particular by way of #backpropagation. This leads the author to describe the process, but also to point out some limitations of the trained model, especially, as you might have guessed, compared to potentially more powerful systems, like #cellularautomata of course...
This now brings us to #embeddings, the crucial ingredient to define "words" in these #LLMs models. To relate "alligator" to "crocodile" vs. a "vending machine," this technique computes distances between words based on their relative distance in the large dataset of text corpus, so that each word is assigned an address in a high-dimensional space, with the intuition that words that are syntactically closer should be closer in the embedding space. It is highly non-trivial to understand the geometry of high-dimensional spaces - especially when we try to relate it to our physical 3D space - but this technique has proven to give excellent results, I highly recommend the #cemantix puzzle to test your intuition about word embeddings: https://cemantle.certitudes.org
Finally, these different parts are glued together by a humongous #transformer network. A standard #NeuralNetwork could perform a computation to predict the probabilities for the next word, but the results would mostly give nonsensical answers... Something more is needed to make this work. Just as traditional Convolutional Neural Networks #CNNs hardwire the fact that operations applied to an image should be applied to nearby pixels first, transformers do not operate uniformly on the sequence of words (i.e., embeddings), but weight them differently to ultimately get a better approximation. It is clear that much of the mechanism is a bunch of heuristics selected based on their performance - but we can understand the mechanism as giving different weights to different tokens - specifically based on the position of each token and its importance in the meaning of the current sentence. Based on this calculation, the sequence is reweighted so that a probability is ultimately computed. When applied to a sequence of words where words are added progressively, this creates a kind of loop in which the past sequence is constantly re-processed to update the generation.
Can we do more and include syntax? Wolfram discusses the internals of #chatGPT, and in particular how it trained iOS to "be a good bot" - and adds another possibility, which is to inject the knowledge that language is organized grammatically, and whether #transformers are able to learn such rules. This points to certain limitations of the architecture and the potential of using graphs as a generalization of geometric rules. The post ends with a comparison of #LLMs, which just aim to sound right, with rule-based models, a debate reminiscent of the older days of AI...
A Step by Step Backpropagation Example
(2015) : Mazur, Matt
url: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
#backpropagation #machine_learning #tutorial #my_bibtex
Hinton is best known for an algorithm called #BackPropagation, which he first proposed with two colleagues in the 1980s. The technique, which allows artificial #NeuralNetworks to learn, today underpins nearly all #MachineLearning models. In a nutshell, backpropagation is a way to adjust the connections between artificial neurons over and over until a neural network produces the desired output.
Deep learning pioneer #GeoffreyHinton quits #Google | #AI
https://www.technologyreview.com/2023/05/01/1072478/deep-learning-pioneer-geoffrey-hinton-quits-google/
I was an electrical engineering student in college, when Rumelhart published his seminal paper on #backpropagation in 1986. It was a game changer for the #NeuralNetworks community and a life changer for me.
Over the past four decades, I have lived through a few cycles of AI Seasons—both Winters and Springs. During that time, I have observed these disturbingly recurrent patterns: collectively, we tend to over promise and under deliver; the community tends to breed practitioners who are oblivious to our origin history and our foundational theories; these practitioners tend to use the technologies they do not fully grasp, relying exclusively on the raw input data and the apparently satisfactory results, never asking "why" and "how".
In the past, we had tiny computers, scant data, weak learning algorithms, and few practitioners. Today, however, we have massive compute clouds, seemingly inexhaustible amounts of data, powerful learning algorithms, and all techies and their grandmama are AI practitioners. So, unlike in the past, if we #misuse AI today, we will do immense harm to humanity. We must establish industry-wide #ethical guidelines.
Currently watching a #DeepLearning experiment I'm running. I have two identical #NeuralNetworks. One is running standard #backpropagation. The other is being trained in two segments, with the second half using standard backprop and the first half being trained with a #SyntheticGradient. The synthetic gradient version is kicking standard backprop's ass, and it feels like a magic trick.
Is #symbolic #reasoning a wall or hurdle for #deeplearning? In other words, is #backpropagation of errors via differentiable functions the only mechanism for #intelligence? If another mechanism is needed couldn’t it simply be learned by deep learning?
If deep learning were able to learn a whole new mechanism then this mechanism would work for its own as an independent system. But this contradicts the premise of a single mechanism.
https://www.noemamag.com/what-ai-can-tell-us-about-intelligence
Through scaling #DeepNeuralNetworks we have found in two different domains, #ReinforcementLearning and #LanguageModels, that these models learn to learn (#MetaLearning).
They spontaneously learn internal models with memory and learning capability which are able to exhibit #InContextLearning much faster and much more effectively than any of our standard #backpropagation based deep neural networks can.
These rather alien #LearningModels embedded inside the deep learning models are emulated by #neuron layers, but aren't necessarily deep learning models themselves.
I believe it is possible to extract these internal models which have learned to learn, out of the scaled up #DeepLearning #substrate they run on, and run them natively and directly on #hardware.
This allows those much more efficient learning models to be used either as #LearningAgents themselves, or as a further substrate for further meta-learning.
I have an #embodiment #research on-going but with a related goal and focus specifically in extracting (or distilling) the models out of the meta-models here:
https://github.com/keskival/embodied-emulated-personas
It is of course an open research problem how to do this, but I have a lot of ideas!
If you're inspired by this, or if you think the same, let's chat!