Inverse Surveillance AI Hackathon 2021

Posted on

This hackathon is part of the Inverse Surveillance AI research project.

Hackathon Challenge: With your help we can demonstrate the potential of Inverse Surveillance AI —> using AI to surveil governments and bigger organizations to identify and predict wrongful behavior or systematic flaws and by doing so empower citizens.

Everyone is welcome to join. (Individuals & Teams)
This includes students, researchers, professionals, etc.

What to Expect

The hackathong consists of two parts:
1. One month of preparation time (starting October 15, 2021)
2. Hackathon Weekend (19-20-21 November, 2021)

Online, via Discord, English, CET (UTC+1h)

Those with other obligations are not required to join all hackathon events, as long as you submit your code before the deadline.

Deliverables:
1. Concept for Inverse Surveillance AI
2. Proof of Concept of Inverse Surveillance AI
3. A (video) pitch explaining your Proof of Concept

You can download the full Hackathon briefing in the link below. Here you can find a full description, the challenge expectation, guiding questions, Prices, Elaborate Timeline & Schedule, etc.

Timeline & Schedule

  • Preparation Month – Friday 15 Oct. – Friday 19 Nov.
    You are allowed to prepare your concept and write code
  • Pe-Hackathon Week – Friday 12 Nov. – Friday 19 Nov.
    • Q&A Session: Friday 12 Nov., 18:00-19:00 CET (UTC+1h)
  • Hackathon Day 1 – Friday 19 Nov. (18:30 – 20:30)
  • Hackathon Day 2 – Saturday 20 Nov. (09:00-18:00)
  • Hackathon Day 3 – Sunday 21 Nov. (09:00-18:30)
    • 15:00 CET (UTC+1h) Submit code, and (video) pitch

Join and make a difference!

Your Proof of Concept, in combination with the theoretical research and expert interview podcasts will serve as a launchpad for future research and work into the topic of Inverse Surveillance AI.

Inverse Surveillance offers a new pespective on the dynamic between citizens and bigger organisation and governments. AI makes this dynamic feasible. Inverse Surveillance AI can empower citizens and turn them into auditors keeping bigger organisations and goverments in check, and by doing so democratize technology in the process.

Your proof of concept has the power to demonstrate the potential of Inverse Surveillance AI and get this idea rolling.

Sign-Up & Questions

For sign-ups you can e-mail Juliette van der Laarse at juliette@asimovinstitute.org or contact her through LinkedIn

Two Examples for Inverse Surveillance

Posted on

Authors: J.P.R. van der Laarse & N.L. Neuman
Publication Date: September 24, 2021

Here we provide some metaphors as examples to better illustrate Inverse Surveillance. These metaphors are a representation of how we see inverse surveillance in comparison to other forms of surveillance and sousveillance at this moment in time. Throughout this project we aim to continue to refine this concept, and more clearly describe the differences between the different forms of veillance. 

Defining Surveillance

We use the terms surveillance and sousveillance as stand-alone concepts in these metaphors, based on the consensus within academic research. But surveillance could also be seen as an umbrella term for all activities. And the same is true for the term sousveillance with respect to all surveillance activities carried out by citizens, including inverse surveillance. 

The definitions used in these metaphors are based on our framework for inverse surveillance research. Prof. Steve Mann, the author on sousveillance, uses a broad veillance framework for veillances that encompasses surveilllance, sousveillance, inverse surveillance and other veillance concepts. He made the case for using veillance as the umbrella term instead of surveillance, which has different connotations.

1) Police Officer vs. Auditor

Inverse surveillance is by definition not anti-government in a dystopian sense, but pro-government from a utopian stance. Inverse surveillance provides citizens with leverage for holding a government accountable, which ought to be considered a positive effect in a functioning democratic society. For the Panopticon effect to work, there needs to be some level of threat. However, citizens will not take the role of a police officer, who issues fines based on criminal behaviour, and exercises power. Rather, citizens using inverse surveillance AI will essentially fulfill the role of an auditor. Auditors are also within their right to assess, correct, and sometimes enforce norms under the threat of specific consequences.However, an auditor is different from a police officer, since auditors report, while offering organizations also an opportunity for improvement. An auditor can be seen as an additional means of control to check that everything is running as it should within an organisation according to some normative framework. Despite the strict monitoring role of auditors, in which they directly hold organizations accountable for their behavior, independent auditors are frequently hired by organizations themselves to monitor their business and operations to ensure that they have everything in order when a formal audit occurs. This dynamic of organizations reaching out to auditors for help in auditing their systems and contributing ideas for improvement is exactly the kind of relationship our Inverse Surveillance project aims to stimulate between citizens and governments or other large /organizations. 

2) School examination

This metaphor relates to the different forms of veillance, and aims to illustrate the differences.

Surveillance: A teacher walks around during an exam to check if students are cheating. This is a form of power from above.

Counter-Surveillance: A student sits behind a pillar during an exam in protest, or sets their table up so that the teacher cannot perceive them properly. Whether the student cheats or not is irrelevant. The focus is on evading surveillance by the teacher. 

Sousveillance: The teacher walks past the tables and a student addresses their behaviour. For example, “Sir/Madam, I keep seeing you walking past the tables of students of colour. This is a form of discrimination”. The teachers’ surveillance is being observed and reported by a student.

Inverse Surveillance: The teacher walks past the students making their exam, without the students paying attention to it. Surveillance is part of this process and the students are not necessarily concerned about it. However, the students have set up a student council to evaluate the teachers and school system. Are they working fairly? What exactly is being surveilled? Have any processes crept in that lead to, for example, occurrences of racism? Or are there patterns that can be identified that indicate corruption? 

Panopticon for the Masses

Posted on

Authors: J.P.R. van der Laarse & N.L. Neuman
Publication Date: May 7, 2021

With security cameras in public places, police making their regular rounds in neighborhoods, proctors watching students during exams, and government organizations monitoring suspicious behavior online, surveillance is a part of our daily lives. Not only does such surveillance help spot and punish criminal behavior, it also has a psychological effect, and it is this effect that makes surveillance so effective. This is known as the Panopticon effect, first coined by Jeremy Bentham in the 18th century. 

What is the panopticon effect? 
In short, it means that when you know you can be watched, you will behave better. In a public place you are less likely to show bad behavior because you are aware that you can be watched. Thus, you correct your own behavior without the police or other enforcement agents having to intervene. It is this psychological self-policing mechanism (Foucault, 1997) that makes surveillance such a powerful tool.

Bentham first looked at the panopticon model in the context of prisons. And he articulated the dynamic, and requirements, needed to make the panopticon model work. Within this structure, the panopticon takes place inside an annular building of cell blocks, where at the center of the building a watchtower is positioned. Each person within a cell block (the subject) is sectioned off from the other ‘prisoners’ inside their cell block, leading to an individualization of the subjects. The officials within the tower (the observers) are invisible to the subjects, however they have total visibility of the subjects themselves, leading to an asymmetrical power relation. The end result of this surveillance structure is that the subjects create a self-regulating mechanism that replaces the anxiety of being watched and thus adhere to the institutional categories of evaluation and behave as is expected from them. As Foucault explained, “the major effect of the panopticon: to induce the inmate a state of conscious and permanent visibility that assures the automatic functioning of power”. (Foucault, 1977; Jezierksi, 2006).

According to Foucault, the panopticon model is as fascinating as it is frightening, and it illustrates Foucault’s views on the unequal power dynamic between citizens and government in general in the best possible way.

A Panopticon for the masses
To achieve Bentham’s form of a panopticon model, architectural change is required. It is the well-known dome prisons that are architecturally designed specifically for this purpose. Security cameras achieve the same effect. The subjects can be viewed undisturbed by the observers without the subjects being able to engage in dialogue with them. To make the panopticon a reality, either a lot of money is needed for architectural redesign, or enough money is needed to install means of large-scale observation – such as security cameras. Permission to build and install, as well as the financial resources are  often in the hands of the government and larger organizations. 

The democratization of AI, however, can be a game-changer for this dynamic. A simple algorithm can be developed at relatively little cost and function as thousands of observers. Not only is this useful for governments in the analysis of big data, this same tool can now be used by citizens to create a panopticon effect. AI thus makes surveillance by citizens, Inverse Surveillance, possible.

We do not see inverse surveillance as a counter-action to surveillance by governments and other large organizations. We merely acknowledge that citizens can now, through AI, create a panopticon effect of their own and thereby take part in the activity of surveillance. This presents opportunities in democratizing surveillance AI that we think are worth exploring. Within this research, we recognize that the panopticon effect works and citizens too can successfully use it as a tool. 

References:

  1. Foucault, M. (1977). Discipline and punish : The birth of the prison. Translated by Sheridan, A. New York: Pantheon Books.
  2. Jezierski, W. (2006). Monasterium panopticum. Frühmittelalterliche Studien, 40(1), 167-182.

On Utopian Thinking

Posted on

Authors: J.P.R. van der Laarse & N.L. Neuman
Publication Date: April 23, 2021

Surveillance AI is not exactly considered to be a positive development in this day and age, with controversial stories like China’s mass surveillance headlining many news platforms. (Andersen, 2020; BBC, 2021). These news items evoke a negative perception on AI and reminds us of movies like iRobot, Terminator, 2001: Space Odyssey, and Minority Report.This technophobic and dystopian view of Surveillance AI is part of the reason why ethical AI is a growing academic field. The focus of these studies lies primarily in preventing and countering this dystopian application of technology. However, despite the fact that these studies from a dystopian perspective are very much needed they mainly focus on limiting or governing these developments, and work from within existing structures and systems. It thus leaves little room for positive innovative developments.

Utopian Thinking
In order to get us to a future that opens up new possibilities in regards to Surveillance AI, instead of limiting them, we need a different approach to complement the dystopian work. Theory U teaches us that we need to be critical of our frame of mind, and preferably break out of our institutional bubble.This would enable innovation and accelerate the emerging future to take shape. (Scharmer, & Senge, 2016). We thus need a more out-of-the-box type of approach that is not limited by existing institutional structures. This approach stands at the center of Thomas More’s ‘Utopia’ (1516), imagining a perfect world in comparison to the world we are living in. Regardless of its attainability we focus on the thinking method itself.


Utopian Thinking has been at the foundation of many great technological innovations, for example the World Wide Web, and smartphones. Not to mention groundbreaking ideas, such as the theory of relativity, and the apartheid abolition (Hök, 2019). According to Brown (2015), it also facilitates collective thinking, which is essential for tackling complex problems “in these times of transformational change” (p.1). Bell and Pahl (2018) add that co-production – like using a thinktank for example – is a Utopian Thinking method. According to them (Bell & Pahl, 2018) Utopian Thinking methods are essential for reshaping the world as we know it for the better. In addition it encourages the public to become involved in the process (Fernando et al., 2018), which is precisely the type of citizen involvement we deem to be important for design, development, and implementation of Inverse Surveillance AI.

It is for these reasons that we approach our research from a utopian perspective, and therefore we encourage, imaginative, original, out of the box thinking, which follows the example of great thinkers that stood at the basis of monumental innovations and ideas (Hök, 2019). We need to look past our current way of thinking within existing structures, and build a new vision of what is socially acceptable in order to drive the growth and implementation of Surveillance AI (Harari, 2018). As Albert Einstein, emphasized: “we cannot solve our problems with the same thinking we used when we created them” (Kataria, 2019).

Bibliography:

  1. Andersen, R. (2020). The Panopticon Is Already Here. The Atlantic. Retrieved from https://www.theatlantic.com/magazine/archive/2020/09/china-ai-surveillance/614197/
  2. BBC. (2021). Uighur-identifying patent is ‘deeply disturbing’. BBC News. Retrieved from https://www.bbc.com/news/av/technology-55651932
  3. Bell, D.M., & Pahl, K. (2018). Co-production: Towards a utopian approach. International Journal of Social Research Methodology, 21(1), 105-117.
  4. Brown, V.A. (2015). Utopian thinking and the collective mind: Beyond transdisciplinarity. Futures : The Journal of Policy, Planning and Futures Studies, 65, 209-216.
  5. Fernando, J. W., Burden, N., Ferguson, A., O’Brien, L. V., Judge, M., & Kashima, Y. (2018). Functions of Utopia: How Utopian Thinking Motivates Societal Engagement. Personality and social psychology bulletin, 44(5), 779-792. https://doi.org/10.1177/0146167217748604
  6. Harari, Y. N. (2018). 21 lessons for the 21st century (First ed.). Random House USA.
  7. Hök, B.W.  (2019). Are great innovations driven by utopian ideas? Journal of Innovation Management, 6(4), 98-116.
  8. Kataria, V. (2019). 3 Lessons from Albert Einstein on Problem Solving. Medium, The Startup. Retrieved from https://medium.com/swlh/3-lessons-from-albert-einstein-on-problem-solving-c5438b2ac2b9
  9. More, T. (1516). Utopia. Retrieved from Planet Ebook: https://www.planetebook.com/utopia/
  10. Scharmer, C., & Senge, P. (2016). Theory U : Leading from the future as it emerges : The social technology of presencing (Second ed.).

Conceptualizing Inverse Surveillance

Posted on

Authors: J.P.R. van der Laarse & N.L. Neuman
Publication Date: April 23, 2021

In our new project, we focus on unwrapping the concept of Inverse surveillance, and how it can be used to empower citizens with AI technology. Since we wanted to place surveillance in the hands of citizens, the first name that popped in mind to label this utopian vision on surveillance was ‘Inverse Surveillance’. After a quick Google search, we found out that this term has actually been used before, so we did a deep dive into the literature. We soon learned that Inverse Surveillance is often used as a synonym (or translation) for sousveillance (Mann, 2004), and also mentioned in relation to counter-surveillance. However, neither of these concepts fully captures what we were going for. We decided to flesh out this concept a bit more and write down what we think are the main distinctions between the different types of surveillance. 


For those interested, we will publish how we came to these distinctions and our definition of inverse surveillance based on the literature in another post, but in this post, we will focus on the table below, and our conclusions.

 SurveillanceCounter-
surveillance
SousveillanceInverse Surveillance
AgentTopBottomBottomBottom
SubjectBottomTopTop & BottomTop
ActionSurveillanceEvading & UnderminingSurveillance & gaining more insight and involvementSurveillance
GoalControlling and Influencing subjectCounter-reaction against surveillance of citizensCounter-reaction against surveillance of citizensControlling and influencing subject
Power DynamicCentralization of PowerChallenging institutional power asymmetriesReversing the balance of power (hierarchical sousveillance); levelling the balance of power (personal sousveillance).Democratisation of Power

Surveillance

Although ‘surveillance’ is also an umbrella term for the other concepts, in its colloquial use surveillance refers to The systematic monitoring (surveillance) of citizens (bottom) by governments or bigger organizations (top), in order to influence and control them (goal) and thus exercise power (power dynamic) (Ball et al., 2012; Hier & Greenberg, 2014; Lyon, 2007).

Counter-surveillance

In the case of counter-surveillance, citizens (bottom) actively evade and undermine surveillance by governments and bigger organizations (top) as a counter-reaction to the surveillance of citizens (goal) and by doing so are challenging institutional power asymmetries (power dynamic) (Monahan, 2006).

Sousveillance

Sousveillance happens when citizens (bottom) are surveilling governments and bigger organizations (top) with the goal to gain more insight and involvement into surveillance, as a counter-reaction against the surveillance of citizens (goal) and by doing so reversing or leveling the power balance (power dynamic) (Mann, 2004; Mann et al., 2002). 

Conceptualizing a fourth surveillance type

The exact definition of sousveillance is quite broad. Some articles focus on sousveillance as a means of gaining insight into surveillance done by governments and bigger organizations by surveilling the agent itself. In most articles, sousveillance often takes a ‘stance against’ surveillance. In other articles, all surveillance activities in which citizens partake in surveillance are included in the sousveillance concept. 

The latter is a bit closer to what we aim to focus on. Thus according to existing terminology, our project would fall under sousveillance. We, however, wanted to make one clear distinction between the ‘anti’ movement also present within sousveillance. And thus we decided to separate the term inverse surveillance from sousveillance and give it a bit more depth. Whether we can view our definition of inverse surveillance as part of the umbrella term sousveillance or not is up for debate but not what we are focussing on. 

Our definition
Inverse Surveillance

In the case of inverse surveillance, citizens (bottom) surveil governments and bigger organizations (top) in order to control and influence (goal) and thus promote transparency and equality, and by doing so democratizing power (power dynamic).

This definition is not definite yet, and it might change during the research. But we wanted to offer a clear starting point for fleshing out a new surveillance concept. 

What we want to emphasize with this distinction is that our perspective on surveillance as a method is closer to surveillance than it is to sousveillance. In our case the focus is not surveillance itself, surveillance is seen as a mere tool that we deem helpful in exercising power, control, and influencing the subject. The difference with surveillance, however, and what puts us in line with sousveillance is that in our case surveillance is done from the bottom to the top.

Facilitating Inverse Surveillance through Artificial Intelligence

Our suggestion to deepen the definition of Inverse Surveillance is the product of technological advancements through which ideas like these are becoming more realistic for the first time in history. In Foucault’s (1977) book, surveillance can only be used by those in power, due to the extensive resources needed to conduct large-scale surveillance (for example, by having a police force that can patrol). With the rise of AI, we no longer need hundreds of eyes to watch data, videos, or images. This makes AI a realistic tool not only for organizations to monitor individuals but also for individuals monitoring organizations, without needing the extensive resources organizations have. For this reason, our project focuses on employing AI to facilitate Inverse Surveillance.

Utopian Vision on Inverse Surveillance AI 

In this project, we focus on a utopian way of thinking. We realize that there are also many side effects to AI such as ethical complications, and these studies from a dystopian perspective are therefore also much needed. However, within this project, we are mainly looking for solutions, and innovative ideas to get this concept off the ground. Thus, from a utopian perspective, we focus not only on the possibilities of inverse surveillance but also on the broader role AI can play in society in this regard.

Throughout this project, our definition of inverse surveillance as elaborated upon here will serve as a starting point for our research. Building on this, we will focus on the utopian vision and the practical application of AI in the context of Inverse Surveillance. 

Bibliography:

  1. Ball, K., Haggerty, K., & Lyon, D. (2012). Routledge handbook of surveillance studies (Routledge international handbooks). Abingdon, Oxon ; New York: Routledge
  2. Foucault, M. (1977). Discipline and punish: The birth of the prison. New York: Pantheon Books.
  3. Hier, S., & Greenberg, J. (2014). Surveillance power, problems, and politics. Vancouver: UBC Press.
  4. Lyon, D. (2007). Surveillance studies : An overview. Cambridge, UK ; Malden, MA: Polity.
  5. Mann, S. (2004). Sousveillance: inverse surveillance in multimedia imaging. Proceedings of the 12th ACM International Conference on Multimedia, New York, NY, USA, October 10-16, 2004. 620-627. DOI: 10.1145/1027527.1027673.
  6. Mann, S., Nolan, J., & Wellman, B. (2002). Sousveillance: Inventing and Using Wearable Computing Devices for Data Collection in Surveillance Environments. Surveillance & Society, 1(3), 331-355.
  7. Monahan, T. (2006). Counter-surveillance as Political Intervention? Social Semiotics, 16(4), 515-534.

Podcast: Creativity and Constraint in Artificial and Biological Intelligence

Posted on

The Brain Inspired podcast approached us for a conversation about Creativity and Constraint in Biological and Artificial Intelligence. We cover generating art with neural networks, AI’s challenges for neuroscience, and how the infamous frame problem in AI traces all the way back to Plato.

Listen to it on iTunes, Spotify, or below:

Brain Inspired podcast 062 Stefan Leijnen: Creativity and Constraint

Neural Network Zoo Prequel: Cells and Layers

Posted on

Cells

The Neural Network Zoo shows different types of cells and various layer connectivity styles, but it doesn’t really go into how each cell type works. A number of cell types I originally gave different colours to differentiate the networks more clearly, but I have since found out that these cells work more or less the same way, so you’ll find descriptions under the basic cell images.

A basic neural network cell, the type one would find in a regular feed forward architecture, is quite simple. The cell is connected to other neurons via weights, i.e. it can be connected to all the neurons in the previous layer. Each connection has its own weight, which is often just a random number at first. A weight can be negative, positive, very small, very big or zero. The value of each of the cells it’s connected to is multiplied by its respective connection weight. The resulting values are all added together. On top of this, a bias is also added. A bias can prevent a cell from getting stuck on outputting zero and it can speed up some operations, reducing the amount of neurons required to solve a problem. The bias is also a number, sometimes constant (often -1 or 1) and sometimes variable. This total sum is then passed through an activation function, the resulting value of which then becomes the value of the cell.

Convolutional cells are much like feed forward cells, except they’re typically connected to only a few neurons from the previous layer. They are often used to preserve spatial information, because they are connected not to a few random cells but to all cells in a certain proximity. This makes them practical for data with lots of localised information, such as images and sound waves (but mostly images). Deconvolutional cells are just the opposite: these tend to decode spatial information by being locally connected to the next layer. Both cells often have a lot of clones which are trained independently; each clone having it’s own weights but connected exactly the same way. These clones can be thought of as being located in separate networks which all have the same structure. Both are essentially the same as regular cells, but they are used differently.

Pooling and interpolating cells are frequently combined with convolutional cells. These cells are not really cells, more just raw operations. Pooling cells take in the incoming connections and decide which connection gets passed through. In images, this can be thought of as zooming out on a picture. You can no longer see all the pixels, and it has to learn which pixels to keep and which to discard. Interpolating cells perform the opposite operation: they take in some information and map it to more information. The extra information is made up, like if one where to zoom in on a small resolution picture. Interpolating cells are not the only reverse operation of pooling cells, but they are relatively common as they are fast and simple to implement. They are respectively connected much like convolutional and deconvolutional cells.

Mean and standard deviation cells (almost exclusively found in couples as probabilistic cells) are used to represent probability distributions. The mean is the average value and the standard deviation represents how far to deviate from this average (in both directions). For example, a probabilistic cell used for images could contain the information on how much red there is in a particular pixel. The mean would say for example 0.5, and the standard deviation 0.2. When sampling from these probabilistic cells, one would enter these values in a Gaussian random number generator, resulting in anything between 0.4 and 0.6 being quite likely results, with values further away from 0.5 being less and less likely (but still possible). They are often fully connected to either the previous or the next layer and they do not have biases.

Recurrent cells have connections not just in the realm of layers, but also over time. Each cell internally stores its previous value. They are updated just like basic cells, but with extra weights: connected to the previous values of the cells and most of the time also to all the cells in the same layer. These weights between the current value and the stored previous value work much like a volatile memory (like RAM), inheriting both properties of having a certain “state” and vanishing if not fed. Because the previous value is a value passed through an activation function, and each update passes this activated value along with the other weights through the activation function, information is continually lost. In fact, the retention rate is so low, that only four or five iterations later, almost all of the information is lost.

Long short term memory cells are used to combat the problem of the rapid information loss occurring in recurrent cells. LSTM cells are logic circuits, copied from how memory cells were designed for computers. Compared to RNN cells which store two states, LSTM cells store four: the current and last value of the output and the current and last values of the state of the “memory cell”. They have three “gates”: input, output, forget, and they also have just the regular input. Each of these gates has its own weight meaning that connecting to this type of cell entails setting up four weights (instead of just one). The gates function much like flow gates, not fence gates: they can let everything through, just a little bit, nothing, or, anything in between. This works by multiplying incoming information by a value ranging from 0 to 1, which is stored in this gate value. The input gate, then, determines how much of the input is allowed to be added to the cell value. The output gate determines how much of the output value can be seen by the rest of the network. The forget gate is not connected to the previous value of the output cell, but rather connected to the previous memory cell value. It determines how much of the last memory cell state to retain. Because it’s not connected to the output, much less information loss occurs, because no activation function is placed in the loop.

Gated recurrent units (cells) are a variation of LSTM cells. They too use gates to combat information loss, but do so with just 2 gates: update and reset. This makes them slightly less expressive but also slightly faster, as they use less connections everywhere. In essence there are two differences between LSTM cells and GRU cells: GRU cells do not have a hidden cell state protected by an output gate, and they combine the input and forget gate into a single update gate. The idea is that if you want to allow a lot of new information, you can probably forget some old information (and the other way around).

Layers

The most basic way of connecting neurons to form graphs is by connecting everything to absolutely everything. This is seen in Hopfield networks and Boltzmann machines. Of course, this means the number of connections grows exponentially, but the expressiveness is uncompromised. This is referred to as completely (or fully) connected.

After a while it was discovered that breaking the network up into distinct layers is a useful feature, where the definition of a layer is a set or group of neurons which are not connected to each other, but only to neurons from other group(s). This concept is for instance used in Restricted Boltzmann Machines. The idea of using layers is nowadays generalised for any number of layers and it can be found in almost all current architectures. This is (perhaps confusingly) also called fully connected or completely connected, because actually completely connected networks are quite uncommon.

Convolutionally connected layers are even more constrained than fully connected layers: we connect every neuron only to neurons in other groups that are close by. Images and sound waves contain a very high amount of information if used to feed directly one-to-one into a network (e.g. using one neuron per pixel). The idea of convolutional connections comes from the observation that spatial information is probably important to retain. It turned out that this is a good guess, as it’s used in many image and sound wave based neural network applications. This setup is however less expressive than fully connected layers. In essence it is a way of “importance” filtering, deciding which of the tightly grouped information packets are important; convolutional connections are great for dimensionality reduction. At what spatial distance neurons can still be connected depends on the implementation, but ranges higher than 4 or 5 neurons are rarely used. Note that “spatial” often refers to two-dimensional space, which is why most representations show three-dimensional sheets of neurons being connected; the connection range is applied in all dimensions.

Another option is of course to randomly connected neurons. This comes in two main variations as well: by allowing for some percentage of all possible connections, or to connect some percentage of neurons between layers. Random connections help to linearly reduce the performance of the network and can be useful in large networks where fully connected layers run into performance problems. A slightly more sparsely connected layer with slightly more neurons can perform better in some cases, especially where a lot of information needs to be stored but not as much information needs to be exchanged (a bit similar to the effectiveness of convolutionally connected layers, but then randomised). Very sparsely connected systems (1 or 2%) are also used, as seen in ELMs, ESNs and LSMs. Especially in the case of spiking networks this makes a lot of sense, because the more connections a neuron has, the less energy each weight will carry over, meaning less propagating and repeating patterns.

Time delayed connections are connections between neurons (often from the same layer, and even connected with themselves) that don’t get information from the previous layer, but from a layer from the past (previous iteration, mostly). This allows temporal (time, sequence or order) related information to be stored. These types of connections are often manually reset from time to time, to clear the “state” of the network. The key difference with regular connections is that these connections are continuously changing, even when the network isn’t being trained.

The following image shows some small sample networks of the types described above, and their connections. I use it when I get stuck on just exactly what is connected to what (which is particularly likely when working with LSTM or GRU cells):

Analyzing Six Deep Learning Tools for Music Generation

Posted on

As deep learning is gaining in popularity, creative applications are gaining traction as well. Looking at music generation through deep learning, new algorithms and songs are popping up on a weekly basis. In this post we will go over six major players in the field, and point out some difficult challenges these systems still face. GitHub links are provided for those who are interested in the technical details (or if you’re looking to generate some music of your own).

Magenta

Magenta is Google’s open source deep learning music project. They aim to use machine learning to generate compelling music. The project went open source in June 2016 and currently implements a regular RNN and two LSTM’s.
GitHub: https://github.com/tensorflow/magenta
Great, because: It can handle any monophonic midi file. The documentation is good, so it’s relatively easy to set-up. The team is actively improving the models and adding functionality. For every model Magenta has provided a training bundle that is trained on thousands of midi files. You can start generating new midi files right away using these pre-trained models.
Challenges: At this point, Magenta can only generate a single stream of notes. Efforts have been made to combine the generated melodies with drums and guitars – but based on human input, as of yet. Once a model that can process polyphonic music has been trained, it could start to create harmonies (or at least multiple streams of notes). This would indeed be a mighty step on their quest for the generation of some compelling music.
Sounds like: The piece below is generated by Magenta from the 8th note onward. Here they use their attention model with the provided pre-trained bundle.

DeepJazz

The result of a thirty-six-hour hackathon by Ji-Sung Kim. It uses a two layer LSTM that learns from a midi file as its input source. DeepJazz has received quite some news coverage in the first six months of its existence.
GitHub: https://github.com/jisungk/deepjazz
Great, because: Can create some jazz by being trained on a single midi file. The project itself is also compelling proof that creating a working computational music prototype using deep learning techniques can be a matter of hours thanks to libraries like Keras, Theano & Tensorflow.
Challenges: While it can handle chords, it converts the jazz midi to a single pitch and single instrument. It would take a few more post-processing steps for the deep learning created melodies to sound more like human created jazz music.
Sounds like: The following piece is generated after 128 epochs (i.e. the training set consisting of a single midi file has cycled through the model that many times).

BachBot

A research project by Feynman Liang at Cambridge University,  also using an LSTM. This time it is used to train itself on Bach chorales. It’s goal is to generate and harmonize chorales indistinguishable from Bach’s own work. The website offers a test where one can listen to two streams and guess which one is an actual composition by Bach.
GitHub: https://github.com/feynmanliang/bachbot/
Great, because: Research found that people have a hard time distinguishing generated Bach from the real stuff. Also, this is one of the best efforts in handling polyphonic music as the algorithm can handle up to four voices.
Challenges: BachBot works best if one or more of the voices are fixed. Otherwise the algorithm just generates wandering chorales.The algorithm could be used to add chorales to a generated melody.
Sounds like: In the below example the notes for “Twinkle Twinkle Little Star” were fixed, with the chorales generated.

FlowMachines

In the picturesque city of Paris, a research team is working on a system that can help to keep an artist in a creative flow. Their system can generate leadsheets based on the style of a composer in a database filled with about 13000 sheets. Markov constraints are used here as neural network technique.
GitHub: not open source.
Great, because: The system has composed the first AI pop-songs.
Challenges: Producing pop songs from a generated leadsheet to these pop songs is not simply done at the click of a button – it still requires a well-skilled musician to create a compelling song like in the example below. Reducing the difficulty of these steps with the help of deep learning is still an open challenge.
Sounds like: The song is composed by the FlowMachines AI. In order to do so, the musician chose the “Beatles” style, and generated melody and harmony. Note the rest of the score (production, mixing, and assigning audio pieces to the notes) was produces by human composer.

WaveNet

Researchers at Google’s DeepMind have created Wavenet. Wavenet is based on Convolutional Neural Networks, the deep learning technique that works very well in image classification and generation in the past few years. Their most promising purpose is to enhance text-to-speech applications by generating a more natural flow in vocal sound. However, their method can also be applied to music as both the input and output consists of raw audio.
GitHub: WaveNet’s code is not open source, but others have implemented it based on DeepMind’s documentation. For example: https://github.com/ibab/tensorflow-wavenet
Great, because: It uses raw audio as input. Therefore it can generate any kind of instrument, and even any kind of sound. It will be interesting to see what this technique is capable of once trained on hours of music.
Challenges: The algorithm is computationally expensive. It takes minutes to train on a second of sound. Some have started to create a faster version. Another researcher working for Google, Sageev Oore from the Magenta project, has written a blog post where he describes what can be learned from the musical output of Wavenet. One of his conclusions is that the algorithm can produce piano notes without a beginning, making them unplayable on a real piano. Interestingly, Wavenet can extend the current library of sounds that a piano can create and produce a new form of piano music – perhaps the next step in (generated) music.
Sounds like: Trained on a dataset of piano music results in the following ten seconds of sound:

GRUV

A Stanford research project that, similar to Wavenet, also tries to use audio waveforms as input, but with an LSTM’s and GRU’s rather than CNN’s. They have showed their proof of concept to the world in June 2015.
GitHub: https://github.com/MattVitelli/GRUV
Great, because: The Stanford researchers were one of the first to show how to generate sounds with an LSTM using raw waveforms as input.
Challenges: The demonstration they provide seems over-fitted on a particular song, due to the small training corpus and the sheer amount of layers of the NN. The researchers themselves did not have the time nor computational power to experiment further with this. Fortunately, this void is starting to get filled by researchers from WaveNet and other enthusiasts. Jakub Fiala has used this code to generate an interesting amen drum break, see this blog post.
Sounds like: The tool trained on a variety of Madeon songs, resulted in the below sample. Until 1:10 is an excerpt of the creation after 100 up to 1000 iterations, after that is a mash-up of their best generated pieces. This excerpt is a recording of this video.

Notes VS Waves

The described deep learning music applications can be divided into two categories based on the input method. Magenta, DeepJazz, BachBot, and FlowMachines all use input in the form of note sequences, while GRUV and Wavenet use raw audio.

Input type: Note sequences Raw audio
Computational complexity Low (minutes – few hours) High (few hours – days)
Editable result Yes, can be imported in music production software No, waveform itself has to be edited
Musical complexity As complex as a single song from the corpus As complex as the combination of the entire corpus

Can we call out a clear winner? In my opinion: no. Each has different applications and these methods can coexist until generating compelling music with raw audio becomes so fast that there is simply no point in doing it yourself.

Music will be easier to create by people who are assisted by an AI that can suggest a melody or harmony. However, these people still need to be musicians (for now). The moment it is possible to train a deep learning algorithm on your entire Spotify history in raw audio form, and generate new songs, everyone can be a musician.

Image classification and generation has been improved with neural network techniques, reaching higher benchmark scores than ever before, mostly thanks to the speed at which huge sets of pixels can be trained. For audio the overarching question is: when will raw audio overtake notes as the pixel of music?


Did you miss anything, or do you have any other feedback? Comments are greatly appreciated. At the Asimov Institute we do deep learning research and development, so be sure to follow us on Twitter for future updates and posts!  In this post we did no go into the technical details, but if you’re new to deep learning or unfamiliar with a method, I refer you to one of our previous posts on neural networks.

We are currently working on generating electronic dance music using deep learning. If you want to share your ideas on this, or have some interesting data to show, please send a message to frankbrinkkemper@gmail.com. Thank you for reading!

The Neural Network Zoo

Posted on

With new neural network architectures popping up every now and then, it’s hard to keep track of them all. Knowing all the abbreviations being thrown around (DCIGN, BiLSTM, DCGAN, anyone?) can be a bit overwhelming at first.

So I decided to compose a cheat sheet containing many of those architectures. Most of these are neural networks, some are completely different beasts. Though all of these architectures are presented as novel and unique, when I drew the node structures… their underlying relations started to make more sense.

The Neural Network Zoo (download or get the poster).

One problem with drawing them as node maps: it doesn’t really show how they’re used. For example, variational autoencoders (VAE) may look just like autoencoders (AE), but the training process is actually quite different. The use-cases for trained networks differ even more, because VAEs are generators, where you insert noise to get a new sample. AEs, simply map whatever they get as input to the closest training sample they “remember”. I should add that this overview is in no way clarifying how each of the different node types work internally (but that’s a topic for another day).

It should be noted that while most of the abbreviations used are generally accepted, not all of them are. RNNs sometimes refer to recursive neural networks, but most of the time they refer to recurrent neural networks. That’s not the end of it though, in many places you’ll find RNN used as placeholder for any recurrent architecture, including LSTMs, GRUs and even the bidirectional variants. AEs suffer from a similar problem from time to time, where VAEs and DAEs and the like are called simply AEs. Many abbreviations also vary in the amount of “N”s to add at the end, because you could call it a convolutional neural network but also simply a convolutional network (resulting in CNN or CN).

Composing a complete list is practically impossible, as new architectures are invented all the time. Even if published it can still be quite challenging to find them even if you’re looking for them, or sometimes you just overlook some. So while this list may provide you with some insights into the world of AI, please, by no means take this list for being comprehensive; especially if you read this post long after it was written.

For each of the architectures depicted in the picture, I wrote a very, very brief description. You may find some of these to be useful if you’re quite familiar with some architectures, but you aren’t familiar with a particular one.


Feed forward neural networks (FF or FFNN) and perceptrons (P) are very straight forward, they feed information from the front to the back (input and output, respectively). Neural networks are often described as having layers, where each layer consists of either input, hidden or output cells in parallel. A layer alone never has connections and in general two adjacent layers are fully connected (every neuron form one layer to every neuron to another layer). The simplest somewhat practical network has two input cells and one output cell, which can be used to model logic gates. One usually trains FFNNs through back-propagation, giving the network paired datasets of “what goes in” and “what we want to have coming out”. This is called supervised learning, as opposed to unsupervised learning where we only give it input and let the network fill in the blanks. The error being back-propagated is often some variation of the difference between the input and the output (like MSE or just the linear difference). Given that the network has enough hidden neurons, it can theoretically always model the relationship between the input and output. Practically their use is a lot more limited but they are popularly combined with other networks to form new networks.

Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.
Original Paper PDF


Radial basis function (RBF) networks are FFNNs with radial basis functions as activation functions. There’s nothing more to it. Doesn’t mean they don’t have their uses, but most FFNNs with other activation functions don’t get their own name. This mostly has to do with inventing them at the right time.

Broomhead, David S., and David Lowe. Radial basis functions, multi-variable functional interpolation and adaptive networks. No. RSRE-MEMO-4148. ROYAL SIGNALS AND RADAR ESTABLISHMENT MALVERN (UNITED KINGDOM), 1988.
Original Paper PDF


Recurrent neural networks (RNN) are FFNNs with a time twist: they are not stateless; they have connections between passes, connections through time. Neurons are fed information not just from the previous layer but also from themselves from the previous pass. This means that the order in which you feed the input and train the network matters: feeding it “milk” and then “cookies” may yield different results compared to feeding it “cookies” and then “milk”. One big problem with RNNs is the vanishing (or exploding) gradient problem where, depending on the activation functions used, information rapidly gets lost over time, just like very deep FFNNs lose information in depth. Intuitively this wouldn’t be much of a problem because these are just weights and not neuron states, but the weights through time is actually where the information from the past is stored; if the weight reaches a value of 0 or 1 000 000, the previous state won’t be very informative. RNNs can in principle be used in many fields as most forms of data that don’t actually have a timeline (i.e. unlike sound or video) can be represented as a sequence. A picture or a string of text can be fed one pixel or character at a time, so the time dependent weights are used for what came before in the sequence, not actually from what happened x seconds before. In general, recurrent networks are a good choice for advancing or completing information, such as autocompletion.

Elman, Jeffrey L. “Finding structure in time.” Cognitive science 14.2 (1990): 179-211.
Original Paper PDF


Long / short term memory (LSTM) networks try to combat the vanishing / exploding gradient problem by introducing gates and an explicitly defined memory cell. These are inspired mostly by circuitry, not so much biology. Each neuron has a memory cell and three gates: input, output and forget. The function of these gates is to safeguard the information by stopping or allowing the flow of it. The input gate determines how much of the information from the previous layer gets stored in the cell. The output layer takes the job on the other end and determines how much of the next layer gets to know about the state of this cell. The forget gate seems like an odd inclusion at first but sometimes it’s good to forget: if it’s learning a book and a new chapter begins, it may be necessary for the network to forget some characters from the previous chapter. LSTMs have been shown to be able to learn complex sequences, such as writing like Shakespeare or composing primitive music. Note that each of these gates has a weight to a cell in the previous neuron, so they typically require more resources to run.

Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.
Original Paper PDF


Gated recurrent units (GRU) are a slight variation on LSTMs. They have one less gate and are wired slightly differently: instead of an input, output and a forget gate, they have an update gate. This update gate determines both how much information to keep from the last state and how much information to let in from the previous layer. The reset gate functions much like the forget gate of an LSTM but it’s located slightly differently. They always send out their full state, they don’t have an output gate. In most cases, they function very similarly to LSTMs, with the biggest difference being that GRUs are slightly faster and easier to run (but also slightly less expressive). In practice these tend to cancel each other out, as you need a bigger network to regain some expressiveness which then in turn cancels out the performance benefits. In some cases where the extra expressiveness is not needed, GRUs can outperform LSTMs.

Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling.” arXiv preprint arXiv:1412.3555 (2014).
Original Paper PDF


Bidirectional recurrent neural networks, bidirectional long / short term memory networks and bidirectional gated recurrent units (BiRNN, BiLSTM and BiGRU respectively) are not shown on the chart because they look exactly the same as their unidirectional counterparts. The difference is that these networks are not just connected to the past, but also to the future. As an example, unidirectional LSTMs might be trained to predict the word “fish” by being fed the letters one by one, where the recurrent connections through time remember the last value. A BiLSTM would also be fed the next letter in the sequence on the backward pass, giving it access to future information. This trains the network to fill in gaps instead of advancing information, so instead of expanding an image on the edge, it could fill a hole in the middle of an image.

Schuster, Mike, and Kuldip K. Paliwal. “Bidirectional recurrent neural networks.” IEEE Transactions on Signal Processing 45.11 (1997): 2673-2681.
Original Paper PDF


Autoencoders (AE) are somewhat similar to FFNNs as AEs are more like a different use of FFNNs than a fundamentally different architecture. The basic idea behind autoencoders is to encode information (as in compress, not encrypt) automatically, hence the name. The entire network always resembles an hourglass like shape, with smaller hidden layers than the input and output layers. AEs are also always symmetrical around the middle layer(s) (one or two depending on an even or odd amount of layers). The smallest layer(s) is|are almost always in the middle, the place where the information is most compressed (the chokepoint of the network). Everything up to the middle is called the encoding part, everything after the middle the decoding and the middle (surprise) the code. One can train them using backpropagation by feeding input and setting the error to be the difference between the input and what came out. AEs can be built symmetrically when it comes to weights as well, so the encoding weights are the same as the decoding weights.

Bourlard, Hervé, and Yves Kamp. “Auto-association by multilayer perceptrons and singular value decomposition.” Biological cybernetics 59.4-5 (1988): 291-294.
Original Paper PDF


Variational autoencoders (VAE) have the same architecture as AEs but are “taught” something else: an approximated probability distribution of the input samples. It’s a bit back to the roots as they are bit more closely related to BMs and RBMs. They do however rely on Bayesian mathematics regarding probabilistic inference and independence, as well as a re-parametrisation trick to achieve this different representation. The inference and independence parts make sense intuitively, but they rely on somewhat complex mathematics. The basics come down to this: take influence into account. If one thing happens in one place and something else happens somewhere else, they are not necessarily related. If they are not related, then the error propagation should consider that. This is a useful approach because neural networks are large graphs (in a way), so it helps if you can rule out influence from some nodes to other nodes as you dive into deeper layers.

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
Original Paper PDF


Denoising autoencoders (DAE) are AEs where we don’t feed just the input data, but we feed the input data with noise (like making an image more grainy). We compute the error the same way though, so the output of the network is compared to the original input without noise. This encourages the network not to learn details but broader features, as learning smaller features often turns out to be “wrong” due to it constantly changing with noise.

Vincent, Pascal, et al. “Extracting and composing robust features with denoising autoencoders.” Proceedings of the 25th international conference on Machine learning. ACM, 2008.
Original Paper PDF


Sparse autoencoders (SAE) are in a way the opposite of AEs. Instead of teaching a network to represent a bunch of information in less “space” or nodes, we try to encode information in more space. So instead of the network converging in the middle and then expanding back to the input size, we blow up the middle. These types of networks can be used to extract many small features from a dataset. If one were to train a SAE the same way as an AE, you would in almost all cases end up with a pretty useless identity network (as in what comes in is what comes out, without any transformation or decomposition). To prevent this, instead of feeding back the input, we feed back the input plus a sparsity driver. This sparsity driver can take the form of a threshold filter, where only a certain error is passed back and trained, the other error will be “irrelevant” for that pass and set to zero. In a way this resembles spiking neural networks, where not all neurons fire all the time (and points are scored for biological plausibility).

Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann LeCun. “Efficient learning of sparse representations with an energy-based model.” Proceedings of NIPS. 2007.
Original Paper PDF


Markov chains (MC or discrete time Markov Chain, DTMC) are kind of the predecessors to BMs and HNs. They can be understood as follows: from this node where I am now, what are the odds of me going to any of my neighbouring nodes? They are memoryless (i.e. Markov Property) which means that every state you end up in depends completely on the previous state. While not really a neural network, they do resemble neural networks and form the theoretical basis for BMs and HNs. MC aren’t always considered neural networks, as goes for BMs, RBMs and HNs. Markov chains aren’t always fully connected either.

Hayes, Brian. “First links in the Markov chain.” American Scientist 101.2 (2013): 252.
Original Paper PDF


A Hopfield network (HN) is a network where every neuron is connected to every other neuron; it is a completely entangled plate of spaghetti as even all the nodes function as everything. Each node is input before training, then hidden during training and output afterwards. The networks are trained by setting the value of the neurons to the desired pattern after which the weights can be computed. The weights do not change after this. Once trained for one or more patterns, the network will always converge to one of the learned patterns because the network is only stable in those states. Note that it does not always conform to the desired state (it’s not a magic black box sadly). It stabilises in part due to the total “energy” or “temperature” of the network being reduced incrementally during training. Each neuron has an activation threshold which scales to this temperature, which if surpassed by summing the input causes the neuron to take the form of one of two states (usually -1 or 1, sometimes 0 or 1). Updating the network can be done synchronously or more commonly one by one. If updated one by one, a fair random sequence is created to organise which cells update in what order (fair random being all options (n) occurring exactly once every n items). This is so you can tell when the network is stable (done converging), once every cell has been updated and none of them changed, the network is stable (annealed). These networks are often called associative memory because the converge to the most similar state as the input; if humans see half a table we can image the other half, this network will converge to a table if presented with half noise and half a table.

Hopfield, John J. “Neural networks and physical systems with emergent collective computational abilities.” Proceedings of the national academy of sciences 79.8 (1982): 2554-2558.
Original Paper PDF


Boltzmann machines (BM) are a lot like HNs, but: some neurons are marked as input neurons and others remain “hidden”. The input neurons become output neurons at the end of a full network update. It starts with random weights and learns through back-propagation, or more recently through contrastive divergence (a Markov chain is used to determine the gradients between two informational gains). Compared to a HN, the neurons mostly have binary activation patterns. As hinted by being trained by MCs, BMs are stochastic networks. The training and running process of a BM is fairly similar to a HN: one sets the input neurons to certain clamped values after which the network is set free (it doesn’t get a sock). While free the cells can get any value and we repetitively go back and forth between the input and hidden neurons. The activation is controlled by a global temperature value, which if lowered lowers the energy of the cells. This lower energy causes their activation patterns to stabilise. The network reaches an equilibrium given the right temperature.

Hinton, Geoffrey E., and Terrence J. Sejnowski. “Learning and releaming in Boltzmann machines.” Parallel distributed processing: Explorations in the microstructure of cognition 1 (1986): 282-317.
Original Paper PDF


Restricted Boltzmann machines (RBM) are remarkably similar to BMs (surprise) and therefore also similar to HNs. The biggest difference between BMs and RBMs is that RBMs are a better usable because they are more restricted. They don’t trigger-happily connect every neuron to every other neuron but only connect every different group of neurons to every other group, so no input neurons are directly connected to other input neurons and no hidden to hidden connections are made either. RBMs can be trained like FFNNs with a twist: instead of passing data forward and then back-propagating, you forward pass the data and then backward pass the data (back to the first layer). After that you train with forward-and-back-propagation.

Smolensky, Paul. Information processing in dynamical systems: Foundations of harmony theory. No. CU-CS-321-86. COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.
Original Paper PDF


Deep belief networks (DBN) is the name given to stacked architectures of mostly RBMs or VAEs. These networks have been shown to be effectively trainable stack by stack, where each AE or RBM only has to learn to encode the previous network. This technique is also known as greedy training, where greedy means making locally optimal solutions to get to a decent but possibly not optimal answer. DBNs can be trained through contrastive divergence or back-propagation and learn to represent the data as a probabilistic model, just like regular RBMs or VAEs. Once trained or converged to a (more) stable state through unsupervised learning, the model can be used to generate new data. If trained with contrastive divergence, it can even classify existing data because the neurons have been taught to look for different features.

Bengio, Yoshua, et al. “Greedy layer-wise training of deep networks.” Advances in neural information processing systems 19 (2007): 153.
Original Paper PDF


Convolutional neural networks (CNN or deep convolutional neural networks, DCNN) are quite different from most other networks. They are primarily used for image processing but can also be used for other types of input such as as audio. A typical use case for CNNs is where you feed the network images and the network classifies the data, e.g. it outputs “cat” if you give it a cat picture and “dog” when you give it a dog picture. CNNs tend to start with an input “scanner” which is not intended to parse all the training data at once. For example, to input an image of 200 x 200 pixels, you wouldn’t want a layer with 40 000 nodes. Rather, you create a scanning input layer of say 20 x 20 which you feed the first 20 x 20 pixels of the image (usually starting in the upper left corner). Once you passed that input (and possibly use it for training) you feed it the next 20 x 20 pixels: you move the scanner one pixel to the right. Note that one wouldn’t move the input 20 pixels (or whatever scanner width) over, you’re not dissecting the image into blocks of 20 x 20, but rather you’re crawling over it. This input data is then fed through convolutional layers instead of normal layers, where not all nodes are connected to all nodes. Each node only concerns itself with close neighbouring cells (how close depends on the implementation, but usually not more than a few). These convolutional layers also tend to shrink as they become deeper, mostly by easily divisible factors of the input (so 20 would probably go to a layer of 10 followed by a layer of 5). Powers of two are very commonly used here, as they can be divided cleanly and completely by definition: 32, 16, 8, 4, 2, 1. Besides these convolutional layers, they also often feature pooling layers. Pooling is a way to filter out details: a commonly found pooling technique is max pooling, where we take say 2 x 2 pixels and pass on the pixel with the most amount of red. To apply CNNs for audio, you basically feed the input audio waves and inch over the length of the clip, segment by segment. Real world implementations of CNNs often glue an FFNN to the end to further process the data, which allows for highly non-linear abstractions. These networks are called DCNNs but the names and abbreviations between these two are often used interchangeably.

LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.
Original Paper PDF


Deconvolutional networks (DN), also called inverse graphics networks (IGNs), are reversed convolutional neural networks. Imagine feeding a network the word “cat” and training it to produce cat-like pictures, by comparing what it generates to real pictures of cats. DNNs can be combined with FFNNs just like regular CNNs, but this is about the point where the line is drawn with coming up with new abbreviations. They may be referenced as deep deconvolutional neural networks, but you could argue that when you stick FFNNs to the back and the front of DNNs that you have yet another architecture which deserves a new name. Note that in most applications one wouldn’t actually feed text-like input to the network, more likely a binary classification input vector. Think <0, 1> being cat, <1, 0> being dog and <1, 1> being cat and dog. The pooling layers commonly found in CNNs are often replaced with similar inverse operations, mainly interpolation and extrapolation with biased assumptions (if a pooling layer uses max pooling, you can invent exclusively lower new data when reversing it).

Zeiler, Matthew D., et al. “Deconvolutional networks.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.
Original Paper PDF


Deep convolutional inverse graphics networks (DCIGN) have a somewhat misleading name, as they are actually VAEs but with CNNs and DNNs for the respective encoders and decoders. These networks attempt to model “features” in the encoding as probabilities, so that it can learn to produce a picture with a cat and a dog together, having only ever seen one of the two in separate pictures. Similarly, you could feed it a picture of a cat with your neighbours’ annoying dog on it, and ask it to remove the dog, without ever having done such an operation. Demo’s have shown that these networks can also learn to model complex transformations on images, such as changing the source of light or the rotation of a 3D object. These networks tend to be trained with back-propagation.

Kulkarni, Tejas D., et al. “Deep convolutional inverse graphics network.” Advances in Neural Information Processing Systems. 2015.
Original Paper PDF


Generative adversarial networks (GAN) are from a different breed of networks, they are twins: two networks working together. GANs consist of any two networks (although often a combination of FFs and CNNs), with one tasked to generate content and the other has to judge content. The discriminating network receives either training data or generated content from the generative network. How well the discriminating network was able to correctly predict the data source is then used as part of the error for the generating network. This creates a form of competition where the discriminator is getting better at distinguishing real data from generated data and the generator is learning to become less predictable to the discriminator. This works well in part because even quite complex noise-like patterns are eventually predictable but generated content similar in features to the input data is harder to learn to distinguish. GANs can be quite difficult to train, as you don’t just have to train two networks (either of which can pose it’s own problems) but their dynamics need to be balanced as well. If prediction or generation becomes to good compared to the other, a GAN won’t converge as there is intrinsic divergence.

Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in Neural Information Processing Systems (2014).
Original Paper PDF


Liquid state machines (LSM) are similar soups, looking a lot like ESNs. The real difference is that LSMs are a type of spiking neural networks: sigmoid activations are replaced with threshold functions and each neuron is also an accumulating memory cell. So when updating a neuron, the value is not set to the sum of the neighbours, but rather added to itself. Once the threshold is reached, it releases its’ energy to other neurons. This creates a spiking like pattern, where nothing happens for a while until a threshold is suddenly reached.

Maass, Wolfgang, Thomas Natschläger, and Henry Markram. “Real-time computing without stable states: A new framework for neural computation based on perturbations.” Neural computation 14.11 (2002): 2531-2560.
Original Paper PDF


Extreme learning machines (ELM) are basically FFNNs but with random connections. They look very similar to LSMs and ESNs, but they are not recurrent nor spiking. They also do not use backpropagation. Instead, they start with random weights and train the weights in a single step according to the least-squares fit (lowest error across all functions). This results in a much less expressive network but it’s also much faster than backpropagation.

Huang, Guang-Bin, et al. “Extreme learning machine: Theory and applications.” Neurocomputing 70.1-3 (2006): 489-501.
Original Paper PDF


Echo state networks (ESN) are yet another different type of (recurrent) network. This one sets itself apart from others by having random connections between the neurons (i.e. not organised into neat sets of layers), and they are trained differently. Instead of feeding input and back-propagating the error, we feed the input, forward it and update the neurons for a while, and observe the output over time. The input and the output layers have a slightly unconventional role as the input layer is used to prime the network and the output layer acts as an observer of the activation patterns that unfold over time. During training, only the connections between the observer and the (soup of) hidden units are changed.

Jaeger, Herbert, and Harald Haas. “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication.” science 304.5667 (2004): 78-80.
Original Paper PDF


Deep residual networks (DRN) are very deep FFNNs with extra connections passing input from one layer to a later layer (often 2 to 5 layers) as well as the next layer. Instead of trying to find a solution for mapping some input to some output across say 5 layers, the network is enforced to learn to map some input to some output + some input. Basically, it adds an identity to the solution, carrying the older input over and serving it freshly to a later layer. It has been shown that these networks are very effective at learning patterns up to 150 layers deep, much more than the regular 2 to 5 layers one could expect to train. However, it has been proven that these networks are in essence just RNNs without the explicit time based construction and they’re often compared to LSTMs without gates.

He, Kaiming, et al. “Deep residual learning for image recognition.” arXiv preprint arXiv:1512.03385 (2015).
Original Paper PDF


Neural Turing machines (NTM) can be understood as an abstraction of LSTMs and an attempt to un-black-box neural networks (and give us some insight in what is going on in there). Instead of coding a memory cell directly into a neuron, the memory is separated. It’s an attempt to combine the efficiency and permanency of regular digital storage and the efficiency and expressive power of neural networks. The idea is to have a content-addressable memory bank and a neural network that can read and write from it. The “Turing” in Neural Turing Machines comes from them being Turing complete: the ability to read and write and change state based on what it reads means it can represent anything a Universal Turing Machine can represent.

Graves, Alex, Greg Wayne, and Ivo Danihelka. “Neural turing machines.” arXiv preprint arXiv:1410.5401 (2014).
Original Paper PDF


Differentiable Neural Computers (DNC) are enhanced Neural Turing Machines with scalable memory, inspired by how memories are stored by the human hippocampus. The idea is to take the classical Von Neumann computer architecture and replace the CPU with an RNN, which learns when and what to read from the RAM. Besides having a large bank of numbers as memory (which may be resized without retraining the RNN). The DNC also has three attention mechanisms. These mechanisms allow the RNN to query the similarity of a bit of input to the memory’s entries, the temporal relationship between any two entries in memory, and whether a memory entry was recently updated – which makes it less likely to be overwritten when there’s no empty memory available.

Graves, Alex, et al. “Hybrid computing using a neural network with dynamic external memory.” Nature 538 (2016): 471-476.
Original Paper PDF


Capsule Networks (CapsNet) are biology inspired alternatives to pooling, where neurons are connected with multiple weights (a vector) instead of just one weight (a scalar). This allows neurons to transfer more information than simply which feature was detected, such as where a feature is in the picture or what colour and orientation it has. The learning process involves a local form of Hebbian learning that values correct predictions of output in the next layer.

Sabour, Sara, Frosst, Nicholas, and Hinton, G. E. “Dynamic Routing Between Capsules.” In Advances in neural information processing systems (2017): 3856-3866.
Original Paper PDF


Kohonen networks (KN, also self organising (feature) map, SOM, SOFM) utilise competitive learning to classify data without supervision. Input is presented to the network, after which the network assesses which of its neurons most closely match that input. These neurons are then adjusted to match the input even better, dragging along their neighbours in the process. How much the neighbours are moved depends on the distance of the neighbours to the best matching units.

Kohonen, Teuvo. “Self-organized formation of topologically correct feature maps.” Biological cybernetics 43.1 (1982): 59-69.
Original Paper PDF


Attention networks (AN) can be considered a class of networks, which includes the Transformer architecture. They use an attention mechanism to combat information decay by separately storing previous network states and switching attention between the states. The hidden states of each iteration in the encoding layers are stored in memory cells. The decoding layers are connected to the encoding layers, but it also receives data from the memory cells filtered by an attention context. This filtering step adds context for the decoding layers stressing the importance of particular features. The attention network producing this context is trained using the error signal from the output of decoding layer. Moreover, the attention context can be visualized giving valuable insight into which input features correspond with what output features.

Jaderberg, Max, et al. “Spatial Transformer Networks.” In Advances in neural information processing systems (2015): 2017-2025.
Original Paper PDF


Follow us on twitter for future updates and posts. We welcome comments and feedback, and thank you for reading!

[Update 22 April 2019] Included Capsule Networks, Differentiable Neural Computers and Attention Networks to the Neural Network Zoo; Support Vector Machines are removed; updated links to original articles. The previous version of this post can be found here .

Van Veen, F. & Leijnen, S. (2019). The Neural Network Zoo. Retrieved from https://www.asimovinstitute.org/neural-network-zoo

Artistic Style Transfer Blending

Posted on

Transferring the style from one image to another has been done plenty of times before and has gotten a fair bit media coverage lately. One thing we considered was the possibility of not just transferring the style from one image, but combining the styles of multiple images and transferring those; style transfer blending. After throwing around a few ideas, the thought came around of combining two images of different styles and feeding that to existing style transfer applications. The results where quite interesting…

These are some of the input images we used for the various style combinations:

auroraicebergroadrunnerstarrynight

We used three style permutations, each style being a compound of two input images. We tested each combined style on these three different images:

target3target2target1

And here are some of the results after 200 iterations:

styleBlending

There is definitely some potential in combining styles and transferring them to content. It may proof useful to designers looking for inspiration, providing a more diverse and bigger set of suggestions.