KERNEL – BASED MACHINE LEARNING METHODS FOR NEURAL BRAIN – MACHINE INTERFACES

Thesis submitted for the degree of “Doctor of Philosophy” by Lavi Shpigelman

Submitted to the Senate of the Hebrew University October / 2009

This work was carried out under the supervision of Professor Yoram Singer and Professor Eilon Vaadia

Abstract A major focus of neuroscience is the interpretation of neural activity; understanding how it is eected by external stimuli, how it depends on its internal dynamics and how it aects behavior. This work is about the development of machine learning methods for interpreting neural activity in order to predict intended behavior. Development starts by learning to track behavior and ends in using these algorithms to

articially

perform motor-control tasks in a (closed

loop) Brain-Machine Interface (BMI) setting.

Each of the studies in this thesis tackles a dierent aspect of modeling how neural activity is related to observed or desired behavior. The rst study models how patterns of activity of a recorded neural population are compared by designing a similarity measure, known as a

kernel.

This novel kernel can be used as the

core of many diverse prediction methods. It is tested within the Support Vector Regression setting in the rst study and used again in the third study.

The second study is the development of a new feature selection method for nding a group of features of the neural activity that is (locally) optimal for tracking behavior.

Feature selection can be viewed as a useful preprocessing

step. It can nd a better trade-o between signal and noise in the data than using the full set of inputs. In addition to improving results it can help identify how the signal is represented in the data which is important as a neuroscience tool.

The main contribution of our method is in that it is able to capture a

complex, non-linear dependency of the target function on its input and in that it suggests a novel way of sifting through neural signals to select those that are most relevant to decoding.

The third study describes a novel kernel method that incorporates modeling of movement dynamics. It uses the kernel from the rst study for comparing spike

rate patterns and adds to it a model of the output dynamics which results in improved reconstruction of hand movements.

The fourth study is an adaptation of a previously existing kernel method (that can incorporate and further explore all the above modeling directions) to an on-line, adaptive, real-time setting and its implementation in a BMI lab experiment. Compared with the models in the third study, this method requires less computational power and also allows for a wider range of non-linearities (but this may be at the expense of learning models that are not optimal for their loss function). The main contribution of this study, however, is in nding solutions to the many challenges involved in converting a tracking application to a closed-loop BMI that can be used in a live experiment within the lab.

These four studies were all directed towards gaining better understanding of the aspects of neural activity that are important in motor control and applying this understanding to producing the best possible working BMI setup from the algorithmic aspect. By successively adding complexity to the models and intermittently removing some of the complexities in order to compare with simpler existing models I show that the added complexities are justied in that they improve our tracking and prediction abilities. The end result is a state-of-the-art BMI setup that (once we stopped developing it) does not require the presence of a machine learning expert to operate, nor does it require a lab-technician's monitoring. It continuously adapts to changes in the neural recordings and is useful from the 2-3rd trial (starting from scratch in BMI mode), making it extremely quick in allowing the subject to gain control over it (e.g. there is no need for any pre-use training session as is customary in other setups). It also appears to be very intuitive to use as our BMI-naive monkey could easily use it in its very rst BMI session.

2

The reader is invited to see some movies as a preview of the BMI work:

https://sites.google.com/site/shpigi/publications/NIPS2008_ShpigelmanLaVa.extra.zip

Has o-line tracking of hand movement using only neural activity and documentation of the rst four days of the BMI experiment

https://sites.google.com/site/shpigi/publications/KARMA_3D_long.m4v

Shows a longer movie of continuous target-to-target reaching trials with the BMI in 3 dimensions

https://sites.google.com/site/shpigi/publications/KARMA_4D_BMI_demo_compressed.m4v

Shows a few continuous movements that require both reaching in 3D and rotation of the cursor (as a 4th movement dimension)

3

4

Contents 1 Introduction

7

1.1

Subjective view of the neuroscience of motor control

. . . . . . .

8

1.2

Brain-machine interfaces for motor control . . . . . . . . . . . . .

13

1.3

Existing algorithms for mapping neural activity to behavior . . .

16

1.4

Overview of the thesis

20

. . . . . . . . . . . . . . . . . . . . . . . .

2 Results 2.1

25

Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces . . . . . . . . .

2.2

Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

2.3

. . . . . . . . . . . . . . . . . . .

47

A Temporal Kernel-Based Model for Tracking Hand-Movements from Neural Activities

2.4

25

. . . . . . . . . . . . . . . . . . . . . . . .

57

Kernel-ARMA for Hand Tracking and Brain-Machine Interfacing During 3D Motor Control

. . . . . . . . . . . . . . . . . . . . . .

67

3 Discussion

77

4 Bibliography

87

5

6

Chapter 1

Introduction The process of making movements to achieve desired results is known as motor control and is one of the main functions of the brain. plays a central role in this function.

Primary motor cortex

By designing learning algorithms that

decode activity in primary motor cortex into desired behavior we provide some understanding of how primary motor cortex operates. Furthermore, by building a good BMI we can equip patients with the means to overcome some handicaps such as various paralyses that leave the motor cortex unharmed.

Several technologies have evolved in recent years to a point where experiments with brain-machine interfaces based on neuronal recordings have become feasible in a lab setting. This resulted in several BMI experiments and a larger number of o-line experiments oriented towards BMI use.

However, one area of BMI

technology that, prior to this work, remained relatively unexplored in the full BMI setting is the study and design of novel machine learning algorithms for mapping neural activity to performed or desired behavior. Most BMI studies use variants of linear regression for this purpose. Development of advanced decoding methods can improve prediction accuracy and BMI usability, making better use of available information and achieving better levels of performance than simpler existing methods while using a smaller number of recording channels.

7

Finally, the search for good decoding models leads to better understanding of how information regarding behavior is stored and processed in the brain. Another area that received little attention in previous studies is the protocols used to train BMIs. These are typically awed as they require the subjects to undergo tedious training at the beginning of each BMI session

before

they start

using the BMI and produce models that are not learned from BMI interactions. Proper design of the setup and the machine learning algorithm can alleviate much of this hassle, shorten the time from session start till the BMI becomes usable, learn models that are derived on-line from the BMI interaction and provide an interface that is easier to use. To understand the motivation, problems, challenges, solutions and conclusions that can be drawn from this work, several topics need to be reviewed. Section 1.1 presents my point of view on the neuroscience of motor control . BMIs and their use for motor control are introduced in Section .1.2. Section 1.3 discusses current algorithms for mapping neural activity to behavior, especially in BMI settings and Section 1.4 provides an overview of my work and the papers of chapter 2.

1.1

Subjective view of the neuroscience of motor control

The neuroscience of motor control aims at understanding how the brain solves problems relating to performance of behavioral tasks involving movement. This includes such things as the planning of movement, integration of information from dierent sources, maintaining internal models of the external environment and of our bodies, predicting the outcome of actions and neural commands, compensating for feedback delays, taking corrective actions, adapting to changes, learning of new tasks and skills and adjusting behavior as tasks goals and the external environment change. Science in general and neuroscience in particular has gained much by divide-

8

and-conquer approaches, prying large problems apart into smaller sub-problems. This is certainly true for motor control as well. Nevertheless, it is important to remember that models which ignore parts of the system make simplifying assumptions that can render them ineective or detrimental to understanding of the full setting. Early motor-control research provided us with rst approximations of what motor cortex does that were adequate at the time. However, given today's experimental and mathematical tools, most simplications made by models dating 20-30 years ago are grossly over-simplifying, missing important parts of the puzzle.

Motor control experiments in awake behaving monkeys date back to the late 1960s. Since then and until the late 1980s such experiments typically involved recording from a single neuron at a time while a monkey performed tasks involving multiple repetitions of relatively simple movements such as single joint exion / extension (e.g. [12]) or straight reaching from point to point (e.g. [18]) while mostly single cells were recorded . The main questions that were being asked were framed in terms of scalar receptive elds a.k.a. tuning curves. I.e. what parameter of the movement and what values of this parameter are the primary cause of an elevated ring rate in a neuron. Answers to these questions were obtained by having the animal repeat the same behavior as accurately as could be controlled over many trials and summing the number of spikes during various parts of the task.

Using this method, numerous correlates were

found between single neurons' activities and behavior, ranging from low-level skeleto-muscular features such as sensitivity to specic activation of muscles and torques around joints, to higher level parameters such as joint kinematics and visuo-motor parameters such as direction of movement, speed, target location and others (for a recent review see [45]). As longer recording times were achieved, more repetitions could be made, more cells could be recorded and ner details of neuronal activity could be discerned. Peri-Stimulus-Time-Histograms [20] were used to show timing relationships between behavioral events and neuronal activity and Joint-Peri-Stimulus-Time-Histograms [1] were used to show

9

stimulus and time dependent correlations between pairs of cells.

Unlike the ndings in many sensory areas of the brain, it soon became clear that motor cortex is not organized in a solely hierarchical or somatotopic nor functional manner.

Nearby cells were often found to be primarily tuned to

widely dierent values of the same movement parameter or to entirely dierent movement parameters or best described in relation to dierent reference frames or active in dierent tasks. Many studies nd cells that are highly correlated with behavior but not in any of the standard engineering-suggested frames of reference. Also, tuning was sometimes shown to be task and even trial-phase dependent, changing as task goal changed. For a recent review of these apparent inconsistencies see [44]. Furthermore, deciding what feature of the movement a neuron is most tuned to is dicult since in many experimental settings there is a high degree of correlation between dierent features of the movement [55]. Designing experiments that dierentiate between one feature and another is hard and sometimes impossible.

The apparent lack of ne structure in motor cortex conicts with many engineering inspired designs of how motor command signals could be produced in the brain. These theories typically suggest that simple receptive elds should be combined to form complex ones and that primarily sensory-based frames of reference should be transformed in a modular fashion to muscle-oriented ones and vise-versa, depending of the direction of information ow. These conceptual problems continue to bae researchers in the eld to this day.

The failure to clearly identify the basic building blocks of motor-cortex computations implies that the divide-and-conquer (bottom-up) approach can not be easily completed to yield agreeable models of how motor-control is achieved by the brain. Various top-down approaches have been proposed that start by describing the problems faced in motor control and what characterizes natural behavior (see, for example [63, 62]) and suggest various behavioral optimality principles.

Some examples are the minimum-jerk principle [16] and the

10

minimum-variance principle [21].

A more general framework applies optimal-

control theory to the problem [56, 33]. These approaches yield models that t observed behavior and neuronal activities (where applicable) and can predict behavior in novel situations. However, just as movement parameters are highly correlated, so are models of behavioral optimality and therefore, deciding which one better approximates the cost functions utilized by the brain is hard. The more general framework of optimal control can wriggle around this problem a bit by claiming that the choice of cost function is task-goal dependent, can therefor change from task to task and may not have a simple form. Neural correlates of the cost would also change accordingly, making the theory hard to prove or disprove.

In my view, the last 50 years of research point more at what should

not

be as-

sumed than at what can be taken as safe modeling simplications because there is such great diversity in where emphasis is placed by dierent studies.

This

refers to both the functional level (what the motor cortex does) and the implementation level (how computations are implemented in the neural network).

Motor control involves a dynamic systems problem in which the motor cortex is a key participant. External sensory and internal state-related and task-goalsrelated information converges onto it. Its output is directed at multiple sites and aects multiple internal computational loops but it also contributes to the direct activation of muscles [26]. Neural activity in this region reects practically every aspect of the problem. It seems to operate at multiple reference frames and time constants. It is involved both in planning and in executing actions. It reects changes in task goals [7] and it is even sensitive to how these goals may be presented to the subject [69]. It changes as the subject is required to cope with changes in the external environment (such as a force eld [48, 2, 32] or a sensory transformation [36, 37]) and is involved in the process of learning internal models of the external environment, known as forward models (allowing it to predict the outcome of neural commands and motor actions) and their inverse (allowing

11

it to transform desired states and sensory information into the neural signals that will bring them about) [63, 28]. It is also sensitive to prior learning and experience [3].

Neural activity in motor cortex is hardly reproducible from trial to trial even in highly controlled lab settings. This is most probably

not

due to an inability of

the neural population to reproduce its activity since there are other (e.g. sensory) areas of the brain (such as visual or auditory cortex) where activity can be reproduced with high accuracy by carefully repeating the same stimuli [4]. The inability to reproduce neuronal activity in motor cortex from trial to trial with high delity is probably due to an inability of experimenters to control aspects of the internal state of the system that aect motor cortex. Analysis methods that average over dierent repetitions of the same movements or aggregate cells recorded at separate times are therefore at fault in this regard (see [9] and [67] for further incentives for the importance of simultaneous recording of neurons and alternative methods for analyzing neural activity in spite of the irreproducibility problems).

The complexity of the neural activity in motor cortex seems to mirror the complexity of the computational problems that it is involved in processing. Information regarding its computations can be retrieved from various features of the activity ranging from individual neuronal ring rates (e.g. wise correlations and their time or task dependency (e.g.

[19]) to pair-

[57]) to population

based attractors and whole population dynamics (see, for example [9] again and references therein).

Any mathematical model that attempts to describe a real-world mechanism is a compromise between accuracy and usefulness. The models presented in this work are no exception. However as will be seen in chapter 2, I have chosen to give more emphasis to making models that are good at representing the relation between neural activity and behavior than to making models that yield (possibly overly) simple interpretations at the expense of prediction accuracy.

12

Nervous system D

A 1

8 7

Biological Sensors

2

BMI

Muscles

6 3 5

C

Figure 1.1:

4

External environment

Possible integration options of BMIs.

B

Input paths are in blue.

Output paths are in red. Solid lines are paths that were implemented in this work and dashed lines are paths that were not.

1.2

Brain-machine interfaces for motor control

Brain-Machine interfaces are articial devices that allow interfacing brains and machines by direct connection with neural activity in the brain.

Figure 1.1

illustrates some of the diverse ways in which BMIs can be integrated with the natural sensory-motor loop that we use to interact with our environment (path A,B,C,D). Dierent integration schemes can be used to achieve many diverse functions. Some BMIs aid or replace damaged senses by using articial sensors to directly stimulate the central nervous system (path 5,8 instead or in addition to path C,D in the gure).

Examples include the highly successful cochlear

implants [13], in use today by more than 100,000 people and the less-thansuccessful visual implants [46, 23]. Some BMIs may be used to bridge damaged neurological pathways from brain to muscles (path 1,2). One such study applied cortically driven Functional Electrical Stimulation to activate hand muscles, thus bypassing a possibly damaged path (A) through the spinal chord [34]. Many BMIs (including those explored in this work) translate neural activity into articial motor control (path 1,4 instead or in addition to path A,B). Other BMIs can be used to replace or augment paths, circuits and neural networks

13

entirely within the central nervous system, such as demonstrated by [24] (path 1,8 in the gure). All of the BMIs share the property that they create new biofeedback loops that touch the nervous system either as their input (recordings) or as their output (stimulation) or both.

BMIs have a potential value both as treatments for disabilities and as research tools.

As a medical aid, BMIs for motor control may enable severely handi-

capped individuals to perform object manipulation such as control of a cursor or robot arm or one's own limbs. Other applications which require lower bitrates include navigation using a wheelchair and dictating sentences though a computer. As a research tool, BMIs short-circuit the visuo-motor (perceptionaction) loop, allowing the experimenter to single out various parts of the system and interact with them in a more direct manner, bypassing some of the need to model those parts of the system that are removed from the loop. For example, when a group of recorded neurons is used to control a BMI in an experimenterspecied manner, changing how part of the population's activity is interpreted and observing the resulting changes in behavior and in neuronal tuning curves [25] allows for a unique investigation of the mechanisms underlying motor learning and neuronal plasticity while skipping the need for modeling how neural activity aects behavior (since this part is controlled by the experimenter).

A survey of all the various forms of BMIs and the various methods for obtaining signals related to neural activity is beyond the scope of this introduction. For some recent reviews see [31] (which provides a broad survey of the eld from a clinical perspective), [43],[30],[11] and [15].

Dierent recording methods vary wildly in how invasive they are and, accordingly, in what is being recorded. A partial list of signals includes spikes, Local Field Potentials (LFPs), ECoG signals, EEG, MEG , PET, fMRI and optical imaging data. These signals also vary wildly by their time and space resolutions, how feasible it might be to record them outside the lab, for how long they can be recorded before tissue damage ensues, how susceptible they are to external

14

noise and perturbations, how much volitional control can be exerted over them and at what rates of information transfer. In this work I studied BMIs based on simultaneous extra-cellular recordings of spike trains from multiple cortical units in primary motor cortex of non-human primates. This choice stems partially from a belief that in spite of numerous biomedical-engineering diculties which this highly invasive method requires, it will prove most useful for clinical applications in the long-run.

A second,

equally important factor in this choice is our desire to use the BMI platform as a research tool for better understanding the problem of motor control. While all the methods above attempt to provide a picture of the neural activity in the brain, recording actual spike trains (albeit from a small section of cortex) seems to us to be most benecial for understanding neuronal mechanisms at this stage. BMIs based on extra-cellular neuronal recordings in monkeys have been around in the form of closed biofeedback loops since the late 1960s [14], however only in the last decade did several technological developments allow implementation of BMIs that are useful as neural prosthetics. These include the development of tools and expertise needed for conducting simultaneous extra-cellular recording of multiple cortical units with chronically implanted electrode arrays. Another development is in the real-time extraction of neural activity and spike sorting from hundreds of channels using dedicated hardware and software.

The third is an increase in computer speeds that allows development

and implementation of memory and processor hungry algorithms which work under real time constraints.

This resulted in a urry of studies from sev-

eral labs, involving BMIs for motor control, mainly in non-human primates [22, 61, 47, 7, 29, 59, 25, 53, 54, 38, 40, 41, 8]. One study was also performed in human quadriplegics with at least one patient attaining reasonable control over a 2D cursor [38]. The main research directions pursued by these studies can be categorized as

1. Development of extracellular recording methods for better recording of

15

larger number of units and from numerous cortical areas.

2. Increasing the BMI's Degrees Of Freedom (DOFs) from 1 to 2,3 and 4 (3D position + grip).

3. Directing the output to robotic arms instead of the screen (usually with the monkey seeing only an on-screen representation of the robotic device).

The algorithmic side of the BMIs in these studies stayed relatively underdeveloped, at least in the closed-loop setting. My work focuses on development of this component.

1.3

Existing algorithms for mapping neural activity to behavior

Surprisingly, with only three exceptions (one of which is our work) , the only machine learning algorithm for BMIs based on recording of spikes that has been applied in a true closed loop setting to date is linear regression [22, 61, 54,

1

47, 7, 53, 29, 38, 59, 25] . The exceptions to this rule are the work of Chapin et al [8] which enabled binary lever pressing in rats using an articial neural network, Sanathanam et al [41] which used a simple maximum likelihood approach to enable target selections by a monkey of up to 16 targets and our work [50], introduced in the next section, which allowed a monkey to perform unrestricted 3 and 4 DOF cursor movements. The various linear regression implementations in BMI studies dier in how the regression weights are learned, what pre-processing is done to the spike trains (mainly linear transformations of binned spike counts), if one or several time lags are used per cell and whether or not weight adaptation schemes are implemented. Nevertheless, linear regression from estimated neural spike rates to current velocity is the method of choice,

1 After

this thesis was submitted, another non-linear BMI work was published [27] 16

not only in BMI implementations, but also as the prevalent method for analyzing neuronal activity. Linear regression, also known as the Population Vector [18, 17, 19] and its implied cosine tuning of neurons to movement direction (or velocity), is the predominant tool used in describing neuronal activity, its relation to behavior and to experiment-controlled task changes. This seems to be in stark contrast with the notion promoted in section 1.1 that motor control and its associated neural activity is highly non-linear and aected by many more factors. Several justications can be made:



Linear regression is a good-enough rst approximation. Many cells

do

exhibit signicant correlations between their ring rate and hand velocities (or planned movement direction) in denable parts of the trials. When a large enough population is being recorded, this information is enough to produce useful BMIs. One of the most recent studies to date [59] achieved impressive control of a 4 DOF robotic arm which a monkey used to feed itself using linear regression.



The learned regression models are very easy to interpret. The concepts of preferred direction, measures of how close to cosine tuned a cell is and measures of how preferred direction is aected by task changes or experience or learning are routinely used to describe neuronal processes. Linear algebra and statistical methods that involve linear relations between parameters are fully developed and provide the researcher with a very wide playing eld.



Linear regression is easy to implement.

It requires very little computa-

tional and programming eort and could be made to work in real-time even with the previous century's technology.



Other, more pressing challenges and scientic questions were at the focus of most BMI research (and many other neuroscience studies) until now.



Once linear regression has become the accepted norm,

17

not

using linear

regression would draw attention

away

from the main ndings and require

an explanation (unless the focus of the BMI study is decoding methods).



Getting advanced machine learning techniques to work in real-world settings requires machine learning experts. Not many machine learning experts are willing to commit themselves to the kind of diligent painstaking laboratory work needed to study BMIs in monkeys.

Many studies, including ours, have shown in the oine setting (mapping neural activity to observed motor behavior) that more advanced methods can produce greatly improved results, both over linear regression and over standard o-the-shelf methods that are not specialized for the neural decoding task. Improvements stem from several modeling concepts:



Better modeling of the

instantaneous relation between neural activity and

behavior, such as introduction of non-linearities through kernel methods ([51, 49, 50]) or assuming Poisson ring rates in generative models (e.g. [6, 5, 68]).



Performing feature selection and extraction as a pre-processing step to further application of machine learning techniques ([54, 66, 35, 67, 9]).



Incorporating movement models that can also function as the algorithm's state space ([6, 49, 5, 64, 65, 68, 50])



Allowing the models to adapt to changes as time goes by ([54, 65, 50]).

Brown et al. [6] applied a generalized linear model to reconstruct a rat's foraging trajectory in a 2D circular arena from spiking activity of Hippocampal place cells. They assume a movement model of Brownian motion in position space and Poisson ring rates (dependent on position and theta rhythm) .

Wu et al. [64] applied the well known Kalman lter (assuming linear state transitions and observations with Gaussian noise) to the problem of reconstructing

18

2D hand movement made by a monkey. They used current position, velocity and acceleration of the hand as the lter's state space. In [65], Wu et al. test the Kalman lter with model adaptivity using a sliding window over trials and

2

show an improvement in decoding results.

Brockwell et al. [5] allow for non-linear probabilistic state transitions and observations by use of particle ltering. They model neuronal spiking activity as having Poisson dependency on current velocity and assume a movement model of Brownian motion in velocity space.

Sanchez et al. [40] used a recurrent non-linear articial neural network (trained by back-propagation through time) with 5 articial neurons in an internal middle layer. The internal state space representation is the result of the optimization procedure and is not pre-specied. It therefore has no obvious counterpart in terms of the observed behavior.

Yu et al.

[68] use Bayesian recursive generalized linear models where neural

activity is Poisson dependent on the state (as in [6]). The state space is current position and velocity. State transition is assumed linear with Gaussian noise (as in Kalman ltering). Their main contribution is the addition of a probabilistic dependency on (discrete) target choice which results in a probability mixture model.

Churchland et al. [9] do not reconstruct behavior. Instead they employ nonlinear dimensionality reduction to visualize population activity in 3D. They show that the reduced trajectories dier as the task goals (and behavior) change and suggest that this dimensionality reduction captures the essential features of the neural activity.

Many of the prominent recent studies listed here (with the exception of [40], [9] and our work) applied the above concepts in a generative (as opposed to a discriminative) setting. The advantage of using generative models is that (as

2 This adaptive Kalman Filter is similar to the one we compare with in [50] though we use a dierent adaptivity scheme.

19

their name suggests) they yield probabilistic explanations of how the data is generated.

If the assumptions that they make are acceptable, the result is a

description of the real-world mechanism. The possible disadvantages of these models are that



If the assumptions that the generative models make are unrealistic, then the value of the learned models as explanations of the real-world mechanisms is greatly diminished.



Other, discriminative models may do a

better

job at learning to predict

the desired outputs (behavior) from the inputs (neural activity).

Our work, described in the next section and in chapter 2 ,makes use of discriminative methods.

1.4

Overview of the thesis

My research proposal states the following main objectives:

1. Gain better understanding of what information can be extracted from activity of a neural population during a motor control learning task.

2. Suggest and test new models for predicting the activity of an animal from primarily neural recordings, by incorporation of previously untried learning algorithms and known biological insights.

3. Find ways for on-line co-learning by human and machine with a common goal of improving performance in a motor task.

4. Elucidate neural/other changes during adaptation (learning) to a brainmachine interface.

I made some progress in all four objectives, especially in objective number 2.

20

My work makes use of kernel methods. For an overview of kernel methods see [58], [10] and [42]. In short, kernel methods provide a framework in which data is implicitly mapped into high dimensional feature spaces by the specication of kernel functions which take two data points as input and return the dot product of their representation in feature space (a reproducing kernel Hilbert space). Algorithms that use the kernel trick access data only trough the dot products of data pairs (i.e the kernel values) .

The actual (possibly innite)

vectors of features that a kernel function implies are not explicitly calculated, making the learned models dicult to interpret as it is not clear from the learned models what features of the data were most salient. Kernel methods produce top results in many diverse learning problems. For this reason I chose to apply them in most of my research.

My rst study [51] (section 2.1) addressed the issue of selecting a representation space for patterns of neural population spike rates. In this work we develop a kernel for spike-count patterns (which we call the Spikernel) and test it against the linear and standard kernels (Gaussian and polynomial variants) for predicting hand velocities in a 2D center out reaching task using Support Vector Regression (SVR). For a comprehensive tutorial on SVR see [52]. The development of this kernel follows some insights on neural activity. We provide some validation of the assumptions that are implied by using of this kernel by showing that using this kernel in SVR achieves better results than linear methods and o-the-shelf kernels .

Feature selection (or weighing) can be used as a means of searching for the most important features of the data.

It may also serve as a pre-processing

step for other machine-learning techniques which would save data collection and processing eorts. This is useful when these eorts have real-world costs (such as money or available computing resources). In a joint study with Amir Navot [35] (section 2.2) we develop a feature subset selection method based on a weighted version of the k-nearest-neighbor algorithm. This method can capture

21

complex dependency of the target function on its input. We use this method to select features of neural activity in the task of reconstructing hand velocities in the same center-out reaching task as in my rst study [51].

By applying

feature selection we are able to improve prediction quality. Inspection of which features were selected provides a means of exploring the relative importance of the various neural inputs.

The above two studies mapped neural activities in a sliding window to current behavior without modeling of the behavior. It is well known that hand movements are quite stereotypical. For example, point to point reaching movements tend to be straight and follow a bell shaped velocity prole which approximately minimizes changes in acceleration [16] . In [49] (section 2.3) we propose a kernel method that incorporates a linear state transition model with a kernel-induced non-linear regression from available observations. One choice of model parameters (using a linear kernel) can reduce it into a model that is similar to the steady-state Kalman lter (an ARMA model). A dierent choice (neglecting the movement model) will reduce it to standard SVR (as in the rst study, [51]). We show that using this algorithm (with the Spikernel from [51]) we improve over standard o-the-shelf algorithms (such as the Kalman lter and SVR with same kernel) and over the linear variant of this same method. This work emphasizes the importance of learning not only the input-output relation but also the dynamics of the output itself as part of the system model.

In order to apply an algorithm in a BMI setting, several conditions should be met. The rst is:



Inference time is limited by the requirements of the interface.

In our lab setting, we wanted to update the cursor position on the screen every 50 milliseconds which (given some system overhead, left approximately 40ms for calculations).

There are various ways of dealing with a situation where

inference is occasionally

not

available in time but they all introduce a wealth of

22

complications ranging from annoying cursor freezes to diculty in model state updates to added uncertainty in later analysis of the experiment.

I therefore

considered this condition as a must-have.

Several other conditions stem from our belief that the algorithm should be adaptive. Most BMI experiments (with the exception of [54]) use static models that are typically learned in the beginning of a session and held xed. The nonadaptive approach is good for studying how the subject copes with learning the non-changing interface (requiring eort on part of the subject).

It also

circumvents possible diculties that may arise due to the interaction dynamics between subject and machine in which these two learning entities may end up changing too quickly for one to understand the other. However, the non-adaptive approach has some drawbacks. It usually involves a pre-session training period in which a model is learned and then held xed for the rest of the session. During this pre-session training, the subjects are either required to perform the task by making hand movements (as in [7]) which is impossible with real patients, or are required to watch movies of desired behavior made by someone else, as in [38], and think about performing those movements. The neural activity from this training period is used to learn a model which is then used during the BMI session to perform the task. This protocol is awed since the model that is learned is not based on BMI interaction and, as many studies have shown, neural activity during BMI usage is dierent than that observed in other situations even when those other situations were used for training the model (e.g. [54, 7, 29]). The advantages of using an adaptive method are that the model is trained based on BMI interactions, a pre-session training period is unnecessary as training can take place simultaneously with regular interaction and changes in the recorded neural activity (whether due to recording instabilities or actual changes in the neural activity) can be accommodated for by the machine-side instead of requiring correction on part of the subject.

The requirements arising from introduction of model adaptivity are

23



Learning time should not be too long so that a learned model can be put to use shortly after the data used to learn it was gathered. Otherwise the learned model would be outdated.



The amount of data that is used to learn a new model can not grow beyond a certain size (otherwise model learning would take longer and longer).



The problem of providing labels to neural activity recorded during BMI interaction must be dealt with.

The last of these points requires an explanation; Training a BMI requires handling of a bootstrap problem. The BMI model should be learned based on BMI usage since neural activity during BMI trials is

dierent

than neural activity

during non-BMI. This involves guessing what the subject was

trying

to do with

the cursor during a BMI trial as opposed to what the cursor actually did (which is the product of using the current model) and using that guess to update the model. Our assumption is that a trained subject would be thinking both on the path actually taken by the cursor and some desired path for the cursor. Knowing the cursor's actual path, where the cursor ended up (or where the trial's target is) and assuming that the subject would have liked to produce quicker and straighter movements with velocity proles that start at rest and end at rest (approximately bell shaped) allows us to create new trajectories that reect our view of the subjects desires and use these labels for updating the BMI model.

The work described in [50] (section 2.4) describes the adaptation of an algorithm which we call KARMA (Kernel Auto-Regressive Moving Average) and its application in a BMI experiments which conforms to all the above requirements.

24

Chapter 2

Results 2.1

Spikernels:

Predicting Arm Movements by

Embedding Population Spike Rate Patterns in Inner-Product Spaces

25

26

LETTER

Communicated by Chris Watkins

Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces Lavi Shpigelman [email protected] School of Computer Science and Engineering and Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel

Yoram Singer [email protected] School of Computer Science and Engineering Hebrew University, Jerusalem 91904, Israel

Rony Paz [email protected]

Eilon Vaadia [email protected] Interdisciplinary Center for Neural Computation and Department of Physiology, Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel

Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this letter is the construction of biologically motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach by comparing the Spikernel to various standard kernels in the task of predicting hand movement velocities from cortical recordings. All of the kernels that we tested in our experiments outperform the standard scalar product used in linear regression, with the Spikernel consistently achieving the best performance.

1 Introduction Neuronal activity in primary motor cortex (MI) during multijoint arm reaching movements in 2D and 3D (Georgopoulos, Schwartz, & Kettner, 1986; Neural Computation 17, 671–690 (2005)

c 2005 Massachusetts Institute of Technology 

672

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Georgopouls, Kettner, & Schwartz, 1988) and drawing movements (Schwartz, 1994) has been used extensively as a test bed for gaining understanding of neural computations in the brain. Most approaches assume that information is coded by firing rates, measured on various timescales. The tuning curve approach models the average firing rate of a cortical unit as a function of some external variable, like the frequency of an auditory stimulus or the direction of a planned movement. Many studies of motor cortical areas (Georgopoulos, Kalaska, & Massey, 1983; Georgopoulos et al., 1988; Moran & Schwartz, 1999; Schwartz, 1994; Laubach, Wessberg, & Nicolelis, 2000) showed that while single units are broadly tuned to movement direction, a relatively small population of cells (tens to hundreds) carries enough information to allow for accurate prediction. Such broad tuning can be found in many parts of the nervous system, suggesting that computation by distributed populations of cells is a general cortical feature. The population vector method (Georgopoulos et al., 1983, 1988) describes each cell’s firing rate as the dot product between that cell’s preferred direction and the direction of hand movement. The vector sum of preferred directions, weighted by the measured firing rates, is used both as a way of understanding what the cortical units encode and as a means for estimating the velocity vector. Several recent studies (Fu, Flament, Cotz, & Ebner, 1997; Donchin, Gribova, Steinberg, Bergman, & Vaadia, 1998; Schwartz, 1994) propose that neurons can represent or process multiple parameters simultaneously, suggesting that it is the dynamic organization of the activity in neuronal populations that may represent temporal properties of behavior such as the computation of transformation from desired action in external coordinates to muscle activation patterns. A few studies (Vaadia, et al., 1995; Laubach, Schuler, & Nicolelis, 1999; Reihle, Grun, Diesmann, & Aersten, 1997) support the notion that neurons can associate and dissociate rapidly to functional groups in the process of performing a computational task. The concepts of simultaneous encoding of multiple parameters and dynamic representation in neuronal populations together could explain some of the conundrums in motor system physiology. These concepts also invite using of increasingly complex models for relating neural activity to behavior. Advances in computing power and recent developments of physiological recording methods allow recording of ever growing numbers of cortical units that can be used for real time analysis and modeling. These developments and new understandings have recently been used to reconstruct movements on the basis of neuronal activity in real time in an effort to facilitate the development of hybrid brain-machine interfaces that allow interaction between living brain tissue and artificial electronic or mechanical devices to produce brain-controlled movements (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Laubach et al., 2000; Nicolelis, 2001; Wessberg et al., 2000; Laubach et al., 1999; Nicolelis, Ghazanfar, Faggin, Votaw, & Oliveira, 1997; Isaccs, Weber, & Schwartz, 2000).

Spikernels

673

Most of the current attempts that focus on predicting movement from cortical activity rely on modeling techniques that employ parametric models to describe a neuron’s tuning (instantaneous spike rate) as a function of current behavior. For instance, cosine-tuning estimation (population vector) is used by Taylor, Tillery, and Schwartz (2002), and linear regression was employed by Wessberg et al. (2000) and Serruya, Hatsopoulos, Paninski, Fellows, and Donoghue (2002). Brown, Frank, Tang, Quirk, and Wilson (1998) and Brockwell, Rojas, and Kass (2004) have applied filtering methods that, apart from assuming parametric models of spike rate generation, also incorporate a statistical model of the movement. An exception is the use of artificial neural nets (Wessberg et al, 2000) (though this study reports getting better results by linear regression). This article describes the tailoring of a kernel method for the task of predicting two-dimensional hand movements. The main difference between our method and the other approaches is that our nonparametric approach does not assume cell independence, imposes relatively few assumptions on the tuning properties of the cells and how the neural code is read, allowing representation of behavior in highly specific neural patterns, and, as we demonstrate later, results in better predictions on our test sets. However, due to the implicit manner in which neural activity is mapped to feature space, improved results are often achieved at the expense of understanding tuning of individual cells. We attempt to partially overcome this difficulty by describing the feature space that our kernel induces and by explaining the way our kernel parameters affect the features. The letter is organized as follows. In section 2, we describe the problem setting that this article is concerned with. In section 3, we introduce and explain the main mathematical tool that we use: the kernel operator. In section 4, we discuss the design and implementation of a biologically motivated kernel for neural activities. The experimental method is described in section 5. We report experimental results in section 6 and give conclusions in section 7. 2 Problem Setting Consider the case where we monitor instantaneous spike rates from q cortical units during physical motor behavior of a subject (the method used for translating recorded spike times into rate measures is explained in section 5). Our goal is to learn a predictive model of some behavior parameter (such as hand velocities, as described in section 5) with the cortical activity as the input. Formally, let S ∈ Rq×m be a sequence of instantaneous firing rates from q cortical units consisting of m time samples. We use S, T to denote sequences of firing rates and denote by len(S) the time length of a sequence S (len(S) = m in above definition). Si ∈ Rq designates the instantaneous firing rates of all neurons at time i in the sequence S and Si,k ∈ R the rate of neuron k. We also use Ss to denote the concatenation of S with one more sample s ∈ Rq . The instantaneous firing rate of a unit k in sample s is then

674

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

sk ∈ R. We also need to employ a notation for subsequences. A consecutive subsequence of S from time t1 to time t2 is denoted St1 :t2 . Finally, throughout the work, we need to examine possibly nonconsecutive subsequences. We denote by i ∈ Rn a vector of time indices into the sequence S such that 1 ≤ i1 < i2 < · · · < in ≤ len(S). Si ∈ Rq×n is then an ordered selection of spike rates (of all neurons) from times i. Let y ∈ Rm denote a parameter of the movement that we would like to predict (e.g., the movement velocity in the horizontal direction, vx ). Our goal is to learn an approximation  y of the form f: Rq×m → Rm from neural firing rates to movement parameter. In general, information about movement can be found in neural activity both before and after the time of movement itself. Our long-term plan, though, is to design a model that can be used for controlling a neural prosthesis. We therefore confine ourselves to causal predictors that use an l (l ∈ Z) long window of neural activity ending  at time step t, S(t−l+1):t , to predict yt ∈ R. We would like to make  yt = ft S(t−l+1):t as close as possible (in a sense that is explained in the sequel) to yt . 3 Kernel Method for Regression The major mathematical tool employed in this article is kernel operators. Kernel operators allow algorithms whose interface to the data is limited to scalar products to employ complicated premappings of the data into feature spaces by use of kernels. Formally, a kernel is an inner-product operator K: X × X → R where X is some arbitrary vector space. In this work, X is the space of neural activities (specifically, population spike rates). An explicit way to describe K is via a mapping φ: X → H from X to a feature space H such that K(x, x ) = φ(x) · φ(x ). Given a kernel operator, we can use it to perform various statistical learning tasks. One such task is support vector regression (SVR) (see, e.g., Smola & Scholkopf, ¨ 2004) which attempts to find a regression function for target values that is linear if observed in the (typically very high-dimensional) feature space mapped by the kernel. We give here a brief description of SVR for clarity. SVR employs the ε-insensitive loss function (Vapnik, 1995). Formally, this loss is defined as       yt − f (S(t−l+1):t ) = max 0, yt − f (S(t−l+1):t ) − ε . ε Examples that fall within ε of the target value (yt ) do not contribute to the error. Those that fall further away contribute linearly to the loss (see Figure 1, left). The form of the regression model is   f (S(t−l+1):t ) = w · φ S(t−l+1):t + b,

(3.1)

which is linear in the feature space H and is implicitly defined by the kernel. Combined with a loss function, f defines a hyperslab of width ε centered at the estimate (see Figure 1, right, for a single-dimensional illustration).

Spikernels

675 y f(x)

Loss

ε

ε

y − yˆ

φ(x)

Figure 1: Illustration of the ε-insensitive loss (left) and an ε-insensitive area around a linear regression in a single-dimensional feature space (right). Examples that fall within distance ε have zero loss (shaded area).

For each trial i and time index t designating a frame within the trial, we received a pair, (Si(t−l+1):t , yit ), consisting of spike rates and a corresponding target value. The optimization problem employed by SVR is arg min w

    1  w 2 + C yit − f Si(t−l+1):t  , ε 2 i,t

(3.2)

where · 2 is the squared vector norm. This minimization casts a trade-off (weighed by C ∈ R) between a small empirical error and a regularization term that favors low sensitivity to feature values. Let φ(x) denote the feature vector that is implicitly implemented by kernel function K(·, x). Then there exists a regression model equivalent to the one given in equation 3.1, is a minimum of equation 3.2, and takes the form  f (T) = At,i K(Si(t−l+1):t , T) + b. t,i

Here, T is the observed neural activity, and A is a (typically sparse) matrix of Lagrange multipliers determined by the minimization problem in equation 3.2. In summary, SVR solves a quadratic optimization problem aimed at finding a linear regressor in an induced high-dimensional feature space. This regressor is the best penalized estimate for the ε-insensitive loss function where the penalty is the squared norm of the regressor’s weights. 4 Spikernels The quality of kernel-based learning is highly dependent on how the data are embedded in the feature space via the kernel operator. For this reason, several studies have been devoted to developing new kernels (Jaakola & Haussler, 1998; Genton, 2001; Lodhi, Shawe-Taylor, Cristianini, & Watkins, 2000). In fact, a good kernel could render the work of a classification or regression algorithm trivial. With this in mind, we develop a kernel for neural spike rate activity.

676

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

R a te

Pattern A Pattern B

T im e

(a) Bin by bin P attern A P attern B

P attern A P attern B

R a te

R a te

Time of Interest

T im e

(b) Time warp

T im e

(c) Time of interest

Figure 2: Illustrative examples of pattern similarities. (a) Bin-by-bin comparison yields small differences. (b) Patterns with large bin-by-bin differences that can be eliminated with some time warping. (c) Patterns whose suffix (time of interest) is similar and prefix is different.

4.1 Motivation. Our goal in developing a kernel for spike trains is to map similar patterns to nearby areas of the feature space. In the description of our kernel, we attempt to capture some well-accepted notions on similarities between spike trains. We make the following assumptions regarding similarities between spike patterns: • A commonly made assumption is that firing patterns that have small differences in a bin-by-bin comparison are similar. This is the basis of linear regression. In Figure 2a, we show an example of two patterns that are bin-wise similar. • A cortical population may display highly specific patterns to represent specific information such as a particular stimulus or hand movement.

Spikernels

677

Though in this work we make use of spike counts within each time bin, we assume that a pattern of spike counts may be specific to an external stimulus or action and that the manner in which these patterns change as a function of the behavior may be highly nonlinear (Segev & Rall, 1998). Many models of neuronal computation are based on this assumption (see, e.g., Pouget & Snyder, 2000). • Two patterns may be quite different from a simple bin-wise perspective, yet if they are aligned using a nonlinear time distortion or shifting, the similarity becomes apparent. An illustration of such patterns is given in Figure 2b. In comparing patterns, we would like to induce a higher score when the time shifts are small. Some biological plausibility for time-warped integration of inputs may stem from the spatial distribution of synaptic inputs on cortical cells’ dendrites. In order for two signals to be integrated along a dendritic tree or at the axon hillock, their sources must be appropriately distributed in space and time, and this distribution may be sensitive to the current depolarization state of the tree (Magee, 2000). • Patterns that are associated with identical values of an external stimulus at time t may be similar at that time but rather different at t ±  when values of the external stimulus for these patterns are no longer similar (as illustrated in Figure 2c. We would like to give a higher similarity score to patterns that are similar around a time of interest (prediction time), even if they are rather different at other times. 4.2 Spikernel Definition. We describe our kernel construction, the Spikernel, by specifying the features that make up the feature space. Our construction of the feature space builds on the work of Haussler (1999), Watkins (1999), and Lodhi et al. (2000). We chose to directly specify the features constituting the kernel rather than describe the kernel from a functional point of view, as it provides some insight into the structure of the feature space. We first need to introduce a few more notations. Let S be a sequence of length l = len(S). The set of all possible n-long index vectors defining a subsequence of S is In,l = i: i ∈ Zn , 1 ≤ i1 < · · · < in ≤ l . Also, let d(α, β ), (α, β ∈ Rq ) denote a bin-wise distance over a pair of samples (population rates). We also overload notation and denote by  

n firing d (Si , U) = k=1 d Sik , Uk a distance between sequences. The sequence distance is the sum of distances over the samples. One such function (the d we explore here) is the squared 2 norm. Let µ, λ ∈ (0, 1). The U component of our (infinite) feature vector φ(S) is defined as

φnU (S) =



µd(Si ,U) λlen(S)−i1 ,

(4.1)

i∈In,len(S)

where i1 is the first entry in the index vector i. In words, φnU (S) is a sum

678

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

over all n-long subsequences of S. Each subsequence Si is compared to U (the feature coordinate) and contributes to the feature value by adding a number that is proportional to the bin-wise similarity between Si and U and is inversely proportional to the amount of stretching Si needs to undergo. That is, the feature entry indexed by U measures how similar U is to the latter part of the time series S. This definition seems to fit our assumptions on neural coding for the following reasons: • A feature φnU (S) gets large values only if U is bin-wise similar to subsequences of the pattern S. • It allows for learning complex patterns: choosing small values of λ and µ (or concentrated d measures) means that each feature tends toward being either 1 or 0, depending on whether the feature coordinate U is almost identical to a suffix of S. • We allow gaps in the indexes defining subsequences, thus allowing for time warping. • Subpatterns that begin further from the required prediction time are penalized by an exponentially decaying weight; thus, more weight is given to activities close to the prediction time (the end of the sequence), which designates our time of interest. 4.3 Efficient Kernel Calculation. The definition of φ given by equation 4.1 requires the manipulation of an infinite feature space and an iteration over all subsequences that grows exponentially with the lengths of the sequences. Straightforward calculation of the feature values and performing the induced inner product is clearly impossible. Building on ideas from Lodhi et al. (2000), who describe a similar kernel (for finite sequence alphabets and no time of interest), we developed an indirect method for evaluating the kernel through a recursion that can be performed efficiently using dynamic programming. The solution we now describe is rather general, as it employs an arbitrary base feature vector ψ . We later describe our specific design choice for ψ . The definition in equation 4.1 is a special case of the following feature definition,

φnU (S)

=

 i∈In,len(S)



  len(S)−i 1 ψ U k S ik λ ,

(4.2)

k

where n specifies the length of the subsequences, and for the Spikernel, we substitute     d S ,U ψ U k S ik = µ i k k .

Spikernels

679

The feature vector of equation 4.2 satisfies the following recursion:  len(S)  i=1 λlen(S)−i+1 ψ (Si ) φn−1 (S1:i−1 ) len(S) ≥ n > 0 n φ (S) = 0 len(S) < n, n > 0 (4.3)  1 n=0 The recursive form is derived by decomposing the sum over all subsequences into a sum over the last subsequence index and a sum over the rest of the indices. The fact that φn (S) is the zero vector for sequences of length less than n is due to equation 4.2. The last value of φn when n = 0 is defined to be the all-one vector. The kernel that performs the dot product between two feature vectors is then defined as  Kn (S, T) = φn (S) · φn (T) = φnU (S)φnU (T)dU, Rq×n

where n designates subsequences of length n as before. We now plug in the recursive feature vector definition from equation 4.3:

 len(S)  n−1 n len(S)−i+1 λ ψ Un (Si ) φU1:n−1 (S1:i−1 ) K (S, T) = Rq×n



×

i=1

len(T) 

     n−1 λlen(T)−j+1 ψ Un Tj φU1:n−1 T1:j−1  dU.

(4.4)

j=1

Next, we note that  n−1 n−1 φn−1 (S1:i−1 ) · φn−1 (T1:j−1 ) U (S1:i−1 )φU (T1:j−1 )dU = φ Rq×n−1   = Kn−1 S1:i−1 , T1:j−1 , and also  Rq

ψ Uk (Si )ψ Uk (Tj )dUk = ψ (Si ) · ψ (Tj ).

Defining K∗ (Si , Tj ) = ψ (Si )·ψ (Si ), we now insert the integration in equation 4.4 into the sums to get Kn (S, T) len(S)  len(T)      n−1   = λlen(S)+len(T)−i−j+2 ψ (Si )· ψ Tj φ (S1:i−1 )· φn−1 T1:j−1 i=1 j=1 len(S)  len(T) 

=

i=1 j=1

    λlen(S)+len(T)−i−j+2 K∗ Si , Tj Kn−1 S1:i−1 , T1:j−1 .

(4.5)

680

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

We have thus derived a recursive definition of the kernel, which inherits the initial conditions from equation 4.3: ∀S, T K0 (S, T) = 1   if min len(S), len(T) < i Ki (S, T) = 0.

(4.6)

The kernel function K∗ (·, ·) may be any kernel operator and operates on population spike rates sampled at single time points. To obtain the Spikernel described above, we chose K∗ (α, β ) =

 Rq

µd(u,α) µ



d u,β

 du.

A more time-efficient description of the recursive computation of the kernel may be obtained if we notice that the sums in equation 4.5 are used, up to a constant λ, in calculating the kernel for prefixes of S and T. Caching these sums yields Kn (Ss, Tt) = λKn (S,Tt)+λKn (Ss,T)−λ2 K(S,T)+λ2 K∗ (s, t) Kn−1 (S,T) , where s and t are the last samples in the sequences (and the initial conditions of equation 4.6 still hold). Calculating Kn (S, T) using the above recursion is computationally cheaper than equation constructing a three-dimensional  4.5 and comprises  array, yielding an O len(S) len(T) n dynamic programming procedure. 4.4 Spikernel Variants. The kernels defined by equation 4.1 consider only patterns of fixed length (n). It makes sense to look at sub-sequences of various lengths. Since a linear combination of kernels is also a kernel, we can define our kernel to be K(S, T) =

n 

pi Ki (S, T),

i=1

Rn+

is a vector of weights. The weighted kernel summation can where p ∈ be interpreted as performing the dot product between vectors that take the √ √ form [ p1 φ1 (·), . . . , pn φn (·)], where φi (·) is the feature vector that kernel i K (·, ·) represents. In the results, we use this definition with pi = 1 (a better weighing was not explored). Different choices of K∗ (α, β ) allow different methods of comparing population rate values (once the time bins for comparison were chosen). We  2 q  2 chose d(α, β ) to be the squared 2 norm: α − β 2 = k=1 αk − β k . The kernel K∗ (·, ·) then becomes  2 1 α−β 2 K∗ (α, β ) = cµ 2 ,

Spikernels

with c =



681 π −2 ln µ

q

; however, c can (and did) set to 1 WLOG. Note that  2 − ln(µ) α−β  2 and is therefore a K∗ (α, β ) can be written as K∗ (α, β ) = e 2 simple gaussian. It is also worth noticing that if λ ≈ 0, Kn (S, T) → ce−

ln(µ) S−T 22 2

.

That is, in this limit, no time warping is allowed, and the Spikernel is reduced to the standard exponential kernel (up to a constant c). 5 Experimental Method 5.1 Data Collection. The data used in this work were recorded from the primary motor cortex of a rhesus (Macaca mulatta) monkey (approximately 4.5 kg). The animal’s care and surgical procedures accorded with The NIH Guide for the Care and Use of Laboratory Animals (rev. 1996) and with the Hebrew University guidelines supervised by the institutional committee for animal care and use. The monkey sat in a dark chamber, and eight electrodes were introduced into each hemisphere. The electrode signals were amplified, filtered, and sorted (MCP-PLUS, MSD, Alpha-Omega, Nazareth, Israel). The data used in this report were recorded on four different days during a one-month period of daily sessions. Up to 16 microelectrodes were inserted on each day. Recordings include spike times of single units (isolated by signal fit to a series of windows) and multiunits (detection by threshold crossing). The monkey used two planar-movement manipulanda to control two cursors (× and + shapes) on the screen to perform a center-out reaching task. Each trial begun when the monkey centered both cursors on a central circle for 1.0 to 1.5 s. Either cursor could turn green, indicating the hand to be used in the trial (× for right arm and + for the left). Then, after an additional hold period of 1.0 to 1.5ss one of eight targets (0.8 cm diameter) appeared at a distance 4 cm from the origin. After another 0.7 to 1.0ss, the center circle disappeared (referred to as the Go Signal), and the monkey had to move and reach the target in less than 2 s to receive liquid reward. We obtained data from all channels that exhibited spiking activity (i.e., were not silent) during at least 90% of the trials. The number and types of recorded units and the number of trials used in each of the recording days are described in Table 1. At the end of each session, we examined the activity of neurons evoked by passive manipulation of the limbs and applied intracortical microstimulation (ICMS) to evoke movements. The data presented here were recorded in penetration sites where ICMS evoked shoulder and elbow movements. Identification of the recording area was also aided by MRI. More information can be found in (Paz, Boraud, Natan, Bergman, and Vaadia (2003).

682

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Table 1: Summary of the Data Sets Obtained. Number of Data Set Single Units 1 2 3 4

27 22 27 25

Number of Mean Number of Multiunit Spike Rate Successful Trials Channels (spikes/sec) 16 9 10 13

8.6 11.6 8.0 6.5

317 333 155 307

Net Cumulative Movement Time (sec) 529 550 308 509

The results that we present here refer to prediction of instantaneous hand velocities (in 2D) during the movement time (from Go Signal to Target Reach times) of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vice versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and nonmovement for each hand. 5.2 Data Preprocessing and Modeling. The recordings of spike times were made at 1 ms precision, and hand positions were recorded at 500 Hz. Our kernel method is suited for use with windows of observed neural activity as time series and with the movement values at the ends of those windows. We therefore chose to partition spike trains into bins, count the spikes in each bin, and use a running window of such spike counts as our data with the hand velocities at the end of the window as the label. Thus, a labeled example (St , vt ) for time t consisted of the X (left-right) or Y (forward-backward) velocity for the left or right hand as the target label vt and the preceding window of spike counts from all q cortical units as the input sequence St . In one such preprocessing, we chose a running window of 10 bins of size 100 ms each (a 1-second-long window), and the time interval between two such windows was 100 ms. Two such consecutive examples would then have nine time bins of overlap. For example, the number of cortical units q in the first data set was 43 (27 single +16 multiple), and the total length of all the trials used in that data set is 529 seconds. Hence, in that session, with the above preprocessing, there are 5290 consecutive examples, where each is a 43 × 10 matrix of spike counts and there are 4 sequences of movement velocities (X/Y and R/L hand) of the same length. Before we could train an algorithm, we had to specify the time bin size, the number of bins, and the time interval between two observation windows. Furthermore, each kernel employs a few parameters (for the Spikernel, they are λ, µ, n, and pi , though we chose pi = 1 a priori) and the SVM regression setup requires setting two more parameters, ε and C. In order to test the algorithms, we first had to establish which parameters work best. We used the first half of data set 1 in fivefold cross-validation to choose the best

Spikernels

683

Table 2: Parameter Values Tried for Each Kernel. Kernel

Parameters

Spikernel

λ ∈ {0.5, 0.7, 0.9, 0.99}, µ ∈ {0.7, 0.9, 0.99, 0.999}, n ∈ {3, 5, 10}, pi = 1, C = 0.1, ε = 0.1 γ ∈ {0.1, 0.01, 0.001, 0.005, 0.0001}, C ∈ {1, 5, 10}, ε = 0.1 C ∈ {0.001, 10−4 , 10−5 }, ε = 0.1 C ∈ {10−4 , 10−5 , 10−6 }, ε = 0.1 C ∈ {100, 10, 1, 0.1, 0.01, 0.001, 10−4 }, ε = 0.1

Exponential Polynomial, second degree Polynomial, third degree Linear

Notes: Not all combinations were tried for the Spikernel. n was chosen to be  and some peripheral combinations of µ and λ were not tried either.

number of bins , 2

preprocessing and kernel parameters by trying a wide variety of values for each parameter, picking the combinations that produced the best predictions on the validation sets. For all kernels, the prediction quality as a function of the parameters was single peaked, and only that set of parameters was chosen per kernel for testing of the rest of the data. Having tuned the parameters, we then used the rest of data set 1 and the other three data sets to learn and predict the movement velocities. Again, we employed fivefold cross-validation on each data set to obtain accuracy results on all the examples. The five-fold cross-validation was produced by randomly splitting the trials into five groups: four of the five groups were used for training, and the rest of the data was used for evaluation. This process was repeated five times by using each fifth of the data once as a test set. The kernels that we tested are the Spikernel, the exponential kernel 2 K(S, T) = e−γ (S−T) , the homogeneous polynomial kernel K(S, T) = (S · T)d d = 2, 3, and standard scalar product kernel K(S, T) = S · T, which boils down to a linear regression. In our initial work, we also tested polynomial kernels of higher order and nonhomogeneous polynomial kernels, but as these kernels produced worse results than those we describe here, we did not pursue further testing. During parameter fitting (on the first half of data set 1), we tried all combinations of bin sizes {50ms, 100ms, 200ms} and number of bins {5, 10, 20} except for bin size and number of bins resulting in windows of length 2 s or more (since full information regarding the required movement was not available to the monkey 2S before the Go Signal and therefore neural activity at that time was probably not informative). The time interval between two consecutive examples was set to 100 ms. For each preprocessing, we then tried each kernel with parameter values described in Table 2. 6 Results Prediction samples of the Spikernel and linear regression are shown in Figure 3. The samples are of consecutive trials taken from data set 4 without

684

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

v

y

1 0 −1 200

202

x

206

208

210

212

214

216

218

220

Go Signal / Target Reach Actual Spikernel Linear Regression

1

v

204

0 −1 200

202

204 206 208 210 212 214 216 cumulative prediction time from session start (sec)

218

220

Figure 3: Sample of actual movement, Spikernel prediction and linear regression prediction of left-hand velocity in data set 4 in a few trials: (top) vy , (bottom) vx (both normalized). Only intervals between Go Signal and Target Reach are shown (separated by vertical lines). Actual movement in thick gray, Spikernel in thick black, and linear regression in thin black. Time axis is seconds of accumulated prediction intervals from the beginning of the session.

prior inspection of their quality. We show the Spikernel and linear regression as prediction quality extremes. Both methods seem to underestimate the peak values and are somewhat noisy when there is no movement. The Spikernel prediction is better in terms of mean absolute error. It is also smoother than the linear regression. This may be due to the time warping quality of the features employed. We computed the correlation coefficient (r2 ) between the recorded and predicted velocities per fold for each kernel. The results are shown in Figure 4 as scatter plots. Each circle compares the Spikernel correlation coefficient scores to those of one of the other kernels. There are five folds for each of the four data sets and four correlation coefficients for each fold (one for each movement direction and hand), making up 80 scores (and circles in the plots) for each of the kernels compared with the Spikernel. Results above the diagonal are cases where the Spikernel outperformed. Note that though the correlation coefficient is informative and considered a standard performance indicator, the SVR algorithm minimizes the ε-insensitive loss. The results in terms of this loss are qualitatively the same (i.e., the kernels are ranked by this measure in the same manner), and we therefore omit them.

Spikernels

685

0.8

0.8

Spikernel − r

Spikernel − r

2

1

2

1

0.6

0.4

0.2

0

0

0.4

0.2

0.2 0.4 0.6 0.8 Exponential Kernel − r 2 

0

1

0

0.2 0.4 0.6 0.8 Polynomial 2nd Deg. Kernel − r 2





1

0.8

0.8

1



Spikernel − r

2

1

2

Spikernel − r

0.6

0.6

0.4

0.2

0

0

0.6

0.4

0.2

0.2 0.4 0.6 0.8 Polynomial 3rd Deg. Kernel − r 2 

1

0

0

0.2 0.4 0.6 0.8 Linear Regression − r 2



1





Figure 4: Correlation coefficient comparisons of the Spikernel versus other kernels. Each scatter plot compares the Spikernel to one of the other kernels. Each circle shows the correlation coefficient values obtained by the Spikernel and by the other kernel in one fold of one of the data sets for one of the two axes of movement. (a) Compares the Spikernel to the exponential kernel. (b) Compares with the second-degree polynomial kernel. (c) Compares with the third-degree polynomial kernel. (d) Compares with linear regression. Results above the diagonals are cases where the Spikernel outperformed.

686

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Table 3: Chosen Preprocessing Values, Parameter Values, and Mean Correlation Coefficient Across Folds for Each Kernel in Each Data Set. Kernel

Kernel Parameters

Day 1

Day 2

Day 3

Day 4

Total

LhX RhX LhX RhX LhX RhX LhX RhX Mean LhY RhY LhY RhY LhY RhY LhY RhY

Type

Preprocessing

Spikernel

10 bins 100 ms

µ = .99 λ = .7 .71 .79 N = 5 C = 1 .74 .63

.77 .63 .87 .79

.75 .70 .67 .66

.69 .66 .79 .71

.72

Exponential

5 bins 200 ms

γ = .01 C = 5

.67 .76 .69 .45

.73 .57 .85 .74

.71 .67 .56 .61

.66 .63 .74 .65

.67

Polynomial second degree

10 bins 100 ms

C = 10−4

.67 .76 .70 .51

.72 .54 .85 .74

.70 .68 .61 .63

.65 .62 .76 .68

.68

Polynomial third degree

10 bins 100 ms

C = 10−5

.65 .72 .66 .51

.69 .52 .84 .71

.67 .68 .61 .62

.63 .61 .75 .64

.66

Linear (dot product)

10 bins 100 ms

C = 0.01

.54 .65 .67 .35

.62 .34 .75 .57

.66 .64 .50 .57

.51 .51 .70 .62

.57

Notes: Total means (across data sets, folds hands, and movement directions) are computed for each kernel. The Spikernel outperforms the rest, and the nonlinear kernels outperform the linear regression in all data sets.

Table 3 shows the best bin sizes, number of bins, and kernel parameters that were found on the first half of data set 1 and the mean correlation coefficients obtained with those parameters on the rest of the data. Some of the learning problems were easier than others (we observed larger differences between data sets than between folds of the same data set). Contrary to our preliminary findings (Shpigelman, Singer, Paz, & Vaadia, 2003), there was no significantly easier axis to learn. The mean scores define a ranking order on the kernels. The Spikernel produced the best mean results, followed by the polynomial second-degree kernel, the exponential kernel, the polynomial third-degree kernel, and finally, the linear regression (dot product kernel). To determine how consistent the ranking of the kernels is, we computed the difference in correlation coefficient scores between each pair of kernels in each fold and determined the 95% confidence interval of these differences. The Spikernel is significantly better than the other kernels and the linear regression (the mean difference is larger than the confidence interval). The other nonlinear kernels (exponential and polynomial second and third degree) are also significantly better than linear regression, but the differences between these kernels are not significant. For all kernels, windows of 1 s were chosen over windows of 0.5 s as best preprocessing (based on the parameter training data). However, within the 1 s window, different bin sizes were optimal for the different kernels. Specifically, for the exponential kernel, 5 bins of size 200 ms, were best and for the rest, 10 bins of 100 ms were best. Any processing of the data (such as time binning or PCA) can only cause loss of information (Cover & Thomas, 1991)

Spikernels

687

Table 4: Mean Correlation Coefficients (Over All Data) for the Spikernel with Different Combinations of Bin Sizes and Number of Bins. Number of Bins

50 ms

100 ms

200 ms

5 bins 10 bins 20 bins

.52 .66 .66

.65 .72

.68

Note: The best fit to the data is still 10 bins of 100 ms.

but may aid an algorithm. The question of what the time resolution of cells is is not answered here, but for the purpose of linear regression and SVR (with the kernels tested) on our MI data, it seems that binning was necessary. In many cases, a poststimulus time histogram, averaged over many trials, is used to get an approximation of firing rate that is accurate in both value and position in time (assuming that one can replicate the same experimental condition many times). In single-trial reconstruction, this approach cannot be used, and the trade-off between time precision versus accurate approximation of rate must be balanced. To better understand what the right bin size is, we retested the Spikernel with bin sizes of 50 ms, 100 ms, and 200 ms and number of bins 5, 10, and 20, leaving out combinations resulting in windows of 2 seconds or more. The mean correlation coefficients (over all the data) are shown in Table 4. The 10 by 100 ms configuration is optimal for the whole data as well as for the parameter fitting step. Note that since the average spike count per second was less then 10, the average spike count in bins of 50 ms to 200 ms bins was 0.5 to 2 spikes. In this respect, the data supplied to the algorithms are on the border between a rate and a spike train (with crude resolution). Run-time issues were not the focus of this study, and little effort was made to make the code implementation as efficient as possible. Run time for the exponential and polynomial kernels (prediction time) was close to real time (the time it took for the monkey to make the movements) on a Pentium 4, 2.4 GHz CPU. Run time for the Spikernel is approximately 100 times longer than the exponential kernel. Thus, real time use is currently impossible. The main contribution of the Spikernel is the allowance of time warping. To better understand the effect of λ on Spikernel prediction quality, we retested the Spikernel on all of the data, using 10 bins of size 100 ms (n = 5, µ = 0.99) and λ ∈ {0.5, 0.7, 0.9}. The mean correlation coefficients were {0.68, 0.72, 0.66}, respectively. We also retested the exponential kernel with γ = 0.005 on all the data with 5 bins of size 100 ms. This would correspond to a Spikernel with its previous parameters (µ = 0.99, n = 5) but no time warping. The correlation coefficient achieved was 0.64. Again, the relevance of time warping to neural activity cannot be directly addressed by our method. However, we can safely conclude that consideration of time-

688

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

warped neural activity (in the Spikernel sense) does improve prediction. We believe that it would make sense to assume that this is relevant to cells as well, as mentioned in section 4.1, because the 3D structure of a cell’s dendritic tree can align integration of differently placed inputs from different times. 7 Conclusion In this article, we described an approach based on recent advances in kernelbased learning for predicting response variables from neural activities. On the data we collected, all the kernels we devised outperform the standard scalar product that is used in linear regression. Furthermore, the Spikernel, a biologically motivated kernel, operator consistently outperforms the other kernels. The main contribution of this kernel is the allowance of time warping when comparing two sequences of population spike counts. Our current research is focused in two directions. First, we are investigating the adaptations of the Spikernel to other neural activities such as local field potentials and the development of kernels for spike trains. Our second goal is to devise statistical learning algorithms that use the Spikernel as part of a dynamical system that may incorporate biofeedback. We believe that such extensions are important and necessary steps toward operational neural prostheses. Acknowledgments We thank the anonymous reviewers for their constructive criticism. This study was partly supported by a Center of Excellence grant (8006/00) administrated by the ISF, BMBF-DIP, and by the United States–Israel Binational Science Foundation. L.S. is supported by a Horowitz fellowship. References Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91, 1899–1907. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience, 2, 664–670. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley.

Spikernels

689

Donchin, O., Gribova, A., Steinberg, O., Bergman, H., & Vaadia, E. (1998). Primary motor cortex is involved in bimanual coordination. Nature, 395, 274– 278. Fu, Q. G., Flament, D., Coltz, J. D., & Ebner, T. J. (1997). Relationship of cerebellar Purkinje cell simple spike discharge to movement kinematics in the monkey. Journal of Neurophysiology, 78, 478–491. Genton, M. G. (2001). Classes of kernels for machine learning: A statistical perspective. Journal of Machine Learning Research, 2, 299–312. Georgopoulos, A. P., Kalaska, J., & Massey, J. (1983). Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor cortical populations. Experimental Brain Research (Supp.), 7, 327–336. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. Journal of Neuroscience, 8, 2913–2947. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Haussler, D. (1999). Convolution kernels on discrete structures (Tech. Rep. No. UCSC-CRL-99-10). Santa Cruz, CA: Uniiversity of California. Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control of a cortical neural prosthesis. IEEE Trans. Rehabil. Eng., 8, 196–198. Jaakola, T. S., & Haussler, D. (1998). Exploiting generative models in discriminative calssifiers. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Laubach, M., Wessberg, J., & Nicolelis, M. A. (2000). Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. Nature, 405(1), 141–154. Laubach, M., Shuler, M., & Nicolelis, M. A. (1999). Independent component analyses for quantifying neuronal ensemble interactions. J. Neurosci Methods, 94, 141–154. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. C. H. (2000). Text classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in nueural information processing systems. Cambridge, MA: MIT Press. Magee, J. (2000). Dendritic integration of excitatory synaptic input. Nat. Rev. Neuroscience, 1, 181–190. Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82, 2676–2692. Nicolelis, M. A. (2001). Actions from thoughts. Nature, 409(18), 403–407. Nicolelis, M. A., Ghazanfar, A. A., Faggin, B. M., Votaw, S., & Oliveira, L. M. (1997). Reconstructing the engram: Simultaneous, multi-site, many single neuron recordings. Neuron, 18, 529–537. Paz, R., Boraud, T., Natan, C., Bergman, H., & Vaadia, E. (2003). Preparatory activity in motor cortex reflects learning of local visuomotor skills. Nature Neuroscience, 6(8), 882–890. Pouget, A., & Snyder, L. (2000). Computational approaches to sensorimotor transformations. Nature Neuroscience, 8, 1192–1198.

690

L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia

Reihle, A., Grun, S., Diesmann, M., & Aersten, A. M. H. J. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1952. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Segev, I., & Rall, W. (1998). Excitable dendrites and spines: Earlier theoretical insights elucidate recent direct observations. TINS, 21, 453–459. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416, 141–142. Shpigelman, L., Singer, Y., Paz, R., & Vaadia, E. (2003). Spikernels: Embedding spiking neurons in inner product spaces. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Smola, A., & Scholkopf, ¨ B. (2004). A tutorial on support vector regression. Statistics of Computing, 14, 199-222. Taylor, D. M., Tillery, S. I. H. S. I., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 1829–1832. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Watkins, C. (1999). Dynamic alignment kernels. In P. Bartlett, B. Scholkopf, ¨ D. Schuurmans, & A. Smola (Eds.), Advances in large main classifiers (pp. 39– 50). Cambridge, MA: MIT Press. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(16), 361–365. Received August 8, 2003; accepted August 5, 2004.

2.2

Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

47

48

Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

Amir Navot12

Lavi Shpigelman12 Naftali Tishby12 Eilon Vaadia23 School of computer Science and Engineering 2 Interdisciplinary Center for Neural Computation 3 Dept. of Physiology, Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel 1

Email for correspondence:

{anavot,shpigi}@cs.huji.ac.il

Abstract We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.

1 Introduction In many supervised learning tasks the input is represented by a very large number of features, many of which are not needed for predicting the labels. Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Feature selection reduces the computational complexity of learning and prediction algorithms and saves on the cost of measuring non selected features. In many situations, feature selection can also enhance the prediction accuracy by improving the signal to noise ratio. Another benefit of feature selection is that the identity of the selected features can provide insights into the nature of the problem at hand. Therefore feature selection is an important step in efficient learning of large multi-featured data sets. Feature selection (variously known as subset selection, attribute selection or variable selection) has been studied extensively both in statistics and by the machine learning community over the last few decades. In the most common selection paradigm an evaluation function is used to assign scores to subsets of features and a search algorithm is used to search for a subset with a high score. The evaluation function can be based on the performance of a specific predictor (wrapper model, [1]) or on some general (typically cheaper to compute) relevance measure of the features to the prediction (filter model). In any case, an exhaustive search over all feature sets is generally intractable due to the exponentially large number of possible sets. Therefore, search methods are employed which apply a variety of heuristics, such as hill climbing and genetic algorithms. Other methods simply rank individual features, assigning a score to each feature independently. These methods are usually very fast,

but inevitably fail in situations where only a combined set of features is predictive of the target function. See [2] for a comprehensive overview of feature selection and [3] which discusses selection methods for linear regression. A possible choice of evaluation function is the leave-one-out (LOO) mean square error (MSE) of the k-Nearest-Neighbor (kNN) estimator ([4, 5]). This evaluation function has the advantage that it both gives a good approximation of the expected generalization error and can be computed quickly. [6] used this criterion on small synthetic problems (up to 12 features). They searched for good subsets using forward selection, backward elimination and an algorithm (called schemata) that races feature sets against each other (eliminating poor sets, keeping the fittest) in order to find a subset with a good score. All these algorithms perform a local search by flipping one or more features at a time. Since the space is discrete the direction of improvement is found by trial and error, which slows the search and makes it impractical for large scale real world problems involving many features. In this paper we develop a novel selection algorithm. We extend the LOO-kNN-MSE evaluation function to assign scores to weight vectors over the features, instead of just to feature subsets. This results in a smooth (“almost everywhere”) function over a continuous domain, which allows us to compute the gradient analytically and to employ a stochastic gradient ascent to find a locally optimal weight vector. The resulting weights provide a ranking of the features, which we can then threshold in order to produce a subset. In this way we can apply an easy-to-compute, gradient directed search, without relearning of a regression model at each step but while employing a strong non-linear function estimate (kNN) that can capture complex dependency of the function on its features1 . Our motivation for developing this method is to address a major computational neuroscience question: which features of the neural code are relevant to the observed behavior. This is an important element of enabling interpretability of neural activity. Feature selection is a promising tool for this task. Here, we apply our feature selection method to the task of reconstructing hand movements from neural activity, which is one of the main challenges in implementing brain computer interfaces [8]. We look at neural population spike counts, recorded in motor cortex of a monkey while it performed hand movements and locate the most informative subset of neural features. We show that it is possible to improve prediction results by wisely selecting a subset of cortical units and their time lags, relative to the movement. Our algorithm, which considers feature subsets, outperforms methods that consider features on an individual basis, suggesting that complex dependency on a set of features exists in the code. The remainder of the paper is organized as follows: we describe the problem setting in section 2. Our method is presented in section 3. Next, we demonstrate its ability to cope with a complicated dependency of the target function on groups of features using synthetic data (section 4). The results of applying our method to the hand movement reconstruction problem is presented in section 5.

2 Problem Setting First, let us introduce some notation. Vectors in Rn are denoted by boldface small letters (e.g. x, w). Scalars are denoted by small letters (e.g. x, y). The i’th element of a vector x is denoted by xi . Let f (x), f : Rn −→ R be a function that we wish to estimate. Given a set S ⊂ Rn , the empiric mean square error (MSE) of an estimator fˆ for f is defined as  2 P M SES (fˆ) = 1 f (x) − fˆ(x) . |S|

x∈S

1 The design of this algorithm was inspired by work done by Gilad-Bachrach et al. ([7]) which used a large margin based evaluation function to derive feature selection algorithms for classification.

kNN Regression k-Nearest-Neighbor (kNN) is a simple, intuitive and efficient way to estimate the value of an unknown function in a given point using its values in other (training) points. Let S = {x1 , . . . , xm } be a set of training points. The kNN estimator is defined P as the mean function value of the nearest neighbors: fˆ(x) = k1 x′ ∈N (x) f (x′ ) where N (x) ⊂ S is the set of k nearest points to x in S and k is a parameter([4, 5]). A softer version takes a weighted average, where the weight of each neighbor is proportional to its proximity. One specific way of doing this is ′ 1 X fˆ(x) = (1) f (x′ )e−d(x,x )/β Z ′ x ∈N (x)

P ′ 2 where d (x, x′ ) = kx − x′ k2 is the ℓ2 norm, Z = x′ ∈N (x) e−d(x,x )/β is a normalization factor and β is a parameter. The soft kNN version will be used in the remainder of this paper. This regression method is a special form of locally weighted regression (See [5] for an overview of the literature on this subject.) It has the desirable property that no learning (other than storage of the training set) is required for the regression. Also note that the Gaussian Radial Basis Function has the form of a kernel ([9]) and can be replaced with any operator on two data points that decays as a function of the difference between them (e.g. kernel induced distances). As will be seen in the next section, we use the MSE of a modified kNN regressor to guide the search for a set of features F ⊂ {1, . . . n} that achieves a low MSE. However, the MSE and the Gaussian kernel can be replaced by other loss measures and kernels (respectively) as long as they are differentiable almost everywhere.

3 The Feature Selection Algorithm In this section we present our selection algorithm called RGS (Regression, Gradient guided, feature Selection). It can be seen as a filter method for general regression algorithms or as a wrapper for estimation by the kNN algorithm. Our goal is to find subsets of features that induce a small estimation error. As in most supervised learning problems, we wish to find subsets that induce a small generalization error, but since it is not known, we use an evaluation function on the training set. This evaluation function is defined not only for subsets but for any weight vector over the features. This is more general because a feature subset can be represented by a binary weight vector that assigns a value of one to features in the set and zero to the rest of the features. For a given weights vector over the features w ∈ Rn , we consider the weighted squared ℓ2 P 2 norm induced by w, defined as kzkw = i zi2 wi2 . Given a training set S, we denote by fˆw (x) the value assigned to x by a weighted kNN estimator, defined in equation 1, using the weighted squared ℓ2 -norm as the distances d(x, x′ ) and the nearest neighbors are found among the points of S excluding x. The evaluation function is defined as the negative (halved) square error of the weighted kNN estimator: 2 1 X e(w) = − f (x) − fˆw (x) . (2) 2 x∈S

This evaluation function scores weight vectors (w). A change of weights will cause a change in the distances and, possibly, the identity of each point’s nearest neighbors, which will change the function estimates. A weight vector that induces a distance measure in which neighbors have similar labels would receive a high score. The mean, 1/|S| is replaced with a 1/2 to ease later differentiation. Note that there is no explicit regularization term in e(w). This is justified by the fact that for each point, the estimate of its function value does not include that point as part of the training set. Thus, equation 2 is a leave-oneout cross validation error. Clearly, it is impossible to go over all the weight vectors (or even over all the feature subsets), and therefore some search technique is required.

Algorithm 1 RGS(S, k, β, T ) 1. initialize w = (1, 1, . . . , 1) 2. for t = 1 . . . T (a) pick randomly an instance x from S (b) calculate the gradient of e(w):  X ∇e(w) = − f (x) − fˆw (x) ∇w fˆw (x) ∇w fˆw (x)

=

x∈S P − β4 x′′ ,x′ ∈N (x) f (x′′ )a(x′ , x′′ ) u(x′ , x′′ ) P ′ ′′ x′′ ,x′ ∈N (x) a(x , x )

′ 2 ′′ 2 where a(x′ , x′′ ) = e−(||x−x ||w +||x−x ||w )/β   ′ ′′ n and u(x , x ) ∈ R is a vector with ui = wi (xi − x′i )2 + (xi − x′′i )2 .   (c) w = w + ηt ∇e(w) = w 1 + ηt ∇w fˆw (x) where ηt is a decay factor.

Our method finds a weight vector w that locally maximizes e(w) as defined in (2) and then uses a threshold in order to obtain a feature subset. The threshold can be set either by cross validation or by finding a natural cutoff in the weight values. However, we later show that using the distance measure induced by w in the regression stage compensates for taking too many features. Since e(w) is defined over a continuous domain and is smooth almost everywhere we can use gradient ascent in order to maximize it. RGS (algorithm 1) is a stochastic gradient ascent over e(w). In each step the gradient is evaluated using one sample point and is added to the current weight vector. RGS considers the weights of all the features at the same time and thus it can handle dependency on a group of features. This is demonstrated in section 4. In this respect, it is superior to selection algorithms that scores each feature independently. It is also faster than methods that try to find a good subset directly by trial and error. Note, however, that convergence to global optima is not guaranteed and standard techniques to avoid local optima can be used. The parameters of the algorithm are k (number of neighbors), β (Gaussian decay factor), T (number of iterations) and {ηt }Tt=1 (step size decay scheme). The value of k can be tuned by cross validation, however a proper choice of β can compensate for a k that is too large. It makes sense to tune β to a value that places most neighbors in an active zone of the Gaussian. In our experiments, we set β to half of the mean distance between points and their k neighbors. It usually makes sense to use ηt that decays over time to ensure convergence, however, on our data, convergence was also achieved with ηt = 1. The computational complexity of RGS is Θ(T N m) where T is the number of iterations, N is the number of features and m is the size of the training set S. This is correct for a naive implementation which finds the nearest neighbors and their distances from scratch at each step by measuring the distances between the current point to all the other points. RGS is basically an on line method which can be used in batch mode by running it in epochs on the training set. When it is run for only one epoch, T = m and the complexity is  Θ m2 N . Matlab code for this algorithm (and those that we compare with) is available at http://www.cs.huji.ac.il/labs/learning/code/fsr/

4 Testing on synthetic data The use of synthetic data, where we can control the importance of each feature, allows us to illustrate the properties of our algorithm. We compare our algorithm with other common

1

1

0.5

0

0 1

−1 1 1

0.5 0 0

0.5

0.5 0 0

(a)

2 1 0

0.5

(b)

0

0.5 0 0

0.5

(c)

−1 1 1

1

0

0

−1 1

1

1

1 1

1

0.5 0.5 0 0

0.5

(d)

1

0.5 0 0

(e)

−1 1

1

0.5

0.5 0 0

(f)

Figure 1: (a)-(d): Illustration of the four synthetic target functions. The plots shows the functions value as function of the first two features. (e),(f): demonstration of the effect of feature selection on estimating the second function using kNN regression (k = 5, β = 0.05). (e) using both features (mse = 0.03), (f) using the relevant feature only (mse = 0.004)

selection methods: infoGain [10], correlation coefficients (corrcoef ) and forward selection (see [2]) . infoGain and corrcoef simply rank features according to the mutual information2 or the correlation coefficient (respectively) between each feature and the labels (i.e. the target function value). Forward selection (fwdSel) is a greedy method in which features are iteratively added into a growing subset. In each step, the feature showing the greatest improvement (given the previously selected subset) is added. This is a search method that can be applied to any evaluation function and we use our criterion (equation 2 on feature subsets). This well known method has the advantages of considering feature subsets and that it can be used with non linear predictors. Another algorithm we compare with scores each feature independently using our evaluation function (2). This helps us in analyzing RGS, as it may help single out the respective contributions to performance of the properties of the evaluation function and the search method. We refer to this algorithm as SKS (Single feature, kNN regression, feature Selection). We look at four different target functions over R50 . The training sets include 20 to 100 points that were chosen randomly from the [−1, 1]50 cube. The target functions are given in the top row of figure 2 and are illustrated in figure 1(a-d). A random Gaussian noise with zero mean and a variance of 1/7 was added to the function value of the training points. Clearly, only the first feature is relevant for the first two target functions, and only the first two features are relevant for the last two target functions. Note also that the last function is a smoothed version of parity function learning and is considered hard for many feature selection algorithms [2]. First, to illustrate the importance of feature selection on regression quality we use kNN to estimate the second target function. Figure 1(e-f) shows the regression results for target (b), using either only the relevant feature or both the relevant and an irrelevant feature. The addition of one irrelevant feature degrades the MSE ten fold. Next, to demonstrate the capabilities of the various algorithms, we run them on each of the above problems with varying training set size. We measure their success by counting the number of times that the relevant features were assigned the highest rank (repeating the experiment 250 times by re-sampling the training set). Figure 2 presents success rate as function of training set size. We can see that all the algorithms succeeded on the first function which is monotonic and depends on one feature alone. infoGain and corrcoef fail on the second, non-monotonic function. The three kNN based algorithms succeed because they only depend on local properties of the target function. We see, however, that RGS needs a larger training set to achieve a high success rate. The third target function depends on two features but the dependency is simple as each of them alone is highly correlated with the function value. The fourth, XOR-like function exhibits a complicated dependency that requires consideration of the two relevant features simultaneously. SKS which considers features separately sees the effect of all other features as noise and, therefore, has only marginal success on the third 2

Feature and function values were “binarized” by comparing them to the median value.

success rate

(a) x2 1

(b) sin(2πx1 + π/2)

(c) sin(2πx1 + π/2) + x2

100

100

100

80

80

80

60

60

60

40

40

40

20

20

20

20

40 60 80 100 # examples

20

40 60 80 100 # examples

(d) sin(2πx1 ) sin(2πx2 )

100 80 60 40 20 20

40 60 80 100 # examples

corrcoef infoGain SKS fwdSel RGS

20

40 60 80 100 # examples

Figure 2: Success rate of the different algorithms on 4 synthetic regression tasks (averaged over 250 repetitions) as a function of the number of training examples. Success is measured by the percent of the repetitions in which the relevant feature(s) received first place(s).

function and fails on the fourth altogether. RGS and fwdSel apply different search methods. fwdSel considers subsets but can evaluate only one additional feature in each step, giving it some advantage over RGS on the third function but causing it to fail on the fourth. RGS takes a step in all features simultaneously. Only such an approach can succeed on the fourth function.

5 Hand Movements Reconstruction from Neural Activity To suggest an interpretation of neural coding we apply RGS and compare it with the alternatives presented in the previous section3 on the hand movement reconstruction task. The data sets were collected while a monkey performed a planar center-out reaching task with one or both hands [11]. 16 electrodes, inserted daily into novel positions in primary motor cortex were used to detect and sort spikes in up to 64 channels (4 per electrode). Most of the channels detected isolated neuronal spikes by template matching. Some, however, had templates that were not tuned, producing spikes during only a fraction of the session. Others (about 25%) contained unused templates (resulting in a constant zero producing channel or, possibly, a few random spikes). The rest of the channels (one per electrode) produced spikes by threshold passing. We construct a labeled regression data set as follows. Each example corresponds to one time point in a trial. It consists of the spike counts that occurred in the 10 previous consecutive 100ms long time bins from all 64 channels (64 × 10 = 640 features) and the label is the X or Y component of the instantaneous hand velocity. We analyze data collected over 8 days. Each data set has an average of 5050 examples collected during the movement periods of the successful trials. In order to evaluate the different feature selection methods we separate the data into training and test sets. Each selection method is used to produce a ranking of the features. We then apply kNN (based on the training set) using different size groups of top ranking features to the test set. We use the resulting MSE (or correlation coefficient between true and estimated movement) as our measure of quality. To test the significance of the results we apply 5fold cross validation and repeat the process 5 times on different permutations of the trial ordering. Figure 3 shows the average (over permutations, folds and velocity components) MSE as a function of the number of selected features on four of the different data sets (results on the rest are similar and omitted due to lack of space)4 . It is clear that RGS achieves better results than the other methods throughout the range of feature numbers. To test whether the performance of RGS was consistently better than the other methods we counted winning percentages (the percent of the times in which RGS achieved lower MSE than another algorithm) in all folds of all data sets and as a function of the number of 3

fwdSel was not applied due to its intractably high run time complexity. Note that its run time is at least r times that of RGS where r is the size of the optimal set and is longer in practice. 4 We use k = 50 (approximately 1% of the data points). β is set automatically as described in section 3. These parameters were manually tuned for good kNN results and were not optimized for any of the feature selection algorithms. The number of epochs for RGS was set to 1 (i.e. T = m).

0.74

0.09

0.63

0.10

0.06 7

200

400

600

0.77

0.08 7

200

400

600

RGS SKS infoGain corrcoef

0.27 7

200

400

600

7

200

400

600

Figure 3: MSE results for the different feature selection methods on the neural activity data sets. Each sub figure is a different recording day. MSEs are presented as a function of the number of features used. Each point is a mean over all 5 cross validation folds, 5 permutations on the data and the two velocity component targets. Note that some of the data sets are harder than others.

100

90

90

80

RGS vs SKS RGS vs infoGain RGS vs corrcoef

70

80

winning percentages uniform weights non−uniform weights

70

60

60

50 0

0.8

MSE

100

winning percentage

winning percentage

features used. Figure 4 shows the winning percentages of RGS versus the other methods. For a very low number of features, while the error is still high, RGS winning scores are only slightly better than chance but once there are enough features for good predictions the winning percentages are higher than 90%. In figure 3 we see that the MSE achieved when using only approximately 100 features selected by RGS is better than when using all the features. This difference is indeed statistically significant (win score of 92%). If the MSE is replaced by correlation coefficient as the measure of quality, the average results (not shown due to lack of space) are qualitatively unchanged. RGS not only ranks the features but also gives them weights that achieve locally optimal results when using kNN regression. It therefore makes sense not only to select the features but to weigh them accordingly. Figure 5 shows the winning percentages of RGS using the weighted features versus RGS using uniformly weighted features. The corresponding MSEs (with and without weights) on the first data set are also displayed. It is clear that using the weights improves the results in a manner that becomes increasingly significant as the number of features grows, especially when the number of features is greater than the optimal number. Thus, using weighted features can compensate for choosing too many by diminishing the effect of the surplus features. To take a closer look at what features are selected, figure 6 shows the 100 highest ranking features for all algorithms on one data set. Similar selection results were obtained in the rest of the folds. One would expect to find that well isolated cells (template matching) are more informative than threshold based spikes. Indeed, all the algorithms select isolated cells more frequently within the top 100 features (RGS does so in 95% of the time and the rest in 70%-80%). A human selection of channels, based only on looking at raster plots and selecting channels with stable firing rates was also available to us. This selection was independent of the template/threshold categorisation. Once again, the algorithms selected the humanly preferred channels more frequently than the other channels. Another and more interesting observation that can also be seen in the figure is that while corrcoef, SKS and infoGain tend to select all time lags of a channel, RGS’s selections are more scattered (more channels and only a few time bins per channel). Since RGS achieves best results, we

100

200 300 400 number of features

500

600

Figure 4: Winning percentages of RGS over the other algorithms. RGS achieves better MSEs consistently.

50 0

100

200 300 400 number of features

500

600

0.6

Figure 5: Winning percentages of RGS with and without weighting of features (black). Gray lines are corresponding MSEs of these methods on the first data set.

RGS

SKS

corrCoef

infoGain

Figure 6: 100 highest ranking features (grayed out) selected by the algorithms. Results are for one fold of one data set. In each sub figure the bottom row is the (100ms) time bin with least delay and the higher rows correspond to longer delays. Each column is a channel (silent channels omitted).

conclude that this selection pattern is useful. Apparently RGS found these patterns thanks to its ability to evaluate complex dependency on feature subsets. This suggests that such dependency of the behavior on the neural activity does exist.

6 Summary In this paper we present a new method of selecting features for function estimation and use it to analyze neural activity during a motor control task . We use the leave-one-out mean squared error of the kNN estimator and minimize it using a gradient ascent on an “almost” smooth function. This yields a selection method which can handle a complicated dependency of the target function on groups of features yet can be applied to large scale problems. This is valuable since many common selection methods lack one of these properties. By comparing the result of our method to other selection methods on the motor control task, we show that consideration of complex dependency helps to achieve better performance. These results suggest that this is an important property of the code. Our future work is aimed at a better understanding of neural activity through the use of feature selection. One possibility is to perform feature selection on other kinds of neural data such as local field potentials or retinal activity. Another promising option is to explore the temporally changing properties of neural activity. Motor control is a dynamic process in which the input output relation has a temporally varying structure. RGS can be used in on line (rather than batch) mode to identify these structures in the code.

References [1] R. Kohavi and G.H. John. Wrapper for feature subset selection. Artificial Intelligence, 97(12):273–324, 1997. [2] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 2003. [3] A.J. Miller. Subset Selection in Regression. Chapman and Hall, 1990. [4] L. Devroye. The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE transactions in information theory, 24(2), 1978. [5] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 11. [6] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners. In Artificial Intelligence Review, volume 11, pages 193–225, April 1997. [7] R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection - theory and algorithms. In Proc. 21st (ICML), pages 337–344, 2004. [8] D. M. Taylor, S. I. Tillery, and A. B. Schwartz. Direct cortical control of 3d neuroprosthetic devices. Science, 296(7):1829–1832, 2002. [9] V. Vapnik. The Nature Of Statistical Learning Theory. Springer-Verlag, 1995. [10] J. R. Quinlan. Induction of decision trees. In Jude W. Shavlik and Thomas G. Dietterich, editors, Readings in Machine Learning. Morgan Kaufmann, 1990. Originally published in Machine Learning 1:81–106, 1986. [11] R. Paz, T. Boraud, C. Natan, H. Bergman, and E. Vaadia. Preparatory activity in motor cortex reflects learning of local visuomotor skills. Nature Neuroscience, 6(8):882–890, August 2003.

2.3

A Temporal Kernel-Based Model for Tracking Hand-Movements from Neural Activities

57

58

A Temporal Kernel-Based Model for Tracking Hand-Movements from Neural Activities Lavi Shpigelman12 Koby Crammer1 Rony Paz23 Eilon Vaadia23 Yoram Singer1 1 School of computer Science and Engineering 2 Interdisciplinary Center for Neural Computation 3 Dept. of Physiology, Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel

Email for correspondance: [email protected]

Abstract We devise and experiment with a dynamical kernel-based system for tracking hand movements from neural activity. The state of the system corresponds to the hand location, velocity, and acceleration, while the system’s input are the instantaneous spike rates. The system’s state dynamics is defined as a combination of a linear mapping from the previous estimated state and a kernel-based mapping tailored for modeling neural activities. In contrast to generative models, the activity-to-state mapping is learned using discriminative methods by minimizing a noise-robust loss function. We use this approach to predict hand trajectories on the basis of neural activity in motor cortex of behaving monkeys and find that the proposed approach is more accurate than both a static approach based on support vector regression and the Kalman filter.

1

Introduction

The paper focuses on the problem of tracking hand movements, which constitute smooth spatial trajectories, from spike trains of a neural population. We do so by devising a dynamical system which employs a tailored kernel for spike trains along with a linear mapping corresponding to the states’ dynamics. Consider a situation where a subject performs free hand movements during a task that requires accurate space and time precision. In the lab, it may be a constrained reaching task while in real life it may be an every day task such as eating. We wish to track the hand position given only spike trains from a recorded neural population. The rationale of such an undertaking is two fold. First, this task can be viewed as a stem towards the development of a Brain Machine Interface (BMI) which gradually and rapidly become a possible future solution for the motor disabled patients. Recent studies of BMIs [13, 3, 10] (being on-line and feedback enabled) show that a relatively small number of cortical units can be used to move a cursor or a robot effectively, even without generation of hand movements and that training of the subjects improves the overall success of the BMIs. Second, an open loop (off-line) movement decoding (see e.g. [2, 7, 1, 15, 11, 8]), while inappropriate for BMIs, is computationally less expensive, easier to implement and allows repeated analysis thus providing a handle to understandings of neural computations in the brain. Early studies [6] showed that the direction of arm movement is reflected by the population vector of preferred directions weighted by current firing rates, suggesting that intended

movement is encoded in the firing rate which, in turn, is modulated by the angle between a unit’s preferred direction (PD) and the intended direction. This linear regression approach is still prevalent and is applied, with some variation of the learning methods, in closed and open loop settings. There is relatively little work on the development of dedicated nonlinear methods. Both movement and neural activity are dynamic and can therefore be naturally modeled by dynamical systems. Filtering methods often employ generative probabilistic models such as the well known Kalman filter [16] or more neurally specialized models [2, 1] in which a cortical unit’s spike count is generated by a probability function of its underlying firing rate which is tuned to movement parameters. The movement, being a smooth trajectory, is modeled as a linear transition with (typically additive Gaussian) noise. These methods have the advantage of being aware of the smooth nature of movement and provide models of what neurons are tuned to. However, the requirement of describing a neural population’s firing probability as a function of movement state is hard to satisfy without making costly assumptions. The most prominent is the assumption of statistical independence of cells given the movement. Kernel based methods have been shown to achieve state of the art results in many application domains. Discriminative kernel methods, such as Support Vector Regression (SVR) forgo the task of modeling neuronal tuning functions. Furthermore, the construction of kernel induced feature spaces, lends itself to efficient implementation of distance measures over spike trains that are better suited to comparing two neural population trajectories than the Euclidean distance in the original space of spike counts per bins [11, 5]. However, SVR is a “static” method that does not take into account the smooth dynamics of the predicted movement trajectory which imposes a statistical dependency between consecutive examples. This paper introduces a kernel based regression method that incorporates linear dynamics of the predicted trajectories. In Sec. 2 we formally describe the problem setting. We introduce the movement tracking model and the associated learning framework in Sec. 3. The resulting learning problem yields a new kernel for linear dynamical systems. We provide an efficient calculation of this kernel and describe our dual space optimization method for solving the learning problem. The experimental method is presented in Sec. 4. Results, underscoring the merits of our algorithm are provided in Sec. 5 and conclusions are given in Sec. 6.

2

Problem Setting

Our training set contains m trials. Each trial (typically indexed by i or j) consists of a pair   tiend is a time of movement and neural recordings, designated by Yi , Oi . Yi = yti t=1 series of movement state values and yti ∈ Rd is the movement state vector at time t in trial i. We are interested in reconstructing position, however, for better modeling, yti may be a vector of position, velocity and acceleration (as is the case in Sec. 4). This trajectory is tiend observed during model learning and is the inference target. Oi = {ot }t=1 is a time series of spike counts from q cortical units at time of neural spike counts and oit ∈ Rq is a vector  t. We wish to learn a function zit = f Oi1:t that is a good estimate (in a sense formalized in the sequel) of the movement yti . Thus, f is a causal filtering method. We confine ourselves to a causal setting since we plan to apply the proposed method in a closed loop scenario where real-time output is required. The partition into separate trajectories is a natural one in a setting where a session is divided into many trials, each consisting of one attempt at accomplishing the basic task (such as reaching movements to displayed targets). In tasks that involve no hitting of objects, hand movements are typically smooth.

End point movement in small time steps is loosely approximated as having constant acceleration. On the other hand, neural spike counts (which are typically measured in bins of 50 − 100ms) vary greatly from one time step to the next. In summary, our goal is to devise a dynamic mapping from sequences of neural activities ending at a given time to the instantaneous hand movement characterization (location, velocity, and acceleration).

3

Movement Tracking Algorithm

Our regression method is defined as follows: given a series O ∈ Rq×tend of observations and, possibly, an initial state y0 , the predicted trajectory Z ∈ Rd×tend is, zt = Azt−1 + Wφ (ot ) , tend ≥ t > 0 , (1) where z0 = y0 , A ∈ Rd×d is a matrix describing linear movement dynamics and W ∈ Rd×q is a weight matrix. φ (ot ) is a feature vector of the observed spike trains at time t and is later replaced by a kernel operator (in the dual formulation to follow). Thus, the state transition is a linear transformation of the previous state with the addition of a non-linear effect of the observation.  Pt Note that unfolding the recursion in Eq. (1) yields zt = At y0 + k=1 At−k Wφ (ok ) . Assuming that A describes stable dynamics (the real parts of the eigenvalues of A are les than 1), then the current prediction depends, in an exponentially decaying manner, on the previous observations. We further assume that A is fixed and wish to learn W (we describe our choice of A in Sec. 4). In addition, ot may also encompass a series of previous spike counts in a window ending at time t (as is the case in Sec. 4). Also, note that this model (in its non-kernelized version) has an algebraic form which is similar to the Kalman filter (to which we compare our results later). Primal Learning Problem: The optimization problem presented here is identical to the standard SVR learning problem (see, for example [12]) with the exception that zit is defined as in Eq. (1) while in standard SVR, zt = Wφ (ot ) (i.e. without the linear dynamics). i i m Given a training set of fully observed trials Y , O i=1 we define the learning problem to be i

min W

d m tX end X X i  1 2 zt − yti . kWk + c s s ε 2 i=1 t=1 s=1

(2)

P 2 2 Where kWk = a,b (W)ab (is the Forbenius norm). The second term is a sum of training errors (in all trials, times and movement dimensions). | · |ε is the ε insensitive loss and is defined as |v|ε = max {0, |v| − ε}. The first term is a regularization term that promotes small weights and c is a fixed constant providing a tradeoff between the regularization term and the training error. Note that to compensate for different units and scales of the movement dimensions one could either define a different εs and cs for each dimension of the movement or, conversely, scale the sth movement dimension. The tracking method, combined with the optimization specified here, defines the complete algorithm. We name this method the Discriminative Dynamic Tracker or DDT in short. A Dual Solution: The derivation of the dual of the learning problem defined in Eq. (2) is rather mundane (e.g. [12]) and is thus omitted. Briefly, we replace the ε-loss with pairs of slack variables. We then write a Lagrangian of the primal problem and replace zit with its (less-standard) definition. We then differentiate the Lagrangian with respect to the slack variables and W and obtain a dual optimization problem. We present the dual dual problem in a top-down manner, starting with the general form and finishing with a kernel definition. The form of the dual is T T T max∗ − 21 (α∗ − α) G (α∗ − α) + (α∗ − α) y − (α∗ + α)  α,α

s.t. α, α∗ ∈ [0, c]

.

(3)

Note that the above expression conforms to the dual form of SVR. Let ` equal the size of the movement space (d), multiplied by the total number of time steps in all the training trajectories. α, α∗ ∈ R` are vectors of Lagrange multipliers, y ∈ R` is a column concatenation of   T T T all the training set movement trajectories y11 · · · ytmm ,  = [ε, . . . , ε]T ∈ R` end

and G ∈ R`×` is a Gram matrix (vT denotes transposition). One obvious difference between our setting and the standard SVR lies within the size of the vectors and Gram matrix. In addition, a major difference is the definition of G. We define G here in a hierarchical manner. Let i, j ∈ {1, . . . , m} be trajectory (trial) indexes. G is built from blocks indexed by Gij , which are in turn made from basic blocks, indexed by Kij tq as follows     ij 11 1m K11 ··· Kij G ··· G 1tj   .. ..   .. .. ij .. , G =  ...  , G = . . . . .   ij ij Kti 1 · · · Kti tj Gm1 · · · Gmm end

end end

where block G refers to a pair of trials (i and j). Finally Each basic block, Kij tq refers to a pair of time steps t and q in trajectories i and j respectively. tiend , tjend are the time lengths of trials i and j. Basic blocks are defined as q t X X  ij T ij At−r krs Ktq = Aq−s , (4) ij

r=1 s=1  ij where krs = k oir , ojs is a (freely chosen) basic kernel between the two neural observations oir and ojs at times r and s in trials i and j respectively. For an explanation of kernel operators we refer the  readerto [14] and mention that the kernel operator can be viewed as computing φ oir · φ ojs where φ is a fixed mapping to some inner product space. The choice of kernel (being the choice of feature space) reflects a modeling decision that specifies how similarities P between neural patterns are measured. The resulting dual form of the tracker is zt = k αk Gtk where Gt is the Gram matrix row of the new example.

It is therefore clear from Eq. (4) that the linear dynamic characteristics of DDT results in a Gram matrix whose entries depend on previous observations. This dependency is exponentially decaying as the time difference between events in the trajectories grow. Note that solution of the dual optimization problem in Eq. (3) can be calculated by any standard quadratic programming optimization tool. Also, note that direct calculation of G is inefficient. We describe an efficient method in the sequel. Efficient Calculation of the Gram Matrix Simple, straight-forward calculation of the Gram matrix is time consuming. To illustrate this, suppose each trial is of length tiend = n, then calculation of each basic block would take Θ(n2 ) summation steps. We now describe a procedure based on dynamic-programming method for calculating the Gram matrix in a constant number of operations for each basic block. Omitting the indexing over trials to ease notation, we are interested in calculating the basic Pt block Ktq . First, define Btq = k=1 kkq At−k . the basic block Ktq can be recursively calculated in three different ways: Ktq = Kt(q−1) AT + Btq (5) Ktq

= AK(t−1)q + (Bqt ) T

T

(6) T

Ktq = AK(t−1)(q−1) A + (Bqt ) + Btq − ktq . (7) Thus, by adding Eq. (5) to Eq. (6) and subtracting Eq. (7) we get Ktq = AK(t−1)q + Kt(q−1) AT − AK(t−1)(q−1) AT + ktq I . Btq (and the entailed summation) is eliminated in exchange for a 2D dynamic program with initial conditions: K11 = k11 I , K1q = K1(q−1) AT + k1q I , Kt1 = AK(t−1)1 + kt1 I.

Table 1: Mean R2 , MAEε & MSE (across datasets, folds, hands and directions) for each algorithm.

Algorithm Kalman filter DDT-linear SVR-Spikernel DDT-Spikernal

pos. 0.64 0.59 0.61 0.73

R2 vel. 0.58 0.49 0.64 0.67

accl. 0.30 0.17 0.37 0.40

pos. 0.40 0.63 0.44 0.37

MAEε vel. 0.15 0.41 0.14 0.14

accl. 0.37 0.58 0.34 0.34

pos. 0.78 0.97 0.76 0.50

MSE vel. accl. 0.27 1.16 0.50 1.23 0.20 0.98 0.16 0.91

0.8

2

DDT−Spikernel, R Scores

1

0.6

0.4 left hand, X dir. left hand, Y dir.

0.2

right hand, X dir. right hand, Y dir.

0 0

0.2 0.4 0.6 0.8 Kalman filter, R2 Scores

1

0

0.2 0.4 0.6 0.8 DDT−linear, R2 Scores

1

0

0.2

0.4

0.6

0.8

SVR−Spikernel, R2 Scores

Figure 1: Correlation coefficients (R2 , of predicted and observed hand positions) comparisons of the DDT-Spikernel versus the Kalman filter (left), DDT-linear (center) and SVR-Spikernel (right). Each data point is the R2 values obtained by the DDT-Spikernel and by another method in one fold of one of the datasets for one of the two axes of movement (circle / square) and one of the hands (filled/non-filled). Results above the diagonals are cases were the DDT-Spikernel outperformes.

Suggested Optimization Method. One possible way to solve the optimization problem (essentially, a modification of the method described in [4] for classification) is to sequentially solve a reduced problem with respect to a single constraint at a time. Define: X ∗ X ∗   G − y α − α δi = αj − αj Gij − yi − min ij i . j j ∗ αi ,αi ∈[0,c] j j ε

ε

Then δi is the amount of ε-insensitive error that can be corrected for example i by keeping (∗) (∗) all αj6=i constant and changing αi . Optimality is reached by iteratively choosing the (∗)

example with the largest δi and changing its αi error for this example.

4

within the [0, c] limits to minimize the

Experimental Setting

The data used in this work was recorded from the primary motor cortex of a Rhesus (Macaca Mulatta) monkey (˜4.5 kg). The monkey sat in a dark chamber, and up to 8 electrodes were introduced into MI area of each hemisphere. The electrode signals were amplified, filtered and sorted. The data used in this report was recorded on 8 different days and includes hand positions, sampled at 500Hz, spike times of single units (isolated by signal fit to a series of windows) and of multi units (detection by threshold crossing) sampled at 1ms precision. The monkey used two planar-movement manipulanda to control 2 cursors on the screen to perform a center-out reaching task. Each trial began when the monkey centered both cursors on a central circle. Either cursor could turn green, indicating the hand to be used in the trial. Then, one of eight targets appeared (’go signal’), the center circle disappeared and the monkey had to move and reach the target to receive liquid reward. The number of multi-unit channels ranged from 5 to 15, the number of single units was 20-27 and the average total was 34 units per dataset. The average spike rate per channel was 8.2 spikes/sec. More information on the recordings can be found in [9].

1

DDT (Spikernel)

DDT (Spikernel)

88.1% 100% 100%

Kalman Filter

78.12%

87.5%

80.0% 96.3%

Kalman Filter

DDT (Linear)

91.88%

SVR (Spikernel)

98.7%

86.3% 95.6%

86.8%

62.5% DDT (Linear)

78.7%

SVR (Spikernel)

99.4%

63.75%

SVR (Spikernel)

DDT (Spikernel)

75%

Kalman Filter

84.4% DDT (Linear)

Figure 2: Comparison of R2 -performance between algorithms. Each algorithm is represented by a vertex. The weight of an edge between two algorithms is the fraction of tests in which the algorithm on top achieves higher R2 score than the other. A bold edge indicates a fraction higher than 95%. Graphs from left to right are for position, velocity, and acceleration respectively.

The results that we present here refer to prediction of instantaneous hand movements during the period from ’Go Signal’ to ’Target Reach’ times of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vise versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and non-movement for each hand. The cumulative time length of all the datasets was about 67 minutes. Since the correlation between the movements of the two hands tend to zero - we predicted movement for each T hand separately, choosing the movement space to be [x, y, vx , vy , ax , ay ] for each of the T hands (preliminary results using only [x, y, vx , vy ] were less accurate). We preprocessed the spike trains into spike counts in a running windows of 100ms (choice of window size is based on previous experience [11]). Hand position, velocity and acceleration were calculated using the 500Hz recordings. Both spike counts and hand movement were then sampled at steps of 100ms (preliminary results step size 50ms were negli with gibly different for all algorithms). A labeled example yti , oit for time t in trial i consisted of the previous 10 bins of population spike counts and the state, as a 6D vector for the left or right hand. Two such consecutive examples would than have 9 time bins of spike count overlap. For example, the number of cortical units q in the first dataset was 43 (27 single and 16 multiple) and the total length of all the trials that were used in that dataset is 529 seconds. Hence in that session there are 5290 consecutive examples where each is a 43×10 matrix of spike counts along with two 6D vectors of end point movement. In order to run our algorithm we had to choose base kernels, their parameters, A and c (and θ, to be introduced below). We used the Spikernel [11], a kernel designed to be used with spike rate patterns, and the simple dot product (i.e. linear regression). Kernel parmeters and c were chosen (and subsequently held fixed) by 5 fold cross validation over half of the first dataset only. We compared DDT with the Spikernel and with the linear kernel to standard SVR using the Spikernel and the Kalman filter. We also obtained tracking results using both DDT and SVR with the standard exponential kernel. These results were slightly less accurate on average than with the Spikernel and are therefore omitted here. The Kalman filter was learned assuming the standard state space model (yt = Ayt−1 + η , ot = Hyt +ξ, where η, ξ are white Gaussian noise with appropriate correlation matrices) such as in [16]. y belonged to the same 6D state space as described earlier. To ease the comparison - the same matrix A that was learned for the Kalman filter was used in our algorithm (though we show that it is not optimal for DDT), multiplied by a scaling parameter θ. This parameter was selected to produce best position results on the training set. The selected θ value is 0.8. The figures that we show in Sec. 5 are of test results in 5 fold cross validation on the rest of the data. Each of the 8 remaining datasets was divided into 5 folds. 4/5 were used for

MSE

X

# Support

Position

MAE

14K

Y

position

2

R

10K

velocity

Velocity

12K Actual DDT−Spikernel SVR−Spikernel

6K θ

θ

θ

θ

Figure 3: Effect of θ on R2 , MAEε ,MSE and number of support vectors.

acceleration

Acceleration

8K

Figure 4: Sample of tracking with the DDTSpikernel and the SVR-Spikernel.

training (with the parameters obtained previously and the remaining 1/5 as test set). This process was repeated 5 times for each hand. Altogether we had 8sets × 5folds × 2hands = 80 folds.

5

Results

We begin by showing average results across all datasets, folds, hands and X/Y directions for the four algorithms that are compared. Table. 1 shows mean Correlation Coefficients (R2 , between recorded and predicted movement values), Mean ε insensitive Absolute Errors (MAEε ) and Mean Square Errors (MSE). R2 is a standard performance measure, MAE ε is the error minimized by DDT (subject to the regularization term) and MSE is minimized by the Kalman filter. Under all the above measures the DDT-Spikernel outperforms the rest with the SVR-Spikernel and the Kalman Filter alternating in second place. To understand whether the performance differences are statistically significant we look at the distribution of position (X and Y) R2 values at each of the separate tests (160 altogether). Figure 1 shows scatter plots of R2 results for position predictions. Each plot compares the DDT-Spikernel (on the Y axis) with one of the other three algorithms (on the X axes). It is clear that although there are large differences in accuracy across the different datasets, the algorithm pairs achieve similar success with the DDT-Spikernel achieving a better R2 score in almost all cases. To summarize the significance of R2 differences we computed the number of tests in which one algorithm achieved a higher R2 value than another algorithm (for all pairs, in each of the position, velocity and acceleration categories). The results of this tournament between the algorithms are presented in Figure 2 as winning percentages. The graphs produce a ranking of the algorithms and the percentages are the significances of the ranking between pairs. The DDT-Spikernel is significantly better then the rest in tracking position. The matrix A in use is not optimal for our algorithm. The choice of θ scales its effect. When θ = 0 we get the standard SVR algorithm (without state dynamics). To illustrate the effect of θ we present in Figure 3 the mean (over 5 folds, X/Y direction and hand) R2 results on the first dataset as a function of θ. It is clear that the value chosen to minimize position error is not optimal for minimizing velocity and acceleration errors. Another important effect of θ is the number of the support patterns in the learned model, which drops considerably (by about one third) when the effect of the dynamics is increased. This means that more training points fall strictly within the ε-tube in training, suggesting that the kernel which tacitly results from the dynamical model is better suited for the problem. Lastly, we show a sample of test tracking results for the DDT-Spikernel and SVR-Spikernel in Figure 4. Note that the acceleration values are not smooth and are, therefore, least aided by the dynamics of the model. However, adding acceleration to the model improves the prediction of position.

6

Conclusion

We described and reported experiments with a dynamical system that combines a linear state mapping with a nonlinear observation-to-state mapping. The estimation of the system’s parameters is transformed to a dual representation and yields a novel kernel for temporal modelling. When a linear kernel is used, the DDT system has a similar form to the Kalman filter as t → ∞. However, the system’s parameters are set so as to minimize the regularized ε-insensitive `1 loss between state trajectories. DDT also bares similarity to SVR, which employs the same loss yet without the state dynamics. Our experiments indicate that by combining a kernel-induced feature space, linear state dynamics, and using a robust loss we are able to leverage the trajectory prediction accuracy and outperform common approaches. Our next step toward an accurate brain-machine interface for predicting hand movements is the development of a learning procedure for the state dynamic mapping A and further developments of neurally motivated and compact representations. Acknowledgments This study was partly supported by a center of excellence grant (8006/00) administered by the ISF, BMBF-DIP, by the U.S. Israel BSF and by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. L.S. is supported by a Horowitz fellowship.

References [1] A. E. Brockwell, A. L. Rojas, and R. E. Kass. Recursive bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91:1899–1907, 2004. [2] E. N. Brown, L. M. Frank, D. Tang, M. C. Quirk, and M. A. Wilson. A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18(7411–7425), 1998. [3] J. M. Carmena, M. A. Lebedev, R. E. Crist, J. E. O’Doherty, D. M. Santucci, D. F. Dimitrov, P. G. Patil, C. S. Henriques, and M. A. L. Nicolelis. Learning to control a brain-machine interface for reaching and grasping by primates. PLOS Biology, 1(2):001–016, 2003. [4] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Jornal of Machine Learning Research, 2:265–292, 2001. [5] J. Eichhorn, A. Tolias, A. Zien, M. Kuss, C. E. Rasmussen, J. Weston, N. Logothetis, and B. Sch¨olkopf. Prediction on spike data using kernel algorithms. In NIPS 16. MIT Press, 2004. [6] A. P. Georgopoulos, J. Kalaska, and J. Massey. Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor cortical populations. Experimental Brain Research (Supp), 7:327–336, 1983. [7] R. E. Isaacs, D. J. Weber, and A. B. Schwartz. Work toward real-time control of a cortical neural prothesis. IEEE Trans Rehabil Eng, 8(196–198), 2000. [8] C. Mehring, J. Rickert, E. Vaadia, S. C. de Oliveira, A. Aertsen, and S. Rotter. Inference of hand movements from local field potentials in monkey motor cortex. Nature Neur., 6(12), 2003. [9] R. Paz, T. Boraud, C. Natan, H. Bergman, and E. Vaadia. Preparatory activity in motor cortex reflects learning of local visuomotor skills. Nature Neur., 6(8):882–890, August 2003. [10] M. D. Serruya, N. G. Hatsopoulos, L. Paninski, M. R. Fellows, and J. P. Donoghue. Instant neural control of a movement signal. Nature, 416:141–142, March 2002. [11] L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia. Spikernels: Embedding spiking neurons in inner product spaces. In NIPS 15, Cambridge, MA, 2003. MIT Press. [12] A. Smola and B. Scholkop. A tutorial on support vector regressio. In NeuroCOLT2 Technical Report, 1998. [13] S. I. H. Tillery, D. M. Taylor, and A. B. Schwartz. Training in cortical control of neuroprosthetic devices improves signal extraction from small neuronal ensembles. Reviews in the Neurosciences, 14:107–119, 2003. [14] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995. [15] J. Wessberg, C. R. Stambaugh, J. D. Kralik, P. D. Beck, M. Laubach, J. K. Chapin, J. Kim, J. Biggs, M. A. Srinivasan, and M. A. Nicolelis. Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(16), November 2000. [16] W. Wu, M. J. Black, Y. Gao, E. Bienenstock, M. Serruya, and J. P. Donoghue. Inferring hand motion from multi-cell recordings in motor cortex using a kalman filter. In SAB02, pages 66–73, Edinburgh, Scotland (UK), 2002.

2.4

Kernel-ARMA for Hand Tracking and BrainMachine Interfacing During 3D Motor Control

67

68

Kernel-ARMA for Hand Tracking and Brain-Machine Interfacing During 3D Motor Control Lavi Shpigelman1 , Hagai Lalazar 2 and Eilon Vaadia 3 Interdisciplinary Center for Neural Computation The Hebrew University of Jerusalem, Israel 1 [email protected], 2 [email protected], 3 [email protected]

Abstract Using machine learning algorithms to decode intended behavior from neural activity serves a dual purpose. First, these tools allow patients to interact with their environment through a Brain-Machine Interface (BMI). Second, analyzing the characteristics of such methods can reveal the relative significance of various features of neural activity, task stimuli, and behavior. In this study we adapted, implemented and tested a machine learning method called Kernel Auto-Regressive Moving Average (KARMA), for the task of inferring movements from neural activity in primary motor cortex. Our version of this algorithm is used in an online learning setting and is updated after a sequence of inferred movements is completed. We first used it to track real hand movements executed by a monkey in a standard 3D reaching task. We then applied it in a closed-loop BMI setting to infer intended movement, while the monkey’s arms were comfortably restrained, thus performing the task using the BMI alone. KARMA is a recurrent method that learns a nonlinear model of output dynamics. It uses similarity functions (termed kernels) to compare between inputs. These kernels can be structured to incorporate domain knowledge into the method. We compare KARMA to various state-of-the-art methods by evaluating tracking performance and present results from the KARMA based BMI experiments.

1

Introduction

Performing a behavioral action such as picking up a sandwich and bringing it to one’s mouth is a motor control task achieved easily every day by millions of people. This simple action, however, is impossible for many patients with motor deficits. In the future, patients with enough cortical activity remaining may benefit from Brain Machine Interfaces that will restore motor control with agility, precision, and the degrees of freedom comparable to natural movements. Such high quality BMI’s are not yet available. The BMI framework involves recording neural activity, typically using chronically implanted electrodes, which is fed in real-time to a decoding algorithm. Such algorithms attempt to infer the subject’s intended behavior. The algorithm’s predictions can be used to artificially control an end-effector: a cursor on a screen, a prosthetic arm, a wheelchair, or the subject’s own limbs by stimulation of their muscles. This study focuses on the algorithmic component. Motor control is a dynamic process involving many feedback loops, relevant time frames, and constraints of the body and neural processing. Neural activity in primary motor cortex (MI) is part of this process. An early approach at decoding movement from MI activity for BMI (see [1]) was rather simplistic. Instantaneous velocity of the cursor, across a set of movements, was linearly regressed against neuronal spike rates. This algorithm (known as the Population Vector Algorithm) is equivalent to modelling each neuron as a consine function of movement velocity. This method is still used today for BMI’s [2], and has become the standard model in many studies of encoding and learning in MI. Our understanding of motor cortex has progressed, and many other factors have been shown to 1

correlate with neuronal activity, but are typically overlooked in modeling. For example, MI activity has been shown to encode arm posture [3], the dynamic aspects of the movement (such as current acceleration, or interaction forces) and the interactions between neurons and their dynamics [4]. State-of-the-art movement decoding methods typically involve improved modeling of behavior, neural activity, and the relations between them. For example, Kalman filtering (see [5]) has been used to model the system state as being comprised of current hand position, velocity and acceleration. Thus, the hand movement is assumed to have roughly constant acceleration (with added Gaussian noise and, consequently, minimal jerk) and the neural activity is assumed to be a linear function of the hand state (with added Gaussian noise). Particle filtering, which relaxes some of the linearity and Gaussian assumptions, has also been applied in an offline setting (see [6]). Support Vector Regression (SVR) from neural activity to current hand velocity (see [7]) has the advantage of allowing for extraction of nonlinear information from neuronal interactions, but is missing a movement model. One of our previous studies ([8]) combines a linear movement model (as in Kalman filtering) with SVR-based nonlinear regression from neural activity. KARMA (see [9] for one of its first appearances, or [10] for a more recent one) is a kernelized version of the ARMA method [11]. It performs ARMA in a kernel-induced feature space (for a comprehensive explanation of this kernel-trick, see [12]). It estimates the next system state as a function of both the time window of previous state estimates (the Auto-Regressive part) and the time window of previous observations (the Moving-Average part). In our application, we extend its formulation to the Multi-Input Multi-Output (MIMO) case, allowing for better modeling of the system state. We apply it in an online learning paradigm, and by limiting the number of support vectors turn it into an adaptive method. This allows for real-time inference, as is necessary for BMI. In section 2 we explain the lab setup, the problem setting, and introduce our notation. In section 3 we describe KARMA, and our online and adaptive version of it, in detail. We explain the available modeling options and how they can be used to either improve performance or to test ideas regarding motor control and neural encoding. Section 4 describes KARMA’s performance in tracking hand movements and compares it with other state-of-the-art methods. Section 5 presents results from our BMI experiment using KARMA and, finally, we summarize in section 6

2

Lab setup and problem setting

In our experiments, a monkey performed a visuomotor control task that involved moving a cursor on a screen from target to target in 3D virtual space. Neuronal activity from a population of single and multi-units was recorded with a chronically implanted array of 96 electrodes (Cyberkinetics, Inc.) in MI, and used to calculate spike rates (spike counts in 50ms time bins smoothed by a causal filter). In hand-control (open-loop) mode, the monkey used its hand (continuously tracked by an optical motion capture system; Phoenix Tech., Inc.) to move the cursor. Data collected from these sessions is used here to assess algorithm performance, by using the real arm movements as the target trajectories. In hand-control, behavior is segmented into sequences of continuous recordings, separated by time periods during which the monkey’s hand is not in view of the tracking device (e.g. when the monkey stops to scratch itself). Each sequence is made up of target-to-target reaching trials (some successful and some not). The targets appeared randomly in the 27 corners, centers of faces, and middle of a virtual cube whose side was 6cm. The target radii were 2.4cm. A successful trial consisted of one reaching movement that started at rest in one target and ended at rest in the next target (with required target hold periods during which the cursor must not leave the target). The next target appears at some point during the hold period of the previous target. Food reward is provided through a tube after each success. In case of failure, the cursor disappears for 1-2 seconds (failure period). During this time the monkey’s hand is still tracked. In the BMI (closed-loop) setting, the monkey’s hands were comfortably restrained and the KARMA algorithm’s inference was used to move the cursor. Trial failures occured if the reaching movement took longer than 6 seconds or if the cursor was not kept inside the target during a 0.25s hold period. During trial-failure periods the inference was stopped and at the next trial the cursor reappeared where it left off. The trial-failure period was also used to pass the latest recorded trial sequence to the model-learning process, and to replace the working model with an updated one, if available. In this text, X (Capital, bold) is a matrix, x (bold) is a vector ( so is xi , xt or xit ) and x is a scalar. (x)T signifies transposition. We will use xit ∈ IRq to designate the neural activity (of q cortical units) at time bin t in behavior sequence i, which we refer to as observations. Given a window size, 2

s, xit−s+1:t =



xit−s+1

T

, . . . , xit

 T T

∈ IRsq is an sq-long vector comprising a concatenated

window of observations ending at time t, of trajectory i. xi will be short-hand notation meaning xi1:tf where tfi is the number of steps in the whole ith trajectory. Similarly, yti ∈ IRd are used i to designate cursor position (d = 3). We refer to yti as the state trajectory. Given a window size, i r ,yt−r:t−1 ∈ IRrd is a concatenated vector of states. Estimated states are differentiated from true ˆ ti . Furthermore (given s and states (or desired states, as will be explained later) by addition of a hat: y  T   T T i ˆ ti = y ˆ t−r:t−1 r) we will use v ∈ IRrd+sq to concatenate windows of estimated xit−s+1:t states and of neural observations and vti to concatenate true (rather than estimated) state values.

In the hand-control we are given a (fully observed) data-set of neural activities and state  setting, n trajectories: xi , yi i=1 . Our goal is to learn to reconstruct the state trajectories from the neural activities. We adhere to the online learning paradigm in which at each step, i, of the process we are ˆ i , then receive the true yi and update the model. This given one observation sequence, xi , predict y allows the model to adapt to changes in the input-output relation that occurs over time. In BMI mode, since hand movements were not performed, we do not know the correct cursor movement. Instead, during learning we use the cursor movement generated in the BMI and the positions of the targets that the monkey was instructed to reach to guess a desired cursor trajectory which is used to replace the missing true trajectory as feedback. The illustration on the right shows the BMI setup from an algorithmic point of view.

3

KARMA, modeling options, and online learning

Pr i As stated earlier, KARMA is a kernelized ARMA. In ARMA: yti = k=1 Ak yt−k + Ps r s i i l=1 Bl xt−l+1 + et , where {Ak }k=1 and {Bl }l=1 are the respective Auto-Regressive (AR) and Moving Average (MA) parameters and eit are residual error terms. Given these model parameters and initial state values, the rest of the state trajectory can be estimated from the observations by recursive application, replacing true state values with the estimated ones. Thus, ARMA inference is essentially application of a linear (MIMO) IIR filter. Defining W = [Ar , . . . , A1 , Bs , . . . , B1 ], the ˆ ti = Wˆ next state estimate is simply y vti (see notation section). Kernelizing ARMA involves application of the kernel trick. A kernel function k (v1 , v2 ) : IRrd+sq × IRrd+sq → IR is introduced, which, conceptually, can be viewed as a dot product of feature vectors: k (v1 , v2 ) = φT (v1 ) φ (v2 ) where the features are possibly functions  P complicated ˆ ti , vτj where ˆ ti = j,τ αjτ k v of both states and (neural) observations. Inference takes the form y αjτ ∈ IRd are learned weight vectors and vτj are examples from a training set, known as the support ˆ ti = Wφ φ v ˆ ti where, as compared set. Conceptually, KARMA inference can be viewed as y P i ˆ t is replaced by its feature vector, W is replaced by Wφ = j,τ αjτ φT (vτj ) and with ARMA, v each recursive  step of KARMA is linear regression in the feature space of observations + states. The weights, αjτ are learned so as to solve the following optimization problem (presented in its primal   P P 2 2 2 form): arg minWφ kWφ k + c i,t,k yti k − Wφ φ vti k  , where kwk = a,b (W)ab is the Frobenius matrix norm, the sum in the second term is over all trials, times and state dimensions of the examples in the training set, |v| = max{0, |v|−} is the -insensitive absolute error and c is a constant that determines the relative trade-off between the first (regularization) term and the second (error) term. Note that during learning, the states are estimated using the true / desired previous state values as input instead of the estimated ones (contrary to what is done during inference). “Luckily”, this optimization problem reduces to standard SVR where xit is replaced with vti . This replacement can be done as a preprocessing step in learning and a standard SVR solver can then be used to find the weights. Inference would require plugging in the previously estimated state values as part of the inputs between iterative calls to SVR inference. Application of KARMA to a specific domain entails setting of some key hyper-parameters that may have drastic effects on performance. The relatively simple ones are the window sizes (r and 3

s) and trade-off parameter c. Beyond the necessary selection of the cursor trajectories as the states, augmenting state dimensions (whose values are known at training and inferred during model testing) can be added in order to make the model use them as explicit features. This idea was tried in our hand tracking experiments using features such as absolute velocity and current trial state (reaching target, holding at target and trial-failure time). But since results did not improve significantly, we discontinued this line of research. The kernel function and its parameters must also be chosen. Note that the kernel in this algorithm is over structured data, which opens the door to a plethora of choices. Depending on one’s view this can be seen as an optimization curse or as a modeling blessing. It obviously complicates the search for effective solutions but it allows to introduce domain knowledge (or assumptions) into the problem. It can also be used as a heuristic for testing the relative contribution of the assumptions behind the modeling choices. For example, by choosing r = 0 the algorithm reduces to SVR and the state model (and its dynamics) are ignored. By selecting a kernel which is a linear sum of two kernels, one for states and one for observations, the user assumes that states and observations have no “synergy” (i.e. each series can be read without taking the other into account). This is because summing of kernels is equivalent to calculating the features on their inputs separately and then concatenating the feature vectors. Selecting linear kernels reduces KARMA to ARMA (using its regularized loss function). In online learning, one may change the learned model between consecutive inferences of (whole)  k time series. At learning step k, all of xi , yi i=1 are available for training. A naive solution would be to re-learn the model at every step, however, this would not be the best solution if one believes that the source of the input-output relation is changing (for example, in our BMI, cortical units may change their response properties, or units may appear or disappear during a recording session). Also, it may not be feasible to store all the sequences, or learning may take too long (opening up a delay between data acquisition until a new model is ready). If the resulting model has too many support vectors, too much time is required for each inference step (which is less than 50ms in our BMI setup). We deal with all the issues above by limiting the number of examples (vti ) that are kept in memory (to 5000 in hand-control tracking and 3000 for real-time use in the BMI). At each online learning iteration, the latest data is added to the pool of examples one example at a time, and if the limit has been reached another example is selected at random (uniformly over the data-set) and thrown out. This scheme gives more mass to recent observations while allowing for a long tail of older observations. For a 3000 sized database and examples coming in at the rate of one per 50ms, the cache is filled after the first 150 seconds. Afterwards, the half life (the time required for an example to have a 50% chance of being thrown out) of an example is approximately 104 seconds, or conversely, at each point, approx. 63% of the examples in the database are from the last 150 seconds and the rest are earlier ones. This scheme keeps the inference time to a constant and seems reasonable in terms of rate of adaptation. We chose 5000 for the tracking (hand-control) experiments since in those experiments there is no real-time inference constraint and the performance improves a bit (suggesting that the 3000 size is not optimal in terms of inference quality). The similarity between consecutive examples is rather high as they share large parts of their time windows (when ARMA parameters r or s are large). Throwing away examples at random has a desired effect of lessening the dependency between remaining examples.

4

Open-loop hand tracking testing

To test various parametrization of KARMA and to compare its performance to other methods we used data from 11 hand-control recording days. These sessions vary in length from between 80 to 120 minutes of relatively continuous performance on the part of the monkey. Success rates in this task were at the 65-85% range. Cortical units were sorted in an automated manner every day with additional experimenter tuning. The number of very well-isolated single units ranged between 21-41. The remaining units consisted of 140-150 medium quality and multi-units, which the sorting software often split into more than one channel. Most of the different algorithms that are compared here have free hyper-parameters that need to be set (such as a Gaussian kernel width for spike rates, the maximal informative time window of neural activities, s and the c trade-off parameter). We had a rough estimate for some of these from previous experiments using similar data (see [8]). To fine-tune these parameters, a brute-force grid search was performed on data from one of the 11 sessions in a (batch) 5-fold cross validation scheme. Those parameters were then kept fixed. 4

Earlier experiments showed the Gaussian kernel to be a good candidate for comparing neural spike rate vectors. It can also be calculated quickly, which is important for the BMI real-time constraint. We tested several variations of structured kernels on neuro-movement inputs. These variations consisted of all combinations of summing or multiplying Gaussian or linear kernels for the spike rates and movement states. Taking a sum of Gaussians or their product produced the best results (with no significant difference between these two options). We chose the sum (having the conceptual ini ˆ ti = Wψ ψ(ˆ ference form: y yt−r:t−1 ) + Wφ φ(xit−s:t ) where ψ, φ are the infinite feature vectors of the Gaussian kernel). The next best result was achieved by summing a Gaussian spike rate kernel and a linear movement kernel (which we will call lin-y-KARMA). The sum of linear kernels produces ARMA (which we also tested). The test results that are presented in this study are only for the remaining 10 recording sessions. The main performance measure that we use here is the (Pearson) correlation coefficient (CC) between true and estimated values of positions (in each of the 3 movement dimensions). To gauge changes in prediction quality over time we use CC in a running window of sequences (window size is chosen so as to decrease the noise in the CC estimate). In other cases, the CC for a whole data-set is computed. To illustrate KARMA’s performance we provide a movie (see video 1 in supplementary material) showing the true hand position (in black) and KARMA’s tracking estimate (in blue) during a continuous 150 second sequence of target-to target reach movements. This clip is of a model that was learned in an online manner using the previous (180) sequences, using a support vector cache of 3000 (as described in section 3). The initial position of the cursor is not given to the algorithm. Instead the average start positions in previous sequences is given as the starting point. The CC in the time window of the previous 40 sequences (0.9, 0.92 and 0.96 for the 3 dimensions) is given to provide a feeling of what such CC’s look like. Similarly, Figure 1.B shows tracking and true position values for an 80 second segment towards the end of a different session. KARMA achieves good performance and it does so with relatively small amounts of data. Figure 1.A shows tracking quality in terms of a running window of CC’s over time. CC’s for the first sequences are calculated on predictions made up to those times. While these CC’s are more noisy it is clear that a useful model is reached within the first 3 minutes (CC’s all higher than 0.6) and close to peak performance is already available within the first 10 minutes.

C

A earliest delay train

delays test

D

positions

B

Figure 1: KARMA performance: (A) Correlation coefficients in a running window of 20 sequences (or less at the session start) for a whole 95 minute session (mean CC window size is 9.7 minutes). (B) True (gray) vs. tracked positions in an 80 second segment at minute 90 of the session. (C) Effect of loosing recording electrodes: tracking was performed over a full recording session using randomly chosen subsets of electrodes. For each selected number of electrodes (from the full 92 available down to 5 electrodes) 50 repetitions were run. CC’s were calculated per run over the last two thirds of the session (to avoid effects of initial performance transients) and averaged over the 3 movement dimensions. Their distributions (over repetitions) are shown in terms of the median CC (red center line), quartile values (skirt containing 50% of the CC’s) and extremal values (whiskers) for each number of electrodes. (D) Effect of delay time between training data and test data: For the session shown in (A), marked ’day 1’ and for another session (marked ’day 2’), hand movement in 20 minute time windows towards the session ends were reconstructed in an online manner but instead of using the same sequences as training data (with a 1 step delay), earlier sequences were used. Figure (A) shows the time window that corresponded to opening a maximal time difference of 60 minutes between the last inferred sequence (at minute 90) and the last learned sequence (at minute 30). CC’s for the test windows (averaged over movement dimensions) are shown as a function of delay time for the two days.

5

KARMA

ARMA

Kalman Filter

95.6%

SVR

Above diagonal is better

Figure 2: Algorithm comparisons: 10 hand-movement sessions were each divided into 3 equally long blocks of 25-35 minutes (the last few minutes were discarded since during this time the monkey often stopped paying attention to the task) to create 30 data-sets. The following algorithms were run on each data-set in an online manner: KARMA, lin-y-KARMA, ARMA and SVR. All four were implemented as versions of the KARMA by varying its parameters. In all cases a support vector cache of 5000 was enforced as described in section 4. A Kalman Filter was also implemented so as to allow for a window of observations as input and a window of movement positions as the state (this version was previously shown to outperform the standard Kalman Filter which has r = s = 1). It was also learned in an online manner, replacing inverses with pseudo-inverses where necessary to avoid non-invertible matrices when data-sets are too small. Results are shown as scatter plots of CC values (30 data-sets and 3 movement dimensions produce 90 points per scatter plot). Each point compares KARMA to another algorithm in a specific data-set and movement dimension pair. Points above the diagonal mean a higher score for KARMA. The Graph on the left shows win-scores for each pair of algorithm. Winscore is defined at the percentage of times one algorithm got a higher CC than another. Edge direction points to the loser. The movement reconstruction on the right (horizontal position only) shows KARMA vs. SVR in a sample 18 second window.

Probably the highest source of variability in BMI performance across subjects is the location and number of electrodes in the brain. To test how performance with KARMA would degrade if electrodes were lost we simulated an electrode dropping experiment (see figure 1.C). Let’s consider a CC of 0.7 as a minimal reasonable performance quality. Let’s also assume that with 50 repetitions, minimal values roughly represent a worst case scenario in terms of mishaps that do not involve movement of the whole array. Then it seems that we can get by easily with only a third of the array (28 electrodes) operational. In terms of medians (average bad luck) we could do with less. Maximal values are relevant in case we need to choose the good electrodes. This may be relevant in situations involving implanted chips that extract and wirelessly transmit neural activity and may have constraints in terms of energy expenditure or bandwidth. Most BMI experiments (e.g. [2, 13] with the exception of [1]) use fixed models that are learned once at the beginning of a session. Our version of KARMA is adaptive. In order to check the importance of adapting to changes in the recorded neural activity we ran an experiment in which variable delays were opened between the inference times andthe latest available data for learning. i.e. after inference of sequence yi from xi , sequence pair xi−k , yi−k where k > 0 was first made available. Figure 1.D shows a degradation in performance during the test periods as the delay grows for two recording sessions. This suggests that adaptability of the algorithm is important for keeping high performance levels. There are two possible reasons for the observed degradation. One is changes in the neural activity within the brain. The other is changes in the process that extracts the neural activity in the BMI (e.g. electrode movements). Differentiating between the two options is a subject of future work. In the BMI setting, feedback is involved. The subject might be able to effortlessly modulate his neural activity and keep it in good fit with the algorithm. In section 5 we address this issue by running BMI sessions in which the model was frozen. Comparison of KARMA to other methods is shown in figure 2. It is clear that KARMA performs much better than ARMA and the Kalman Filter, suggesting that a nonlinear interpretation of neural activity is helpful. While KARMA is statistically significantly better than SVR, the differences in CC values are not very big (note the scaling difference of the scatter plots). Looking at the movement reconstruction comparison it seems that SVR has a good average estimate of the current position, 6

ˆ ti = Wφ φ(xit−s:t ) ) it fluctuates rapidly however, missing a movement model (SVR has the form y around the true value. This fluctuation may not be very apparent in the CC values however it would make a BMI much more difficult to control. Lin-y-KARMA uses a linear movement model (and i ˆ ti = Aˆ has the form: y yt−r:t−1 + Wφ φ(xit−s:t )). Its performance is inferior to full KARMA. Having a nonlinear movement model means that different areas of the state-space get treated in locally relevant fashion. This might explain why full KARMA outperforms. Note that the difference between lin-y-KARMA and SVR are not very significant (win-score of only 65.6%). Comparison to the Population Vector algorithm was also done however the PVA achieved especially bad results for our long sequences since it accumulates errors without any decay (this is less of a problem in BMI since the subject can correct accumulated errors ). We therefore omit showing them here. 100

hand moving

mixed mode

pure BMI control

both hands restrained

Success rate

80 60 mixing factor: 0−>35%

40

learning algorithm frozen

mixing factor: 0−>50%

days start with full amplitude targets

20 Day 1

Day 2

Day 4

Day 3

144 min.

Freeze

Freeze

Success rate

0

Day 6

Day 5

Day 7

Day 8

Hand control

BMI mode

Hand control

33 min. trials

49 min.

57 min.

Figure 3: All graphs show success rates. The light, noisy plots are success rates in consecutive bins of 10 trials while the smoother plots are the result of running a smoothing filter on the noisy plots. Mixing mode was used on day 1 and part of day 2. Afterwards we switched to full BMI (mixing factor 100%). Hands were allowed to move freely until day 4. On day 4 both hands were comfortably restrained for the first time and though performance levels dropped, the monkey did not attempt to move its hands. On day 5 an attempt to freeze the model was made. When performance dropped and the monkey became agitated we restarted the BMI from scratch and performance improved rapidly. Day 6 consists of full BMI but with the targets not as far apart as with hand-control. This makes the task a bit easier and allowed for higher success rates. On days 7 and 8 a BMI block was interleaved with hand control blocks. Only the BMI blocks are shown in the top graph. The full three blocks of day 8 are shown in the bottom right graph. Bottom left graph shows a recording session during which the model was frozen.

5

BMI experiment

The BMI experiment was conducted after approximately four months of neural recordings during which the monkey learned and performed the hand-control task. Figure 3 shows a trace of the first days of the BMI task. A movie showing footage from these days is in the supplementary material (video 2). To make the transition to BMI smooth, the first 1.5 days consisted of work in a mixed mode, during which control of the cursor was a linear combination of the hand position and KARMA’s predictions. We did this in order to see if the monkey would accept a noisier control scheme than it was used to. During the next 1.5 days the cursor was fully controlled by KARMA, but the monkey kept moving as if it was doing hand-control. i.e. it made movements and corrections with its hands. On day 4 the monkey’s free hand was comfortably restrained. Despite our concerns that the monkey would stop performing it seemed impervious to this change, except that control over the cursor became more difficult. On days 5 and 6 we made some tweaks to the algorithm (tried to freeze learning and change the way in which correct movement trajectories are generated) and the 7

task (tried to decrease target size and the distance between targets) which had some effects of task difficulty and on success rates. On days 8 and 9 we interleaved a BMI block with hand-control blocks. We saw that performance is better in hand-control than in BMI but not drastically so. In the following sessions we discontinued all tweaking with the algorithm and we’ve seen some steady improvement in performance. We repeated the freezing of model learning on two days (one of these session appears in figure 3). In all cases where we froze the model, we noticed that the monkey starts experiencing difficulty in controlling the cursor after a period of 10-15 minutes and stopped working completely when this happened. As stated earlier, in most BMI experiments the subjects interact with fixed models. One possible explanation for the deterioration is that because our monkey is trained in using an adaptive model it does not have experience in conforming to such a model’s constraints. Instead, the opposite burden (that of following a drifting source of neural activity) falls on the algorithm’s shoulders. As mentioned in section 2, in BMI mode no hand movements are performed and therefore model learning is based on our guess of what is the desired cursor trajectory (the monkey’s emphintended cursor movement). We chose to design the desired trajectory as a time varying linear   combination of ˜ i where ˆ ti + tft y the cursor trajectory that the monkey saw and the target location: yti = 1 − tft y i

i

˜ i is the target location on trial i. Note that this trajectory starts at the observed cursor location at y the trial start and ends at the target location (regardless of where the cursor actually was at the end of the trial).

6

Summary

This study was motivated by the view that lifting overly simplifying assumptions and integrating domain knowledge into machine learning algorithms can make significant improvements to BMI technology. In turn, this technology can be used as a testbed for improved modeling of the interaction between the brain and environment, especially in visuomotor control. We showed in open-loop hand tracking experiments that the incorporation of a nonlinear movement model, interpreting the neural activity as a whole (rather than as the sum of contributions made by single neurons) and allowing the model to adapt to changes, results in better predictions. The comparison was done against similar models that lack one or more of these properties. Finally, we showed that this model can be used successfully in a real-time BMI setting, and that the added mathematical ’complications’ result in a very intuitive and high quality interface.

References [1] D. M. Taylor, S. I. Helms, S.I. Tillery, and A.B. Schwartz. Direct cortical control of 3d neuroprosthetic devices. Science, 296(7):1829– 1832, 2002. [2] M.Velliste, S.Perel, M.C.Spalding, A.S. Whitford, and A. B. Schwartz. Cortical control of a prosthetic arm for self-feeding. Nature (online), May 2008. [3] S. Kakei, D. S. Hoffman, and P. L. Strick. Muscle and movement representations in the primary motor cortex. Science, 285:2136–2139, 1999. [4] E. Vaadia, I. Haalman, M. Abeles, H. Bergman, Y. Prut, H. Slovin, and A. Aertsen. Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373:515–518, Febuary 1995. [5] W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black. Bayesian population coding of motor cortical activity using a kalman filter. Neural Computation, 18:80–118, 2005. [6] A. E. Brockwell, A. L. Rojas, and R. E. Kass. Recursive Bayesian Decoding of Motor Cortical Signals by Particle Filtering. J Neurophysiol, 91(4):1899–1907, 2004. [7] L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia. Spikernels: Predicting hand movements by embedding population spike rates in inner-product spaces. Neural Computation, 17(3), 2005. [8] L. Shpigelman, K.Crammer, R. Paz, E.Vaadia, and Y.Singer. A temporal kernel-based model for tracking hand movements from neural activities. In NIPS. 2005. [9] P.M.L. Drezet and R.F. Harrison. Support vector machines for system identification. In UKACC, volume 1, pages 688–692, Sep 1998. [10] M. Martínez-Ramón, J. Luis Rojo-Álvarez, G. Camps-Valls, J. Muñoz-Marí, A. Navia-Vázquez, E. Soria-Olivas, and A. R. FigueirasVidal. Support vector machines for nonlinear kernel ARMA system identification. IEEE Trans. Neural Net., 17(6):1617–1622, 2006. [11] G. E. P. Box and G. M. Jenkins. Time Series Analysis: Forecasting and Control. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1994. [12] B. Schoelkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA, 2002. [13] L.R.Hochberg, M.D. Serruya, G. M. Friehsand J.A. Mukand, M.Saleh, A. H. Caplan, A. Branner, D. Chen, R. D. Penn, and J. P. Donoghue. Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature, 442:164–171, 2006.

8

Chapter 3

Discussion This work, carried out over several years involved constant activity in the lab; developing the 3D interactive setup, training of monkeys, xing hardware, adding real-time closed loop abilities, testing dierent options and re-planning our next steps while also researching machine learning methods for hand tracking and closed loop BMI control. The result is a state of the art setup whose algorithmic aspect is, in our opinion, (to date) the most advanced in the eld. The four papers that comprise this work are all directed towards the combined goal of gaining a better understanding of how intended behavior is encoded by the motor cortex while contributing to a series of improvements in decoding quality. Each of the papers tackles the problem from a dierent angle. The Spikernel [51] provides a method for measuring distances between spike patterns of a large neuronal population. This kernel can capture dependencies between neurons at various time lags. It also allows (depending on parameter settings) for some time-warping and for setting the sensitivity of the bin-by-bin comparison accuracy . This kernel was shown to outperform standard kernels and the Euclidean distance measure (the linear kernel) in an SVR setting. Specic choices of the Spikernel parameters can reduce it to the Gaussian kernel (which allows

no

time warping). The fact that the Spikernel performs better

77

than a Gaussian kernel shows that allowing time warping within this context is benecial. However, whether time warping is a feature of neural patterns or a desirable trait in the SVR algorithmic setting is hard to say. The same critique goes for setting of the optimal time-bin size (100ms) and number of time bins (10 bins, going back 1 second) in this work.

The work on feature selection [35] shows that appropriate selection of a subset of cortical units and their relevant time bins for use within a non-linear regression method (a regression version of k-nearest-neighbor) can improve movement reconstruction performance. This method, as opposed to the other methods in the paper (which oer inferior results) and the work in [60] treats the selected subset of features as a whole. If features were independent (as is often the assumption in analysis methods of neuronal data) then the dierence in performance would probably have been a smaller.

The novel algorithm developed in the third paper [49] incorporates a linear model of movement dynamics.

This model allows integration of velocity and

acceleration in order to reconstruct position. Because the movement model is linear, the dependency of new predictions on prior predictions within the same trial can be transformed into a (decaying) dependency on all the neural activity since the beginning of that trial (by opening the recursion).

The transfer of

dependency from outputs to inputs allows us to reduce the optimization problem to the same setting as in SVR but with (yet another) novel kernel matrix which uses the Spikernel as a base element.

In a post-publication analysis I

found that the learned model parameters produced a movement model (matrix

A

in the paper) that promotes movement with constant acceleration (which is

then further steered by neural input). This movement model ts the minimumjerk paradigm. The comparison with other algorithms allows us to show that the movement dynamics and the non-linear treatment of the input, each contribute to the success of this method.

This provides further reinforcement of

the importance of these features in the neural decoding problem.

78

The fourth paper involves a change in the problem setup. We no longer reconstruct velocities or positions of performed behavior.

In the BMI setting, the

learning algorithm and the subject interact with each other to produce behavior. If the result is far from what the subject intended we may not only lose the trial but eventually, also the subject's cooperation (whether our subject is a non-human or human primate).

This problem intensies our need to build

a BMI with a fast learning curve. The combination of conditions and requirements outlined in the introduction rules out the use of Spikernels and the DDT method of the third paper.

Calculating a Spikernel involves a dynamic pro-

gramming method and takes much longer than a simple Gaussian kernel. The improvement in performance due to the use of a better kernel would likely be oset by the need to use a smaller number of support vectors in order to meet the required inference time. DDT, without further development, requires keeping the data from each training-set trial that precedes a support vector from that same trial. This non-sparsity causes diculties in managing memory size. Furthermore, the Gram matrix width and height in DDT are the number of support vectors multiplied by the number of state dimensions (in case of a sixdimensional state, as was presented in the paper, this means a 36 fold increase in the Gram matrix size as compared with regular SVR).

A new approach was needed for the BMI setting. KARMA is a natural replacement. It allows using either the same state space as in DDT or a state space that is comprised of

several

previous estimates. This allowed us to replace the

state space that included position, velocity and acceleration with one that has the previous 4 positions (estimates of the velocities and accelerations can be linearly derived from this state) without any signicant loss in prediction quality while gaining in terms of inference speed and memory footprint. KARMA treats neural activity in a similar manner to DDT and SVR (i.e through the kernel's mapping to feature space). Our positive results in the previous studies suggested that this feature of the algorithm should be preserved. KARMA al-

79

lows us to throw away examples not in the support

1 . KARMA (and DDT as

well) allow introduction of new state dimensions that are comprised of data that is available as part of the training set but is unknown during inference. These auxiliary state components may aid in predicting the main state components (hand position in our case). An example of such a possible auxiliary state component is the trial target's position. The motivation for adding another state component is that if this state component helps to explain the neural activity or has some direct bearing on the main state components it could help predict those main components. By testing which auxiliary state components are easy to estimate or improve results we can infer what features of behavior are encoded in the neural activity. This notion was tested in some of my preliminary work. There is a trade-o, however, since inference time is limited. The improvement due to better state modeling must outweigh the degradation due to decreasing the number of support vectors and the longer times needed for model updating. I was unable to nd auxiliary state components that merited addition under these conditions.

KARMA allows non-linear treatment of the previous states through the use of a kernel that works on both states and observations. Using such kernels can allow modeling that treats neural activity in a way that is also dependent on the state. For example, by using a Gaussian kernel over both previous state and current neural activity (as one long concatenated vector) we can get a model that treats neural activity in a way that is localized by the current position or velocity. This idea was tested in the paper by comparing a kernel that uses a product of Gaussian kernels over states and observation with a kernel that uses a sum of those Gaussian kernels. The product-of-Gaussians kernel's performance was almost identical to that of the sum-of-Gaussians kernel. This nding does

not

support a need for context-dependent interpretation of neural activity (which I was hoping to nd).

1 This requires a bit of data manipulation which enlarges the memory requirement per support vector

80

The adaptive BMI setting introduces a need for us to guess the subject's intended behavior given the end-result of the trial, i.e. given the trajectory that the algorithm produced and, possibly, extra knowledge such as where the target was or which object was being reached for, etc. We can then generate a better estimate of the desired trajectory (as opposed to that which was generated online by our algorithm) and use the desired trajectory as labeling of the neural activity in order to adapt the model. In our study we tested one such scheme (using a time dependent linear combination of observed cursor trajectory and target position). There are various other ways that could have been tried and may have been more successful.

We risk degrading our model by continuously adapting it, especially if the subject stops cooperating and no-longer intends to reach the presented targets . To deal with this situation we need to estimate whether or not the neural activity in the last trial reects an intention to reach the target or perhaps it reects the subject's intention to get some sleep. In the lab setting we could check whether or not the trial ended in failure and allow neural activity from successful trials to have a larger inuence on the model than neural activity from failed trials. In a real world situation, where the intended action is not externally specied, we can make educated guesses by looking at parameters such as movement speed and assume that any target that was reached (e.g.

an icon on the computer

screen), was reached intentionally and is the equivalent of making a successful trial.

One potential BMI issue that had concerned us is the possibility that a continuously adapting model would cause the interface to change too quickly and our subject would end up spending the session trying to adapt to the model without being able to eectively control it. Our experiments revealed an opposite eect; as long as model adaptation was turned on, our monkey had no trouble reaching the targets. Once the model was frozen, performance dropped noticeably within 15-20 minutes. It seems that by allowing our algorithm to adapt quickly

81

it did a good job of than

causing

following

changes in our subject's neural activity rather

them. This heightens the need for adaptivity, however, this need

may be diminished by proper training of the subjects.

As mentioned in the introduction, choosing decoding methods and modeling options involves a trade-o between decoding power and model interpretability. The kernel methods used in my studies produce models that do not yield themselves to simple analysis. For example it is quite dicult to deduce how a neuron's activity is inuenced by current behavior, external stimuli and other neurons' activities from a learned KARMA model. The KARMA model implies that neuronal activity is inuenced by all three but manipulating the model's equations and isolating a neuron's spike rate as a function of the current activity of other neurons and current state (which may include current behavior and current external stimuli) is intractable.

Although the models presented in my work do not yield simple interpretations of neural encoding, they can still be probed in order to gain some understanding of what they have learned.

I do so in the papers by manipulating model

meta-parameters such as choice of kernel or the relative contribution of the movement dynamics model. We learn something about the relative importance a component by examining the eect on prediction power. This type of analysis is analogous to a lesioning study, where a deleted component is analogous to a lesioned part of the brain. The same critique that applies to lesioning studies (e.g the lesion damages the operation of the remaining tissue) applies here (e.g deleting the algorithmic component damages the workings of the remaining components).

Nevertheless this method does provide supportive evidence

regarding the function and importance of the various components.

There are multiple possible future research directions that are natural extensions of this work. The following list is a small sample of the many open directions and questions that this work raises but leaves unanswered. These ideas would all extend our understanding of how motor-control is carried out by our brain

82

and most should also improve our decoding / BMI abilities.



One research direction involves replacing or extending the decoding models' state space and state dynamics. For example, instead of assuming that the current state should only be the previous hand positions, we could use an EM-like method to nd extra state dimensions (as in [40]). We could also learn to predict which part of the trial we are in (e.g.

movement

planning, holding at current location, initial movement, nal feedback dependent corrections etc..) and incorporate this into a possibly hierarchical model. By explicitly learning models that are time-localized to dierent parts of the trial (or dierent tasks) we may do better at identifying different modes of operation of the same recorded group of neurons in those dierent tasks (and, obviously, do better in both the o-line tracking and BMI settings).



Another way of producing interpretable models is to apply system identication techniques to a learned (dicult-to-interpret) model. One simple application of this idea would be to t linear models to a non-linear one where linearization is done locally in various locations in the model's state space.

By comparing the linear models in dierent locations, the dif-

ferences between linear models would highlight which features are most important in dierent areas of state-space.



Our recordings were done in primary motor cortex. Many studies show that other areas such as the parietal-reach-region, SMA and areas in prefrontal cortex may be less informative in regard to current movement parameters but may hold other information regarding target position or decisions regarding the task and it's goals.

Recording from such areas

and using models that can predict higher order features of the task may be a better approach than modeling hand dynamics alone. This direction has potential for improving decoding / BMI accuracy. By comparing the dierences in results when using dierent recording locations and by com-

83

paring dierences in optimal model parameters in the various recording locations we can also gain insight in better understanding the roles and interactions of these areas.



Combining dierent machine learning approaches is, quite often, more benecial than continued optimization of a single approach. Dierent approaches often dier in their point of view of the problem and the true nature of the problem is usually something in-between. Combining learning algorithms of orthogonal nature (e.g. Bayes nets, SVMs and decision trees) can help in capturing dierent aspects of the problem.

This di-

rection is aimed mainly at improving decoding / BMI accuracy but may potentially lead to better understanding of how dierent aspects of neural activity (represented by the decoding power of the combined algorithms) combine to produce the desired behavior.



Just as dierent machine-learning approaches can capture dierent aspects of the problem, dierent recording methods and neural-activity-derived signals (e.g LFP + spiking activity + EEG) may reveal dierent types of information. In addition to improving decoding quality, combining recording methods and sources can also make a BMI more robust to changes and can help us contrast the dierent signals' contents. By seeing which sources of information are most helpful we can, again, deduce something about the calculations the are employed by the brain.



All the algorithms in this work fall within the supervised learning realm. A dierent approach would be to use unsupervised machine learning. Unsupervised learning is a great tool for nding structures within data. By

not

directing the learning algorithm at producing desired labels and let-

ting it describe the naturally occurring structures within the data (such as clusters or reduced dimensions) we can sometimes stumble upon a connection between a naturally occurring structure and some external feature of behavior. Depending on whether this connection was previously known

84

or not we can then, either claim to have (somewhat objectively) found a new function of the relevant piece of cortex or, alternatively (and more often), claim to have validated a known function using a dierent tool. As a concrete example, we can take neural activity during the planning phase of center-out reach movements and learn an LLE model for it [39]. We can then reduce the dimensionality of input data to 2 and plot out a vector map where the vectors' points of origin are the positions chosen by LLE and their direction is dictated by the directions of movement (in 2d space as well). If we see a nice alignment we can claim that the main reason for the variations in neural activity in MI is due to the dierences in the planned movement directions. If we do

not

see a nice alignment we

can try and correlate the reduced neural activity to other features of the task and discover other aspects of MI computations.

One example of a

study taking this approach is that of Churchland et al [9]



My algorithms use estimates of the current spike rates as input.

It is

quite possible that additional information can be obtained from spiking timing coincidences and other spike codes. Note that the kernel methods presented in my work can be adapted to spike codes by changing the inputs and replacing the kernel used with one that can handle such codes.



Another research direction uses the BMI setup that we developed in order to study neural plasticity.

This could be done by requiring the subject

to generate certain desired neuronal activities in order to accomplish the task using conditioning methods. By specifying the desired activity and how dierences between activities are measured we can probe how much control the subject has over the neural activity and its dynamics.

85

86

Chapter 4

Bibliography

87

Bibliography [1] A. M. Aertsen, G. L. Gerstein, M. K. Habib, and G. Palm. of neuronal ring correlation:

Dynamics

modulation of "eective connectivity".

J

Neurophysiol, 61(5):900917, May 1989. [2] Fritzie

Arce,

Itai

Novick,

Yael

Mandelblat-Cerf,

and

Eilon

Vaadia.

Learning-induced modulation of motor cortical activity during adaptations to force elds with and without visual feedback.

In

Society for Neuro-

science, 2007. [3] Fritzie Arce, Itai Novick, Maayan Shahar, Yuval Link, Claude Ghez, and Eilon Vaadia. Dierences in context and feedback result in dierent trajectories and adaptation strategies in reaching.

PLoS ONE,

4(1):e4214+,

January 2009.

[4] Wyeth Bair and Christof Koch. Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey.

Neural Computation,

8(6):11851202, 1996.

[5] A. E. Brockwell, A. L. Rojas, and R. E. Kass. Recursive bayesian decoding of motor cortical signals by particle ltering.

J Neurophysiol, 91(4):1899

1907, 2004.

[6] E. N. Brown, L. M. Frank, D. Tang, M. C. Quirk, and M. A. Wilson. A statistical paradigm for neural spike train decoding applied to position

88

prediction from ensemble ring patterns of rat hippocampal place cells.

Journal of Neuroscience, 18:74117425, 1998. [7] Jose M. Carmena, Mikhail A. Lebedev, Roy E. Crist, Josef E. O'Doherty, David M. Santucci, Dragan F. Dimitrov, Parag G. Patil, Craig S. Henriques, and Miguel A. L. Nicolelis.

Learning to control a brain-machine

interface for reaching and grasping by primates.

PLOS Biology, 1(2):001

016, 2003.

[8] J. K. Chapin, K. A. Moxon, R. S. Markowitz, and M. A. Nicolelis. Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex.

Nature Neuroscience, 2:664670, 1999.

[9] M. M. Churchland, B. M. Yu, M. Sahani, and K. V. Shenoy. Techniques for extracting single-trial activity patterns from large-scale neural recordings.

Current opinion in neurobiology, 17(5):609618, October 2007. [10] N. Cristianini and J. Shawe-Taylor.

An introduction to support Vector

Machines: and other kernel-based learning methods.

Cambridge University

Press New York, NY, USA, 1999.

[11] John P. Donoghue.

Bridging the brain to the world:

neural interface systems.

A perspective on

Neuron, 60(3):511521, 2008.

[12] E. V. Evarts. Relation of pyramidal tract activity to force exerted during voluntary movement.

Journal of neurophysiology,

31(1):1427, January

1968.

[13] James B. Fallon, Dexter R.F. Irvine, and Robert K. Shepherd. Cochlear implants and brain plasticity.

Hearing Research, 238(1-2):110  117, 2008.

The Auditory Brain - A Tribute to Dexter R.F. Irvine.

[14] Eberhard E. Fetz. Operant conditioning of cortical unit activity. 163(3870):955958, February 1969.

89

Science,

[15] Eberhard E. Fetz.

Volitional control of neural activity: implications for

brain-computer interfaces.

J Physiol, 579(3):571579, 2007.

[16] Tamar Flash and Neville Hogan.

The coordination of arm movements:

The Journal of Neuro-

An experimentally conrmed mathematical model.

science, 5(7):16881703, July 1985. [17] A. P. Georgopoulos, A. B. Schwartz, and R. E. Kettner. Neuronal population coding of movement direction.

Science, 233:14161419, 1986.

[18] Apostolos P. Georgopoulos, J.F. Kalaska, and J.T. Massey. Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor cortical populations.

Experimental Brain Research (Supp), 7:327

336, 1983.

[19] Apostolos P. Georgopoulos, Ronald E. Kettner, and Andrew B. Schwartz. Primate motor cortex and free arm movements to visual targets in threedimensional space.

The Journal of NeuroScience, 8, August 1988.

[20] George L. Gerstein and Nelson Y. Kiang.

An approach to the quantita-

tive analysis of electrophysiological data from single neurons.

Biophys. J.,

1(1):1528, September 1960.

[21] C. M. Harris and D. M. Wolpert. Signal-dependent noise determines motor planning.

Nature, 394(6695):780784, August 1998.

[22] Helms, D. M. Taylor, and A. B. Schwartz.

The general utility of a neu-

roprosthetic device under direct cortical control.

Engineering in Medicine

and Biology Society, 2003. Proceedings of the 25th Annual International Conference of the IEEE, 3:20432046 Vol.3, 2003. [23] Parwez Hossain, Ian W Seetho, Andrew C Browning, and Winfried M Amoaku.

Articial means for restoring vision.

2005.

90

BMJ,

330(7481):3033,

[24] Andrew Jackson, Jaideep Mavoori, and Eberhard E. Fetz. Long-term motor cortex plasticity induced by an electronic neural implant.

Nature, 444:56

60, 2006.

[25] Beata Jarosiewicz, Steven M. Chase, George W. Fraser, Meel Velliste, Robert E. Kass, and Andrew B. Schwartz. Functional network reorganization during learning in a brain-computer interface paradigm.

Proceedings

of the National Academy of Sciences, 105(49):1948619491, 2008. [26] Eric R. Kandel, James H. Schwartz, and Thomas M. Jessel.

Neural Science.

Principles of

McGraw-Hill, 4 edition, 2000.

[27] Shinsuke Koyama, Steven Chase, Andrew Whitford, Meel Velliste, Andrew Schwartz, and Robert Kass. Comparison of brain computer interface decoding algorithms in open-loop and closed-loop control.

Journal of Com-

putational Neuroscience. [28] H. Lalazar and E. Vaadia. Neural basis of sensorimotor learning: modifying internal models.

Current Opinion in Neurobiology, December 2008.

[29] Mikhail A. Lebedev, Jose M. Carmena, Joseph E. O'Doherty, Miriam Zacksenhouse, Craig S. Henriquez, Jose C. Principe, and Miguel A. Nicolelis. Cortical ensemble adaptation to represent velocity of an articial actuator controlled by a brain-machine interface.

J. Neurosci.,

25(19):46814693,

May 2005.

[30] Mikhail A. Lebedev and Miguel A. L. Nicolelis. Brain-machine interfaces: past, present and future.

TRENDS in Neurosciences, 9(9):536546, 2006.

[31] E. C. Leuthardt, G. Schalk, D. Moran, and J. G. Ojemann. The emerging world of motor neuroprosthetics: a neurosurgical perspective.

Neuro-

surgery, 59(1), July 2006. [32] Chiang-Shan R. Li, Camillo Padoa-Schioppa, and Emilio Bizzi. Neuronal correlates of motor performance and motor learning in the primary motor

91

cortex of monkeys adapting to an external force eld.

Neuron, 30(2):593

607, May 2001.

[33] D. Liu and E. Todorov. Evidence for the exible sensorimotor strategies predicted by optimal feedback control.

J Neurosci, 27(35):93549368, Au-

gust 2007.

[34] Chet T. Moritz, Steve I. Perlmutter, and Eberhard E. Fetz. Direct control of paralysed muscles by cortical neurons.

Nature, 456:639642, 2008.

[35] Amir Navot, Lavi Shpigelman, Naftali Tishby, and Eilon Vaadia. Nearest neighbor based feature selection for regression and its application to neural activity. In

Advances in Neural Information Processing Systems 17.

MIT

Press, Cambridge, MA, 2006.

[36] R. Paz, T. Boraud, C. Natan, H. Bergman, and E. Vaadia. Preparatory activity in motor cortex reects learning of local visuomotor skills.

Nature

Neuroscience, 6(8):882890, August 2003. [37] Rony Paz and Eilon Vaadia. Learning-induced improvement in encoding and decoding of specic movement directions by neurons in the primary motor cortex.

PLoS Biology, 2(2):e45+, February 2004.

[38] Leigh R.Hochberg,

Mijail D. Serruya,

Gerhard M. Friehsand Jon A.

Mukand, Maryam Saleh, Abraham H. Caplan, Almut Branner, David Chen, Richard D. Penn, and John P. Donoghue.

Neuronal ensemble control of

prosthetic devices by a human with tetraplegia.

Nature, 442:164171, July

2006.

[39] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding.

Science, 290(5500):23232326, December 2000.

[40] J. C. Sanchez, D. Erdogmus, M. A. Nicolelis, J. Wessberg, and J. C. Principe. Interpreting spatial and temporal neural activity through a recurrent neural network brain-machine interface.

92

IEEE transactions on neural

systems and rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society, 13(2):213219, June 2005. [41] Gopal Santhanam, Stephen I. Ryu, Byron M. Yu, Afsheen Afshar, and Krishna V. Shenoy. A high-performance brain-computer interface.

Nature,

442(7099):195198, 2006.

Learning with Kernels.

[42] B. Schoelkopf and A. J. Smola.

The MIT Press,

Cambridge, MA, 2002.

[43] A. Schwartz, X. Cui, D. Weber, and D. Moran. Brain-controlled interfaces: Movement restoration with neural prosthetics.

Neuron, 52(1):205220, Oc-

tober 2006.

[44] Scott and H. Stephen. Inconvenient truths about neural processing in primary motor cortex.

The Journal of Physiology,

586(5):12171224, March

2008.

[45] S. H. Scott. The role of primary motor cortex in goal-directed movements: insights from neurophysiological studies on non-human primates.

Current

opinion in neurobiology, 13(6):671677, December 2003. [46] Chris Sekirnjak, Pawel Hottowy, Alexander Sher, Wladyslaw Dabrowski, Alan M. Litke, and E. J. Chichilnisky.

High-Resolution Electrical Stim-

ulation of Primate Retina for Epiretinal Implant Design.

J. Neurosci.,

28(17):44464456, 2008.

[47] Mijail D. Serruya, Nicholas G. Hatsopoulos, Liam Paninski, Matthew R. Fellows, and John P. Donoghue. signal.

Instant neural control of a movement

Nature, 416:141142, March 2002.

[48] R. Shadmehr and Fa Mussa-Ivaldi. Adaptive representation of dynamics during learning of a motor task.

J. Neurosci., 14(5):32083224, May 1994.

[49] Lavi Shpigelman, Koby Crammer, Rony Paz, Eilon Vaadia, and Yoram Singer. A temporal kernel-based model for tracking hand movements from

93

neural activities. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors,

Advances in Neural Information Processing Systems 17, pages 1273

1280. MIT Press, Cambridge, MA, 2005.

[50] Lavi Shpigelman, Hagai Lalazar, and Eilon Vaadia.

Kernel-arma for

hand tracking and brain-machine interfacing during 3d motor control. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

Advances in

Neural Information Processing Systems 21. MIT Press, 2009. [51] Lavi Shpigelman, Yoram Singer, Rony Paz, and Eilon Vaadia. Spikernels: Predicting hand movements by embedding population spike rates in innerproduct spaces.

Neural Computation, 17(3):671690, march 2005.

[52] A.J. Smola and B. Schölkopf.

A tutorial on support vector regression.

Statistics and Computing, 14(3):199222, 2004. [53] Dawn M. Taylor, Stephen I. Helms, and Andrew B. Schwartz. Information conveyed through brain-control: cursor versus robot.

IEEE Trans Neural

Syst Rehabil Eng., 11(2):195199, Jun 2003. [54] Dawn M. Taylor, Stephen I. Helms S. I. Tillery, and Andrew B. Schwartz. Direct cortical control of 3d neuroprosthetic devices.

Science, 296(7):1829

1832, 2002.

[55] Emanual Todorov. Direct cortical control of muscle activation in voluntary arm movements.

Nature Neuroscience, 3(4):391398, 2000.

[56] Emanuel Todorov and Michael I. Jordan. Optimal feedback control as a theory of motor coordination.

Nature Neuroscience, 5(11):12261234, 2002.

[57] E. Vaadia, I. Haalman, M. Abeles, H. Bergman, Y. Prut, H. Slovin, and A. Aertsen. Dynamics of neuronal interactions in monkey cortex in relation to behavioral events.

[58] Vladimir Vapnik.

Nature, 373:515518, Febuary 1995.

The Nature of Statistical Learning Theory.

N.Y., 1995.

94

Springer,

[59] Meel Velliste, Sagi Perel, M. Chance Spalding, Andrew S Whitford, and Andrew B Schwartz. Cortical control of a prosthetic arm for self-feeding.

Nature, 453(7198):10981101, Jun 2008. [60] R. Wahnoun, J. He, and S. I. Helms Tillery.

Selection and parameteri-

zation of cortical neurons for neuroprosthetic control.

Journal of neural

engineering, 3(2):162171, June 2006. [61] Johan Wessberg, Christopher R. Stambaugh, Jerald D. Kralik, Pamela D. Beck, Mark Laubach, John K. Chapin, Jung Kim, James Biggs, Mandayam A. Srinivasan, and Miguel A. Nicolelis. Real-time prediction of hand trajectory by ensembles of cortical neurons in primates.

Nature,

408(16),

November 2000.

[62] Daniel M. Wolpert, Zoubin Ghahramani, and J. Randall Flanagan. Perspectives on problems in motor learning.

TRENDS in Cognitive Sciences,

5(11):487494, November 2001.

[63] Daniel M. Wolpert and Zoubin Gharamani. movement neuroscience.

Computational principles of

Nature Neuroscience, 3, November 2000.

[64] W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black. Bayesian population coding of motor cortical activity using a kalman lter.

Neural

Computation, 18:80118, 2005. [65] Wei Wu and N. G. Hatsopoulos. Real-time decoding of nonstationary neural activity in motor cortex.

Neural Systems and Rehabilitation Engineering,

IEEE Transactions on, 16(3):213222, 2008. [66] B. M. Yu, S. I. Ryu, G. Santhanam, M. M. Churchland, and K. V. Shenoy. Improving neural prosthetic system performance by combining plan and peri-movement activity. In

Engineering in Medicine and Biology Society,

2004. IEMBS '04. 26th Annual International Conference of the IEEE, volume 2, pages 45164519 Vol.6, 2004.

95

[67] Byron Yu, Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Krishna Shenoy, and Maneesh Sahani. Extracting dynamical structure embedded in neural activity. In Y. Weiss, B. Schölkopf, and J. Platt, editors,

Advances

in Neural Information Processing Systems 18, pages 15451552. MIT Press, Cambridge, MA, 2006.

[68] Byron M. Yu, Caleb Kemere, Gopal Santhanam, Afsheen Afshar, Stephen I. Ryu, Teresa H. Meng, Maneesh Sahani, and Krishna V. Shenoy. Mixture of trajectory models for neural decoding of goal-directed movements.

J

Neurophysiol, 97(5):37633780, May 2007. [69] Neta Zach, Dorrit Inbar, Yael Grinvald, Hagai Bergman, and Eilon Vaadia. Emergence of novel representations in primary motor cortex and premotor neurons during associative learning. September 2008.

96

J. Neurosci.,

28(38):95459556,

Acknowledments First and foremost I would like to thank my advisors, Elion Vaadia and Yoram Singer. Eilon is a great advisor. His support during the all my years at the Hebrew University was tremendous. I set out to do interdisciplinary work which required me to do Machine-learning related research while building a new lab setup, training monkeys and nally conducting the Brain-Machine interface experiment. Eilon provided all the support I could possibly ask for; from assistant programmers, trainers, and funds to lots of advice, hands-on assistance and guidance that kept me focused on my goals, while giving me the freedom and a show of condence in letting me pursue the machine-learning aspects of the research as I saw t.

Yoram Singer provided invaluable support as well. The rst years of my work were concentrated on developing machine-learning methods of the kind that Yoram specializes in.

Without his guidance in directing me towards viable

solutions and the support of his group of students I would have probably lost my way in the maze of options.

The brilliant researchers at the Interdisciplinary Center for Neural Computation and the School of Computer Science and Engineering have been great sources of knowledge and inspiration to me.

I would especially like to thank Naftali

Tishby whose deep insights in both of my elds of interest (and in several others) are often unique and raise more interesting research questions than can be addressed, certainly by one PhD.

The machine learning lab, the ICNC and the Physiology department of the Hadassah medical school are full of people that are great colleagues and friends. I would like to thank them all for their friendship and the fertile environment that they provided.

Some of them deserve special mention.

Amir Navot is

a good friend who helped me understand my priorities and was great help in solving the various problems that we encountered in our courses and research. Yuval Link provided an incredible amount of assistance in the lab and in working

97

with the monkeys.

Without his 24/7 willingness to help the lab setup would

not have bin built. Hagai Lalazar took an abandoned monkey, made her into an obedient worker, got her successfully implanted and, after doing his own work with her, initiated our joint experiment. Without his get-things-done attitude we wouldn't have gone through with the BMI work. Koby Carmmer was a great room mate that demonstrated to me how not to waste time while at one's desk. Michael Fink was another great room mate who demonstrated to me how to be successfully involved in 10 projects at once (the PhD not necessarily one of them).

Shai Shalev-Schwartz was a surrogate advisor at times.

Morning

coees with Matan Ninio were a nice break (Matan telling me to get that PhD submitted - I nd hilarious).

Other people whose company I enjoyed are Fritzie Arse, Aharon Bar-Hillel, Yosef Barash, Gal Chechik, Gal Elidan, Ben Engelhard, Ran Gilad-Bachrach, Amir Globerson, Amit Gruber, Tamir Hazan, Tomer Hertz, Dorit Inbar, Tommy Kaplan, Roni Libel, Itai Novik, Naama Parush ,Roni Paz, Boris Rosin and many more.

Finally, I want to take the opportunity to thank Shlomit, my parents and the rest of the family for everything.

98

BASED MACHINE LEARNING METHODS FOR ...

machine learning methods for interpreting neural activity in order to predict .... describing the problems faced in motor control and what characterizes natural.

2MB Sizes 3 Downloads 278 Views

Recommend Documents

Machine Learning Methods for High Level Cyber ...
Windows Explorer (the file browser). This instrumentation captures events at a semantically-meaningful level (e.g., open excel file, navigate to web page, reply.

Ensemble Methods for Machine Learning Random ...
Because of bootstrapping, p(picking sample for tree Ti) = 1 – exp(-1) = 63.2%. – We have free cross-validation-like estimates! ○. Classification error / accuracy.

MACHINE LEARNING BASED MODELING OF ...
function rij = 0 for all j, the basis function is the intercept term. The matrix r completely defines the structure of the polynomial model with all its basis functions.

Experiments in Graph-Based Semi-Supervised Learning Methods for ...
Experiments in Graph-based Semi-Supervised Learning Methods for. Class-Instance Acquisition ... oped over the years (Hearst, 1992; Riloff and. Jones, 1999; Etzioni et al., ..... YAGO KB, and hence these instance nodes had degree 0 in the YAGO graph.

Comparing Machine Learning Methods in Estimation of Model ...
Abstract− The paper presents a generalization of the framework for assessment of predictive models uncertainty using machine learning techniques. Historical ...

Comparing Machine Learning Methods in Estimation ...
author, phone: +31-15-2151764, e-mail: [email protected]). Dimitri P. ... Furthermore, the first and the last approaches deal only with a single source of ...

MALADY: A Machine Learning-based Autonomous ...
One of the characteristics that distinguishes sensor networks ... MALADY, that can be used by the nodes in a sensor network ... wireless sensor networks.

MALADY: A Machine Learning-based Autonomous ... - Semantic Scholar
Geethapriya Thamilarasu. Dept. of Computer Science & Engineering ... when there is some degree of imprecision in the data. Third, in some cases, example ...

MALADY: A Machine Learning-based Autonomous ... - Semantic Scholar
MALADY: A Machine Learning-based Autonomous ... machine learning techniques for autonomous decision making. .... on MicaZ motes running TinyOS.

Machine learning-based 3D resist model
94.5 signals. We performed 10-fold cross-validation of two alternative ANNs. In case of ANN for regression,. ANN in turn had 69 input nodes and 5 hidden layers, ...

Machine learning-based 3D resist model
Accurate prediction of resist profile has become more important as ... ANN corresponds to our machine learning-based resist 3D model (ML-R3D model). Due to ...

MACHINE LEARNING FOR DIALOG STATE ... - Semantic Scholar
output of this Dialog State Tracking (DST) component is then used ..... accuracy, but less meaningful confidence scores as measured by the .... course, 2015.

Machine Learning for Computer Games
Mar 10, 2005 - Teaching: Game Design and Development for seven years. • Research: ..... Russell and Norvig: Artificial Intelligence: A Modern. Approach ...

Machine Learning for PHY_PhDPosition -
Toward smart transceivers: Machine/Deep learning for the Physical Layer. With the ... Solid programming skills (Python, Matlab, …) - Good proficiency in English ...

Machine Learning for Computer Security
ferring application protocol behaviors in encrypted traffic to help intrusion detection systems. The ... Applications of Data Mining in Computer Security. Kluwer,.

Machine Learning for Computer Games
Mar 10, 2005 - GDC 2005: AI Learning Techniques Tutorial. Machine Learning for ... Teaching: Game Design and Development for seven years. • Research: ...

Machine Learning for Computer Security
Conventional security software requires a lot of human effort to identity threats, extract char- ... to select both the top-performing N-grams and the best performing classifier which in this case was ... Workshop Notes of Visualization and Data.

Learning Methods for Dynamic Neural Networks - IEICE
Email: [email protected], [email protected], [email protected]. Abstract In .... A good learning rule must rely on signals that are available ...

Applied Machine Learning - GitHub
In Azure ML Studio, on the Notebooks tab, open the TimeSeries notebook you uploaded ... 9. Save and run the experiment, and visualize the output of the Select ...