Published in Proceedings of ASSETS 98, Marina del Rey, California, April 1998
A Tool for Creating Eye–aware Applications that Adapt to Changes in User Behavior
Gregory Edwards
Advanced Eye Interpretation Project
CSLI, Stanford University
Stanford, CA 94305–4115
(650) 725–1725
gedwards@eyetracking.stanford.edu
http://eyetracking.stanford.edu
ABSTRACT
A development tool is described that can be used to create eye–aware software applications that adapt in real-time to changes in a user’s natural eye–movement behaviors and intentions. The research involved in developing this tool focuses on identifying patterns of eye–movement that describe three behaviors: Knowledgeable Movement, Searching, and Prolonged Searching. In the process of doing the research, two important features of eye–movement patterns were discovered—Revisits and Significant Fixations. Revisits and Significant Fixations complement the recognition of saccades, fixations, and blinks, and make easier the recognition of high–level patterns in users’ natural eye–movements.
Keywords
Eyetracking, eye–aware, eye interpretation engine, user intent, visual search, fixations, fixation duration, user–centered approach, human–computer interaction
INTRODUCTION
We present a description of a software development tool, the Eye Interpretation Engine, that recognizes three eye–movement behavior patterns, allowing applications that make use of the tool to adapt and react to changes in a user’s intentions. The work described in this paper grew out of our need to develop eyetracking software that could provide full keyboard and mouse control of a computer for a person who has late–stage ALS (Lou Gehrig’s disease). The software we developed had to enable a person who could move no part of his or her body other than the eyes to have full control of the keyboard and mouse of any computer and, thereby, have full access to the Internet and all the software that runs on a computer.
As part of this project we developed an on–screen keyboard that has keys that can be activated by looking at them. In effect, the user can type with his or her eyes. The idea of an eye–activated on–screen keyboard is not new—a number of disabled people who do not have the use of their hands and voices have used them for a number of years to communicate and work—but the approach we took in developing the keyboard is unique.
We wanted to write a program that would improve the interaction between a human user and a computer; and in particular, we wanted the computer to adapt to the human user instead of the human having to adapt to the computer. The eye–activated keyboard had to be fast and effective for experienced users, but easy to use for first time users—allowing them to search for letters at a leisurely pace without worry that a letter would be mistakenly selected should they look at it too long—and we wanted the system to automatically handle both types of users. Our goal was to create software that would recognize the individual characteristics of a user’s eye–movements and automatically adjust itself to the individual.
While writing the software, we realized that no high–level development tools exist to write eye–aware software that can work with the various eyetracking devices on the market. We created such a tool, which we call an Eye Interpretation Engine, that can recognize three eye–movement behaviors commonly practiced by people when using interfaces that have buttons that can be selected by eye. The power of being able to recognize the different behaviors is that it enables software written with the tool to adapt to the changing needs and expectations of the user.
Natural Movement of the Eye
People are not generally aware of how their eyes move and collect visual information. For instance, most people are not aware that their eyes typically move at least twice a second, jumping quickly in a straight line from looking at one location to another. These eye–movements are called saccades. People are also not generally aware that while they are searching for something, like a letter on a keyboard, their eyes pause up to four times each second, momentarily fixating on a key before moving on. It is during these fixations that the eye perceives the keys on a keyboard, the features of a person’s face, or any other object in the world. Jacob [4] gives a good, detailed overview of how the human eye works and moves.
When users tried our keyboard, we found that they were not aware that they fixated while they were searching for the letter they wanted. When asked, most users thought that their eyes moved smoothly and continuously while they were searching for a letter, and only stopped once they found it. In addition, users were not aware that some of the fixations they performed while they were searching were longer than some of the fixations they performed when they knew where the next letter was. From the users’ perspective, when they were using the keyboard, a letter was selected when they looked at a letter, but not while they were searching for one.
Though people are typically unaware of how their eyes move, eye–movements are a rich source of information about a person, which humans are adept at interpreting. Every day, people tailor how they explain subjects to others based on the listener’s eye–movements and body language. It is also not too difficult to judge when another person is searching for something, again, based on observing the person’s eye–movements.
CONTINUAL ADAPTATION TO THE USER
The software Eye Interpretation Engine we created interprets the natural eye–movements of users in real–time, and attempts to categorize which behavior the user is currently engaged in. A software application that makes use of the tool can adjust how it reacts to the user, so that the user can select letters or targets quickly when the user knows where the desired target is, while at the same time allowing the user to search for targets at a leisurely pace.
Currently, we have defined three behaviors related to the use of an on–screen keyboard that can be recognized by patterns of eye–movement. The Eye Interpretation Engine can recognize:
An on–screen keyboard is an excellent example of an eye–aware application that can greatly benefit from a system that adjusts to changes in the user. New users require a great deal of searching since they are not familiar with the keyboard layout—the system needs to recognize this and slow down to the user’s speed. As the user begins to learn the location of commonly used letters, the user moves directly to those letters when she wants to select them—when the user does this it is behooving that the system not make her wait. In general, it should be faster to select known targets than targets the user needs to search for. Finally, as the user becomes more and more expert, the system should speed up to match her expectations—until the moment when the user needs to search for a little–used letter, which the system needs to allow by slowing down and letting the user search.
THE EYETRACKING FIELD
We believe there are four areas that need to succeed for eyetracking to become widely available.
The Need for Eye–aware Tools
Currently, it is far too difficult for average programmers to write eye–aware software applications. In order to create an eye–aware application, a programmer must handle either raw eye–position data or data that has been preprocessed to yield fixations, saccades, and blinks. This requires the programmer to be intimately familiar with how the eye works and with data filtering methods. Tools are limited to basic development kits that come with some hardware eyetrackers and only provide interpretations at the levels of saccades, fixations, and blinks.
Our aim is to integrate the knowledge derived from research into tools that non–research developers can use. We believe that good development tools will facilitate the expansion of the eyetracking market, which will increase the demand for eyetracking hardware and which will ultimately result in better, cheaper products for people with disabilities.
Requirements for Eye–aware Tools
The eyetracking field needs good programming tools that provide access to information at different levels, ranging from simple—for a programmer who knows nothing about eye–movement but who is familiar with event–driven programming—to complex, enabling researchers to focus on high–level issues more easily without having to recreate the required recognition of low–level features.
Ideally, a tool for creating eye–aware applications will:
In addition, because it is difficult to predict the problems involved in writing eye–aware applications and determine what exactly the user is doing and trying to do in real–time, a visualization tool that shows a complete history of the actions and patterns of movement that the user's eye makes is a necessary component for both developers and researchers.
DIFFICULTIES IN DEVELOPING EYE-AWARE APPLICATIONS
Currently, there are a number of difficult issues faced in developing eye–aware programs. The goal of the development tool described in the next section is to begin to solve these problems, making development of eye–aware applications more tractable.
Processing of a large mass of data: Hardware eyetracking devices sample where the eye is pointed at up to thirty times a second. Consequently, up to 1,800 samples must be processed and interpreted each minute. Out of this mass of data, the eyetracking application needs to find what is significant and intended by the user and ignore what is not.
Breaking data into fixations and saccades: It is helpful to know when the user's eye is fixating, so a method is needed that can recognize beginnings and endings of fixations. The standard approach is described by Jacob [4], and an interesting alternative concept is described by Goldberg and Schryver [3].
Smoothing and filtering the data: Depending on the quality and characteristics of the hardware eyetracker, certain areas of the screen might not be well calibrated although the calibration procedure for the hardware has been completed. This problem is worse in some current eyetrackers than others and will eventually go away as eyetracking hardware improves; but until that time comes the problem must be handled. For examples of ways to do this, see Jacob [4] and Stampe and Reingold [7].
Recognizing when the user wants to make a selection: To date, the most common approach taken by those who use fixation duration as a means of selection is to pick a threshold value between 250–1000ms. If a fixation lasts longer than the threshold value then it is considered a selection or action fixation that causes an event to occur [4] [7] [8].
Recognizing when the user does not want to make a selection (the Midas Touch problem): Jacob [4] points out that allowing a user to select targets by eye can quickly becomes undesirable, in that the user cannot casually look at any target for too long for fear that it will cause the target to be selected. This problem is commonly called the Midas Touch problem and is the biggest difficulty on the software side of eyetracking.
To get around the Midas Touch problem, two categories of solutions can be applied. The first category includes those solutions that make use of a second device, such as a mouse button, a puff/sip tube, a tongue switch, or an eyebrow raise detector, to control the making of selections. These secondary devices can be effective, though we are interested in a more direct solution that does not necessitate another piece of hardware, and that all users of eyetracking can employ—from able–bodied people to a person with late–stage ALS (Lou Gehrig’s disease) who might be able to move only his or her eyes.
The second category includes those solutions that rely on blinks or movements of the eye. We feel that the use of double–blinks or slow–blinks is inappropriate and disconcerting to users since these methods require users to make selections while their eyes are closed, and because blinking up to 30 times a minute quickly becomes annoying in situations where many selections need to be made.
Another selection mechanism that relies on the eye is based on fixation dwell–time, which was mentioned above. This approach requires the user to look at the desired target until it selects. Typically, a threshold value is picked that is greater than the length of time a person naturally fixates. Velichkovsky, Sprenger, and Unema [8] make observations about the different cognitive processes that are associated with fixations of different lengths. This approach works well in that it does not require any additional hardware, is something all people can use, and is relatively fast and easy since the user is already looking at what he or she wants to select.
A problem with any arbitrarily chosen duration threshold value, however, is that it is almost invariably either too fast or too slow. Allowing the user to adjust the value, and thereby the length of time needed to select a target, is not an optimum solution since
This last point is best illustrated with an example. Consider an experienced user who tends to take 500ms to select a target. The user is comfortable with that speed; however, when the user is searching for a target, his eyes most likely will fixate longer than 500ms on some targets that he does not wish to select. The eye–aware device needs to recognize this and adjust itself to match the expectations of the user, which in this case means that while the user is searching the eye–aware device should not cause a selection to occur at 500ms.
In the future, we expect that many computers and devices will be eye–aware. When a user approaches such a device, it should automatically adjust to the user without the user’s needing to set parameters.
Matching user expectations: Once the Midas Touch problem has been solved, it is beneficial to match as closely as possible the length of time required to select a target to the user's expectation of when that target should be selected. In order to match the user's expectations, the software must have an idea what the user's expectations might be, and so the software must recognize these high–level behaviors and intentions.
Visualization: The final problem that affects the development of eye–aware applications is that it is difficult to visualize what is happening from the eyetracking data, and what the user is trying to do. If the developer does not know what is happening in the data and there is no way to visualize it, then the developer cannot solve problems that arise.
DESCRIPTION OF THE TOOL
To solve the problems listed above, the Eye Interpretation Engine software works to recognize patterns of eye–movements caused by typical human behavior. We have taken a context–free approach in developing the tool, meaning that the interpretation provided by the tool is not tied to what is displayed on the screen. We are able to do this because we have found general patterns of eye–movement that are caused by several forms of human behavior. For example, the Eye Interpretation Engine can recognize patterns of eye–movement caused when a user searches for a target on a screen. This is not to say that we could not get better recognition of behaviors if we made the tool context-sensitive. If the tool had access to information pertaining to objects on the screen and the characteristics of those objects, it could make better, more accurate predictions. However, a context–sensitive approach ties the recognition to a particular application, and the tool ceases to be generally applicable.
In developing the current system, we have made a number of simplifying assumptions. For instance, we assume that we can determine saccades and fixations in a context-free way, using knowledge about properties of the eye and the way it moves. We may be correct in assuming that we can do this; but it is not proven, other than that it seems to work well when we test the system. Furthermore, we make a simplifying assumption about the amount of area that a person mentally perceives during each fixation, which is not a constant—the amount of peripheral vision that is used by a person changes depending on the scene viewed and the activity the user is engaged in. It is our intention to revisit these assumptions at a later date to try to understand them better and to develop better solutions.
Figure 1: Hierarchy of Recognized Artifacts
The diagram in figure 1 shows the basic hierarchy of artifacts that our current system recognizes, with the most basic artifact—Samples—positioned at the bottom and with more complex artifacts positioned higher in the diagram.
Recognition of Fixations and Saccades
The first things that must be recognized in the eye position history data are fixations and saccades.
Each data point returned by an eyetracking device represents the device's best estimate of where the user is looking. Because of a number of measurement and modeling difficulties [9], the data returned from any eyetracker must be treated as only a suggestion as to what the user is perceiving. Because current eyetrackers return only X, Y and, sometimes, Z coordinates for the eye position data, special thought must be given as to how to handle the data.
Our approach is to treat each sample returned from the eyetracking device as the center of an area that most likely contains the true area perceived by the user. For example, if an ideal eyetracking device returned a sample of (10,20), then we would interpret that to mean that the area that the user is perceiving most likely falls within the circle that has (10,20) as its center and has a radius of approximately a half–degree. The radius is increased for noisy eyetracking devices in order to handle the loss of confidence in where the eye is looking.
With multiple samples we can get a better idea of where the user is looking—or we can at least cut down some of the measurement noise of the device. The system does this by taking the overlap of a number of samples.
The overlapping area that the fixations have in common can be treated in two ways. It can be treated as an area, which is useful if the application wishes to display the fixation area on the screen, since it gives a better approximation of what the user is perceiving than a single point on the screen. The overlap area can also be treated as a single point, which is computationally more tractable. To determine a single point, the centroid of the overlap area is computed.
In our present system, the most recent two–tenths of a second's worth of data is kept; so as new data arrives and old data is dropped, the overlap area changes. A history is kept of the fixation data, so that pursuit motion can be recognized, as well as basic information about each fixation, such as position and duration values.
In our present system, saccades are recognized as at least 100ms of data that occur outside the overlap area.
For each fixation and saccade, we measure four features:
There are two higher–level artifacts that the system recognizes from the sequence of fixations and saccades, which greatly improves the recognition of patterns and what the eye is doing.
Revisits
A Revisit is a fixation that goes back to the location of a recent previous fixation.
Quantitatively, a Revisit is a fixation whose center is within one degree of the center of any of the last five fixations, not including the immediately prior fixation. Thus, two consecutive fixations that are close to each other are not considered a Revisit, though two close fixations that are separated by one fixation would result in a Revisit.
Recognizing Revisits is important because it reveals patterns in the fixations and saccades. For example, they can be used to determine when a user has found the desired target after searching.
Significant Fixations
A Significant Fixation is a fixation that lasts longer than a variable threshold, referred to as the Significant Fixation Threshold (SFT). In a command interface, a Significant Fixation would result in a selection if it occurred over a button or selectable target.
Our current system chooses between two fixed values for the SFT based on its interpretation of the current behavior as defined below. Knowledgeable Movement causes an SFT of 600ms to be used, and Searching and Prolonged Searching causes an SFT of 1100ms to be used. These values err on the side of safety and tend to be perceived as a bit slow to an experienced user. This use of multiple SFT's proved to be beneficial to the users of the eye–aware on–screen keyboard that we created for disabled users.
We are currently working on an algorithm that dynamically sets the values for the SFT based on user eye–movement characteristics that will allow us to better tailor the SFT values to individual users.
Eye–Movement Patterns
Movement Patterns are patterns of saccades and features. Currently, the system recognizes two Patterns of Movement:
Overview Shifts of Focus occur when the user shifts his focus from one general area to another when searching. This method of search gives a broad overview of the entire scene, not narrowing in too much on the details. Local Shifts of Focus occur when the user believes he is in the right general area, but needs to pinpoint the desired target.
Eye–Behavior Patterns
Eye–Behavior Patterns build on Eye–Movement Patterns and represent the inferences that the system makes about user intent. Our system monitors the user’s eye–motions and looks for patterns of eye–movement that fit into one of three mutually exclusive categories of behavior that we have defined: Searching, Knowledgeable Movement, and Prolonged Searching.
Searching is defined by the presence of Overview Shifts of Focus or Local Shifts of Focus. The rationale for this is that, if the user knows where the next desired target is, he only needs one large saccade to get to the general area of the target, and then possibly a few small saccades in order to zero in on the specific location (see Figure 4). If the user needs more than one large saccade, or has a string of short saccades that cover a lot of area, then the user does not know where he is going.

Figure 2: Example of Searching (Overview Shifts of Focus)

Figure 3: Example of Searching (Local Shifts of Focus)
Knowledgeable Movement is the default behavior that becomes active after each Significant Fixation until another behavior is recognized. The reason being that, after a Significant Fixation, the user can clearly perceive the general area around the fixated spot. If the user makes a small saccade at this time, then the user knows where he is moving because the destination is already within peripheral vision. If the next desired target is not close to the location of the previous Significant Fixation, and if the user knows where the target is located, then he will make one large saccade to the correct vicinity of the target and will stay in that vicinity to make the selection.

Figure 4: Example of Knowledgeable Movement
During Searching (but not Prolonged Searching), a fixation that is a Revisit is treated as being in the Knowledgeable Movement category as long as that fixation lasts. This covers the situation when a user is searching, briefly perceives the desired target, moves to a new location before realizing that he just passed the desired target, and then moves back to (Revisits) the previous fixation. Recognizing Revisits makes it possible to transition back to Knowledgeable Movement after a user has been searching. It is relatively easy to recognize when a user has begun searching, but it is much harder to determine when the user has stopped searching.
Prolonged Searching is defined as 10 or more saccades since the last Significant Fixation. Currently this "catch all" behavior covers the following situations:
HOW THE TOOL WAS DEVELOPED
The Eye Interpretation Engine was initially developed as a part of the on–screen keyboard described above. Searching and Knowledgeable Movement were the first two behaviors we focused on recognizing so that the on–screen keyboard would operate smoothly. As we were developing the system, we demonstrated it to numerous visitors to our lab, as well as at conferences where we would let anyone sit down and try it. Approximately seventy first–time users have tried the on–screen keyboard.
When we were testing the keyboard, we told new users to "just look at the letter you want until it flashes," which meant that the letter was selected. The layout used for the letters was purposefully not the standard QWERTY layout, so all new users were forced to search for letters. We asked everyone to start by spelling out "Hello my name is _____." After this initial sentence users could type anything they wished. The initial sentence has two instances of "e" and "m" so we often saw movement directly to these letters on the second instance. As the user typed more words and phrases, we saw more and more movement directly to the area of the desired letter. We could see instances where the system got the interpretation wrong, which caused an unexpected letter to be selected and typed—users fixed these false selections by looking at the on–screen keyboard’s delete key. We used this information as we developed the system to reduce the number of false selections and misinterpretations.
The original raw eyetracking data from each session was saved so that we could recreate the entire session. We often showed users a playback of what they had just done. While watching the playback, users made many useful comments regarding times when they were searching for letters, times when the target selected too fast or too slow, as well as how they felt about the system. When we were developing and testing the system’s ability to recognize searching behavior, we added an audible tone to the playback that sounded whenever the system thought that the user was searching. We asked the user to watch and listen to the playback and to report the inaccuracies in the interpretation. This information helped us fine–tune the mechanisms that enable the system to switch between interpretations and different values for the SFT.
The values for the SFT that were mentioned above—600ms for Knowledgeable Movement, and 1100ms for Searching and Prolonged Searching—were determined empirically to provide the best balance between matching the user’s expectation of when a selection should occur and the number of errors caused by mis-selections.
Five people from around our lab have used the system on a number of occasions as it developed and are considered experts. These users provided comments and data that were useful for comparing the differences between new and experienced users. Additionally, these users enabled us to make sure the system remained effective for longtime users.
Early on, we wondered whether users would perceive that the keyboard was making use of multiple SFT values that changed as the users’ behaviors changed—we were concerned that users might find it disconcerting to have the selection times change in this way. When we tested this, though, we found that the opposite was true—users disliked the use of one static SFT, and said the dynamically set SFT felt natural. When the keyboard only used one static SFT that was slow enough to minimize the number of mis-selections, many users expressed that the system was too slow. Alternately, when faster static SFTs were used, users felt that incorrect targets selected arbitrarily when they did not want them to, especially when they were in the middle of searching for the desired target. We found that most users thought that their eyes moved smoothly and continuously while they were searching for a letter, and that their eyes only stopped once they found it. From the users’ perspective, when they were using the keyboard with dynamically set SFTs, a letter was selected when they looked at a letter, but not while they were searching for one. Most users did not perceive that there were two thresholds being used at different times. The multiple SFT values that changed based on the user’s behavior allowed the keyboard to match the user’s perception of what they were doing.
The user testing done so far has been used solely to improve the system without the expectation of publishing statistics regarding the system and approach. Formal testing of the system is now underway, and findings and results will be made available.
PROGRAMMER'S INTERFACE
The tool is implemented as a 32–bit DLL that currently runs under Windows95. The Eye Interpretation Engine automatically connects to and reads data from the various eyetracking devices. An application interacts with the Eye Interpretation Engine by linking to the DLL.
The tool returns a smoothed value for the current fixation (if the user is fixating), a boolean value of whether the fixation is a Significant Fixation, and an interpretation of the state of the user. Applications can treat Significant Fixations as selections if they occur over a button or selectable area.
Advanced interfaces that want to take advantage of the full power of an eye–aware interface can access any level of detail, including each pattern or artifact that went into making the most recent, highest–level interpretation. Applications can also access any previous artifact or interpretation, all of which are kept in a history that records the user’s eye–activities along with the accompanying interpretations.
An external viewer is included with the tool, which can run either on the same machine or on a second computer. The viewer enables the real–time observation of all the various levels of interpretation that occur moment–by–moment, as well as a trace of the fixations and saccades as the user makes them. The viewer has proven to be invaluable in helping us develop the tool, recognize new patterns, and catch errors in our logic.
We have gotten good results with two different eyetracking systems: the Eyegaze system, from LC Technologies [5], and QuickGlance, from EyeTech Digital Systems [2]. The tool is designed to work with any eyetracking hardware, and expanding the number of systems we support requires minimal effort.
CONCLUSIONS
The system described here has a number of unique features. Using multiple values for the Significant Fixation Threshold opens the possibility of having interfaces that adapt instantly to changes in the user’s behavior. We also describe simple patterns of eye movement that can be used to infer the user’s behavior and intentions. Enabling eye–aware software to recognize and react to high–level behaviors makes possible both command and noncommand interfaces that automatically customize themselves to each individual user's needs.
We believe that this is the only system that includes these abilities, and we have provided them in a software tool that others can use immediately. Bundling these capabilities into a developer's tool facilitates the creation of eye–aware software and acts as a democratizing force that increases the number of people who can contribute to the eyetracking field. With increased demand for eye–aware devices, the costs per system will drop, and many people with disabilities who could benefit from the use of eyetracking will have access to this mode of input.
ACKNOWLEDGMENTS
This work was supported by a grant from the Packard Foundation and by reserve funds from the Center for the Study of Language and Information (CSLI), Stanford University. The first eyetracker used for research was purchased by privately donated funds.
A patent is pending for this work, which is being funded by Stanford University's Office of Technology Licensing (OTL). OTL is also funding the commercialization effort of the Eye Interpretation Engine. Any interested party may contact Stanford's OTL at (650) 723–0651.
The author wishes to thank Michael Ross for his excellent work as a programming intern.
REFERENCES