Invited Paper
MVA 2000 (IAPR Workshop on
Machine Vision Applications)
Yasuda Auditorium, University of Tokyo
November 2000
Evolution of Real-time
Image Processing
in Practical Applications
Masakazu Ejiri, Takafumi Miyatake, Hiroshi
Sako,
Akio Nagasaka, and Shigeki Nagaya
Central Research Laboratory, Hitachi, Ltd.,
Kokubunji, Tokyo 185-8601, Japan
Abstract
The history of machine vision and its applications pioneered by the authors and
their colleagues at Hitachi during the past three decades is reviewed. A
variety of applications, especially in factory automation and office
automation, were made possible by the evolution of real-time image- processing
techniques. In recent years, social automation is becoming another application
area of image processing, in which real-time color-video processing is expected
to be a key technology.
1 Introduction
Research on "artificial intelligence" began at the Massachusetts
Institute of Technology, Stanford University, and Stanford Research Institute
in the US in the mid-1960s. The main concern of the researchers was the
realization of intelligence by using a conventional computer, which had been
developed mainly for numerical computing. At that time, a hand-eye system was
thought to be an excellent research tool to visualize intelligence and
demonstrate its behavior inside the computer. The hand-eye system was, by
itself, soon recognized as an important research target, and it became known as
the "intelligent robot". One of the core technologies of the
intelligent robot was, of course, vision, and this vision research area was
called "computer vision", which is now regarded as a fundamental and
scientific approach to investigate how artificial vision can be best achieved
and what principles underlie it.
On the other hand, research on "machine vision" was launched in the
mid-1960s at the Hitachi Central Research Laboratory as one of the core
technologies towards attaining flexible factory automation. Other Japanese
companies also played an important role in its incubation and development. It
should be noted that the word "machine vision" is preferentially used
in this paper for representing a more pragmatic approach towards useful vision
systems.
Though the road was not so smooth, we have fortunately achieved quite a few
successes in factory automation, and also in office automation, over the last
three decades. In this paper, we briefly introduce our pioneering applications
of machine vision, and we discuss the history of machine vision by focusing on
the image-processing technology underlying these applications. Recent technology
of real-time video processing, which may be applicable to future social
automation, is also explained. To begin with, the history of machine vision is
illustrated in Fig. 1 and is explained in the following sections in more
detail.
2 Early Years of Machine
Vision
Our first attempt at machine vision, in 1964, was to automate the assembly
process (i.e., wire-bonding process) of transistors. In this development, we
used a very primitive optical sensor by combining a microscope and a
rotating-drum type scanner with two slits on its surface. By detecting the
reflection from the transistor surface with photo-multipliers, the position and
orientation of transistor chips were determined with more or less 95% success
rate. However, this percentage was still too low to enable us to replace human
workers; thus, our first attempt failed and was eventually abandoned after a
two-year struggle.
What we learned from this experience was the need for reliable artificial
vision compatible to a human's pattern recognition, which quickly captures the
image first, then, reduces the information quantity drastically until the
positional information is firmly determined. Our slit-type optical-scanning
method inherently lacked the right quantity of captured information; thus, the
recognition result was apt to be easily affected by reflective noises.
In those days, however, microprocessors had not yet been developed and the
available computers were still too expensive, bulky, and slow particularly for
image processing and pattern recognition. Moreover, memory chips were extremely
costly, so the use of full-frame image memories was prohibited.
Though there was no indication that these circumstances of processors and
memories would soon improve, we started a seminal research on flexible machines
in 1968. A generic intelligent machine conceived at that time is shown in Fig.
2. It consisted of three basic functions: intention understanding,
object/environment understanding, and decision-making. Based on this
conception, a prototype intelligent robot was developed in 1970. It could
assemble blocks into various forms by responding to the objectives presented
macroscopically by an assembly drawing [1]. The configuration of this
intelligent robot is shown in Fig. 3.
Besides our laboratory, two
other institutions, Electro-technical Laboratory (Japan) and Edinburgh
University (UK), also joined in the research on intelligent robots. Therefore,
six organizations in total, three in the US, two in Japan, and one in Europe, were
the centers of research on computer vision and intelligent robots in the early
1970s. From then on, many research groups were founded in various institutions
and companies, and computer vision and robotics became central research topics
throughout the world.
3 Factory Automation
Though our prototype intelligent robot in 1970 was nothing more than an
expensive toy, it revealed many basic problems underlying "flexible
machines" and gave us useful insights into future applications of
robotics. One significant problem we were confronted with was the robot's
extremely slow image-processing speed in object and environment understanding.
Our next effort was therefore focused on developing high-speed dedicated
hardware for image processing, with the minimum use of memory, instead of using
rather slow and expensive computers. One of the core ideas was to adaptively
threshold the image signal into a binary form by responding to the signal
behavior and to input it into a shift-register-type local memory that dynamically
stores the latest pixel data of several horizontal scan-lines. This
local-parallel-type configuration enabled us to simultaneously extract plural
pixel data from a 2-D local area in synchronization with image scanning. The
dedicated processing hardware for this extraction is shown in Fig. 4.
Local-parallel-type logical image processing was thus the first successful
solution to realize practical machine vision. By designing its logic circuit
according to the envisaged purpose, the processing hardware could be adapted to
many applications.
One useful yet simple
method using this local-parallel-type image processing was windowing. This
method involved setting up a few particular window areas in the image plane,
and the pixels in the windows were selectively counted to find the background
area and the object area occupying the windows.
In 1972 a bolting robot applying windowing was developed in order to automate
the molding process of concrete piles and poles [2]. It became the first
application of machine vision to moving objects. Another effective method
based on local-parallel architecture was erosion/dilation of patterns,
which was executed by simple AND/OR logic on a 2-D local area. This method
could detect defects in printed circuit boards (PCBs), and formed one of
the bases of today's morphological processing. This defect-detection machine
in 1972 also became the first application of machine vision to the automation
of visual inspection [3]. These two pioneering applications are illustrated
in Figs. 5 and 6.
Encouraged by the effectiveness of these machine-vision systems in actual
production lines, we again started to develop a new assembly machine for
transistors, which was, this time, based fully on image processing. Multiple
local pattern matching was extensively studied for the purpose of detecting
electrode positions of transistors. And by basing a local-parallel-type
image processor on this matching, as shown in Fig. 7, we finally developed
fully automatic transistor assembly in 1973 [4]. This successful development
was the result of our ten-year effort since our first failed attempt. The
developed assembly machines were recognized as the world's first image-based
machines for fully automatic assembly of semiconductor devices. These machines
and their configuration are shown in Fig. 8.
After this development, our efforts were then
focused on expanding the machine-vision applications from transistors to other
semiconductor devices, such as ICs, hybrid ICs, and LSIs. Consequently, the
automatic assembly of all types of semiconductor devices was completed by 1977.
This automatic assembly gained widespread attention from semiconductor
manufacturers and expanded quickly into industry. As a result, the
semiconductor industry as a whole prospered by virtue of higher speed
production of higher quality products with more uniform performance than ever.
In the mid-1970s to early 1980s, our efforts
were also focussed on other industrial applications of vision technology.
Examples of such applications during this period are a pump-hose connection
robot, an intra-factory physical distribution system, and an inspection machine
for printed marks and characters on electronic parts [5][6][7]. These examples
show that the key concept representing those years seemed to be the realization
of a productive society through factory automation, and the objectives of
machine vision were mainly:
・ position detection for assembly,
・ shape detection for classification, and
・ defect detection for inspection.
Assembly, classification, and inspection have
been crucial manufacturing processes for realizing productive factories.
The
most difficult but rewarding development in the mid-1980s was an inspection
machine for detecting defects in semiconductor wafers [8]. It was estimated
that even the world's largest super-computer available at that time would
require at least one month of computing for finishing the defect detection of
an 8-inch single wafer. We therefore had to develop special hardware for lowering
the processing time to less than 1 hour/wafer. The resulting hardware was a
network of local-parallel-type image processors that use a "design pattern referring method", shown in Fig. 9.
In this machine, hardware-based knowledge processing, in which each processor
was regarded as a combination of IF-part and THEN-part logical circuits,
was first attempted [9].
Meanwhile, the processing speed of microprocessors improved considerably
since their appearance in the mid-1970s, and the capacity of memories drastically
increased without excessively increasing costs. This improvement facilitated
the use of gray-scale images instead of binary ones. And dedicated LSI
chips for image processing were developed in the mid-1980s [10]. These
developments all contributed to achieving more reliable, microprocessor-based
general-purpose machine vision systems with full-scale buffers for gray-level
images. As a result, applications of machine vision soon expanded from
circuit components, such as semiconductors and PCBs, to home-use electronic
equipment, such as VCRs and color TVs. A typical configuration of general-purpose
machine vision is shown in Fig. 10. Nowadays, machine vision systems are
found in various areas such as electronics, machinery, medicine, and food
industries.
4 Office Automation
Besides the above-described machine-vision
systems for factory applications, there has been extensive research on
character recognition in the area of office automation. For example, in the
mid-1960s, a FORTRAN program reader was developed to replace key punching
tasks, and mail-sorting machines were developed in the late 1960s to
automatically read handwritten postal codes. However, we were not too
interested in this character recognition technology until recently.
Our first effort to apply machine-vision
technology to areas other than factory automation was the automatic recognition
of monetary bills in 1976. This recognition system was extremely successful in
spurring the development of automatic teller machines (ATMs) for banks. Due to
the processing time limitation, not all of the bill image was captured, but by
combining several partial data obtained from optical sensors with those from
magnetic sensors, so-called sensor-fusion was first attempted and, thus,
resulted in high-accuracy bill recognition with less than 1/1015
theoretical error rate.
Our
next attempt, in the early 1980s, was the efficient handling of large amount of
graphic data in the office [11]. The automatic digitization of paper-based
engineering drawings and maps was first studied. The recognition of these
drawings and maps was based on a vector representation technique, such as that
shown in Fig. 11. It was usually executed by spatially-parallel-type image
processors, in which each processor was designated to a specific image area.
Currently, geographic information systems (GIS) based on these digital maps are
becoming popular and are being used by many service companies and local
governments to manage their electric power supply, gas supply, water supply,
and sewage service facilities. The use of digital maps was then extended to car
navigation systems and more recently to other information service systems via
the Internet. Machine-vision technology contributed, mainly in the early
developmental stage of these systems, to the digitization of original
paper-based maps into electronic form until these digital maps began to be
produced directly from measured data through computer-aided map production.
Spatially divided parallel processing was
also useful for large-scale images such as those from satellite data. One of
our early attempts in this area was the recognition of wind vectors, back in
1972, by comparing two simulated satellite images taken at a 30-minute
interval. This system formed a basis of weather forecasting using Japan's first
meteorological geo-stationary satellite "Himawari", launched a few
years later. Research on document understanding also originated as part of
machine-vision research in the mid-1980s [12]. During those years, electronic
document editing and filing became popular because of the progress in
word-processing technology for over-4000 Kanji and Kana characters. The
introduction of an electronic patent-application system in Japan in 1990 was
quite a stimulus of further research on office automation. We developed
dedicated workstations and a parallel-disk-type distributed filing system for
the use of patent examiners. This system enables examiners to efficiently
retrieve and display the images of past documents for comparison. Nowadays,
however, many application forms used in offices, such as of local government
and insurance companies, continue to be hand-written by applicants. Therefore,
the recognition of these various form types and the contents written in the
forms is posing a challenge to further innovation.
The
recognition of handwritten postal addresses is the most recent topic in
machine-vision applications. In 1992, a decision was made by a government
committee to adopt a new 7-digit postal code system in Japan starting from
1998. We developed new automatic mail-sorting machines for post offices in
1997, and they are now in use for daily sorting and delivery. The new sorting
machine and its configuration are shown in Fig. 12. In this machine,
hand-written/printed addresses in Kanji characters are read together with the
7-digit postal codes, and the results are printed on each letter as a
transparent barcode consisting of 20-digit data. The letters are then
dispatched to other post offices for delivery. In a subsequent process, only
these barcodes are read, and prior to home delivery the letters are arranged by
the new sorting machine in such a way that the order of the letters corresponds
to the house order on the delivery route.
In these postal
applications, the recognition of all types of printed fonts and hand-written
Kanji characters was made possible by using a multi-microprocessor-type
image-processing system. A mail image is sent to one of the unoccupied
processors, and this designated processor analyzes the image, as shown in Fig.
13. The address recognition by a designated single processor usually requires a
processing time of 1.0 to 2.5 seconds, depending on the complexity of the
address image. As up to 32 microprocessors are used in parallel for
successively flowing letters, the equivalent recognition time of the whole system
is less than 0.1 seconds/letter, producing a maximum processing speed of 50,000
letters/hour.
The previously described office applications
of vision technology show that the key concept representing those years seemed
to be the realization of an efficient society through office automation, and
the objectives of machine vision were mainly:
・ efficient handling of large-scale data and
・ high-precision, high-speed recognition and handling of paper-based
information.
Recent progress in network technology has
also increased the importance of office automation. To secure the reliability
of information systems and to realize more advanced network communication
systems, multi-media-type processing is becoming more important and will be one
key area for intensive research and development. This processing may include
image data compression, encryption, scrambling, watermarking, and personal-identification
technologies.
With
the progress in machine-vision technology, the size of images and the scale of
processing have drastically increased, as shown in Fig. 14. The processing time
needed for documents and drawings was not too restricted in the past, compared
to that needed in factory automation applications. However, the present speed
requirements are approaching the critical limit; thus, increasingly powerful
processing technology is needed for further development in office automation.
5 Social Automation
In
recent years, applications to social automation have become more feasible
because of the introduction of new technological improvements in sensors,
processors, algorithms, and networks. Probably the earliest attempt at social
automation was our elevator-eye project in 1977, in which we tried to implement
machine vision in an elevator system in order to control the human traffic in
large-scale buildings. The elevator hall on each floor was equipped with a
camera to observe each hall, and a vision system to which these cameras were
connected surveyed all floors in a time-sharing manner in order to estimate the
number of persons waiting for an elevator. The vision system then designated an
elevator to quickly serve the crowded floor [13]. The configuration of this
system is shown in Fig. 15. A robust change-finding algorithm based on edge
vectors was used in order to cope with the change in the brightness of the
surroundings. In this algorithm, the image plane was divided into several
blocks, and the edge-vector distribution in each block was compared with that
of the background image, which was sometimes updated automatically by a new
image when nobody is in the elevator hall. This system could minimize the
average waiting time for the elevator. Though a few systems were put into use
in the Tokyo area in the early 1980s, there has not been enough market demand
to develop the system further.
More promising applications of image
recognition seem to be for monitoring road traffic, where license plates,
traffic jams, and illegally parked cars must be identified so that traffic can
be controlled smoothly and parking lots can be automatically allocated [14].
Charging tolls automatically at highway toll gates without stopping cars, by
means of a wireless IC card, is now being tested on a few highways as part of
the ITS (Intelligent Transport System) project. The system will be further
improved if the machine vision can quickly recognize other important
information such as license plate numbers and even driver's identities.
A
water-purity monitoring system using fish behavior [15] has been in operation
at a river control center in a local city for the past 10 years. A schematic
diagram of the system is shown in Fig. 16. The automatic observation of algae
in water in sewage works was also studied. Volcanic lava flow was continuously
monitored at the base of Mt. Fugendake in Nagasaki, Japan, during the eruption
period in 1993. To optically send images from unmanned remote observation posts
to the central control station, laser communication routes were planned by
using 3-D undulation data derived from GIS digital contour maps. A GIS was also
constructed to assist in restoration after the earthquake in Kobe, Japan, in
1995. Aerial photographs after the earthquake were analyzed by matching them
with digital 3-D urban maps containing additional information on the height of
buildings. Buildings with damaged walls and roofs could thus be quickly
detected and given top priority for restoration [16].
Intruder detection is also becoming important
in the prevention of crimes and in dangerous areas such as those around
high-voltage electric equipment. Railroad crossings can also be monitored
intensively by comparing the vertical line data in an image with that in a
background image updated automatically [17]. Arranging the image differences in
this vertical window gives a spatio-temporal image of objects intruding onto
the crossing. In almost all of these social applications, color-image
processing is becoming increasingly important for reliable detection and
recognition.
As mentioned before, the application of image
processing to communications is increasingly promising as multimedia and
network technologies improve. Human-machine interfaces, as already shown in
Fig. 2, will be greatly improved if the machine is capable of recognizing every
media used by humans. Human-to-human communication assisted by intelligent
machines and networks is also expected. Machine vision will contribute to this
communication in such fields as motion capturing, face recognition, facial
expression understanding, gesture recognition, sign language understanding, and
behavior understanding.
In addition, applications of machine vision
to the field of human welfare, medicine, and environment improvement will
become increasingly important in the future. Examples of these applications are
rehabilitation equipment, medical surgery assistance, and water purification in
lakes.
Thus, the key concept representing the future
seems to be the realization of a calm society, in which all uneasiness will be
relieved through networked social automation, and the most important objectives
of machine vision will eventually be the realization of two functions:
・ 24-hour/day abnormality monitoring via networks and
・ personal identification via networks.
In most of these social applications, dynamic image processing, which analyzes
video images in real-time, will be a key to success. There are already
some approaches for analyzing incoming video images in real-time by using
smaller-scale personal computers. One typical example is our "Mediachef",
which automatically cuts video images into a set of scenes by finding significant
changes between consecutive image frames [18]. The principle of the system
is shown in Fig. 17. This is one of the essential technologies for video
indexing and video-digest editing. To date, this technology has been put
into use in the video inspection process in a broadcasting company so that
subliminal advertising can be detected before the video is on the air.
For the purpose of
searching scenes, we developed a real-time video coding technique that uses an
average color in each frame and represents its sequence by a "run"
between frames, as shown in Fig. 18. This method can compress 24-hour video
signals into a memory capacity of only 2 MB. This video-coding technology can
be applied to automatically detect the time of broadcast of a specific TV
commercial by continuously monitoring TV signals by means of a compact personal
computer. It will therefore allow manufacturers to monitor their commercials
being broadcast by an advertising company and, thus, will provide evidence of a
broadcast.
A real-time creation of panoramic
pictures may also be an interesting application of video-image processing [19].
A time series of each image frame from a video camera during panning and
tilting is spatially connected in real-time into a single still picture, as
shown in Fig. 19. Similarly, by connecting all the image frames obtained during
the zooming process, a high-resolution picture (having higher resolution in the
inner areas) can be obtained.
Another example of dynamic
video analysis is "Tour into the picture (TIP)" technology. As shown
in Fig. 20, a 2-D picture is scanned and interpreted into three-dimensional
data by manually fitting vanishing lines on the displayed picture. The picture
can then be looked at from different angles and distances [20]. A motion video
can thus be generated from a single picture and viewers can feel as if they
were taking a walk in an ancient city when an old picture of the city is
available.
The technology called
"Cyber BUNRAKU", in which human facial expressions are monitored by
small infrared-sensitive reflectors put on a performer’s face, is also
noteworthy. By combining the facial expressions thus obtained with the limb
motions of a "Bunraku doll" (used in traditional Japanese theatrical
performance), a 3-D character model in a computer can be automatically animated
in real-time to create video images [21], as shown in Fig. 21. This technology
is now being used to create multimedia programs much faster than through
traditional methods.
We have given a few
examples of real-time image processing technologies, which may be applicable to
social automation in the future. The most difficult technical problem facing
social automation, however, is how to make robust machine-vision systems that can
be used day or night in all types of weather conditions. To cope with the wide
changes in illumination, the development of a variable-sensitivity imaging
device with a wide dynamic range is still a stimulating challenge. Another key
towards achieving robust machine vision will be the establishment of
sensor-fusion technology, which combines image information with other
information obtained from a variety of different sensors.
6
Summary
The history of machine vision research was briefly reviewed by mainly focusing
on the topics we have studied at Hitachi. We can say that the progress in
factory automation and office automation has been greatly advanced by the
evolution of machine-vision technology, which has also been affected, in turn,
by the progress of memory technology and processor technology. The features of
machine vision in the future, together with (and in contrast to) those of the
past and present, are summarized in Table 1. In the future, the real-time
analysis of color video images will be an important key for realizing the calm
society through networked social automation and will, thus, be a central topic
for future research on machine vision.
References
1 Ejiri, M. Uno, T., Yoda, H., Goto, T., and
Takeyasu, K.: "A prototype intelligent robot that assembles objects
from plan drawings", IEEE Trans. Comput., C-21, 2, pp. 161-170 (1972)
2 Uno, T., Ejiri, M., and Tokunaga, T.:
"A method of real-time recognition of moving objects and its application",
Pattern Recognition, 8, pp. 201-208 (1976)
3 Ejiri, M., Uno, T., Mese, M., and Ikeda,
S.: "A process for detecting defects in complicated patterns",
Comp. Graphics & Image Processing, 2, pp.326-339 (1973)
4 Kashioka, S., Ejiri, M., and Sakamoto, Y.:
"A transistor wire-bonding system utilizing multiple local pattern
matching techniques", IEEE Trans. Syst. Man & Cybern., SMC-6, 8, pp.
562-570 (1976)
5 Ejiri, M.: "Machine vision: A
practical technology for advanced image processing", Gordon & Breach
Sci. Pub., New York (1989)
6 Ejiri, M.: "Recent image processing
applications in industry", Proc. 9th SCIA, Uppsala, pp. 1-13
(1995)
7 Ejiri, M.: "Machine vision: A
key technology for flexible automation", Proc. of Japan-U.S.A. Symposium
on Flexible Automation, Otsu, Japan, pp. 437-442 (1998)
8 Yoda, H., Ohuchi, Y., Taniguchi, Y., and
Ejiri, M.: "An automatic wafer inspection system using pipelined
image processing techniques" IEEE Trans. Pattern Analysis & Machine
Intelligence, PAMI-10, 1 (1988)
9 Ejiri, M., Yoda, H., and Sakou, H.:
"Knowledge-directed inspection for complex multi-layered patterns",
Machine Vision and Applications, 2, pp.155-166 (1989)
10 Fukushima, T., Kobayashi, Y., Hirasawa,
K., Bandoh, T., and Ejiri, M. : "Architecture of image signal
processor", Trans. IEICE, J-66C, 12, pp.959-966 (1983) (in Japanese)
11 Ejiri, M., Kakumoto, S., Miyatake, T.,
Shimada, S., and Matsushima, H.: "Automatic recognition of
engineering drawings and maps", Proc. Int. Conf. on Pattern Recognition,
Montreal, Canada, pp.1296-1305 (1984)
12 Ejiri, M.: "Knowledge-based
approaches to practical image processing", Proc. MIV-89, Inst. Ind. Sci,
Univ. of Tokyo, Tokyo, pp. 1-8 (1989)
13 Yoda, H., Motoike, J., Ejiri, M.,
Yuminaka, T., : "A measurement method of the number of passengers using
real-time TV image processing techniques", Trans. IEICE, J-69D, 11,
pp.1679-1686 (1986) (in Japanese)
14 Takahashi, K., Kitamura, T., Takatoo, M.,
Kobayashi, Y., and Satoh, Y. : "Traffic flow measuring system by
image processing", Proc. IAPR MVA, Tokyo, pp. 245-248 (1996)
15 Yahagi, H., Baba, K., Kosaka, H., and
Hara, N., : "Fish image monitoring system for detecting acute
toxicants in water", Proc. 5th IAWPRC, pp. 609-616 (1990)
16 Ogawa, Y., Kakumoto, S., and Iwamura,
K.: "Extracting regional features from aerial images based on 3-D
map matching", Trans. IEICE, D-II, 6, pp.1242-1250 (1998) (in Japanese)
17 Nagaya, S., Miyatake, T., Fujita, T.,
Itoh, W., and Ueda, H.: "Moving object detection by
time-correlation-based background judgment", Proc. ACCV'95 Second Asian
Conf. on Comp. Vision, pp. 717-721 (1995)
18 Nagasaka, A., Miyatake, T., and Ueda,
H.: "Video retrieval method using a sequence of representative
images in a scene", Proc. IAPR MVA, Kawasaki, pp. 79-82 (1994)
19 Nagasaka, A., Miyatake, T.,: "A
real-time video mosaics using luminance-projection correlation", Trans.
IEICE, J82-D-II, 10, pp.1572-1580 (1999) (In Japanese)
20 Horry, Y., Anjyo, K., and Arai, K.:
"Tour into the picture: Using a spidery mesh interface to make animation
from a single image", Proc. ACM SIGGRAPH 97, pp. 225-232 (1997)
21
Arai, K. and Sakamoto, H.: "Real-time animation of the upper half of
the body using a facial expression tracker and an articulated input
device", Research Report 96-CG-83, Information Processing Society of
Japan, 96, 125, pp. 1-6 (1996) (in Japanese)
 
|