Research on Dependable Distributed Systems for
Smart Grid
Qilin Li
Production and Technology Department, Sichuan Electric Power Science and Research Institute, Chengdu, P.R.China
Email: li_qi_lin@163
Mingtian Zhou
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu,
P.R.China
Email: mtzhou@uestc.edu
Abstract—Within the last few years, smart grid has been one of major trends in the electric power indu
stry and has gained popularity in electric utilities, research institutes and communication companies. As applications for smart grid become more distributed and complex, the probability of faults undoubtedly increases. This fact has motivated to construct dependable distributed systems for smart grid. However, dependable distributed systems are difficult to build. They present challenging problems to system designers. In this paper, we first examine the question of dependability and identify major challenges during the construction of dependable systems. Next, we attempt to present a view on the fault tolerance techniques for dependable distributed systems. As part of this view, we present the distributed tolerance techniques for the construction of dependable distributed applications in smart grid. Subsequently, we propose a systematic solution based on the middleware that supports dependable distributed systems for smart grid and study the combination of reflection and dependable middleware. Finally, we draw our conclusions and points out the future directions of research. Index Terms—smart grid, dependability, dependable middleware, fault-tolerance, fault, error, failure, error processing, fault treatment, replication, distributed recovery, partitioning, open implementation, reflection, inspection, adaptation
I.I NTRODUCTION
Within the last few years, smart grid has been one of major trends in the electric power industry and
has gained popularity in electric utilities, research institutes and communication companies. The main purpose of smart grid is to meet the future power demands and to provide higher supply reliability, excellent power quality and satisfactory services. Although smart grid brings great benefits to electric power industry, such a new grid introduces new technical challenges to researchers and engineering practioners.
As applications for smart grid become more distributed and complex, the probability of faults undoubtedly increases. Distributed systems are defined as a set of geographically distributed components that must cooperate correctly to carry out some common work. Each component runs on a computer. The operation of one component generally depends on the operation of other components that run on different computers[1] [2]. Although the reliability of computer hardware has improved during the last few decades, the probability of component failure still exists. Furthermore, as the number of interdependent components in a distributed system increases, the probability that a distributed service can easily be disrupted if any of the components involved should fail also increases[2]. This fact has motivated to construct dependable distributed systems for smart grid.
Fault tolerance is needed in many different dependable distributed applications for smart grid. However, dependable distributed systems are difficult to build. They present challenging problems to
system designers. System designers must face the daunting requirement of having to provide dependability at the application level, as well as to deal with the complexities of the distributed application itself, such as heterogeneity, scalability, performance, resource sharing, and the like. Few system designers have these skills. As a result, a systematic approach to achieving the desired dependability for distributed applications in smart grid is needed to simplify the difficult task.
Recently, middleware has emerged as an important architectural component in supporting the construction of dependable distributed systems. Dependable middleware can render building blocks to be exploited by applications for enforcing non-functional properties, such as scalability, heterogeneity, fault-tolerance, performance, security, and so on[3]. These attractive features have made middleware a powerful tool in the construction of dependable distributed systems for smart grid [3].
This paper makes three contributions to the construction of dependable distributed systems for smart grid. First of all, we examine the question of dependability and identify major challenges during the construction of dependable systems. Subsequently, we attempt to present a view on the fault tolerance techniques for dependable distributed systems. As part of
© 2012 ACADEMY PUBLISHER doi:10.4304/jsw.7.6.1250-1257
this view, we present the distributed tolerance techniques for building dependable distributed applications in smart grid. Finally, we propose a systematic solution based on the middleware that supports dependable distributed systems for smart grid and study the combination of reflection and dependable middleware.
The remainder of this paper is organized as follows: SectionⅡstudies dependability matters for distributed systems in smart grid and identifies the major challenges for the construction of dependable systems. Section Ⅲintroduces basic concepts and key approaches related to fault-tolerance. In SectionⅣ, we discusses distributed fault-tolerant techniques for building dependable systems in smart grid. SectionⅤintroduces dependable middleware to address the ever increasing complexity of distributed systems for smart grid in a reusable way. Finally, SectionⅥdraws our conclusions and points out the future directions of research.
Ⅱ.D EPENDABLILITY M ATTERS
Distributed systems are intended to form the backbone of emerging applications for smart grid, including supervisory control, data acquisition system and distribution management system, and so on. An obvious benefit of distributed systems is that they reflect the global business and social enviro
nments in which electric utilities operate. Another benefit is that they can improve the quality of service in terms of scalability, reliability, availability, and performance for complex power systems.
Dependability is an important quality in power distributed applications. In general terms, a system's dependability is defined as the degree to which reliance can justifiably be placed on the service it delivers [4]. The service delivered by a system is its behavior as it is perceived by its user(s); a user is another system (physical, human) which interacts with the former[4]. More specifically, dependability is a global concept that encapsulates the attributes of reliability (continuity of service), availability (readiness for usage), safety (avoidance of catastrophes), and security (prevention of unauthorized handling of information)[2] [4]. In power distributed environments, even small amounts of downtime can annoy customers, hurt sales, or endanger human lives. This fact has made it necessary to build dependable distributed systems for electric utilities.
Fault tolerance is an important aspect of dependability. It is referred to as the ability for a system to provide its specified service in spite of component failure[2] [4]. Fault-tolerant system‘s behavior is predictable despite of partial failures, asynchrony, and run-time reconfiguration of the system. Moreover, fault-tolerant applications are highly available. The application can provide its essential services despite the failure of computing nodes, software object crash, communication network partiti
on, value fault for applications [5]. However, building dependable distributed systems is complex and challenging. On the one hand, system designers have to deal explicitly with problems related to distribution, such as heterogeneity, scalability, resource sharing, partial failures, latency, concurrency control, and the like. On the other hand, system developers must have a deep knowledge of fault tolerance and must write fault-tolerant application software from scratch[2]. As consequence, they have to face a daunting and error-prone task of providing fault tolerance at the application level [2].
Certain aspects of distributed systems make dependability more difficult to achieve. Distribution presents system developers with a number of inherent problems. For instance, partial failures are an inherent problem in distributed systems. A distributed service can easily be disrupted if any of the nodes involved should fail. As the number of computing nodes and communication links that constitute the system increases, the reliability of components in a distributed system rapidly decreases.
Another inherent problem is concurrency control. System developers must address complex execution states of concurrent programs. Distributed systems consist of a collection of components, distributed over various computers connected via a computer network. These components run in parallel on heterogeneous operating systems and hardware platforms and are therefore prone to rac
e conditions, the failure of communication links, node crashes, and deadlocks. Thus, dependable distributed systems are often more difficult to develop, applications developers must cope explicitly with the complexities introduced by distribution.
In theory, the fault tolerance mechanisms of a dependable distributed system can be achieved with either software or hardware solution. However, the cost of custom hardware solution is prohibitive. In the meantime, software can provide more flexibility than its counterpart[2]. As a result, software is a better choice for implementing the fault tolerance‘s mechanisms and policies of dependable distributed systems[2]. However, the software solution for the construction of dependable is also difficult. This is particularly true if distributed systems‘ dependability requirements dynamically change during the execution of an application. Further complicating matters are accidental problems such as the lack of widely reused higher level application frameworks, primitive debugging tools, and non-scalable, unreliable software infrastructures. In that case, fault tolerance can be achieved using middleware [2]. Middleware can be devised to address these problems and to hide heterogeneity and the details of the underlying system software, communication protocols, and hardware. Built-in mechanisms and policies for fault-tolerant can be achieved by middleware and provide solutions to the problem of detecting and reacting to partial failures and to network partitioning. Middleware can r
ender a reusable software layer that supports standard interfaces and protocols to construct a fault-tolerance distributed systems. Dependable middleware shields the underlying distributed environment‘s complexity by separating applications from explicit
© 2012 ACADEMY PUBLISHER
protocol handling, disjoint memories, data replication, and facilitates the construction of dependable application [6].
Ⅲ.F AULT T OLERANCE
A. Failure, Error and Fault
In order to construct a dependable distributed system, it is important to understand the concepts of failure, error, and fault. In a distributed system, a failure occurs when the delivered service of a system or a component deviates from its specification[4]. An error is that part of the system state that is liable to lead to subsequent failure. An error affecting the service is an indication that a failure occurs or has occurred[4]. A fault is the adjudged or hypothesized cause of an error [4].
In general terms, we think that an error is the manifestation of a fault in the distributed system, while
a failure is the effect of an error on the service. As a result, faults are potential sources of system failures.
Whether or not an error will actually lead to a failure depends on three major factors. One factor is the system composition, and especially the nature of the existing redundancy [4]. Another factor is the system activity. An error may be overwritten before creating damage[4]. A third factor is t he definition of a failure from the user‗s viewpoint. What is a failure for a given user may be a bearable nuisance for another one [4].
Faults and their sources are extremely diversified. They can be categorized according to five main perspectives that are their phenomenological cause, their nature, their phase of creation or of occurrence, their situation with respect to the system boundaries, and their persistence [4].
B. Fault models
When designing a distributed fault-tolerant system, we can not to tolerate all faults. As consequence, we must define what types of faults the system is intended to tolerate. The definition of the types of faults to tolerate is referred to as the fault model, which describes abstractly the possible behaviors of faulty components[2] [4]. A system may not, and generally does not, always fail in the same way. T
he ways a system can fail are its fault modes. As a result, the fault model is an assumption about how components can fail [2] [4].
In distributed systems, a fault model is characterized by component and communication failures[2] [4]. It is common to acknowledge that communication failures can only result in lost or delayed messages, since checksums can be used to detect and discard garbled messages[2] [4]. However, duplicated or disordered messages are also included in some models [2] [4].
For a component, the most commonly assumed fault models are (in increasing order of generality): stopping failures or crashes, timing fault model, value fault model and arbitrary fault model[2] [4]. Stopping failures or crashes is the simplest and most common assumption about faulty components[2] [4]. This model always assumes that the only way a component can fail is by stopping the delivery of messages and that its internal state is lost [2] [4].
The timing fault model assumes that a component will respond with the correct value, but not within a given time specification [2] [4]. A timing fault model can result in events arriving too soon or too late. A timing fault model includes delay and omission faults[2] [4]. A delay fault occurs when the message has the right content but arrives late[2] [4]. An omission fault occurs when no message is re
ceived. Sometimes, delay faults are called performance faults[2] [4]. In the value fault model, the value of delivered service does not comply with the specification [2] [4].
Arbitrary fault model is the most general fault model, in which components can fail in an arbitrary way [2] [4]. As a result, if arbitrary faults are considered, no restrictive assumption will be made[2] [4]. An arbitrarily faulty component might even send contradictory messages to different destinations (a so-called byzantine fault)[2] [4]. This model can include all possible causes of fault, such as messages arriving too early or too late, messages with incorrect values, messages never sent at all, or malicious faults [2] [4].
C. Error Processing and Fault Treatment
Fault tolerance is system‘s ability to continue to provide service in spite of faults [2] [4]. It can be achieved by two main forms: error processing and fault treatment [2] [4]. The purpose of error processing is to remove errors from the computational state before a failure occurs, if possible before failure occurrence, whereas the purpose of fault treatment is to prevent faults from being activated again [2] [4].
In error processing, error detection, error diagnosis, and error recovery are commonly used approac
hes[2] [4]. Error detection and diagnosis is an approach that first identifies an erroneous state in the system, and then assesses the damages caused by the detected error or by errors propagated before detection[2] [4]. After error detection and diagnosis, error recovery substitutes an error-free state for the erroneous state [2] [4].
Error recovery may take on three forms: backward recovery, forward recovery, and compensation[2] [4]. In backward recovery, the erroneous state transformation consists of bringing the system back to a state already occupied prior to error occurrence[2] [4]. This entails the establishment of recovery points, which are points in time during the execution of a process for which the then current state may subsequently need to be restored [2] [4].In forward recovery, the erroneous state transformation consists of finding a new state, from which the system can operate[2] [4]. Error compensation renders enough redundancy so that a system is able to deliver an error-free service from the erroneous state [2] [4].
The goal of fault treatment determines the cause of observed errors and prevents faults from being activated again[2] [4]. The first step in fault treatment is fault diagnosis, which consists of determining the cause(s) of error(s), in terms of both location and nature [2] [4]. Then it
© 2012 ACADEMY PUBLISHER
takes actions aimed at making it (them) passive[2] [4]. This is achieved by preventing the component(s)
identified as being faulty from being invoked in further executions[2] [4]. Fault treatment can be used to reconfigure a system to restore the level of redundancy so that the system is able to tolerate further faults [2] [4].
Ⅳ.D ISTRIBUTED T OLERANCE T ECHNIQUES
A. Replication
In order to mask the effects of faults, distributed fault tolerance always requires some form of redundancy. Replication is a classic example of space redundancy. It
exploits additional resources beyond what is needed for normal system operation to implement a distributed fault-tolerant service[2] [4]. The metaphor of replication is to manage the group of processes or replicas so as to mask
failures of some members of the group[2] [4]. By coordinating a group of components replicated on different computing nodes, distributed systems can provide continuity of service in the presence of fai
led
nodes [2] [4].
There are three well-known replication schemes: active replication, passive replication, and semi-active replication. In active replication scheme, every replica
executes the same operations[2] [4]. Input messages are atomically multicasted to all replicas, who all process them and update their internal states. All replicas generate output messages [2] [4].
Passive replication is a technique in which only one of the replicas (the primary) actively executes the operation, updates its internal state and sends output messages [2] [4]. The other replicas (the standby replicas) do not process
input messages; however, their internal state must be updated periodically by information sent by the primary [2] [4]. If the primary should fail, one of the standby replicas is elected to take its place [2] [4].
Semi-active replication is a technique which is similar to active replication[2] [4]. In semi-active replication, all replicas will receive and process input messages. However, unlike active replication, the processing of messages is asymmetric in that one replica (the leader) takes responsibility for cert
ain decisions (e.g., concerning message acceptance) [2] [4]. The leader replica can enforce its choice on the other replicas (the followers) without resorting to a consensus protocol [2] [4]. One alternative for semi-active replication is that the leader replica may take sole responsibility for sending output messages[2] [4]. Semi-active replication primarily targeted at crash failures. However, under certain conditions, this strategy can also be extended to deal with arbitrary or byzantine failures [2] [4].
Continuity of service in the presence of failed nodes
requires replication of processes or objects on multiple nodes[2] [4]. Replication can provide high-available service for a dependable distributed system. By replicating their constituent objects and distributing their replicas across different computers connected by the network, distributed applications can be made dependable [5]. The major challenge of replication technique is to maintain replica consistency [7] [8] [9]. Replication will fail in its purpose if the replicas are not true copies of each other, both in state and in behavior [5] [10] [11] [12].
B. Distributed Recovery
In a dependable distributed system, some form of recovery is required to minimize the negative impa
ct of a failed process or replica on the availability of a distributed service [4]. In its simplest form, this can be just a local recovery of the failed process or replica. However, distributed recovery will occurs if the recovery of one process or replica requires remote processes or replicas also to undergo recovery[4]. In this case, processes or replica must rollback to a set of checkpoints that together constitute a consistent global state [4].
In order to create checkpoints, there are several major approaches. One way is asynchronous checkpointing[4]. In asynchronous checkpointing, checkpoints are created independently by each process or replica, and then when a failure occurs, a set of checkpoints must be found that represents a consistent global state [4]. This approach aims to minimize timing overheads during normal operation at the expense of a potentially large overhead when a global state is sought dynamically to perform the recovery[4]. The price to be paid for asynchronous checkpointing is domino effect. If no other global consistent state can be found, it might be necessary to roll all processes back to the initial state[4]. As a result, in order to avoid the domino effect, checkpoints can be taken in some coordinated fashion.
Another way is to structure process or replica interactions in conversations[4]. In a conversation, processes or replicas can communicate freely between themselves but not with other processes exte
rnal to a conversation[4]. If processes or replicas all take a checkpoint when entering or leaving a conversation, recovery of one process or replica will only propagate to other processes or replica in the same conversation [4].
service faultA third alternative is synchronous checkpointing [4] [13]. In this approach, dynamic checkpoint coordination is allowed so that a set of checkpoints can represent global consistent states [4] [13]. As consequence, the domino effect problem can be transparently avoided for the software developers even if the processes or replicas are not deterministic[4]. At each instant, each process or replica possesses one or two checkpoints: a permanent checkpoint (constituting a global consistent state) and another temporary checkpoint[4]. The temporary checkpoints may be undone or transformed into a permanent checkpoint. The creation of temporary checkpoints, and their transformation into permanent ones, is coordinated by a two-phase commit protocol to ensure that all permanent checkpoints effectively constitute a global consistent state [4].
C. Partitioning Tolerance
A distributed system may partition into a finite number of components. The processes or replicas in different
© 2012 ACADEMY PUBLISHER
components can not communicate each other[11]. Partitioning may occur due to normal operations, such as in mobile computing, or due to failures of processes or inter-process communication. Performance failures due to overload situations can cause ephemeral partitions that are difficult to distinguish from physical partitioning [4]. Partitioning is a very real concern and a common event in wide area networks[4]. If the network partitions, different operations may be performed on the processes or replicas in different components, leading to inconsistencies that must be resolved when communication is re-established and the components remerge[5]. One strategy for achieving this is to allow components of a partition to continue some form of operation until the components can re-merge [4] [11]. Once the components of a partitioned remerge, the processes or replicas in the merged components must communicate their states, perform state transfer and reach a global consistent state [5].
As another example, certain distributed fault-tolerance techniques are aimed at adopting dynamic linear voting protocol to ensure replica consistency in partitioned networks[5]. Voting protocols are based on quorums. In voting protocols, each node is assigned a number of votes. When a network is partitioned or remerged, if a majority of the last installed quorum is connected, a new quorum is esta
blished and updates can be performed within this partition [5].
Ⅴ. DEPENDABLE M IDDLEWARE
In the past decade, middleware has emerged as a major building block in supporting the construction of distributed applications[14]. The development of distributed applications has been greatly enhanced by middleware. Middleware provides application developers with a reusable software layer that relieve them from dealing with frequently encountered problems related to distribution, such as heterogeneity, interoperability, security, scalability, and so on[14][15][16][17]. Implementation details are encapsulated inside the middleware itself and are shielded from both users and application develop ers‘, so that the infrastructure‘s diversities are homogenized by middleware [18] [19] [20] [21]. These attractive features have made middleware an important architectural component in the distributed system development practice. Further, with applications becoming increasingly distributed and complex, middleware appears as a powerful tool for the development of software systems [14].
Recently, a strong incentive has been given to research community to develop middleware to provide fault tolerance to distributed applications[2]. Middleware support for the construction of dependable distributed systems has the potential to relieve application developers from the burden by making de
velopment process faster and easier and significantly enhancing software reuse. Hence, such middleware can render building blocks to be exploited by applications for enforcing dependability property [2].
However, building such software infrastructure that achieves dependable goal is not an easy task. Neither the standard nor conventional implementations of middleware directly address complex problems related to dependable computing, such as partial failures, detection of and recovery from faults, network partitioning, real-time quality of service or high-speed performance, group communication, and causal ordering of events[9]. In order to cope with these limitations, many research efforts have been focused on designing new middleware systems capable of supporting the requirements imposed by dependability [5].
A first issue that needs to be addressed by dependable middleware is interoperability[2]. Interoperability allows different software systems to exchange data via a common set of exchange formats, to read and write the same file formats, and to use the same protocols. As a result, in order to be useful, dependable middleware should be interoperable[2]. Through interoperability, dependable middleware can provide a platform-independent way for applications to interact with each other[2]. In other words, two systems running on the different middleware platforms can interop
erate with each other even when implemented in different programming languages, operating systems, or hardware facilities [2].
Another important problem concerns transparency. Dependable middleware should provide some form of transparency to applications[2]. It allows dynamically to add to an existing distributed application and to interfere as little as possible with applications at runtime. Therefore, many existing applications can benefit from the dependable middleware [2]. Traditional middleware is built adhering to the metaphor of the black box. Application developers do not have to deal explicitly with problems introduced by distribution. Middleware developed upon network operating systems provides application developers with a higher level of abstraction. The infrastructure‘s diversities are hidden from both users and application developers, so that the system appears as a single integrated computing facility [16].
Although transparency philosophy has been proved successful in supporting the construction of traditional distributed systems, it cannot be used as the guiding principle to develop the new abstractions and mechanisms needed by dependable middleware to foster the development of dependable distributed systems when applied to the today‘s computing settings[15][18][19]. As a result, it is important to adopt an open implementation approach to the engineering of dependable mi
ddleware platforms in terms of allowing inspection and adaptation of underlying components at runtime[22][23][24][25].
With networks becoming increasingly pervasive, major system requirements posed by today‘s networking infrastructure relate to openness and context-awareness [14]. This leads to investigate new approaches for
middleware with support for dependability and context-aware adaptability. However, in order to provide transparency, traditional middleware must make decisions on behalf of the application. This is inevitably
© 2012 ACADEMY PUBLISHER
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论