We collect a lot of data in the water/wastewater industry. Data is required for regulations, engineering, operations, and process improvement; we need to be able to collect, store, and retrieve quality data securely and reliably. Our methods used to collect this data range from random sticky notes collected by operators to voluminous databases filled with data collected from Internet of Things (IoT) devices. When we talk about data loss prevention, we are generally assuming that the data has already been collected and stored, and in the information technology (IT) world, this is mostly true. In the operational technology (OT) world, using the MSDR3 (Monitor, Security, Data Retention, Retrieval and Redundancy) method, data loss prevention begins with the instruments that are providing the data and the operators that maintain them, ending with our data deletion policies.
By leveraging existing OT systems combined with IT systems and procedures, we can ensure that the data we are working with will help lead to the desired outcomes. By focusing on maintaining data quality, we can utilize our existing security, network, and controls systems towards achieving that goal and filling in the gaps where they exist. With the convergence of IT/OT technologies, skills, and practices, we can form a single method that when implemented allows us to maintain IT/OT segmentation while ensuring we collect and deliver quality and reliable data.
“We are entering a new world where data may be more important than software.” - Tim O'Reilly
With the rise in the use of industrial automation and artificial intelligence (AI) in manufacturing and production, increased regulations in data collection and retention, rapid developments in instrumentation and controls, as well as a marked increase in the need for secure data transmission, there is a need for evolution regarding our methods of data handling. We need to update the ways data sources are monitored, data is validated, verified, stored, and retrieval is optimized. Between 2010-2020, the creation, capturing, and consumption of data went up by a massive 5000%, making the need for cohesive policies and strategies more important than ever before.
No more are manual readings written on a spreadsheet on a clipboard. Replaced by instrumentation, programmable logic controllers (PLCs), and databases, we have gained the ability to reduce human error by 100%. Even with the reduction of human error, there is still the potential for error. As systems become increasingly more complex, they introduce a slew of high-level issues regarding how to store, handle, and deliver the exceedingly large amount of digital data reliably and securely.
The potential consequences of data loss or inaccurate data in reference to regulations can range from multi-million dollar fines to possible criminal charges. When using data for operational design and planning, inaccurate data can lead to multi-million dollar projects failing to achieve the desired outcomes, introducing a threat to life, limb, health, and other operational consequences.
Knowledge in multiple technological fields, including network architecture and management; database management and scripting; security policy; instrumentation and controls; PLC; human-machine interface (HMI) development; and basic IT skills are valuable tools when developing our data handling policies. As discussed later in this article, IT/OT segmentation is crucial, but IT skills are necessary. While there may be mention of specific products or applications, it is by no means an endorsement but a demonstration of real-world scenarios. The practices covered here do not require any application, program, or equipment and should be applied regardless of your preferred vendors or architecture.
The methods outlined in this article have been broken up into three principal components. Monitoring, Security, and Data Handling, with the latter subdivided into three categories of data handling (Retention, Retrieval and Redundancy). While an argument could be made for placing security above monitoring, and in a controls philosophy (safety through security) or in the IT realm where there is the possibility of confidential data exposure (security through obfuscation) it would make sense, here we are talking about SCADA data and data handling specifically. So, the argument is that without reliable data, the need for secure data ceases to exist. First, we need to make sure the data is reliable and usable.
“Plans are useless, but planning is indispensable.” - Dwight D. Eisenhower
The question then becomes, “How do we make sure our data is reliable and useful?”
Which elements of our system can we monitor to ensure the data we are getting meets the usability standards as we have defined for our needs? What is the primary source of our data? In the industrial world, we could, and should, start with our edge devices; the instrumentation and equipment that are providing the values. While there are many measures and methods to determine the reliability of instrumentation and equipment, one of the most dependable methods would be a normalized value range and average rate of change, or plausibility checking.
Let's take a PH meter for example. The generally accepted known value range for PH is 0-14. Monitoring this value, we can determine that anything outside of this range is no longer reliable data and could indicate an issue with either the transducer, transmitter, or any intermediary devices between the data source and the collection system. We can also trend the rate of change of a particular value and alert on an abnormal rate of change. We can use several methods to make sure our edge devices are maintaining reliable connections and providing valuable data. Using manufacture-equipped fail and range functions gives us a device-level opportunity to monitor.
Along with monitoring the physical connection, we also need to make sure the devices providing the values are providing reliable data. This typically involves calibration procedures and standard operating procedures (SOPs) regarding general maintenance, upkeep, and the security of the devices themselves. The methods used for these procedures vary from device to device, but are just as important as the physical connection, and thus should be monitored and tracked diligently.
Once we have defined methods for monitoring our raw data sources and determined that the data we are receiving is reliable and useful, we need to make sure the methods in which we are transferring the data to our database remains reliable and make sure we are collecting data at usable or required intervals. This could include monitoring the PLC or distributed control system (DCS) controller cycle time as well as the uptime.
This allows us to catch hardware interruptions before they can affect the data itself. We can use plausibility checking to give us program and controller level monitoring. We can monitor the log times of the most current data as well as time differentials between specific logged data to determine proper intervals. Our transfer methods include everything from the physical connection to our database server (e.g., PLC to SCADA network to the programmatic methods we use to transfer the data to the database whether it is a direct data entry [DDE] or open database connectivity [ODBC] data connection, an open platform communications [OPC] or OPC unified architecture [UA] interface, a data logging function built into SCADA software, a message queueing telemetry transport [MQTT] broker, or custom queries running directly from a data server).
While the methods to monitor our data connections will vary, the basic idea is to make sure that the services driving the data transfer are functioning as required. As in, the connections through which we are transferring the data are maintaining reliability (to be discussed more in-depth later), and that we are logging the data at the required intervals. The methods we use to achieve these goals will vary widely from system to system. With the use of custom scripting (PowerShell, Python, Visual Basic for Applications [VBA], etc.) to monitor time differentials in logged data as well as the connection status of the database itself, we can ensure a high level of reliability.
We can also use software and custom scripts to monitor connections to the network interfaces as well as specific running services (e.g., SQL server agent, automated jobs, or data transfer executables). Depending on the system you are using, application and security event logs can provide valuable information and can be monitored. We should monitor the size of our database files to keep them within an acceptable range and reduce the potential for disruptions in data collection. We should have the ability to notify the relevant people if a script fails to run as scheduled, returns incorrect or invalid values, or the time differentials are out of range. This would typically be handled with email and text message capabilities as well as potentially alerting the operators through the SCADA or HMI alarming system when possible.
To summarize, monitoring is of utmost importance, as without reliable, quality data the rest is irrelevant. The use of our data relies solely on the quality of our data, so organizations should continually pursue oversight and verification of data quality.
Some of the elements of the monitoring component include:
The overall goal is to monitor many of the potential points of failure from the data source to the database itself. Using manufacturer-provided fail and range functions gives us a device-level opportunity for monitoring, while deep packet inspection gives us a network-level opportunity, and plausibility checking gives us program and controller level monitoring. Once we are certain that we are obtaining reliable and valuable data, the next step is to secure the data and prevent data loss or corruption.
“Everyone’s got a plan until they get hit.” - Mike Tyson
While the network element of the security component could constitute a section of its own, networking and security have intersected to a point that it is prudent we include them together. With the convergence of IT/OT networks necessary for businesses to keep up with production and reach their goals, we need to advance our strategies for securing not only our data, but the data sources themselves. We need to be able to detect, track down, and eliminate any anomalous behavior in a timely manner before it can influence our system.
The rise of the Fourth Industrial Revolution, colloquially known as “Industry 4.0,” has caused industries to integrate a variety of digital solutions to networks in the form of “smart” devices and instrumentation to remain competitive, maintain compliance with regulations, and improve quality of service. These devices, while making data collection more efficient and available, also requires us to evaluate the interconnectedness of our networks and refine our methods of network segmentation. What was traditionally an air-gapped network is now connected to IT networks along with new technologies, creating the IoT and subsequently the Industrial Internet of Things (IIoT), which consequently has introduced a myriad of cybersecurity issues.
While servers, database technologies, networking devices, and data analytics typically belong in the realm of IT, IIoT devices have allowed us to integrate a wide assortment of sensors for gathering real-world conditions. Devices like temperature, pressure, and chemical composition sensors as well as actuators that translate digital commands and instructions into physical actions (such as controlling valves and moving mechanisms) now provide a wide array of data that we can utilize in day-to-day operations as well as in design and engineering. Integrating IoT devices into our network and operations has required that we dive headfirst into the world of IT while maintaining the safety and availability of our systems.
While IT security would generally prioritize confidentiality, and OT would prioritize safety, here we focus on the security of our data to prevent potential loss or corruption. The type of data, as well as the treatment of data, differs significantly between IT and OT. Much of the data that IT deals with requires security through encryption (security through obscurity) due to its sensitive nature (such as passwords, HIPAA requirements, social security numbers, etc.). However, a breach of IT data typically will not affect real-world process outcomes or threaten life and limb. While OT data typically would not require as strong of encryption (unless the data is of a proprietary nature in the case of pharmaceuticals or manufacturing procedures), our data sources require more security, rather than encryption, than that of IT data, as OT data controls the process and outcomes of the physical world.
With a large amount of our instrumentation and controls connected to our network, we should keep in mind that many of these devices have configuration capabilities and interfaces that provide a potential exploit into our system. When commissioning such devices, the first thing we should do is change the default username, password, and network settings. This allows us to reduce the possibility of a security breach and gives us a device level opportunity to protect the integrity of our data and device configuration.
Ideally, in a new system, you can segment your IT and OT networks. This is important for a variety of reasons, many of which would constitute its own article (and much has been written on it). For the sake of brevity, as the security component has the potential to get very complex (especially if segmentation was not taken into consideration while designing the system), much of what this article will focus on will involve network security and encryption, as that will be the primary source of protection for our data.
While there are numerous network configurations, the most used in industrial settings are flat networks, firewall rules, dual-home host, and remote access servers. Realistically, not all of these methods provide reliable security or segmentation.
While a flat network design can allow us to reduce cost in design and infrastructure by connecting everything to a single switch, we do so at the cost of increased security risk.
A screened subnet firewall network architecture has three components:
The use of an additional “layer” and other aspects of the screened subnet firewall makes it a viable choice for many high-traffic or high-speed traffic sites. Screened subnet firewalls also help with throughput and flexibility.
Along with network segmentation, network management is also a crucial component of the security of your data. The idea is to manage and monitor the access of the network, traffic payload content (destination, source, type), minimize disruption of your data collection, maintain a standard of high performance, and avoid security issues. With a well-managed network, we can monitor the devices that have access to our network using numerous filtering techniques, monitor and maintain performance, and potentially catch issues before they can affect our network. Network management also allows us to get a jumpstart on redundancy by using a redundant configuration as well. More detail on this can be found in the next section.
“Humans continue to be the weakest link in information security. Whether deliberate or accidental, the actions of employees can quickly destroy a company. Organizations must keep this in mind when creating information security policies and while implementing safeguards.” - Rachel Holdgrafer
With much of the SCADA or industrial control systems (ICS) running on legacy software and platforms, they may lack sufficient security for our purposes and should be segmented from the network as much as possible while allowing us to maintain the connection. There are solutions that can add security to these types of sites, like the Hirschmann Xenon Tofino industrial firewall, which adds a zone-level layer of security.
For applications and industries that are concerned with proprietary data or confidentiality, encryption should be considered with storage and transportation. Different methods in which we can encrypt our data storage include transparent data encryption, symmetric encryption with a set session key (advanced encryption standard [AES]-128 or AES-256; staying away from Rivest Cipher 4 [RC4]-128, which is considered broken) certificates as well as asymmetric encryption to securely exchange session keys. Using cryptographic-like presentation layer security such as secure sockets layer (SSL) or transport layer security (TLS) that authenticate data transfer between servers or systems provides endpoint encryption systems that prevent unauthorized access. We can also protect against web application attacks by using stored procedures to obfuscate our queries and table/database structures as well as protect against SQL injection attacks.
As we can see, the security component is quite complex and spans the gamut of our entire system from the devices providing the data to the applications used to retrieve and interact with our data. We can and should spend a significant amount of time during the design and implementation of our system focusing on security in detail.
To summarize:
“Experts often possess more data than judgment.” - Colin Powell
The first thing we need to consider when forming data retention policies is a destination for the data. There are many options for database selection, but we should keep a few things in mind when making the selection, such as what type of database will we use (Microsoft Access, SQL, NoSQL, PostgreSQL, etc.; with SCADA application data generally requiring a more rigid database structure, we would typically select a SQL or PostgreSQL type database over a NoSQL type). What is the size, speed, and memory limits of the database (which can be a primary metric we monitor)? How is the database managed and by whom (this has traditionally been handled by IT)? What are the native capabilities for monitoring, security, redundancy, and retrieval? How will we communicate with the database? Does our database reside on its own machine or within our application server? In the case of an existing ICS system, we need to choose a database that is compatible and can communicate. Once we have selected a database, we can determine the methods of transferring data to it.
While most SCADA systems have data logging capabilities built in that have pre-defined table structures and handle the transfer of data, the ways in which these tables are structured or indexed (or not indexed) may not work for our needs, which is a factor that should be considered. The formats of these tables may also hinder our ability to run queries at the required speed to retrieve information and may need to be modified. The methods and policies regarding data retention will vary from industry to industry depending on regulatory needs, engineering and process needs, confidentiality agreements as well as other factors. While our retention needs may vary, there are a few best practices we can follow regardless of the needs of the data itself.
A good retention policy will outline what type of data we need to store, why the data is being stored, how long the records are kept, which people or applications have access to the data, how they will interface with the data, and how transactional data is protected along with how we will make exceptions to our policies in case of lawsuits or other disruptions.
Assuming a retention policy development team is already in place (if not, now would be a good time to form one), then the first task in constructing our retention policies is the selection and classification of data. We need to identify the important data and define the sensitivity and purpose of the data. What data do we need to retain? What is the data used for? Is the data of a sensitive or proprietary nature, and does it need to be encrypted? Our goal is to categorize various information assets based on the contents of the data and the audience to which it will be delivered and tag them as such.
Our classification policy should outline if the data is public and may be freely disclosed to the public (i.e., if it is internal data and has low security but is not meant for public disclosure, or if it is restricted and a breach could negatively impact operations or can create a risk to the organization). We also need to determine who can modify, delete, or move the data and how to track and audit the changes.
We need to keep in mind that SCADA data will typically have a different retention lifetime than that of IT or business data, as well as different sensitivity and security needs. While some data may be allowed to expire after a few minutes, we may need to retain other data for months or years depending on the purpose. There are cases where operational data will be used to update master plans, in rehabilitating or updating existing facilities, or in research for design and process development or improvement. In these cases, the data may not need to be as granular, but may be required to be stored for longer periods of time. In some cases, it may not be feasible or operationally efficient to store years’ worth of data in a single database. The larger the database gets, the longer it takes to retrieve data and run queries. Having archiving as part of retention policies allows us to deal with those issues whilst maintaining compliance. Our archiving policies allow us to free up application space, improve reporting and application performance, reduce storage costs, and reduce litigation exposure.
The second part of our retention policy should be compliance, which is a continual and ongoing process. Compliance policies ensure that our data retention adheres to government, business, industry, or agency rules as well as regulatory requirements. For instance, the North American Electric Reliability Corporation (NERC) instructs entities to maintain data for an entire compliance verification period. Meaning, the data collected must be retained for the whole 3- to 6-year auditing period. These compliance regulations will vary throughout different industries as well as geographical locations. We would first determine our regulatory agency/agencies to obtain their specific compliance frameworks and requirements. At some point, the data we are retaining will exceed our required data retention period and lose its useful value. It is at this point that we will need a data deletion and purge policy. Our deletion and purge policy is a supplementation to our retention policy and should include our archiving policies.
Most data has a finite life and should be disposed of once the retention, archival, legal, and regulatory requirements have been met. This lifecycle management should be performed systematically and within the bounds of our retention policy framework. We must be able to adapt to changes in retention policies (for example, changes in the regulatory framework regarding data collection interval requirements or data classification changes). These policies rely heavily on our data protection procedures to maintain the integrity of the data during its lifetime. For this integrity to remain trustworthy, the data should be stored in a manner that is immutable, meaning we can be sure the data has not been changed throughout its lifecycle. Once the data has reached the end of its lifecycle, it must be disposed of with as much care as it has been retained, as this is often a threat actor’s easiest means of data theft. Setting expiration dates and automating the deletion of data allows us to maintain a sound and consistent deletion policy as well as decrease storage and costs. We should always set a date or time retention range after which data should be archived as well.
We can use automated maintenance plans that allow us to perform tasks like rebuilding indexes, purging history, performing backups, clearing cached files, and other functions. All these tasks will help us maintain the speed of retrieval and integrity of our data. This automation also reduces the potential for human error in data maintenance. By automating these services, we can generally guarantee that the functions will run the same way every time. For instance, if we have a delete query to purge data beyond a certain date, we can be assured that a person is not going to accidentally use a less than (<) symbol rather than a greater than (>) symbol, potentially deleting data that we are required to retain rather than data that has expired, per our retention policies.
In the age of seemingly unending memory and storage, it is tempting to hoard data, and this is an urge we need to resist. Hoarding data will cause a decrease in available storage and an increase in associated costs. Retain only the data that is required and do not be afraid to delete or archive it.
“Data are just summaries of thousands of stories—tell a few of those stories to help make the data meaningful.” - Dan Heath
With all this work to monitor, collect, secure, and retain data, we should not forget the purpose in doing so. We are collecting and maintaining data to be used. If we are not using the data, or do not see a future use for it, then there is no purpose in collecting it. A few things we need to decide in the retrieval and presentation of our data are who will have access to the data, how will we provide it, and how will we present it?
Our goal is to harness insights from the collected data. While we can gain a modicum of insight from data alone, a good presentation exponentially and immediately increases our ability to disseminate and analyze it. We are taking the data and turning it into information that can be consumed in a more viable and visual fashion. To make this a reality, we need to be working with reliable and quality data.
The retrieval process is directly affected by our security and data retention policies. We do not want direct interaction with our transactional data, as it opens us up to potential security issues as well as accidental deletion or modification of data that we want to remain immutable. During the retrieval and delivery of data, we can potentially open ourselves up to SQL injections, hence the recommendations for using encryption and stored procedures as part of our security policy. Using stored procedures also allows us to retrieve and present data at a faster rate, reduce network traffic, and avoid potential security issues. While the monitoring and security components have enabled us to ensure our transactional data is reliable and usable, and our retention policies make sure the data retrieved from our transactional server is secure and auditable, there is still the possibility of corruption during retrieval and delivery.
Even with the data in a data warehouse and our transactional data protected, we still need to restrict direct interaction with our raw data. Using views or virtual tables allows us to do just that.
A view is the result set of a stored query on the dataset. It is basically a predefined query that retrieves a dataset from our raw data and creates a virtual table with the result that our end-user can interact with without affecting our raw data. This also gives us an opportunity to increase the speed at which we deliver data. By creating views that serve only the data that is necessary for the current task, we reduce the need to iterate through every row in the table. Using set-based operations will perform faster in almost every case and allow us to simplify the code.
“Redundancy is ambiguous because it seems like a waste if nothing unusual happens. Except that something unusual happens—usually.” - Nassim Nicholas Taleb
The redundancy of our data begins at the instrumentation and OT assets (e.g., PLC, remote terminal unit [RTU], etc.). This is where we begin our data value (data point) redundancy. The goal of our data value redundancy is to maintain a reliable value that is accessible for retrieval in the case of a data transfer failure. Beyond the point of the data value redundancy, we get into database and data file redundancy, discussed further below.
Understandably, it is not generally cost-effective to maintain two complete systems for every data point (although for maximum redundancy, this is ideal). For example, if we have a PH transducer communicating to a transmitter that is providing a value to your SCADA system, that edge device (the PH system) may be a single point of failure. If the probe fails, the data from the failed device is no longer reliable or quality data based on our standards. At that level of our system (the device level), this may be an acceptable failure as long as we have a spare transducer or transmitter on hand. Our monitoring system should have alerted us to the failure and allowed us time to correct the issue before it caused us to violate our data policies. Beyond the device level is where we can really begin to implement our redundancy.
Many devices that provide regulatory values have built-in or expandable storage. For example, a Hach SC200 has a secure digital (SD) card slot that allows onboard storage transfer of around 30 days’ worth of data. Assuming your transducer maintains reliability, this is a great point for redundancy of quality data in the case of a network or SCADA device failure. We can also use independent data logging devices to serve the same purpose as a low-cost device level redundancy solution.
Beyond the device level redundancy, we can use a store and forward system in the case of a database or data transfer failure. This generally requires an array of values at the controller level and timed interval data snapshots. For instance, we could be required to log a value every 15 minutes. To store a days’ worth of data, this would require an array of 96 data points.
In this example of an Allen Bradley Logix system, we've built a data structure that includes an index and value array and another structure that includes an index and a timestamp.
The next layer of data value redundancy would be something like a historian module. There are many manufacturers of historian modules, but they all achieve the same goal. The historian can collect data from multiple controllers at very high speeds. The memory on these historian modules will vary but generally would be 1 gigabyte (GB) and up. In a test case historian, an Allen Bradley 1756-HIST1G, 1,000 double integer points require a total memory of 61 megabytes (MB) (this memory includes device configuration and archive files) of the 1,024MB available. This allows us to store one months’ worth of all regulatory data points.
Let's say you are required to maintain data in 15-minute intervals. This means we can lose up to 14 minutes’ worth of data without violating regulatory policies; if we lose that 15th minute, we have violated. In this case, we need to do a differential backup every 15 minutes. In the case of a SQL server, this would require us to backup our transaction logs every 15 minutes or perform a differential backup at the same interval. The transaction log in a SQL server is a log that records every transaction made to the SQL server database and is the most important component of the database when it comes to disaster recovery, as long as the file integrity is maintained. Every time the database is modified (a transaction occurrence), a log record is written to the transaction log. These logs are a recoverable snapshot of the last 15 minutes of data transactions. We do need to keep in mind that you cannot restore a transaction log without a previous full backup.
Our transaction logs should be backed up in the required intervals between the full and differential backups to allow for maximum data resilience. The full backup requires quite a bit of space as it is the size of the entire database. Full backup files, as well as transaction log and differential backups, should be disposed of based on our data retention policies, as they are not needed. Transaction logs, while they do not require as much space, can still consume storage if not managed properly, as we are generating them at a much faster rate than full or differential backups. Typically, we can dispose of the transaction log once the data is fully committed to the database.
In the following scenario, data loss occurs at 1:40am, so the changes made within the period of 1:39am and 1:50am will be lost. In this circumstance, we can restore our full backup at 12:00am, the differential backup at 1:00am, and the transaction log backups between 1:00am and 1:30am. The transaction log backup at 1:45am will allow us to roll back to as far as 1:39am, minimizing the data loss to an acceptable range. Our monitoring system should have alerted us to a data interval logging error in time to correct the issue.
3) Create one primary backup and two copies of your data,
2) Save your backups to two different types of media, and,
1) Keep at least one backup file off-site as a general guideline, and try to improve upon it.
We should have multiple locations for backup file destinations. Our initial backup file would reside on the database server, which allows us to recover and restore data in the case of a network outage. Once the backup is complete, multiple copies of the files should be made and copied to an off-site storage location (or ideally several different locations) to allow for recovery in the case of a catastrophic database system failure. We can also use something like an external hard drive, tape, rotated USB drives, or cloud solutions to allow copies on separate types of media.
Our complete redundancy from the device level to database would resemble something like the flow chart below:
There are a few different trains of thought as far as the connectivity of these redundant servers. While maintaining the server connection to the network allows you to automate the replication, this also leaves your server open to the same attacks that take down your primary server. An operator plugs in an infected USB into an enabled port with a self-replicating payload that travels throughout your network, and you may lose your primary and secondary server.
For this reason, a manual replication policy with an air-gapped secondary server is a more viable solution. This air gap protects you from any transient virus that may depend on the host maintaining connectivity and availability. Once this host is removed from the network and you've switched to your redundant backup, you have reduced downtime and can investigate and correct the issue with the primary server.
This is also a good practice for preventing a ransomware situation. Your primary server can get hit with a ransomware attack and your air-gapped secondary server will be protected, giving you the ability to failover, rebuild your primary server, and avoid paying the ransom. Now, depending on your acceptable range of downtime, this may not be a viable solution as it requires:
1) More manhours making sure your secondary server replication is continually updated, and,
2) Reduces your ability to automatically failover to the secondary server, thus increasing your downtime.
In this case, we could do something like keeping the secondary server off the domain, disabling all remote access capabilities, and adding a unique local admin password. There should be no access to the administrative shares if domain credentials are breached. While this is not an air gap, it does allow us to effectively restrict access through network resources.
Along with redundancy in data files and adding redundant server hardware, we need to make sure we have our server image backup policies designed in a similar fashion. While we probably do not need 15-minute incremental backups of our server itself, we should have regular full and differential backups. Your backup policy could look something like a monthly full image backup, and a weekly (or daily depending on the frequency of changes) differential backups.
With all our backup strategies in place for our database/data files as well as our servers, it would be remiss of us not to add resilience to our network. While we have covered our network design, our network topology allows us to add an additional layer of redundancy to our system. While this is a very complex subject, we will go over the basics of the most used redundant topologies.
While network redundancy does not typically add redundancy to end devices, as that redundancy would be provided by the device itself (think dual network interface controller [NIC] cards in a server with an application mechanism to facilitate failover and prevent loopbacks), it is responsible for the availability of the entire topology (depending on the topology you choose; many IIoT devices now support redundant ring topologies).
The primary consideration when choosing a network topology is recovery time. While having a fast stated recovery time is important, there are other factors that determine your actual recovery time, such as how much traffic is on the network, how many devices exist on the network (more devices equal more learned media access control [MAC] addresses), where the failure is in relation to the redundancy manager, transport layer protocols, and so on.
There are many proprietary protocols that exist for network redundancy, but we will only be covering the most common standardized protocols. Rapid spanning tree protocol (RSTP) is the primary go-to redundancy topology throughout most of the enterprise world. RSTP creates a resilient mesh network allowing greater than two paths between network infrastructure devices. While being very flexible, it does come at the cost of lower recovery speeds and may not be the best option for mission-critical or industrial networks. It is easy to implement, but requires quite a bit of fine-tuning to improve recovery speeds.
Redundant ring topology is the methodology of choice in the industrial world. Even though ring topologies only allow two paths out of each switch, they increase recovery time significantly compared to other redundant topologies. There are countless proprietary ring protocols, such as X-Ring, resilient ethernet protocol (REP), N-Ring, etc. Here, we will focus on two non-proprietary standardized protocols.
Device level ring (DLR) protocol allows a realized recovery time of less than 3 milliseconds and provides a nearly “bumpless” failover. In this case, if your requested packet interval (RPI) is less than 3 milliseconds, you will not notice a network drop; it would appear as nothing ever happened, save for the ring fault notification. DLR is part of the Ethernet/IP standard and is supported by many current Ethernet/IP IIoT devices which allows us to include our edge devices in the redundant network. DLR allows network faults and errors to be quickly remedied. A DLR network includes at least one node that acts as a ring supervisor and any number of normal ring nodes. It is assumed that all the ring nodes have at least two ethernet ports and incorporate embedded switch technology.
Media redundancy protocol (MRP), originally the HiPER-ring protocol and developed by Hirschmann, is now standardized as part of the IEC 62439-3 industrial communication network standard and can guarantee a maximum recovery time of 200 milliseconds in an application with a maximum of 50 nodes. MRP redundancy failover is controlled by the ring manager (a Media Redundancy Manager), which blocks a port until it detects a failure in the ring. In that case, the port becomes unblocked and moves to a forwarding status.
To summarize:
We can see through this article that even though the individual components outlined can and do exist independently, they work as a whole to improve our ability to reliably collect, transmit, deliver, and present quality and usable data. By removing any one of the components, we drastically reduce the resiliency of our system and expose ourselves to the potential negative consequences that come with inaccurate, unreliable, or missing data.