Implementing Content Freshness protection in DFSR

https://blogs.technet.microsoft.com/askds/2009/11/18/implementing-content-freshness-protection-in-dfsr/

 

Background

Content Freshness is an admin-defined setting that you can set on a per-computer basis when using DFSR on Win2008 or Win2008 R2 – it does not exist on Windows Server 2003 R2. The DFSR database has a record for each Replicated Folder (RF) called CONTENT_SET_RECORD. This record contains a timestamp called “LastConnected”. We store this record on a per-Replicated-Folder basis because it’s possible for a replicated folder to be current when it’s connected to other members in that replication group. At the same time, another replicated folder can be stale because it is not connected with other members in its replication group. Every day, DFSR updates this timestamp to show the opportunity for replication occurred. When attempting replication for an RF between computers, the DFSR service checks if the last time replication was allowed is older than the freshness date. If the last-allowed-replicated date is newer, it replicates. If it’s not, we block replication.

By now, you’re asking yourself “why would I want to block replication.” Good question. DFSR has a JET database just like Active Directory, and it uses multi-master replication just like AD. This means that it must implement tombstones to deleted items to replicate. When a file is deleted in DFSR, the local database records the deletion as a tombstone in the database – a logical deletion. After 60 days DFSR garbage collects the record from the database and it is truly gone – a physical deletion. Online defragmentation of the database can now reclaim that whitespace. The 60 days allows all the replication partners to learn about the deletion and act on it.

And herein lays the problem. If a DFSR server cannot replicate an RF for more than 60 days, but then replication is allowed later, it can replicate out old deletions for files that are actually live or replicate out stale data and overwrite existing files. If you’ve ever worked on an Active Directory “lingering object” issue, you have seen what can happen when a DC that was offline for months is brought back up. This is why Strict Replication Consistency was invented for AD – Content Freshness protection is the same thing.

Being “unable to replicate” can mean any one of these scenarios:

  • Disabling the replication connections.
  • Deleting the replication connections (either one-way or in both directions).
  • Stopping the DFSR service.
  • Closing the schedule (i.e. setting “no replication”)
  • Keeping the server shut off.

This whole content freshness idea is novel enough that we went to the trouble of applying for a patent on it.

Implementing Content Freshness Protection

Content Freshness protection is not enabled by default. To turn it on you simply modify the DfsrMachineConfig setting for MaxOfflineTimeInDays on each DFSR server with:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig set MaxOfflineTimeInDays=<some value>

The recommendation is to set the value to 60:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig set MaxOfflineTimeInDays=60

Remember, this has to be done on all DFSR servers, as this change only affects the computer itself. This value is not stored in a central AD location, but instead in the DfsrMachineConfig.XML file that resides in the hidden operating system folder “%systemdrive%\system volume information\dfsr\config”:

image

You can also view your existing MaxOfflineTimeInDays with:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig get MaxOfflineTimeInDays

Remember, by default this protection is OFF and be assumed to be zero if there are no entries in the DfsrMachineConfig.xml.

Note: Sharp-eyed admins may notice that we actually have an AD attribute stamped on every Replication Group called ms-DFSR-TombstoneExpiryInMin that appears to control tombstone lifetime. It even has the value – in minutes – for 60 days. Sorry to disappoint you, but this attribute is never read by DFSR and changing it has no effect – tombstone lifetime garbage collection is always hard-coded to 60 days in the service and cannot be changed.

Protection in Action

Let’s see how all this works. My repro environment:

  • A pair of Windows Server 2008 R2 computers named 2008r2-fresh-01 and 2008r2-fresh-02
  • Replicating in a Replication Group named “RG1”
  • Using a Replicated Folder named “RF1”
  • Keeping a few user files in sync.
  • MaxOfflineTimeInDays set to 60 on 2008r2-fresh-02

Important note: I am going to simulate the offline time by rolling clocks forward. Never ever do this in production – this is for testing and demonstration purposes only. Also, I only set MaxOfflineTimeInDays on one server – you would do this on all servers.

So here’s my data:

image

Now I stop DFSR on 2008r2-fresh-02 and roll time forward to January 1st, 2010 on both servers – about 75 days from this writing. I then make a few changes on 2008r2-fresh-02.

image

And then I start the DFSR service back up on 2008r2-fresh-02.

  • My changed files do not replicate out
  • New files do not replicate in

I now have this event:

Log Name:      DFS Replication
Source:        DFSR
Date:          1/1/2010 3:37:14 PM
Event ID:      4012
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      2008r2-fresh-02.blueyonderairlines.com
Description:
The DFS Replication service stopped replication on the replicated folder at local path c:\rf1. It has been disconnected from other partners for 76 days, which is longer than the MaxOfflineTimeInDays parameter. Because of this, DFS Replication considers this data to be stale, and will replace it with data from other members of the replication group during the next replication. DFS Replication will move the stale files to the local Conflict folder. No user action is required.
Additional Information:
Error: 9061 (The replicated folder has been offline for too long.)
Replicated Folder Name: rf1
Replicated Folder ID: 5856C18F-CA72-4D2D-9D89-4CC1D8042D86
Replication Group Name: rg1
Replication Group ID: BC5976EF-997E-4149-819D-57193F21EC76
Member ID: FAEC4B17-E81F-4036-AAD9-78AA46814606

Note: this event has incorrect wording. The first two sentences in the description are good, but the following sentences are wrong. DFSR does not self-correct this situation, it does not move files into the ConflictAndDeleted folder, and you, the user, have actions you need to take. More on this later.

The DFSR Debug logs will show (edited for brevity):

20100101 15:37:14.410 1008 CSMG 5504 [WARN] ContentSetManager::CheckContentSetState This replicated folder has not connected to other partners for a long time. lastOnlineTime: [*** Logger Runtime Error:-114757888 ***]

20100101 15:37:14.410 1008 CSMG 7492 [ERROR] ContentSetManager::Initialize Failed to initialize ContentSetManager csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} csName:rf1 Error:

+ [Error:9061(0x2365) ContentSetManager::CheckContentSetState contentsetmanager.cpp:5596 1008 C The replicated folder has been offline for too long.]

20100101 15:37:14.410 1008 CSMG 7972 ContentSetManager::Run csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} csName:rf1 state:InitialBuilding

20100101 15:37:14.504 1948 SRTR 957 [WARN] SERVER_EstablishSession Failed to establish a replicated folder session. connId:{5E05AE2A-6117-4206-B745-7785DB316F74} csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} Error:

+ [Error:9028(0x2344) UpstreamTransport::EstablishSession upstreamtransport.cpp:808 1948 C The content set was not found]

The state of the replicated folder will be “In Error” – i.e. set to 5:

wmic.exe /namespace:\\root\microsoftdfs path DfsrReplicatedFolderInfo get ReplicationGroupName,ReplicatedFolderName,State

ReplicatedFolderName   ReplicationGroupName   State
rf1                               rg1                               5

The above is Content Freshness protection in action. It is protecting your DFSR environment from sending divergent data out to the rest of your working servers.

Recovering DFSR from Content Protection

Important note: Before repairing the blocked replication, get a backup of the data on the affected server and its partners. Failure to do will tempt Murphy’s Law to disastrous new heights. Understand that by following these steps below, any DFSR data that was on this server and never replicated will be moved to PreExisting and/or ConflictAndDeleted – this server goes through non-authoritative sync again and loses all conflicts with other DFSR servers. You have been warned!!!

Also, whatever is being done to stop replication from working needs to be ironed out – whether it is leaving the service off for months on end or not having any connections. Otherwise this is just going to happen again.

To get things back in order, do the following:

1. Start DFSMGMT.MSC on the affected server.

2. On any affected replication groups this server is a member of, select the computer on the Membership tab and “Disable” it.

image

3. Accept the warning prompt.

image

4. If the reason for replication never occurring was the schedule being set to “no replication” on the RG or RF, or no bi-directional connections being place between servers, fix that situation now.

5. Force AD Replication and verify it has converged.

6. On the affected server, run:

DFSRDIAG.EXE POLLAD

7. Wait for the 4008 and 4114 events being written to the DFSR event log to confirm that the replicated folder(s) are no longer being replicated.

8. In DFSMGMT.MSC, “Enable” the replication again on the affected replicated folders for that server.

9. Force AD replication and POLLAD again.

The server goes through non-authoritative initial sync, as if it was setup the first time. All matching data is unchanged and does not replicate. Any files on the server that do not exist on its authoritative partner are moved to the PreExisting folder. Any files on the server that have been changed locally are moved to the ConflictAndDeleted folder and the authoritative server’s copy is replicated inbound.

The Sum Up

Content Freshness protection is a good thing and putting it in place may someday save you some real pain. Trust me – we work cases here where Content Freshness being enabled would have stopped huge problems. All it takes is Windows Server 2008 or later, and a few moments of your time.

– Ned “Kool and the Gang” Pyle

Manually Clearing the ConflictAndDeleted Folder in DFSR

source: http://blogs.technet.com/b/askds/archive/2008/10/06/manually-clearing-the-conflictanddeleted-folder-in-dfsr.aspx

 

Scenario 1: We need to empty out the ConflictAndDeleted folder in a controlled manner as part of regular administration (i.e. we just lowered quota and we want to reclaim that space).

Scenario 2: The ConflictAndDeleted folder quota is not being honored due to an error condition and the folder is filling the drive.

Let’s walk through these now.

Emptying the folder normally

It’s possible to clean up the ConflictAndDeleted folder through the DFSMGMT.MSC and SERVICES.EXE snap-ins, but it’s disruptive and kind of gross (you could lower the quota, wait for AD replication, wait for DFSR polling, and then restart the DFSR service). A much faster and slicker way is to call the WMI method CleanupConflictDirectory from the command-line or a script:

1.  Open a CMD prompt as an administrator on the DFSR server.
2.  Get the GUID of the Replicated Folder you want to clean:

WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderconfig get replicatedfolderguid,replicatedfoldername

(This is all one line, wrapped)

Example output:

image

3.  Then call the CleanupConflictDirectory method:

WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where “replicatedfolderguid='<RF GUID>'” call cleanupconflictdirectory

Example output with a sample GUID:

WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where “replicatedfolderguid=’70bebd41-d5ae-4524-b7df-4eadb89e511e'” call cleanupconflictdirectory

image

4.  At this point the ConflictAndDeleted folder will be empty and the ConflictAndDeletedManifest.xml will be deleted.

Emptying the ConflictAndDeleted folder when in an error state

We’ve also seen a few cases where the ConflictAndDeleted quota was not being honored at all. In every single one of those cases, the customer had recently had hardware problems (specifically with their disk system) where files had become corrupt and the disk was unstable – even after repairing the disk (at least to the best of their knowledge), the ConflictAndDeleted folder quota was not being honored by DFSR.

Here’s where quota is set:

image

Usually when we see this problem, the ConflictAndDeletedManifest.XML file has grown to hundreds of MB in size. When you try to open the file in an XML parser or in Internet Explorer, you will receive an error like “The XML page cannot be displayed” or that there is an error at line X. This is because the file is invalid at some section (with a damaged element, scrambled data, etc).

To fix this issue:

  1. Follow steps 1-4 from above. This may clean the folder as well as update DFSR to say that cleaning has occurred. We always want to try doing things the ‘right’ way before we start hacking.
  2. Stop the DFSR service.
  3. Delete the contents of the ConflictAndDeleted folder manually (with explorer.exe or DEL).
  4. Delete the ConflictAndDeletedManifest.xml file.
  5. Start the DFSR service back up.

For a bit more info on conflict and deletion handling in DFSR, take a look at:

Staging folders and Conflict and Deleted folders (TechNet)
DfsrConflictInfo Class (MSDN)

Implementing Content Freshness protection in DFSR

source: http://blogs.technet.com/b/askds/archive/2009/11/18/implementing-content-freshness-protection-in-dfsr.aspx

 

Background

Content Freshness is an admin-defined setting that you can set on a per-computer basis when using DFSR on Win2008 or Win2008 R2 – it does not exist on Windows Server 2003 R2. The DFSR database has a record for each Replicated Folder (RF) called CONTENT_SET_RECORD. This record contains a timestamp called “LastConnected”. We store this record on a per-Replicated-Folder basis because it’s possible for a replicated folder to be current when it’s connected to other members in that replication group. At the same time, another replicated folder can be stale because it is not connected with other members in its replication group. Every day, DFSR updates this timestamp to show the opportunity for replication occurred. When attempting replication for an RF between computers, the DFSR service checks if the last time replication was allowed is older than the freshness date. If the last-allowed-replicated date is newer, it replicates. If it’s not, we block replication.

By now, you’re asking yourself “why would I want to block replication.” Good question. DFSR has a JET database just like Active Directory, and it uses multi-master replication just like AD. This means that it must implement tombstones to deleted items to replicate. When a file is deleted in DFSR, the local database records the deletion as a tombstone in the database – a logical deletion. After 60 days DFSR garbage collects the record from the database and it is truly gone – a physical deletion. Online defragmentation of the database can now reclaim that whitespace. The 60 days allows all the replication partners to learn about the deletion and act on it.

And herein lays the problem. If a DFSR server cannot replicate an RF for more than 60 days, but then replication is allowed later, it can replicate out old deletions for files that are actually live or replicate out stale data and overwrite existing files. If you’ve ever worked on an Active Directory “lingering object” issue, you have seen what can happen when a DC that was offline for months is brought back up. This is why Strict Replication Consistency was invented for AD – Content Freshness protection is the same thing.

Being “unable to replicate” can mean any one of these scenarios:

  • Disabling the replication connections.
  • Deleting the replication connections (either one-way or in both directions).
  • Stopping the DFSR service.
  • Closing the schedule (i.e. setting “no replication”)
  • Keeping the server shut off.

This whole content freshness idea is novel enough that we went to the trouble of applying for a patent on it.

Implementing Content Freshness Protection

Content Freshness protection is not enabled by default. To turn it on you simply modify the DfsrMachineConfig setting for MaxOfflineTimeInDays on each DFSR server with:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig set MaxOfflineTimeInDays=<some value>

The recommendation is to set the value to 60:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig set MaxOfflineTimeInDays=60

Remember, this has to be done on all DFSR servers, as this change only affects the computer itself. This value is not stored in a central AD location, but instead in the DfsrMachineConfig.XML file that resides in the hidden operating system folder “%systemdrive%\system volume information\dfsr\config”:

image

You can also view your existing MaxOfflineTimeInDays with:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig get MaxOfflineTimeInDays

Remember, by default this protection is OFF and be assumed to be zero if there are no entries in the DfsrMachineConfig.xml.

Note: Sharp-eyed admins may notice that we actually have an AD attribute stamped on every Replication Group called ms-DFSR-TombstoneExpiryInMin that appears to control tombstone lifetime. It even has the value – in minutes – for 60 days. Sorry to disappoint you, but this attribute is never read by DFSR and changing it has no effect – tombstone lifetime garbage collection is always hard-coded to 60 days in the service and cannot be changed.

Protection in Action

Let’s see how all this works. My repro environment:

  • A pair of Windows Server 2008 R2 computers named 2008r2-fresh-01 and 2008r2-fresh-02
  • Replicating in a Replication Group named “RG1”
  • Using a Replicated Folder named “RF1”
  • Keeping a few user files in sync.
  • MaxOfflineTimeInDays set to 60 on 2008r2-fresh-02

Important note: I am going to simulate the offline time by rolling clocks forward. Never ever do this in production – this is for testing and demonstration purposes only. Also, I only set MaxOfflineTimeInDays on one server – you would do this on all servers.

So here’s my data:

image

Now I stop DFSR on 2008r2-fresh-02 and roll time forward to January 1st, 2010 on both servers – about 75 days from this writing. I then make a few changes on 2008r2-fresh-02.

image

And then I start the DFSR service back up on 2008r2-fresh-02.

  • My changed files do not replicate out
  • New files do not replicate in

I now have this event:

Log Name:      DFS Replication
Source:        DFSR
Date:          1/1/2010 3:37:14 PM
Event ID:      4012
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      2008r2-fresh-02.blueyonderairlines.com
Description:
The DFS Replication service stopped replication on the replicated folder at local path c:\rf1. It has been disconnected from other partners for 76 days, which is longer than the MaxOfflineTimeInDays parameter. Because of this, DFS Replication considers this data to be stale, and will replace it with data from other members of the replication group during the next replication. DFS Replication will move the stale files to the local Conflict folder. No user action is required.
Additional Information:
Error: 9061 (The replicated folder has been offline for too long.)
Replicated Folder Name: rf1
Replicated Folder ID: 5856C18F-CA72-4D2D-9D89-4CC1D8042D86
Replication Group Name: rg1
Replication Group ID: BC5976EF-997E-4149-819D-57193F21EC76
Member ID: FAEC4B17-E81F-4036-AAD9-78AA46814606

 

Note: this event has incorrect wording. The first two sentences in the description are good, but the following sentences are wrong. DFSR does not self-correct this situation, it does not move files into the ConflictAndDeleted folder, and you, the user, have actions you need to take. More on this later.

The DFSR Debug logs will show (edited for brevity):

20100101 15:37:14.410 1008 CSMG 5504 [WARN] ContentSetManager::CheckContentSetState This replicated folder has not connected to other partners for a long time. lastOnlineTime: [*** Logger Runtime Error:-114757888 ***]

20100101 15:37:14.410 1008 CSMG 7492 [ERROR] ContentSetManager::Initialize Failed to initialize ContentSetManager csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} csName:rf1 Error:

+ [Error:9061(0x2365) ContentSetManager::CheckContentSetState contentsetmanager.cpp:5596 1008 C The replicated folder has been offline for too long.]

20100101 15:37:14.410 1008 CSMG 7972 ContentSetManager::Run csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} csName:rf1 state:InitialBuilding

20100101 15:37:14.504 1948 SRTR 957 [WARN] SERVER_EstablishSession Failed to establish a replicated folder session. connId:{5E05AE2A-6117-4206-B745-7785DB316F74} csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} Error:

+ [Error:9028(0x2344) UpstreamTransport::EstablishSession upstreamtransport.cpp:808 1948 C The content set was not found]

The state of the replicated folder will be “In Error” – i.e. set to 5:

wmic.exe /namespace:\\root\microsoftdfs path DfsrReplicatedFolderInfo get ReplicationGroupName,ReplicatedFolderName,State

ReplicatedFolderName   ReplicationGroupName   State
rf1                               rg1                               5

The above is Content Freshness protection in action. It is protecting your DFSR environment from sending divergent data out to the rest of your working servers.

Recovering DFSR from Content Protection

Important note: Before repairing the blocked replication, get a backup of the data on the affected server and its partners. Failure to do will tempt Murphy’s Law to disastrous new heights. Understand that by following these steps below, any DFSR data that was on this server and never replicated will be moved to PreExisting and/or ConflictAndDeleted – this server goes through non-authoritative sync again and loses all conflicts with other DFSR servers. You have been warned!!!

Also, whatever is being done to stop replication from working needs to be ironed out – whether it is leaving the service off for months on end or not having any connections. Otherwise this is just going to happen again.

To get things back in order, do the following:

1. Start DFSMGMT.MSC on the affected server.

2. On any affected replication groups this server is a member of, select the computer on the Membership tab and “Disable” it.

image

3. Accept the warning prompt.

image

4. If the reason for replication never occurring was the schedule being set to “no replication” on the RG or RF, or no bi-directional connections being place between servers, fix that situation now.

5. Force AD Replication and verify it has converged.

6. On the affected server, run:

DFSRDIAG.EXE POLLAD

7. Wait for the 4008 and 4114 events being written to the DFSR event log to confirm that the replicated folder(s) are no longer being replicated.

8. In DFSMGMT.MSC, “Enable” the replication again on the affected replicated folders for that server.

9. Force AD replication and POLLAD again.

The server goes through non-authoritative initial sync, as if it was setup the first time. All matching data is unchanged and does not replicate. Any files on the server that do not exist on its authoritative partner are moved to the PreExisting folder. Any files on the server that have been changed locally are moved to the ConflictAndDeleted folder and the authoritative server’s copy is replicated inbound.

The Sum Up

Content Freshness protection is a good thing and putting it in place may someday save you some real pain. Trust me – we work cases here where Content Freshness being enabled would have stopped huge problems. All it takes is Windows Server 2008 or later, and a few moments of your time.