Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in TraceSource Initialization in Azure Web Site extension #428

Closed
Dmitry-Matveev opened this issue Feb 10, 2017 · 10 comments
Closed
Assignees
Labels
Milestone

Comments

@Dmitry-Matveev
Copy link
Member

TelemetryConfigurationFactory creates an instance of InMemoryChannel which in turn creates an instance of the InMemoryTransmitter.

InMemoryTransmitter starts a new Task which logs with CoreEventSource.Log.LogVerbose().

TelemetryConfigurationFactory continues the execution (it does not wait for the task) and may execute Event Source logging as well leading to a possible deadlock between two threads - one executing Event Source call in the newly created task and one executing event source call on the old thread.
The deadlock seems possible because Event Source is performing some initialization logic for the first ever call and there are two locks that can be acquired in reverse:

Thread A:
TraceInternal.TraceEvent() took TraceInternal.critSec,  
Switch.InitializeWithStatus() is waiting for for Switch.InitializedLock

Thread B:
Switch.InitializeWithStatus took Switch.InitializedLock
DiagnosticsConfiguration.Initialize() is waiting on TraceInternal.critSec

While the nature of the deadlock is not entirely clear at the moment (a bug in EventSource or in usage of Event Source?), the fix might be to simply initialize Event Source synchronously before we create InMemoryChannel.

Dumps are available for the additional investigation, please contact me to get them.

@Dmitry-Matveev Dmitry-Matveev added this to the 2.3-Beta3 milestone Feb 10, 2017
@cijothomas cijothomas self-assigned this Feb 10, 2017
@SergeyKanzhelev
Copy link
Contributor

cc: @vancem are those two sequences of locks are by design?

@vancem
Copy link

vancem commented Feb 13, 2017

If you think this is a problem in EventSource, I think we need more explanation of the scenario. (Are we talking about TraceSource and not EventSource by any chance?) None of the APIs above do I recognize as methods in EventSource (in particular I searched for TraceInternal). Also you don't get lock leveling issues unless you CALL OUT from your API while holding a lock. So presumably this is happening because you have an EventListener involved, but again it is unclear from the explanation what exactly is happening as EventListeners are not mentioned. I know of only one lock associated with EventSource and it is called 'EventListernersLock' but that was also not mentioned.

So I can't really comment on the issue yet, as I can't relate the quest to the code.

@brianrob

@SergeyKanzhelev
Copy link
Contributor

@vancem as Dmitry reported the issues was caused by initialization order and the fact that we had EventSource tracing in initialization. @Dmitry-Matveev do you have full call stacks?

BTW, I found this comment in reference code:

Initialize() is also called by other components outside of Trace (such as PerformanceCounter)
as a result using one lock for this critical section and another for Trace API critical sections
(such as Trace.WriteLine) could potentially lead to deadlock between 2 threads that are
executing these critical sections (and consequently obtaining the 2 locks) in the reverse order.
Using the same lock for DiagnosticsConfiguration as well as TraceInternal avoids this issue.
Sequential locks on TraceInternal.critSec by the same thread is a non issue for this critical section.

@vancem
Copy link

vancem commented Feb 13, 2017

Just to be clear, the reference source you are looking at is System.Diagnostics.TraceSource logic and has nothing to do with System.Diagnostics.Tracing.EventSource. This is fine, it is just that different people own the different code, and if we are going to fix things we certainly need to know what code actually has the problem.

Including @safern who has been updating TraceSource code recently.

Note also that TraceSource is been changed in .NET Core (our forward-looking code base), and no longer has complex logic in DiagnosticSource.Configuration.

Also one of the locks, TraceInternal.critSec, is only present if the TraceListener asks for it. Generally speaking such 'global' locks are not a good idea for a high performance logging system (indeed we don't recommend TraceSource for high performance logging at all, it is really there only for compatibility reasons).

In short, it is more important than ever to carefully understand your scenario.

@SergeyKanzhelev
Copy link
Contributor

thanks for clarification! We clearly need those stacks from @Dmitry-Matveev =)

@Dmitry-Matveev Dmitry-Matveev modified the milestones: 2.4-Beta1, 2.3-Beta3 Mar 8, 2017
@SergeyKanzhelev SergeyKanzhelev changed the title Deadlock in EventSource Initialization Deadlock in TraceSource Initialization in Azure Web Site extension Mar 8, 2017
@SergeyKanzhelev
Copy link
Contributor

Changes should be made in this NuGet: https://www.nuget.org/packages/Microsoft.ApplicationInsights.Azure.WebSites to work around this problem.

Specifically in this class: Microsoft.ApplicationInsights.Extensibility.Implementation.Tracing.TraceSourceForEventSource

@vancem
Copy link

vancem commented Mar 9, 2017

As mentioned this bug does not exist in CoreClr but only the desktop. It is being tracked by the following bug for the Desktop product.

 387336 Deadlock in TraceSource Configuration

There is a suggested fix

There are probably multiple fixes, but one ‘obvious’ fix is to use the ‘TraceInternal.critSec’ lock for all code in ‘Switches’ as well. (DiagnosticConfiguration used the ‘TraceInternal.critSec’ lock to avoid exactly this kind of deadlock, the fix just did not go far enough (they probably did not know about the lock in Switches.cs).

which I believe @brianrob is looking at. However as mentioned changes to the Desktop framework only propagate very slowly (people don't upgrade for months or years), so a work-around is suggested as well which is what Sergey is referring to above.

@brianrob
Copy link
Member

brianrob commented Mar 9, 2017

I have a candidate fix for 387336. This is now pending validation by AppInsights. Per e-mail, @SergeyKanzhelev is looking into this.

@SergeyKanzhelev
Copy link
Contributor

The class responsible for deadlock was moved to GitHub. Moving issue along to that repo

@TimothyMothra
Copy link
Member

The bug 387336 is in .NetFramework version 4.7.0.
The fix was released with 4.7.1.
http://referencesource.microsoft.com/#System/compmod/system/diagnostics/Switch.cs

Also, ApplicationInsights no longer supports Net40 where this bug was discovered.
We believe it is safe to close this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants