multi-threading task under SolarMutex -> deadlock

classic Classic list List threaded Threaded
8 messages Options
Norbert Thiebaud Norbert Thiebaud
Reply | Threaded
Open this post in threaded view
|

multi-threading task under SolarMutex -> deadlock

Recently I have had linux dbgutil build hang on occasion

The issue is that recently drawinlayer is starting using threadpool
( https://cgit.freedesktop.org/libreoffice/core/commit/?id=657413b5deea11a850970f23cba2cf34a5bdf8ea
)
and is issuing a waitUntilEmpty() on a threadpool, while holding the
solar mutex...


The threaded work then raise() due to some memory problem
and out signal handler try to acquire the solar mutex ->deadlock

relevant backtrace:

#0  0x00002af85e71c6d5 in pthread_cond_wait@@GLIBC_2.3.2 () at
/lib64/libpthread.so.0
#1  0x00002af85d8cd744 in osl_waitCondition(oslCondition, TimeValue
const*) (Condition=0x35fff90, pTimeout=0x0) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sal/osl/unx/conditn.cxx:228
#2  0x00002af866bf12b6 in osl::Condition::wait(TimeValue const*)
(this=0x3669e78, pTimeout=0x0) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/osl/conditn.hxx:84
#3  0x00002af866c556a1 in comphelper::ThreadPool::waitUntilEmpty()
(this=0x3669e60) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/comphelper/source/misc/threadpool.cxx:202
#4  0x00002af878ef7a4d in
drawinglayer::primitive2d::ScenePrimitive2D::create2DDecomposition(drawinglayer::geometry::ViewInformation2D
const&) const (this=0x2af8961bacd0, rViewInformation=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/primitive2d/sceneprimitive2d.cxx:439

^^^ wait for threadpool

#5  0x00002af878eae474 in
drawinglayer::primitive2d::BufferedDecompositionPrimitive2D::get2DDecomposition(drawinglayer::geometry::ViewInformation2D
const&) const (this=0x2af8961bacd0, rViewInformation=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/primitive2d/baseprimitive2d.cxx:99
#6  0x00002af878ef9009 in
drawinglayer::primitive2d::ScenePrimitive2D::get2DDecomposition(drawinglayer::geometry::ViewInformation2D
const&) const (this=0x2af8961bacd0, rViewInformation=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/primitive2d/sceneprimitive2d.cxx:695
#7  0x00002af878f63eb6 in
drawinglayer::processor2d::VclPixelProcessor2D::processBasePrimitive2D(drawinglayer::primitive2d::BasePrimitive2D
const&) (this=0x3b59700, rCandidate=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/vclpixelprocessor2d.cxx:1251
#8  0x00002af878f4192c in
drawinglayer::processor2d::BaseProcessor2D::process(drawinglayer::primitive2d::Primitive2DContainer
const&) (this=0x3b59700, rSource=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/baseprocessor2d.cxx:63
#9  0x00002af878f63ecf in
drawinglayer::processor2d::VclPixelProcessor2D::processBasePrimitive2D(drawinglayer::primitive2d::BasePrimitive2D
const&) (this=0x3b59700, rCandidate=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/vclpixelprocessor2d.cxx:1251
#10 0x00002af878f4192c in
drawinglayer::processor2d::BaseProcessor2D::process(drawinglayer::primitive2d::Primitive2DContainer
const&) (this=0x3b59700, rSource=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/baseprocessor2d.cxx:63
#11 0x00002af878f63ecf in
drawinglayer::processor2d::VclPixelProcessor2D::processBasePrimitive2D(drawinglayer::primitive2d::BasePrimitive2D
const&) (this=0x3b59700, rCandidate=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/vclpixelprocessor2d.cxx:1251
#12 0x00002af878f4192c in
drawinglayer::processor2d::BaseProcessor2D::process(drawinglayer::primitive2d::Primitive2DContainer
const&) (this=0x3b59700, rSource=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/baseprocessor2d.cxx:63
#13 0x00002af878f63ecf in
drawinglayer::processor2d::VclPixelProcessor2D::processBasePrimitive2D(drawinglayer::primitive2d::BasePrimitive2D
const&) (this=0x3b59700, rCandidate=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/vclpixelprocessor2d.cxx:1251
#14 0x00002af878f4192c in
drawinglayer::processor2d::BaseProcessor2D::process(drawinglayer::primitive2d::Primitive2DContainer
const&) (this=0x3b59700, rSource=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor2d/baseprocessor2d.cxx:63
#15 0x00002af87e332e5a in paintUsingPrimitivesHelper(OutputDevice&,
drawinglayer::primitive2d::Primitive2DContainer const&,
basegfx::B2DRange const&, basegfx::B2DRange const&)
(rOutputDevice=..., rSequence=..., rSourceRange=..., rTargetRange=...)
at /home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/doc/notxtfrm.cxx:744
#16 0x00002af87e334068 in SwNoTextFrame::PaintPicture(OutputDevice*,
SwRect const&) const (this=0x3252b80, pOut=0x3d97410, rGrfArea=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/doc/notxtfrm.cxx:1023
#17 0x00002af87e330f21 in SwNoTextFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x3252b80, rRenderContext=...,
rRect=...) at /home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/doc/notxtfrm.cxx:304
#18 0x00002af87e5f817f in SwLayoutFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x31a9d40, rRenderContext=...,
rRect=...) at /home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/layout/paintfrm.cxx:3681
#19 0x00002af87e5faf9b in SwFlyFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x31a9d40, rRenderContext=...,
rRect=...) at /home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/layout/paintfrm.cxx:4359
#20 0x00002af87e73d37b in SwFlyCntPortion::Paint(SwTextPaintInfo
const&) const (this=0x36e1bc0, rInf=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/text/porfly.cxx:241
#21 0x00002af87e730ab2 in SwTextPainter::DrawTextLine(SwRect const&,
SwSaveClip&, bool) (this=0x7ffe6b535250, rPaint=..., rClip=...,
bUnderSz=false) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/text/itrpaint.cxx:392
#22 0x00002af87e701bde in SwTextFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x2af8961b8000,
rRenderContext=..., rRect=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/text/frmpaint.cxx:691
#23 0x00002af87e5f817f in SwLayoutFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x2af8961b7000,
rRenderContext=..., rRect=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/layout/paintfrm.cxx:3681
#24 0x00002af87e5f817f in SwLayoutFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x2af8961b5000,
rRenderContext=..., rRect=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/layout/paintfrm.cxx:3681
#25 0x00002af87e5f6f80 in SwRootFrame::Paint(OutputDevice&, SwRect
const&, SwPrintData const*) const (this=0x3092820, rRenderContext=...,
rRect=..., pPrintData=0x0) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/layout/paintfrm.cxx:3400
#26 0x00002af87eac44b5 in SwViewShell::ImplEndAction(bool)
(this=0x3bdb160, bIdleEnd=false) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/view/viewsh.cxx:419

^^^^ hold a SolarMutextGuard


#27 0x00002af87e0f43e0 in SwViewShell::EndAction(bool)
(this=0x3bdb160, bIdleEnd=false) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/inc/viewsh.hxx:609
#28 0x00002af87eac521b in SwViewShell::MakeVisible(SwRect const&)
(this=0x3bdb160, rRect=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/view/viewsh.cxx:590
#29 0x00002af87e0f055b in SwCursorShell::MakeSelVisible()
(this=0x3bdb160) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/crsr/crsrsh.cxx:2807
#30 0x00002af87e508800 in SwFEShell::MakeSelVisible() (this=0x3bdb160)
at /home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/frmedt/feshview.cxx:2345
#31 0x00002af87e0eaab4 in SwCursorShell::UpdateCursor(unsigned short,
bool) (this=0x3bdb160, eFlags=7, bIdleEnd=false) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/crsr/crsrsh.cxx:1821
#32 0x00002af87e0e31cd in SwCursorShell::EndAction(bool, bool)
(this=0x3bdb160, bIdleEnd=false, DoSetPosX=false) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/crsr/crsrsh.cxx:294
#33 0x00002af87e5e49ab in SwRootFrame::EndAllAction(bool)
(this=0x3092820, bVirDev=false) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/layout/pagechg.cxx:1691
#34 0x00002af87e9c91c0 in UnoActionContext::~UnoActionContext()
(this=0x3bc4c00, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/core/unocore/unoobj2.cxx:261
#35 0x00002af87eee6fc7 in SwXTextDocument::unlockControllers()
(this=0x2af85d40f908) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sw/source/uibase/uno/unotxdoc.cxx:539
#36 0x00002af87a43c06e in
oox::core::FilterBase::filter(com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue>
const&) (this=0x2af8960fcdd8, rMediaDescSeq=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/oox/source/core/filterbase.cxx:497
#37 0x00002af895a199b2 in
WriterFilter::filter(com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue>
const&) (this=0x2af8961d0f68, aDescriptor=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/writerfilter/source/filter/WriterFilter.cxx:149
#38 0x00002af88121e8f3 in SfxObjectShell::ExportTo(SfxMedium&)
(this=0x37e0430, rMedium=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/objstor.cxx:2416
#39 0x00002af8812197c8 in SfxObjectShell::SaveTo_Impl(SfxMedium&,
SfxItemSet const*) (this=0x37e0430, rMedium=..., pSet=0x0) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/objstor.cxx:1540
#40 0x00002af881220c64 in
SfxObjectShell::PreDoSaveAs_Impl(rtl::OUString const&, rtl::OUString
const&, SfxItemSet&) (this=0x37e0430, rFileName=..., aFilterName=...,
rItemSet=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/objstor.cxx:2811
#41 0x00002af881220107 in
SfxObjectShell::CommonSaveAs_Impl(INetURLObject const&, rtl::OUString
const&, SfxItemSet&) (this=0x37e0430, aURL=..., aFilterName=...,
rItemSet=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/objstor.cxx:2681
#42 0x00002af881209f40 in SfxObjectShell::APISaveAs_Impl(rtl::OUString
const&, SfxItemSet&) (this=0x37e0430, aFileName=..., rItemSet=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/objserv.cxx:308
#43 0x00002af881260830 in SfxBaseModel::impl_store(rtl::OUString
const&, com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue>
const&, bool) (this=0x2af85d40fa38, sURL=..., seqArguments=...,
bSaveTo=true) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/sfxbasemodel.cxx:3041
#44 0x00002af881258fdd in SfxBaseModel::storeToURL(rtl::OUString
const&, com::sun::star::uno::Sequence<com::sun::star::beans::PropertyValue>
const&) (this=0x2af85d40fa38, rURL=..., rArgs=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sfx2/source/doc/sfxbasemodel.cxx:1672
#45 0x00002af87713fa51 in ChartTest::reload(rtl::OUString const&)
(this=0x2eb1f20, rFilterName=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/chart2/qa/extras/charttest.hxx:128





#0  0x00002af85e71ef4d in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00002af85e71ad1d in _L_lock_840 () at /lib64/libpthread.so.0
#2  0x00002af85e71ac3a in pthread_mutex_lock () at /lib64/libpthread.so.0
#3  0x00002af85d8dad33 in osl_acquireMutex(oslMutexImpl*)
(pMutex=0x23ab450) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sal/osl/unx/mutex.cxx:99
#4  0x00002af86ad37407 in osl::Mutex::acquire() (this=0x23b2a78) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/osl/mutex.hxx:56
#5  0x00002af86b49477e in SalYieldMutex::acquire() (this=0x23b2a70) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/vcl/unx/generic/app/geninst.cxx:54
#6  0x00002af86ad37abf in SolarMutexGuard::SolarMutexGuard()
(this=0x2af8919f4580) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/vcl/svapp.hxx:1461

^^^ Insanity: trying to acquire the solarmutex in a signal.
and since the other thread is waiting for us to finish while holding
the solar mutex -> dead lock

#7  0x00002af86b28a51a in VCLExceptionSignal_impl(void*,
oslSignalInfo*) (pInfo=0x2af8919f4610) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/vcl/source/app/svmain.cxx:137
#8  0x00002af85d8974bb in callSignalHandler(oslSignalInfo*)
(pInfo=0x2af8919f4610) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sal/osl/all/signalshared.cxx:59
#9  0x00002af85d8e8134 in (anonymous
namespace)::signalHandlerFunction(int) (signal=6) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/sal/osl/unx/signal.cxx:421
#10 0x00002af85e385670 in <signal handler called> () at /lib64/libc.so.6
#11 0x00002af85e3855f7 in raise () at /lib64/libc.so.6
#12 0x00002af85e386ce8 in abort () at /lib64/libc.so.6
#13 0x00002af85e3cc515 in free_check () at /lib64/libc.so.6

^^^ Ooops memory issue -> signal

#14 0x00002af86bdce568 in
__gnu_cxx::new_allocator<basegfx::B3DVector>::deallocate(basegfx::B3DVector*,
unsigned long) (this=0x3906610, __p=0x3c133d0) at
/usr/include/c++/4.8.2/ext/new_allocator.h:110
#15 0x00002af86bdc9564 in
std::__cxx1998::_Vector_base<basegfx::B3DVector,
std::allocator<basegfx::B3DVector>
>::_M_deallocate(basegfx::B3DVector*, unsigned long) (this=0x3906610,
__p=0x3c133d0, __n=4) at /usr/include/c++/4.8.2/bits/stl_vector.h:174
#16 0x00002af86bdc9148 in
std::__cxx1998::_Vector_base<basegfx::B3DVector,
std::allocator<basegfx::B3DVector> >::~_Vector_base() (this=0x3906610,
__in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/stl_vector.h:160
#17 0x00002af86bdc349f in std::__cxx1998::vector<basegfx::B3DVector,
std::allocator<basegfx::B3DVector> >::~vector() (this=0x3906610,
__in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/stl_vector.h:416
#18 0x00002af86bdbc4f8 in std::__debug::vector<basegfx::B3DVector,
std::allocator<basegfx::B3DVector> >::~vector() (this=0x3906610,
__in_chrg=<optimized out>) at /usr/include/c++/4.8.2/debug/vector:144
#19 0x00002af86bdb51fc in NormalsArray3D::~NormalsArray3D()
(this=0x3906610, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/basegfx/source/polygon/b3dpolygon.cxx:446
#20 0x00002af86bdb6fab in ImplB3DPolygon::~ImplB3DPolygon()
(this=0x3a7bb80, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/basegfx/source/polygon/b3dpolygon.cxx:872
#21 0x00002af86bdc604c in o3tl::cow_wrapper<ImplB3DPolygon,
o3tl::ThreadSafeRefCountingPolicy>::impl_t::~impl_t() (this=0x3a7bb80,
__in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/o3tl/cow_wrapper.hxx:178
#22 0x00002af86bdc60a5 in o3tl::cow_wrapper<ImplB3DPolygon,
o3tl::ThreadSafeRefCountingPolicy>::release() (this=0x390e680) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/o3tl/cow_wrapper.hxx:203
#23 0x00002af86bdc0706 in o3tl::cow_wrapper<ImplB3DPolygon,
o3tl::ThreadSafeRefCountingPolicy>::~cow_wrapper() (this=0x390e680,
__in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/o3tl/cow_wrapper.hxx:246
#24 0x00002af86bdb2104 in basegfx::B3DPolygon::~B3DPolygon()
(this=0x390e680, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/basegfx/source/polygon/b3dpolygon.cxx:1511
#25 0x00002af86bddfb64 in
std::_Destroy<basegfx::B3DPolygon>(basegfx::B3DPolygon*)
(__pointer=0x390e680) at
/usr/include/c++/4.8.2/bits/stl_construct.h:93
#26 0x00002af86bddf10a in
std::_Destroy_aux<false>::__destroy<basegfx::B3DPolygon*>(basegfx::B3DPolygon*,
basegfx::B3DPolygon*) (__first=0x390e680, __last=0x390e688) at
/usr/include/c++/4.8.2/bits/stl_construct.h:103
#27 0x00002af86bdde103 in
std::_Destroy<basegfx::B3DPolygon*>(basegfx::B3DPolygon*,
basegfx::B3DPolygon*) (__first=0x390e680, __last=0x390e688) at
/usr/include/c++/4.8.2/bits/stl_construct.h:126
#28 0x00002af86bddcf29 in std::_Destroy<basegfx::B3DPolygon*,
basegfx::B3DPolygon>(basegfx::B3DPolygon*, basegfx::B3DPolygon*,
std::allocator<basegfx::B3DPolygon>&) (__first=0x390e680,
__last=0x390e688) at /usr/include/c++/4.8.2/bits/stl_construct.h:151
#29 0x00002af86bddbded in std::__cxx1998::vector<basegfx::B3DPolygon,
std::allocator<basegfx::B3DPolygon> >::~vector() (this=0x3a7b920,
__in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/stl_vector.h:415
#30 0x00002af86bdda568 in std::__debug::vector<basegfx::B3DPolygon,
std::allocator<basegfx::B3DPolygon> >::~vector() (this=0x3a7b920,
__in_chrg=<optimized out>) at /usr/include/c++/4.8.2/debug/vector:144
#31 0x00002af86bdda494 in ImplB3DPolyPolygon::~ImplB3DPolyPolygon()
(this=0x3a7b920, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/basegfx/source/polygon/b3dpolypolygon.cxx:30
#32 0x00002af86bddcd1a in o3tl::cow_wrapper<ImplB3DPolyPolygon,
o3tl::ThreadSafeRefCountingPolicy>::impl_t::~impl_t() (this=0x3a7b920,
__in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/o3tl/cow_wrapper.hxx:178
#33 0x00002af86bddcd73 in o3tl::cow_wrapper<ImplB3DPolyPolygon,
o3tl::ThreadSafeRefCountingPolicy>::release() (this=0x2af88c3f4b28) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/o3tl/cow_wrapper.hxx:203
#34 0x00002af86bddbc80 in o3tl::cow_wrapper<ImplB3DPolyPolygon,
o3tl::ThreadSafeRefCountingPolicy>::~cow_wrapper()
(this=0x2af88c3f4b28, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/o3tl/cow_wrapper.hxx:246
#35 0x00002af86bdd8dd6 in basegfx::B3DPolyPolygon::~B3DPolyPolygon()
(this=0x2af88c3f4b28, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/basegfx/source/polygon/b3dpolypolygon.cxx:209
#36 0x00002af878f2dd6a in
drawinglayer::primitive3d::PolyPolygonMaterialPrimitive3D::~PolyPolygonMaterialPrimitive3D()
(this=0x2af88c3f4ac8, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/drawinglayer/primitive3d/polypolygonprimitive3d.hxx:42
#37 0x00002af878f2dda6 in
drawinglayer::primitive3d::PolyPolygonMaterialPrimitive3D::~PolyPolygonMaterialPrimitive3D()
(this=0x2af88c3f4ac8, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/drawinglayer/primitive3d/polypolygonprimitive3d.hxx:42
#38 0x00002af865a16320 in cppu::OWeakObject::release()
(this=0x2af88c3f4ac8) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/cppuhelper/source/weak.cxx:207
#39 0x00002af86596cad2 in cppu::WeakComponentImplHelperBase::release()
(this=0x2af88c3f4ac8) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/cppuhelper/source/implbase.cxx:88
#40 0x00002af878f23496 in
cppu::WeakComponentImplHelper1<com::sun::star::graphic::XPrimitive3D>::release()
(this=0x2af88c3f4ac8) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/cppuhelper/compbase1.hxx:58
#41 0x00002af878ec18d1 in
com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>::~Reference()
(this=0x3b91b90, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/com/sun/star/uno/Reference.hxx:110
#42 0x00002af878ec17c2 in
std::_Destroy<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>
>(com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*)
(__pointer=0x3b91b90) at
/usr/include/c++/4.8.2/bits/stl_construct.h:93
#43 0x00002af878ec1664 in
std::_Destroy_aux<false>::__destroy<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*>(com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*,
com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*)
(__first=0x3b91b90, __last=0x3b91d20) at
/usr/include/c++/4.8.2/bits/stl_construct.h:103
#44 0x00002af878ec1511 in
std::_Destroy<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*>(com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*,
com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*)
(__first=0x3b91b20, __last=0x3b91d20) at
/usr/include/c++/4.8.2/bits/stl_construct.h:126
#45 0x00002af878ec137d in
std::_Destroy<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*,
com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>
>(com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*,
com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>*,
std::allocator<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>
>&) (__first=0x3b91b20, __last=0x3b91d20) at
/usr/include/c++/4.8.2/bits/stl_construct.h:151
#46 0x00002af878ec10b5 in
std::__cxx1998::vector<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>,
std::allocator<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>
> >::~vector() (this=0x2af8919f5380, __in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/bits/stl_vector.h:415
#47 0x00002af878ec0fc6 in
std::__debug::vector<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>,
std::allocator<com::sun::star::uno::Reference<com::sun::star::graphic::XPrimitive3D>
> >::~vector() (this=0x2af8919f5380, __in_chrg=<optimized out>) at
/usr/include/c++/4.8.2/debug/vector:144
#48 0x00002af878ec0f4e in
drawinglayer::primitive3d::Primitive3DContainer::~Primitive3DContainer()
(this=0x2af8919f5380, __in_chrg=<optimized out>) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/include/drawinglayer/primitive3d/baseprimitive3d.hxx:56
#49 0x00002af878f752ce in
drawinglayer::processor3d::DefaultProcessor3D::processBasePrimitive3D(drawinglayer::primitive3d::BasePrimitive3D
const&) (this=0x3a8ee70, rBasePrimitive=...) at
/home/tdf/lode/jenkins/workspace/lo_tb_master_linux_dbg/drawinglayer/source/processor3d/defaultprocessor3d.cxx:582
_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Thorsten Behrens-6 Thorsten Behrens-6
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock

Norbert Thiebaud wrote:
> The threaded work then raise() due to some memory problem and out
> signal handler try to acquire the solar mutex ->deadlock
>
Eek, that's ugly. Then again, at the core is the OOM condition, which
needs solving independently. Per chance, is that happening on a box
with massive amounts of CPU threads?

Cheers,

-- Thorsten

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice

signature.asc (968 bytes) Download Attachment
Norbert Thiebaud Norbert Thiebaud
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock

On Tue, May 17, 2016 at 6:44 AM, Thorsten Behrens <[hidden email]> wrote:
> Norbert Thiebaud wrote:
>> The threaded work then raise() due to some memory problem and out
>> signal handler try to acquire the solar mutex ->deadlock
>>
> Eek, that's ugly. Then again, at the core is the OOM condition, which
> needs solving independently. Per chance, is that happening on a box
> with massive amounts of CPU threads?

it is on the ci builder, so yeah 32 thread or so.

but I disagree that it is _at the core_

at the core this exhibit 2 things:
1/ we do a lot of thing that is verboten in a signal handler.
2/ taking a lock that rely on other thread to move forward while
holding the solarmutex is begging for deadlock.

Norbert
_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Armin Le Grand-2 Armin Le Grand-2
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock

Hi Norbert,

thanks for also having an eye on this - I am looking for the failure reports on ci.libreoffice.org currently, too.
Last is from http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7195/, so last is from Friday, 13th (uhhh...)

Have you seen such or similar stacks anywhere else? In the meantime I tried ChartTest massively locally on Linux and Win, but could never locally reproduce.

The SolarMutex thing is sure not good, but only a symtom showing up I would guess. There are tests and codes in SC e.g. that also use massive parallelism, not limited to an upper core count. The basic problem is that the MainThread always holds the SolarMutex, so also during calling waitUntilEmpty(). The consequence is that no WorkerThread is allowed to get the SolarMutex, limiting multithreaded actions to this.

I knew that and made sure that the multithreaded 3DRenderer WorkerThreads do not need the SolarMutex for their work. I did not know yet that the memory fail handler tries to get the SolarMutex, too, but is logic when it wants to bring up a dialog in some form.

But the deeper problem is that allocation - here extending a vector of pointers to a helper class from 1 to 2 entries - fails. Sometimes. And that only on many cores on that machine (up to now).

I checked all involved classes, their refcounting and that the used o3tl::cow_wrapper uses the ThreadSafeRefCountingPolicy, looks good so far. It is also not the case that the WorkerThreads need massive amounts of own memory, so I doublt that limiting to e.g. 8 thredads would change this, except maybe making it less probable to happen. I looked at o3tl::cow_wrapper itself, and the basic B2D/B3DPrimitive implementations which internally use a comphelper::OBaseMutex e.g. for creating buffered decompositions.

I found no concrete reason until now, any tipps/help much appreciated.

I keep watching this - at least it did not happen in all the builds since 13th and on no other machine, so the thread now is to somehow nail it to get it reproducable. If someone has other traces, please send them! I would hate to take this back, esp. because we will need multithreading more and more since Moore's law is tilting.

Sincerely,

Armin


Am 17.05.2016 um 14:35 schrieb Norbert Thiebaud:
On Tue, May 17, 2016 at 6:44 AM, Thorsten Behrens [hidden email] wrote:
Norbert Thiebaud wrote:
The threaded work then raise() due to some memory problem and out
signal handler try to acquire the solar mutex ->deadlock

Eek, that's ugly. Then again, at the core is the OOM condition, which
needs solving independently. Per chance, is that happening on a box
with massive amounts of CPU threads?
it is on the ci builder, so yeah 32 thread or so.

but I disagree that it is _at the core_

at the core this exhibit 2 things:
1/ we do a lot of thing that is verboten in a signal handler.
2/ taking a lock that rely on other thread to move forward while
holding the solarmutex is begging for deadlock.

Norbert

-- 
--
ALG (PGP Key: EE1C 4B3F E751 D8BC C485 DEC1 3C59 F953 D81C F4A2)

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Norbert Thiebaud Norbert Thiebaud
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock

On Wed, May 18, 2016 at 3:43 AM, Armin Le Grand <[hidden email]> wrote:

> Hi Norbert,
>
> thanks for also having an eye on this - I am looking for the failure reports
> on ci.libreoffice.org currently, too.
> Last is from http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7195/, so
> last is from Friday, 13th (uhhh...)
>
> Have you seen such or similar stacks anywhere else? In the meantime I tried
> ChartTest massively locally on Linux and Win, but could never locally
> reproduce.

I noticed, and I still had to cancel a hung job few minutes ago, that
things started to hang regularely.
I only look at that particular case I sent the backtrace about.

>
> I knew that and made sure that the multithreaded 3DRenderer WorkerThreads do
> not need the SolarMutex for their work. I did not know yet that the memory
> fail handler tries to get the SolarMutex, too, but is logic when it wants to
> bring up a dialog in some form.

yeah but that _is_ a major flaw. code under a signal handler are only
allow to call async-safe-signal functions.
Ignoring that _will_ cause trouble. it is not your doing. it just
happen that your multithreading case seems to cause memory starvation
which in turn trigger a signal that exhibit clearly why all the thing
we try to do in a signal handler are not a good idea.
Not to mention that, even if that was allowed, trying to bring about a
dialog when we have already ran out of memory is never going to end
well anyway.

for reference:

  Async-signal-safe functions
       A signal handling routine established by sigaction(2) or
signal(2) must be very careful, since processing elsewhere may be
interrupted at some arbitrary point in the execution of the program.
POSIX has the concept of "safe function".  If a signal interrupts the
execution of an unsafe function, and handler calls an  unsafe
function,  then
       the behavior of the program is undefined.

       POSIX.1-2004 (also known as POSIX.1-2001 Technical Corrigendum
2) requires an implementation to guarantee that the following
functions can be safely called inside a signal handler:

           _Exit()
           _exit()
           abort()
           accept()
           access()
           aio_error()
           aio_return()
           aio_suspend()
           alarm()
           bind()
           cfgetispeed()
           cfgetospeed()
           cfsetispeed()
           cfsetospeed()
           chdir()
           chmod()
           chown()
           clock_gettime()
           close()
           connect()
           creat()
           dup()
           dup2()
           execle()
           execve()
           fchmod()
           fchown()
           fcntl()
           fdatasync()
           fork()
           fpathconf()
           fstat()
           fsync()
           ftruncate()
           getegid()
           geteuid()
           getgid()
           getgroups()
           getpeername()
           getpgrp()
           getpid()
           getppid()
           getsockname()
           getsockopt()
           getuid()
           kill()
           link()
           listen()
           lseek()
           lstat()
           mkdir()
           mkfifo()
           open()
           pathconf()
           pause()
           pipe()
           poll()
           posix_trace_event()
           pselect()
           raise()
           read()
           readlink()
           recv()
           recvfrom()
           recvmsg()
           rename()
           rmdir()
           select()
           sem_post()
           send()
           sendmsg()
           sendto()
           setgid()
           setpgid()
           setsid()
           setsockopt()
           setuid()
           shutdown()
           sigaction()
           sigaddset()
           sigdelset()
           sigemptyset()
           sigfillset()
           sigismember()
           signal()
           sigpause()
           sigpending()
           sigprocmask()
           sigqueue()
           sigset()
           sigsuspend()
           sleep()
           sockatmark()
           socket()
           socketpair()
           stat()
           symlink()
           sysconf()
           tcdrain()
           tcflow()
           tcflush()
           tcgetattr()
           tcgetpgrp()
           tcsendbreak()
           tcsetattr()
           tcsetpgrp()
           time()
           timer_getoverrun()
           timer_gettime()
           timer_settime()
           times()
           umask()
           uname()
           unlink()
           utime()
           wait()
           waitpid()
           write()

       POSIX.1-2008 removes fpathconf(), pathconf(), and sysconf()
from the above list, and adds the following functions:

           execl()
           execv()
           faccessat()
           fchmodat()
           fchownat()
           fexecve()
           fstatat()
           futimens()
           linkat()
           mkdirat()
           mkfifoat()
           mknod()
           mknodat()
           openat()
           readlinkat()
           renameat()
           symlinkat()
           unlinkat()
           utimensat()
           utimes()


>
> But the deeper problem is that allocation - here extending a vector of
> pointers to a helper class from 1 to 2 entries - fails.

the straw that broke the camel's back....
or as the french put it: the drop that made the water overflow...
bear in mind, iirc the default allocator strategy for a vector is to
'double' on out-of-space.
iow if you had a vector of 128 capacity.. when you try to push a 129th
elements you end up with a 256-sized vector

also, tests are run in parallel.. so up to 32 test maybe running in // already.
if each one want to run 32 threads.. that will seriouly over-commit
the machine.. with 32^32 = 1024 threads fighting for cpu.


> Sometimes. And that
> only on many cores on that machine (up to now).

I do not mind the out-of-memory thing... unless it is a bug (run-away
allocation or something)
out of memory are bound to happen... what I mind is the consequences..
as bad as a crash is... a hung process is worse

fyi. tb75 has 2x8 cores, 32 threads and 64GB or RAM.


>
> I checked all involved classes, their refcounting and that the used
> o3tl::cow_wrapper uses the ThreadSafeRefCountingPolicy, looks good so far.
> It is also not the case that the WorkerThreads need massive amounts of own
> memory, so I doublt that limiting to e.g. 8 thredads would change this,
> except maybe making it less probable to happen. I looked at
> o3tl::cow_wrapper itself, and the basic B2D/B3DPrimitive implementations
> which internally use a comphelper::OBaseMutex e.g. for creating buffered
> decompositions.
>
> I found no concrete reason until now, any tipps/help much appreciated.

I honestly did not look at all _why_ it signaled. I only investigated
why that resulted in a deadlock.

>
> I keep watching this - at least it did not happen in all the builds since
> 13th and on no other machine, so the thread now is to somehow nail it to get
> it reproducable.
It very likely happened again on
http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7260/
which I cancelled after more than 2 hours or runtime

> If someone has other traces, please send them! I would hate
> to take this back, esp. because we will need multithreading more and more
> since Moore's law is tilting.

If we want to use multi-threading. we will have, at least, to solve
the signal handler situation, since as it stand it makes the 'promise
that this worker thread won't try to hold the solarmutex' impossible
to uphold.
Maybe a way around that may be to make sure that all worker thread use
a pthread_sigmask to essentially prevent any of them to be interrupted
for signal handling, by blocking all signal in these thread.
in the same vein we could have a dedicated signal handler thread
looping on a sigwait... then we can do the crazy thing we do in a
handler without as many limitation


The whole SolarMutex craziness is yet another story altogether....
_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Armin Le Grand-2 Armin Le Grand-2
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock

Hi Norbert,

thanks a lot for the extensive answer. Comments inline...

Am 18.05.2016 um 20:22 schrieb Norbert Thiebaud:
On Wed, May 18, 2016 at 3:43 AM, Armin Le Grand [hidden email] wrote:
Hi Norbert,

thanks for also having an eye on this - I am looking for the failure reports
on ci.libreoffice.org currently, too.
Last is from http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7195/, so
last is from Friday, 13th (uhhh...)

Have you seen such or similar stacks anywhere else? In the meantime I tried
ChartTest massively locally on Linux and Win, but could never locally
reproduce.
I noticed, and I still had to cancel a hung job few minutes ago, that
things started to hang regularely.
I only look at that particular case I sent the backtrace about.

Yes, that is not acceptable for long time, I agree.


I knew that and made sure that the multithreaded 3DRenderer WorkerThreads do
not need the SolarMutex for their work. I did not know yet that the memory
fail handler tries to get the SolarMutex, too, but is logic when it wants to
bring up a dialog in some form.
yeah but that _is_ a major flaw. code under a signal handler are only
allow to call async-safe-signal functions.
Ignoring that _will_ cause trouble. it is not your doing. it just
happen that your multithreading case seems to cause memory starvation
which in turn trigger a signal that exhibit clearly why all the thing
we try to do in a signal handler are not a good idea.
Not to mention that, even if that was allowed, trying to bring about a
dialog when we have already ran out of memory is never going to end
well anyway.

for reference:

  Async-signal-safe functions
       A signal handling routine established by sigaction(2) or
signal(2) must be very careful, since processing elsewhere may be
interrupted at some arbitrary point in the execution of the program.
POSIX has the concept of "safe function".  If a signal interrupts the
execution of an unsafe function, and handler calls an  unsafe
function,  then
       the behavior of the program is undefined.

       POSIX.1-2004 (also known as POSIX.1-2001 Technical Corrigendum
2) requires an implementation to guarantee that the following
functions can be safely called inside a signal handler:

           _Exit()
           _exit()
           abort()
           accept()
           access()
           aio_error()
           aio_return()
           aio_suspend()
           alarm()
           bind()
           cfgetispeed()
           cfgetospeed()
           cfsetispeed()
           cfsetospeed()
           chdir()
           chmod()
           chown()
           clock_gettime()
           close()
           connect()
           creat()
           dup()
           dup2()
           execle()
           execve()
           fchmod()
           fchown()
           fcntl()
           fdatasync()
           fork()
           fpathconf()
           fstat()
           fsync()
           ftruncate()
           getegid()
           geteuid()
           getgid()
           getgroups()
           getpeername()
           getpgrp()
           getpid()
           getppid()
           getsockname()
           getsockopt()
           getuid()
           kill()
           link()
           listen()
           lseek()
           lstat()
           mkdir()
           mkfifo()
           open()
           pathconf()
           pause()
           pipe()
           poll()
           posix_trace_event()
           pselect()
           raise()
           read()
           readlink()
           recv()
           recvfrom()
           recvmsg()
           rename()
           rmdir()
           select()
           sem_post()
           send()
           sendmsg()
           sendto()
           setgid()
           setpgid()
           setsid()
           setsockopt()
           setuid()
           shutdown()
           sigaction()
           sigaddset()
           sigdelset()
           sigemptyset()
           sigfillset()
           sigismember()
           signal()
           sigpause()
           sigpending()
           sigprocmask()
           sigqueue()
           sigset()
           sigsuspend()
           sleep()
           sockatmark()
           socket()
           socketpair()
           stat()
           symlink()
           sysconf()
           tcdrain()
           tcflow()
           tcflush()
           tcgetattr()
           tcgetpgrp()
           tcsendbreak()
           tcsetattr()
           tcsetpgrp()
           time()
           timer_getoverrun()
           timer_gettime()
           timer_settime()
           times()
           umask()
           uname()
           unlink()
           utime()
           wait()
           waitpid()
           write()

       POSIX.1-2008 removes fpathconf(), pathconf(), and sysconf()
from the above list, and adds the following functions:

           execl()
           execv()
           faccessat()
           fchmodat()
           fchownat()
           fexecve()
           fstatat()
           futimens()
           linkat()
           mkdirat()
           mkfifoat()
           mknod()
           mknodat()
           openat()
           readlinkat()
           renameat()
           symlinkat()
           unlinkat()
           utimensat()
           utimes()


But the deeper problem is that allocation - here extending a vector of
pointers to a helper class from 1 to 2 entries - fails.
the straw that broke the camel's back....
or as the french put it: the drop that made the water overflow...
bear in mind, iirc the default allocator strategy for a vector is to
'double' on out-of-space.
iow if you had a vector of 128 capacity.. when you try to push a 129th
elements you end up with a 256-sized vector

True - I know. In the case of the stacktrace it is a vector of pointers, so not much mem involved. All objects used in 3D are ref-counted or cow-wrapped. Even the target mem for pixels is shared - thus I doubt that rendering in paralell uses much more than rendering in one thread. I would state that it will not need double the mem as it needs in a single thread. Still, of course, a single byte makes the water overflow at the end.


also, tests are run in parallel.. so up to 32 test maybe running in // already.
if each one want to run 32 threads.. that will seriouly over-commit
the machine.. with 32^32 = 1024 threads fighting for cpu.

That might be more a problem. Do the UnitTests run in one mem space? I doubt so. The overhead for a single WorkerTask should not be too big in itself - RegisterSet, Stack, ..?
BTW, that global TreadPool and the WorkerThreads used get created when the office starts, and other (few for now) usages use it already, so that problem should have shown earier...?



Sometimes. And that
only on many cores on that machine (up to now).
I do not mind the out-of-memory thing... unless it is a bug (run-away
allocation or something)
out of memory are bound to happen... what I mind is the consequences..
as bad as a crash is... a hung process is worse

fyi. tb75 has 2x8 cores, 32 threads and 64GB or RAM.

It probably indeed is a run out-of-mem in extreme situations, even when I can hardly imagine that. The 3D rasterconversion is hard limited to upper bounds in pixels, as said, it uses referenced objects, not really local memory objects of remarkable size - I have no idea yet.
Interestingly, in stack 7273, the crash is probably related, but happens in freeing one object. Does this not speak against the out-of-memory theory completely?



I checked all involved classes, their refcounting and that the used
o3tl::cow_wrapper uses the ThreadSafeRefCountingPolicy, looks good so far.
It is also not the case that the WorkerThreads need massive amounts of own
memory, so I doublt that limiting to e.g. 8 thredads would change this,
except maybe making it less probable to happen. I looked at
o3tl::cow_wrapper itself, and the basic B2D/B3DPrimitive implementations
which internally use a comphelper::OBaseMutex e.g. for creating buffered
decompositions.

I found no concrete reason until now, any tipps/help much appreciated.
I honestly did not look at all _why_ it signaled. I only investigated
why that resulted in a deadlock.

If I understand you correct, your main concern is to have a clean crash instead of the office hanging? That would definitely be better. But how to do that? I see in the stack 7195:

#9  0x00002aea35f04670 in <signal handler called> () at /lib64/libc.so.6
#10 0x00002aea35f045f7 in raise () at /lib64/libc.so.6
#11 0x00002aea35f05ce8 in abort () at /lib64/libc.so.6
#12 0x00002aea35f4a1b8 in  () at /lib64/libc.so.6
#13 0x00002aea35f4d877 in _int_malloc () at /lib64/libc.so.6
#14 0x00002aea35f4dae6 in malloc_check () at /lib64/libc.so.6
#15 0x00002aea3570e0cd in operator new(unsigned long) () at /lib64/libstdc++.so.6

So how to come in-between and not have operator new when it fails to call signal handler, but e.g. exit()..?
I see now way to do that. In consequence that would mean that operator new is not alowed in WorkerThreads at all - that would be a too big limitation I guess...?


I keep watching this - at least it did not happen in all the builds since
13th and on no other machine, so the thread now is to somehow nail it to get
it reproducable.
It very likely happened again on
http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7260/
which I cancelled after more than 2 hours or runtime

Checked 7260, no trace to 'B3D' contained, this one has no 3D multihreaded render included. Must be someting else, of course maybe related to the general signal handler in WorkerThread problem. Or not - 'worker' is also not included.


If someone has other traces, please send them! I would hate
to take this back, esp. because we will need multithreading more and more
since Moore's law is tilting.
If we want to use multi-threading. we will have, at least, to solve
the signal handler situation, since as it stand it makes the 'promise
that this worker thread won't try to hold the solarmutex' impossible
to uphold.

True, agreed. How?

I already tried to use SolarMutexReleaser as inverse to SolarMutexGuard around waitUntilEmpty(), but that makes the next message to be processed, not good.

The only thing I could think of is to change the implementation of waitUntilEmpty() to not just wait for asll ThreadTasks to finish, but to do this in a timer loop. This is polling with timer, but would in that way - AFAIK - release the SolarMutex from time to time and thus allow a deadlock to be resolved and let the signal handler do it's thing -> and allow the crash. Not nice, but pragmatic.

Maybe a way around that may be to make sure that all worker thread use
a pthread_sigmask to essentially prevent any of them to be interrupted
for signal handling, by blocking all signal in these thread.
in the same vein we could have a dedicated signal handler thread
looping on a sigwait... then we can do the crazy thing we do in a
handler without as many limitation

Interesting, but too complicated as long as we have the SolarMutex mayhem...?



The whole SolarMutex craziness is yet another story altogether....

Yes, what a big can of worms, sigh...


    

-- 
--
ALG (PGP Key: EE1C 4B3F E751 D8BC C485 DEC1 3C59 F953 D81C F4A2)

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock

In reply to this post by Norbert Thiebaud
On 05/18/2016 08:22 PM, Norbert Thiebaud wrote:
> If we want to use multi-threading. we will have, at least, to solve
> the signal handler situation, since as it stand it makes the 'promise
> that this worker thread won't try to hold the solarmutex' impossible
> to uphold.

The crash handling code called from within the signal handler is just a
big mess of illegal activity (that apparently happens to work some of
the time, but is also known to fail spectacularly sometimes).  Locking
the SolarMutex is just one of many illegal aspects.  I wouldn't worry
trying to take care of it independently of cleaning up the big mess
completely.

> Maybe a way around that may be to make sure that all worker thread use
> a pthread_sigmask to essentially prevent any of them to be interrupted
> for signal handling, by blocking all signal in these thread.
> in the same vein we could have a dedicated signal handler thread
> looping on a sigwait... then we can do the crazy thing we do in a
> handler without as many limitation

You cannot block SIGBUS, SIGSEGV etc. synchronously generated by the OS.
_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice
Noel Grandin-2 Noel Grandin-2
Reply | Threaded
Open this post in threaded view
|

Re: multi-threading task under SolarMutex -> deadlock



On 23 May 2016 at 12:57, Stephan Bergmann <[hidden email]> wrote:

The crash handling code called from within the signal handler is just a big mess of illegal activity (that apparently happens to work some of the time, but is also known to fail spectacularly sometimes).  Locking the SolarMutex is just one of many illegal aspects.  I wouldn't worry trying to take care of it independently of cleaning up the big mess completely.

Would it not be easier (and safer) to create some small, independent programs, that use no other parts of LibreOffice, and wrap the main executable, and catch all and any exceptions and then do whatever we need to do: upload a dump, set some of kind "attempt restore" flag in the registry, etc, etc.
 

_______________________________________________
LibreOffice mailing list
[hidden email]
https://lists.freedesktop.org/mailman/listinfo/libreoffice