Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed: (delref), function GemRB_RemoveView, with clang on FreeBSD #2074

Open
traveler-gemrb opened this issue Apr 12, 2024 · 24 comments
Labels
bug portability for things related to platform support

Comments

@traveler-gemrb
Copy link

Bug description

gemrb-git build as Release (-O3 -DNDEBUG) fails here with

[ResourceManager]: Found 'GTRBPSK.mos' in 'chitin.key'.
Assertion failed: (delref), function GemRB_RemoveView, file /usr/home/Kuba/GemRBGit/gemrb/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.

after (used git bisect)

commit 605519e
Author: Jaka Kranjc <[email protected]>
Date: Tue Mar 26 11:07:12 2024 +0100

cmake: propagate compiler flag changes to parent scope

fixes last known regression and hopefully the odd crashes we've been
having #2063

cmake/Helpers.cmake | 5 ++++-
gemrb/core/Geometry.cpp | 2 +-
2 files changed, 5 insertions(+), 2 deletions(-)

Env:

FreeBSD 14.0-STABLE #0 stable/14-e069c451f amd64

FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)

@traveler-gemrb
Copy link
Author

To reproduce, one need to load game (BG1) or just load menu (IWD1).

@lynxlynxlynx
Copy link
Member

Well please try what we talked about.

@traveler-gemrb
Copy link
Author

traveler-gemrb commented Apr 13, 2024

Hello,

I've already tested default Release, that is (-O3 -DNDEBUG), without any custom -flto / -march=native etc. I think I've also tried -O2 (same result). I've tested -O now (same).

https://pastebin.com/ABRCLKv5 (default cmake build configuration with clang)

The problem does not exist with gcc13, however I've noticed that GUI have changed (??).

head/master (gcc13)

2024-04-13-214422_1440x900_scrot

ddce87c (clang)

2024-04-13-220255_1440x900_scrot

@lynxlynxlynx
Copy link
Member

The GUI stuff is unrelated; if you build the same revision you should get the same results. It's just widescreen mod changes.

@traveler-gemrb
Copy link
Author

traveler-gemrb commented Apr 27, 2024

Thanks for a clarification, how can I get back to previous style? I think that was the intended look.

@lynxlynxlynx
Copy link
Member

I don't know which version you have, but from above reverting d47c4cf should do it (or copying the file).

@lynxlynxlynx lynxlynxlynx changed the title Assertion failed: (delref), function GemRB_RemoveView, in GUIScript.cpp, line 2015 Assertion failed: (delref), function GemRB_RemoveView, with clang on FreeBSD May 3, 2024
@traveler-gemrb
Copy link
Author

FWIW,

Same with 14.1-STABLE #0 stable/14-91df7d335 amd64 and FreeBSD clang version 18.1.5

@lynxlynxlynx
Copy link
Member

@bsdcode can you do a quick check if it still runs for you? It's odd that you two would have so different experiences on FreeBSD so close in time.

@bsdcode
Copy link
Contributor

bsdcode commented May 12, 2024

Unfortunately because of lack of time I hadn't compiled and tested GemRB for over a month now I think. So I just compiled the most recent commit now and can confirm the problem.

I'm on FreeBSD 14.0-RELEASE-p6 and compiled with clang 16.0.6. I tested BG1. Starting a new game produces this after character creation:

Assertion failed: (ref), function GemRB_View_AddSubview, file /usr/ports/games/gemrb-devel/work/gemrb-5cfd2e6/gemrb/plugins/GUIScript/GUIScript.cpp, line 1051.

Loading a saved game produces this:

Assertion failed: (delref), function GemRB_RemoveView, file /usr/ports/games/gemrb-devel/work/gemrb-5cfd2e6/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.

I can test with GCC later.

@lynxlynxlynx
Copy link
Member

It sounds like #1786 all over again. :s
It's worth a bisect between the end of February and now, since two of you reported everything was fine:
#1786 (comment)

@lynxlynxlynx lynxlynxlynx added the portability for things related to platform support label May 12, 2024
@traveler-gemrb
Copy link
Author

I did bisect and my problem appeared with 605519e

gcc13 here is fine at the moment, and was then too.

@lynxlynxlynx
Copy link
Member

Oh, right, forgot you already did it, sorry. That commit is needed, since before all the compiler flags weren't being set after the modularisation of the system. At that time I compared command-lines (it's all printed if you do make VERBOSE=1).

This little patch shows that on my system, there's no preset set for the values, so even if there was some problem with overriding, it's not coming into play. Do you get the same results?

diff --git a/CMakeLists.txt b/CMakeLists.txt
index a0cd50a88..fcabcae71 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -93,8 +93,14 @@ INCLUDE_DIRECTORIES(
 )
 
 INCLUDE(config)
+message(STATUS "INTRO VALS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "INTRO VALS: ${CMAKE_SHARED_LINKER_FLAGS}")
+message(STATUS "INTRO VALS: ${CMAKE_MODULE_LINKER_FLAGS}")
 CONFIGURE_LINKING()
 CONFIGURE_COMPILER()
+message(STATUS "POST VALS: ${CMAKE_CXX_FLAGS}")
+message(STATUS "POST VALS: ${CMAKE_SHARED_LINKER_FLAGS}")
+message(STATUS "POST VALS: ${CMAKE_MODULE_LINKER_FLAGS}")
 CONFIGURE_PYTHON()
 CONFIGURE_SDL(${SDL_BACKEND})
 CONFIGURE_OPENGL(${OPENGL_BACKEND} ${SDL_BACKEND})

-- INTRO VALS:
-- INTRO VALS:
-- INTRO VALS:
-- POST VALS: -Werror -Wno-inline -Wno-error=cast-align -Wmissing-declarations -Wall -W -Wpointer-arith -Wno-format-y2k -Wno-long-long -pedantic -fsigned-char -fvisibility=hidden -ffast-math -frounding-math -fno-strict-aliasing -Wcast-align -Wno-error=stringop-truncation -Wno-error=stringop-overflow -Wno-error=stringop-overread -Wimplicit-fallthrough=2 -Woverloaded-virtual=0
-- POST VALS:
-- POST VALS: -Wl,--no-undefined

@bsdcode
Copy link
Contributor

bsdcode commented May 12, 2024

My results are similar to yours, I have less Warnings/no-errors:

-- INTRO VALS:
-- INTRO VALS:
-- INTRO VALS:
-- POST VALS: -Werror -Wno-inline -Wno-error=cast-align -Wmissing-declarations -Wall -W -Wpointer-arith -Wno-format-y2k -Wno-long-long -pedantic -fsigned-char -fvisibility=hidden -ffast-math -frounding-math -fno-strict-aliasing
-- POST VALS:
-- POST VALS: -Wl,--no-undefined

After I reverted 605519e in cmake/Helpers.cmake I got empty POST VALS, but GemRB works again.

@bsdcode
Copy link
Contributor

bsdcode commented May 12, 2024

Dropping -fvisibility=hidden from cmake/Helpers.cmake seems to fix the issue. But I'm a little bit out of the loop of symbol visibility and how it affects the C++/python interactions, so I don't know what could break, also in regards to compiling with GCC.

@lynxlynxlynx
Copy link
Member

That's odd, since even before the major cmake changes of 553ecf4, we were adding it if available:

       CHECK_CXX_COMPILER_FLAG("-fvisibility=hidden" VISIBILITY_HIDDEN)
       IF (VISIBILITY_HIDDEN AND NOT WIN32)
               string(APPEND CMAKE_CXX_FLAGS " -fvisibility=hidden")
       ENDIF ()

@bradallred
Copy link
Member

bradallred commented May 12, 2024

I feel like we've touched on this elsewhere at some point (I think I wanted to drop visibility=hidden from core), but the issue has nothing to do with Python. There is possibly more than one issue in play as well. The gist is:

  1. we use dynamic_cast and hiding symbols could conceivably result in RTTI info getting stripped as an optimization if the compiler deduces its not used. It's allowed to do this because it's used across module boundaries, which is implementation defined behavior, because C++ has no concept of dynamically loaded code. This might be fixed by explicitly adding a compiler flag to keep RTTI info.
  2. if duplicate symbols exist in both libcore and GUIScripts.so, which one is used would be UB. This is a chronic issue with templates.
  3. could be that some classes are missing GEM_EXPORT

I still say it makes no sense to hide symbols in libcore, and I don't remember why we do it. Not sure if that alone would resolve this issue tho. I think the disunion is in a closed PR from me if you want to go look.

@lynxlynxlynx
Copy link
Member

Let's not get sidetracked if not needed, as we've been conservative with visibility for over a decade and nobody reported any problems until we did @czarny247's mentioned cmake refactor. Let's try to get to the underlying trigger, is it just a newer clang version problem? Etc.

@traveler-gemrb / @bsdcode can you try building a3501e0?

@bradallred
Copy link
Member

I'm not getting side tracked, I'm sharing my experience from decades of cross platform development regarding symbol visibility issues.

I'm suggesting that we identify the problematic symbols and ensure they are exported, including all the requisite RTTI info. Even then, there is no guarantee if you are finding yourself in a situation where the same symbol exists in multiple binary images (eg tamplates). A GemRB::foo defined in libcore is not the same as a GemRB::foo defined in GUIScripts.so and cannot be dynamically_cast to one. There may be linker flags/magic you can do to resolve this, however. Otherwise, it's going to require preprocessor magic to extern those when building plugins to prevent duplication.

@lynxlynxlynx
Copy link
Member

And we worked fine on FreeBSD for ages, so it's not at all clear if that's an issue in practice. I'm not saying it shouldn't be addressed at some point, but here we currently have a clear smoking gun.

@bradallred
Copy link
Member

here we currently have a clear smoking gun

which is?

And we worked fine on FreeBSD for ages, so it's not at all clear if that's an issue in practice

Thats my point tho, the nature of UB/IB is that it is allowed to work just fine. It's also allowed to stop working. We just witnessed this same phenomenon with SDL_Quit and plugin lifetimes.

@lynxlynxlynx
Copy link
Member

The refactor ... we were miscompiling everything for several months.

@MarcelHB
Copy link
Collaborator

Don't know if it's helpful to narrow down the cause, but it's reproducible under a VM and the demo easily, and this is what returns nullptr here for whatever reason:

if (!attr) {
RuntimeError("Invalid Scripting reference, must have SCRIPT_GROUP attribute.");
return nullptr;
}

Originating from that piece of script:

console = consoleWin.ReplaceSubview (0, IE_GUI_CONSOLE, hist)

@traveler-gemrb
Copy link
Author

traveler-gemrb commented May 13, 2024

@traveler-gemrb / @bsdcode can you try building a3501e0?

Sorry, was away - same problem here with clang :

(Assertion failed: (delref), function GemRB_RemoveView, file /home/Kuba/GemRBGit/gemrb/gemrb/plugins/GUIScript/GUIScript.cpp, line 2015.)

$ git log
commit a3501e0 (HEAD)
Author: czarny247 [email protected]
Date: Sun Dec 17 19:29:57 2023 +0100

Add clangd .cache dir to .gitignore

I've checked again, ddce87c is ok.

@lynxlynxlynx
Copy link
Member

Ok, thanks, so we can let cmake rest and it is a worse problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug portability for things related to platform support
Projects
None yet
Development

No branches or pull requests

5 participants