Commit Graph

2 Commits

Author SHA1 Message Date
Abdullah Atta
205373dca3 core: use htmlparser2 for html rewriting
This replaces DOMParser with htmlparser2 which is much, much faster.
How much faster? 80%. This new implementation can parse at 50mb/s
which is insane! The old one could only do 5-10mb/s

We still haven't gotten rid of the DOMParser though since HTML-to-MD
conversion still needs it. This will be done soon though by using `dr-sax`.

This uses a custom implementation of htmlparser2 instead of the default
one which is 50% faster.
2022-11-10 15:16:13 +05:00
Abdullah Atta
e1fc116994 core: improve content conflict detection using proper HTML diffing (#1183)
Since HTML is a tree-like language it is futile to compare it character
for character. `html1 === html2` is almost always false. This commit
introduces a simple diffing algorithm that only checks the text inside
the html + a few other attributes to decide whether the 2 HTMLs are
actually different or not. This is obviously not foolproof and it will
ignore everything aesthetic (b, em, strong tags etc.). This is actually
desireable because in our case only the text difference should
warrant a conflict. Everything else can easily be brought back.
Similarly, this also ignores whitespace differences surrouding the
tags.

All in all it'll provide a more reliable alternative to MD5 hashing the
2 HTMLs.
2022-10-13 19:22:32 +05:00