|
|
Created:
5 years, 2 months ago by wychen Modified:
5 years, 1 month ago CC:
blink-reviews, blink-reviews-api_chromium.org, blink-reviews-dom_chromium.org, chromium-reviews, dglazkov+blink, eae+blinkwatch, rwlbuis, sof Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionAdd feature extraction for distillability to Blink
BUG=509869
TEST=webkit_unit_tests --gtest_filter=DocumentStatisticsCollectorTest.*
Committed: https://crrev.com/db4d18afb53ef9ac67a03edefa2bbbafe50723a7
Cr-Commit-Position: refs/heads/master@{#359158}
Patch Set 1 #
Total comments: 81
Patch Set 2 : address comments, add tests #
Total comments: 51
Patch Set 3 : address comments, add saturation #
Total comments: 13
Patch Set 4 : address comments, remove innerText #
Total comments: 3
Patch Set 5 : fix linking issue #Patch Set 6 : fix assertion style #Patch Set 7 : don't trim textContent, remove debug msg #Patch Set 8 : add mobile friendly detection #
Total comments: 8
Patch Set 9 : address dglazkov's comments #Patch Set 10 : wrap long line #
Total comments: 24
Patch Set 11 : address esprehn's comments #Patch Set 12 : avoid sqrt in global ctor #Patch Set 13 : merge master #
Total comments: 14
Patch Set 14 : address esprehn's comments #
Total comments: 2
Patch Set 15 : stricter test #Messages
Total messages: 62 (17 generated)
wychen@chromium.org changed reviewers: + dglazkov@chromium.org, esprehn@chromium.org
The Blink-side of the following CL in patch set 13 is split here. PTAL. Thanks! https://codereview.chromium.org/1248643004/
Migrated some ongoing questions from https://codereview.chromium.org/1248643004 below for easier reply. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:45: class ExtractFeatureWalker { On 2015/10/22 16:30:31, dglazkov wrote: > This can just be a separate class, no need to hide it. Will likely need a > different name then. Do you mean creating new cpp/h files for this class in WebKit/Source/core/dom/? I don't understand why hiding it is not a good idea. This class is only used here. Or merging it with class DocumentStatisticsCollector? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:66: String hay = elem.getClassAttribute().lower() + " " + elem.getIdAttribute().lower(); On 2015/10/22 16:30:31, dglazkov wrote: > We have lots of style machinery to do this correctly. Please look over > StyleResolver and friends. Now that you're in Blink, you no longer need to treat > that machinery as opaque. Could you elaborate how to use StyleResolver to improve this?
Is there a design doc on why you're collecting so much information about the entire document? This is going to be very expensive, when will this run? How do you avoid running on something like the HTML spec or wikipedia where crawling the entire document like this will take hundreds of ms. Also this needs unit tests inside blink, see the other web api tests for examples. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:19: using namespace WTF; don't using namespace WTF https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:20: using namespace Unicode; why do you need this? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:31: return root.textContent().stripWhiteSpace().length(); this is very expensive, you're creating a huge string of things inside the document, then allocating a big copy without start/end whitespace https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:37: EphemeralRange range = EphemeralRange::rangeOfContents(root); Why do you need innerText length? This is going to be very expensive to compute. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:53: bool isVisible(Element& elem) element, don't abbreviate in blink https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:56: m_document.domWindow()->getComputedStyle(&elem, String()); don't call through the domWindow(), you should almost never need that in C++, certainly not here. element->computedStyle() is all you need. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:58: style->getPropertyValue("display") == "none" You want to use ComputedStyle and the API surface on that https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:60: || style->getPropertyValue("opacity") == "0" these properties are not inherited, this code doesn't actually work, if your ancestor was visibility: hidden or opacity: 0 this doesn't catch it. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:64: bool matchName(Element& elem, const std::vector<String>& words) don't use std::vector in blink, don't abbreviate. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:66: String hay = elem.getClassAttribute().lower() + " " + elem.getIdAttribute().lower(); don't combine concat the strings like this, also don't convert them to lowercase. This is super expensive, you want to do: You want to check element.hasId() then use equalIgnoringCase with element.getIdAttribute() and iterate the words, note that this will be pretty expensive, do you really want to do this? Same with element.hasClass(), and using element.classNames(). https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:77: walk(*m_document.body(), false); what is false? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: m_features.openGraph = hasOGArticle(*m_document.head()); don't abbreviate, what is OG? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:82: void walk(Element& root, bool underLi = false) what is underLi? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:84: DEFINE_STATIC_LOCAL(std::vector<String>, unlikelyCandidates, ({"banner", "combx", "comment", "community", "disqus", "extra", "foot", "header", "menu", "related", "remark", "rss", "share", "shoutbox", "sidebar", "skyscraper", "sponsor", "ad-break", "agegate", "pagination", "pager", "popup"})); don't use std::vector, use WTF types, also I'm pretty sure you want AtomicString here since it'll be faster, but I guess not if you want everything case insensitive... which means you actually want a HashSet<String, CaseFoldingHash> and to do set lookups as you traverse the tree. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:85: DEFINE_STATIC_LOCAL(std::vector<String>, okMaybeItsACandidate, ({"and", "article", "body", "column", "main", "shadow"})); ditto https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:87: for (Node& node : NodeTraversal::childrenOf(root)) { this doesn't handle shadow dom https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:90: m_features.textContentLength += text.length(); toText(node).length() https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:100: m_features.numAnchors++; anchorCount formCount etc. don't abbreviate with numFoo https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:104: if (element.getAttribute("type").lower() == "text") { typeAttr, you also want to use equalIgnoringCase. I'm also pretty sure you want to cast to HTMLInputElement and use ->type() instead https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:106: } else if (element.getAttribute("type").lower() == "pasword") { ditto https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:110: m_features.numPPRE++; Lets count these separately, then you can sum on the other side if you want https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:111: if (!underLi && isVisible(element) underListItem, also this isVisible thing doesn't work https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:112: && (!matchName(element, unlikelyCandidates) || matchName(element, okMaybeItsACandidate))) { okMaybeItsACandidate needs a better name, both sets seem like they need comments. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:115: m_features.mozScore += sqrt(len - 140); why 140? this needs a comment in the code or a constant https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:121: walk(element, element.hasTagName(liTag) || underLi); why do you care about ancestor li tags, what's the reason for this? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:127: for (const Node& node : NodeTraversal::childrenOf(head)) { again this doesn't handle shadow dom, but this is probably fine for the head https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:133: if ((element.getAttribute("name") == ("og:type")) || (element.getAttribute("property") == ("og:type"))) { toHTMLMetaElement and then use ->name(), also remove all the extra parens https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:134: WTF::CString content = element.getAttribute("content").upper().utf8(); don't convert to CString, you almost never want that, you're causing many copies on this line of code. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:135: if ((content) == "ARTICLE") { You want to use ->content() and equalIgnoringCase https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:152: WebDistillabilityFeatures features({0}); Can we add a constructor for the struct instead and set the variable defaults? we don't usually do {0} like this in blink. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:157: if (!document.hasFinishedParsing()) how does this ever get called when we're still parsing, should this be an assert? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:160: ASSERT(document.body()); this doesn't have to be true, you can remove the body from an html document, why do you think this should exist here? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:163: ExtractFeatureWalker walker(document, features); why do you need a walker at all instead of just static recursive functions? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:164: walker.walk(); just use recursive functions https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:167: features.innerTextLength += innerTextLength(*document.body()); you can remove the body, also this doesn't have to be an HTMLDocument... you want a null check this stuff I think. This is also traversing the entire document a second time. Once inside walker and again here, that's super expensive, what are you trying to figure out? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:198: document.addConsoleMessage(consoleMessage); just fprintf, no reason to console log your debug code https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.h (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.h:9: class String; remove, you don't depend on this https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.h:17: class DocumentStatisticsCollector { this needs tests.
fyi: the existing extract_features.js is 13ms on wikipedia cats and 230ms on the html spec on Macbook Pro. This code should be somewhat faster, but still that seems pretty expensive. What's the plan for when you trigger this? This is a really big (and very expensive) hammer. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:114: if (len >= 140) { if you only care about this we can write something faster than early outs as soon as it sees something >= 140 chars without allocating a buffer. this is potentially allocating a massive string that contains all the text in the entire page (ex. <body class="body"> causes this).
On 2015/10/23 at 05:17:30, esprehn wrote: > fyi: the existing extract_features.js is 13ms on wikipedia cats and 230ms on the html spec on Macbook Pro. This code should be somewhat faster, but still that seems pretty expensive. What's the plan for when you trigger this? This is a really big (and very expensive) hammer. You bring up a good point. Couple of things: 1) We do need to better understand the performance impact of feature extraction. Currently, this extraction happens twice during the page load. What do these 13ms translate into on Nexus 5? FWIW, even 13ms is too long. Regardless of the next steps, performance impact of a feature needs to be well-understood. 2) We need to determine strategies for collecting this data without sacrificing performance. The tag/class stats could probably be collected at the time of DOM construction without noticeable performance degradation. However, the distiller needs innerText of the page, and that's full-document scan. Luckily (or unluckily), there is already an innerText happening for CapturePageInfo on all platforms. It might not be the best method or the best fit for Distiller, but it's already there. So maybe the strategy should be to use that text instead of adding a new document scan.
Thank you for the very detailed review! It's very appreciated. I'll fix the code accordingly later. Before that, some of the comments are replied below. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:31: return root.textContent().stripWhiteSpace().length(); On 2015/10/23 04:59:14, esprehn wrote: > this is very expensive, you're creating a huge string of things inside the > document, then allocating a big copy without start/end whitespace Would this be acceptable if it's rewritten so that it doesn't allocate the string here? Like how m_features.textContentLength is calculated, but with tweaks about whitespace. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:37: EphemeralRange range = EphemeralRange::rangeOfContents(root); On 2015/10/23 04:59:15, esprehn wrote: > Why do you need innerText length? This is going to be very expensive to compute. This is one of the features we fed to the machine learning model. The full list of current and potential features are discussed here: https://docs.google.com/spreadsheets/d/1oLxW-H_kSXSmiZqiBaFWT6OCL91ljz1iKxAC7... It might be possible to use the text in CapturePageInfo instead, but we are currently bound to features that are available in JS. Alternatively, we could let them use innerText instead if this doesn't affect translation, but this might be tricky. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:60: || style->getPropertyValue("opacity") == "0" On 2015/10/23 04:59:14, esprehn wrote: > these properties are not inherited, this code doesn't actually work, if your > ancestor was visibility: hidden or opacity: 0 this doesn't catch it. Opacity is indeed not inherited, but visibility is. This is a direct translation from the JS implementation in order to make sure the output is the same. We could certainly do better regarding opacity than JS land, but currently we could only use what's available in JS as features. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:64: bool matchName(Element& elem, const std::vector<String>& words) On 2015/10/23 04:59:15, esprehn wrote: > don't use std::vector in blink, don't abbreviate. I'll use Vector instead. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:66: String hay = elem.getClassAttribute().lower() + " " + elem.getIdAttribute().lower(); On 2015/10/23 04:59:14, esprehn wrote: > don't combine concat the strings like this, also don't convert them to > lowercase. This is super expensive, you want to do: > > You want to check element.hasId() then use equalIgnoringCase with > element.getIdAttribute() and iterate the words, note that this will be pretty > expensive, do you really want to do this? > > Same with element.hasClass(), and using element.classNames(). Since partial matching is expected, I'll be using findIgnoringCase. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:77: walk(*m_document.body(), false); On 2015/10/23 04:59:15, esprehn wrote: > what is false? The second argument should've been removed since there's a default. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: m_features.openGraph = hasOGArticle(*m_document.head()); On 2015/10/23 04:59:14, esprehn wrote: > don't abbreviate, what is OG? Is hasOpenGraphArticle() a self-explanatory name? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:84: DEFINE_STATIC_LOCAL(std::vector<String>, unlikelyCandidates, ({"banner", "combx", "comment", "community", "disqus", "extra", "foot", "header", "menu", "related", "remark", "rss", "share", "shoutbox", "sidebar", "skyscraper", "sponsor", "ad-break", "agegate", "pagination", "pager", "popup"})); On 2015/10/23 04:59:14, esprehn wrote: > don't use std::vector, use WTF types, also I'm pretty sure you want AtomicString > here since it'll be faster, but I guess not if you want everything case > insensitive... which means you actually want a HashSet<String, CaseFoldingHash> > and to do set lookups as you traverse the tree. If we only want exact matching, this would be a brilliantly efficient way. Alas partial matching is expected, so iterating through the word list is still necessary. In the original JS code, this was a regex matching. Presumably this is still faster than that, right? Is it possible that Irregexp magic makes it faster than hand-written code here? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:87: for (Node& node : NodeTraversal::childrenOf(root)) { On 2015/10/23 04:59:15, esprehn wrote: > this doesn't handle shadow dom Could you elaborate how shadow dom should be handled? Skipping them as JS would? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:112: && (!matchName(element, unlikelyCandidates) || matchName(element, okMaybeItsACandidate))) { On 2015/10/23 04:59:14, esprehn wrote: > okMaybeItsACandidate needs a better name, both sets seem like they need > comments. Does highlyLikelyCandidates sound better? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:121: walk(element, element.hasTagName(liTag) || underLi); On 2015/10/23 04:59:14, esprehn wrote: > why do you care about ancestor li tags, what's the reason for this? This is part of the scoring heuristics. <p> or <pre> under <li> are excluded since they are less likely to be a useful signal. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:167: features.innerTextLength += innerTextLength(*document.body()); On 2015/10/23 04:59:14, esprehn wrote: > you can remove the body, also this doesn't have to be an HTMLDocument... you > want a null check this stuff I think. > > This is also traversing the entire document a second time. Once inside walker > and again here, that's super expensive, what are you trying to figure out? The ratio between innerText.length() and textContent.length() is a useful signal. The signal is actually much stronger if innerHTML is also available, but that is removed due to its high cost.
Description was changed from ========== Add feature extraction for distillability to Blink BUG=509869 ========== to ========== Add feature extraction for distillability to Blink BUG=509869 TEST=webkit_unit_tests --gtest_filter=DocumentStatisticsCollectorTest.* ==========
Patchset #2 (id:20001) has been deleted
Thanks for your detailed feedback! This CL is in a better shape now. Could you take another look? https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:19: using namespace WTF; On 2015/10/23 04:59:14, esprehn wrote: > don't using namespace WTF Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:20: using namespace Unicode; On 2015/10/23 04:59:14, esprehn wrote: > why do you need this? Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:31: return root.textContent().stripWhiteSpace().length(); On 2015/10/23 18:56:35, wychen wrote: > On 2015/10/23 04:59:14, esprehn wrote: > > this is very expensive, you're creating a huge string of things inside the > > document, then allocating a big copy without start/end whitespace > > Would this be acceptable if it's rewritten so that it doesn't allocate the > string here? > Like how m_features.textContentLength is calculated, but with tweaks about > whitespace. Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:53: bool isVisible(Element& elem) On 2015/10/23 04:59:14, esprehn wrote: > element, don't abbreviate in blink Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:56: m_document.domWindow()->getComputedStyle(&elem, String()); On 2015/10/23 04:59:15, esprehn wrote: > don't call through the domWindow(), you should almost never need that in C++, > certainly not here. > > element->computedStyle() is all you need. Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:58: style->getPropertyValue("display") == "none" On 2015/10/23 04:59:14, esprehn wrote: > You want to use ComputedStyle and the API surface on that Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:82: void walk(Element& root, bool underLi = false) On 2015/10/23 04:59:15, esprehn wrote: > what is underLi? Added comment. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:90: m_features.textContentLength += text.length(); On 2015/10/23 04:59:14, esprehn wrote: > toText(node).length() Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:100: m_features.numAnchors++; On 2015/10/23 04:59:15, esprehn wrote: > anchorCount > formCount > > etc. > > don't abbreviate with numFoo Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:104: if (element.getAttribute("type").lower() == "text") { On 2015/10/23 04:59:14, esprehn wrote: > typeAttr, you also want to use equalIgnoringCase. > > I'm also pretty sure you want to cast to HTMLInputElement and use ->type() > instead Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:106: } else if (element.getAttribute("type").lower() == "pasword") { On 2015/10/23 04:59:14, esprehn wrote: > ditto Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:110: m_features.numPPRE++; On 2015/10/23 04:59:14, esprehn wrote: > Lets count these separately, then you can sum on the other side if you want Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:111: if (!underLi && isVisible(element) On 2015/10/23 04:59:14, esprehn wrote: > underListItem, also this isVisible thing doesn't work Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:114: if (len >= 140) { On 2015/10/23 05:17:30, esprehn wrote: > if you only care about this we can write something faster than early outs as > soon as it sees something >= 140 chars without allocating a buffer. > > this is potentially allocating a massive string that contains all the text in > the entire page (ex. <body class="body"> causes this). I've saturated the length so that we could early out. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:115: m_features.mozScore += sqrt(len - 140); On 2015/10/23 04:59:14, esprehn wrote: > why 140? this needs a comment in the code or a constant Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:127: for (const Node& node : NodeTraversal::childrenOf(head)) { On 2015/10/23 04:59:15, esprehn wrote: > again this doesn't handle shadow dom, but this is probably fine for the head Acknowledged. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:133: if ((element.getAttribute("name") == ("og:type")) || (element.getAttribute("property") == ("og:type"))) { On 2015/10/23 04:59:14, esprehn wrote: > toHTMLMetaElement and then use ->name(), also remove all the extra parens Done, but getAttribute("property") is still there. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:134: WTF::CString content = element.getAttribute("content").upper().utf8(); On 2015/10/23 04:59:14, esprehn wrote: > don't convert to CString, you almost never want that, you're causing many copies > on this line of code. Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:135: if ((content) == "ARTICLE") { On 2015/10/23 04:59:14, esprehn wrote: > You want to use ->content() and equalIgnoringCase Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:152: WebDistillabilityFeatures features({0}); On 2015/10/23 04:59:14, esprehn wrote: > Can we add a constructor for the struct instead and set the variable defaults? > we don't usually do {0} like this in blink. I'll use an initializer instead of adding a constructor. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:157: if (!document.hasFinishedParsing()) On 2015/10/23 04:59:15, esprehn wrote: > how does this ever get called when we're still parsing, should this be an > assert? Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:160: ASSERT(document.body()); On 2015/10/23 04:59:15, esprehn wrote: > this doesn't have to be true, you can remove the body from an html document, why > do you think this should exist here? Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:163: ExtractFeatureWalker walker(document, features); On 2015/10/23 04:59:15, esprehn wrote: > why do you need a walker at all instead of just static recursive functions? Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:164: walker.walk(); On 2015/10/23 04:59:14, esprehn wrote: > just use recursive functions Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:167: features.innerTextLength += innerTextLength(*document.body()); On 2015/10/23 04:59:14, esprehn wrote: > you can remove the body, also this doesn't have to be an HTMLDocument... you > want a null check this stuff I think. Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:198: document.addConsoleMessage(consoleMessage); On 2015/10/23 04:59:14, esprehn wrote: > just fprintf, no reason to console log your debug code I just wanted to quickly get message to adb log. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.h (right): https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.h:9: class String; On 2015/10/23 04:59:15, esprehn wrote: > remove, you don't depend on this Done. https://codereview.chromium.org/1419033004/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.h:17: class DocumentStatisticsCollector { On 2015/10/23 04:59:15, esprehn wrote: > this needs tests. Done.
https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:28: unsigned trimmedTextContentLength(Element& root) Do pages usually have enough whitespace for this to matter in your training model? It seems like you can just use the toText(node).length() and adjust at scale. Doing this is pretty expensive. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:33: for (Node& node : NodeTraversal::inclusiveDescendantsOf(root)) { This doesn't understand shadow dom. ex. <p><content></content></p> maybe your code doesn't care though. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:59: TextIteratorAlgorithm<EditingStrategy> it(range.startPosition(), range.endPosition(), TextIteratorForInnerText); Why do you care about innerText length at all? That's just the length of textContent that's visible, you can figure that out yourself. The only reason to really look at innerText is if you care about visual order, and you don't (just length). Also I think you want to use TextIterator, not the Algorithm class directly. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:68: const blink::ComputedStyle* style = element.ensureComputedStyle(); this forces a style computation on elements that would otherwise not have them, this means you're possibly allocating ElementRareData and doing style resolve on lots of the page. You want to call computedStyle() and null check it https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:69: return !( run demorgans https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:80: for (const String& word: words) { missing space https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:86: if (UNLIKELY(element.hasID())) { remove UNLIKELY https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:87: const String& hay = element.getIdAttribute(); these already have a hasID() check in them, so I'd suggest removing the hasID() and hasClass() checks and just doing a single loop. const String& classes = element.getClassAttribute(); const String& id = element.getIdAttribute(); for (const auto& word : words) { if (classes.findIgnoringCase(word) != WTF::kNotFound || id.findIgnoringCase(word) != WTF::kNotFound) return true; } https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:88: for (const String& word: words) { missing space https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:98: void walk(Element& root, WebDistillabilityFeatures& features, bool underListItem = false) needs a better name. collectFeatures? https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:101: if (unlikelyCandidates.size() == 0) { isEmpty() https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:107: if (highlyLikelyCandidates.size() == 0) { isEmpty() https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:112: const unsigned kParagraphLengthThreshold = 140; why did you pick 140? Add a comment. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:131: if (equalIgnoringCase(input.type(), "text")) { input.type() == InputTypeNames::text https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:133: } else if (equalIgnoringCase(input.type(), "password")) { == InputTypeNames::password https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:143: && (!matchName(element, unlikelyCandidates) || matchName(element, highlyLikelyCandidates))) { matchAttributes? It's not really related to name at all https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:152: walk(element, features, element.hasTagName(liTag) || underListItem); this checks hasTagName(liTag) for every element, even though you know it's not an li in all the above cases. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:158: for (const Node& node : NodeTraversal::childrenOf(head)) { for (const Element* child = ElementTraversal::firstChild(*head); child; child = child = ElementTraversal::nextSibling(*child)) https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:165: if (meta.name() == "og:type" || element.getAttribute("property") == "og:type") { You want to declare static local AtomicString variables for this. DEFINE_STATIC_LOCAL(AtomicString, ogType, "og:type") DEFINE_STATIC_LOCAL(AtomicString, propertyAttr, "property") meta.name == ogType || element.getAttribute(propertyAttr) == ogType https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:192: // Next, traverse the Layout tree and collect statistics on innerText length. this needs to do document->updateLayout() so it's safe to use the text iterator. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:193: features.innerTextLength += innerTextLength(*document.body()); this really seems unnecessary, you can just collect the text content length of the visible nodes when doing the tree walk above. also = not += right? https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp (right): https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:5: * modification, are permitted provided that the following conditions are Use the modern short copyright. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:59: void setHtmlInnerHTML(const char*); const String& https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:72: document().documentElement()->setInnerHTML(String::fromUTF8(htmlContent), ASSERT_NO_EXCEPTION); from fromtUTF8 https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:73: document().view()->updateAllLifecyclePhases(); remove this, you don't need it. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/publ... File third_party/WebKit/public/platform/WebDistillability.h (right): https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/publ... third_party/WebKit/public/platform/WebDistillability.h:21: double mozScore; This needs a comment in the code or a link to what a mozScore is.
Patchset #3 (id:60001) has been deleted
For patch set 3, two distillability runs for https://en.wikipedia.org/wiki/Cat on Nexus 5 are: openGraph: 0, elementCount: 9276, anchorCount: 2553, formCount: 1, textInputCount: 0, passwordInputCount: 0, pCount: 90, preCount: 0, innerTextLength: 98559, textContentLength: 116593, mozScore: 175.955, mozScoreAllSqrt: 189.737, mozScoreAllLinear: 6000 Elapsed time (ms): 3.304, openGraph time (ms): 0.0138283, innerText time (ms): 14.524 openGraph: 0, elementCount: 9318, anchorCount: 2562, formCount: 1, textInputCount: 0, passwordInputCount: 0, pCount: 90, preCount: 0, innerTextLength: 3815, textContentLength: 116864, mozScore: 175.955, mozScoreAllSqrt: 189.737, mozScoreAllLinear: 6000 Elapsed time (ms): 4.15921, openGraph time (ms): 0.015974, innerText time (ms): 0.833035 It seems updateLayout() + innerTextLength() is the time dominating part. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:28: unsigned trimmedTextContentLength(Element& root) On 2015/10/26 21:43:09, esprehn wrote: > Do pages usually have enough whitespace for this to matter in your training > model? It seems like you can just use the toText(node).length() and adjust at > scale. Doing this is pretty expensive. Trimming the white spaces should make this less noisy. I don't have quantitative results to say how useful this is though. If I scan the ending backwards, the cost shouldn't be too high in normal cases, since the proportion of the leading/trailing spaces should be low. I'll try not trimming and see if the accuracy degradation is acceptable. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:33: for (Node& node : NodeTraversal::inclusiveDescendantsOf(root)) { On 2015/10/26 21:43:08, esprehn wrote: > This doesn't understand shadow dom. ex. > > <p><content></content></p> > > maybe your code doesn't care though. The visibility model used here is the same as JavaScript as intended, before we can use non-JS features. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:59: TextIteratorAlgorithm<EditingStrategy> it(range.startPosition(), range.endPosition(), TextIteratorForInnerText); On 2015/10/26 21:43:09, esprehn wrote: > Why do you care about innerText length at all? That's just the length of > textContent that's visible, you can figure that out yourself. The only reason to > really look at innerText is if you care about visual order, and you don't (just > length). The length ratio between innerText and textContent is a good signal. It's even better with innerHTML. I'm not sure how to count visible textContent in the same pass manually and keep the functionality equivalent. If not done exactly the same, it might lead to some inconsistencies. > Also I think you want to use TextIterator, not the Algorithm class directly. Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:68: const blink::ComputedStyle* style = element.ensureComputedStyle(); On 2015/10/26 21:43:08, esprehn wrote: > this forces a style computation on elements that would otherwise not have them, > this means you're possibly allocating ElementRareData and doing style resolve on > lots of the page. > > You want to call computedStyle() and null check it This statistics collection happens right after the first layout after parsing and loading. It is highly likely that all the styles are computed and fresh, right? If computedStyle() is used instead, the result might be different from the JS version. Without updateAllLifecyclePhases() in the test, using computedStyle() failed the tests. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:69: return !( On 2015/10/26 21:43:09, esprehn wrote: > run demorgans Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:80: for (const String& word: words) { On 2015/10/26 21:43:09, esprehn wrote: > missing space Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:86: if (UNLIKELY(element.hasID())) { On 2015/10/26 21:43:09, esprehn wrote: > remove UNLIKELY Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:87: const String& hay = element.getIdAttribute(); On 2015/10/26 21:43:09, esprehn wrote: > these already have a hasID() check in them, so I'd suggest removing the hasID() > and hasClass() checks and just doing a single loop. > > const String& classes = element.getClassAttribute(); > const String& id = element.getIdAttribute(); > > for (const auto& word : words) { > if (classes.findIgnoringCase(word) != WTF::kNotFound || > id.findIgnoringCase(word) != WTF::kNotFound) > return true; > } Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:88: for (const String& word: words) { On 2015/10/26 21:43:08, esprehn wrote: > missing space Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:98: void walk(Element& root, WebDistillabilityFeatures& features, bool underListItem = false) On 2015/10/26 21:43:08, esprehn wrote: > needs a better name. collectFeatures? Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:101: if (unlikelyCandidates.size() == 0) { On 2015/10/26 21:43:09, esprehn wrote: > isEmpty() Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:107: if (highlyLikelyCandidates.size() == 0) { On 2015/10/26 21:43:08, esprehn wrote: > isEmpty() Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:112: const unsigned kParagraphLengthThreshold = 140; On 2015/10/26 21:43:08, esprehn wrote: > why did you pick 140? Add a comment. Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:131: if (equalIgnoringCase(input.type(), "text")) { On 2015/10/26 21:43:09, esprehn wrote: > input.type() == InputTypeNames::text Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:133: } else if (equalIgnoringCase(input.type(), "password")) { On 2015/10/26 21:43:08, esprehn wrote: > == InputTypeNames::password Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:143: && (!matchName(element, unlikelyCandidates) || matchName(element, highlyLikelyCandidates))) { On 2015/10/26 21:43:09, esprehn wrote: > matchAttributes? It's not really related to name at all Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:152: walk(element, features, element.hasTagName(liTag) || underListItem); On 2015/10/26 21:43:09, esprehn wrote: > this checks hasTagName(liTag) for every element, even though you know it's not > an li in all the above cases. Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:158: for (const Node& node : NodeTraversal::childrenOf(head)) { On 2015/10/26 21:43:09, esprehn wrote: > for (const Element* child = ElementTraversal::firstChild(*head); child; child = > child = ElementTraversal::nextSibling(*child)) Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:165: if (meta.name() == "og:type" || element.getAttribute("property") == "og:type") { On 2015/10/26 21:43:09, esprehn wrote: > You want to declare static local AtomicString variables for this. > > DEFINE_STATIC_LOCAL(AtomicString, ogType, "og:type") > DEFINE_STATIC_LOCAL(AtomicString, propertyAttr, "property") > > meta.name == ogType || element.getAttribute(propertyAttr) == ogType Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:192: // Next, traverse the Layout tree and collect statistics on innerText length. On 2015/10/26 21:43:08, esprehn wrote: > this needs to do document->updateLayout() so it's safe to use the text iterator. Done. I'm curious when should updateLayoutIgnorePendingStylesheets() be used though. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:193: features.innerTextLength += innerTextLength(*document.body()); On 2015/10/26 21:43:09, esprehn wrote: > this really seems unnecessary, you can just collect the text content length of > the visible nodes when doing the tree walk above. There seems to be much more than visibility in TextIterator. This could be a good cheap feature candidate after we support non-JS features though. > also = not += right? Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp (right): https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:5: * modification, are permitted provided that the following conditions are On 2015/10/26 21:43:09, esprehn wrote: > Use the modern short copyright. Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:59: void setHtmlInnerHTML(const char*); On 2015/10/26 21:43:09, esprehn wrote: > const String& Done. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:72: document().documentElement()->setInnerHTML(String::fromUTF8(htmlContent), ASSERT_NO_EXCEPTION); On 2015/10/26 21:43:09, esprehn wrote: > from fromtUTF8 I'm not quite sure I understand this comment. For this particular unit test, UTF8 conversion is not necessary, so I just deleted it. https://codereview.chromium.org/1419033004/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:73: document().view()->updateAllLifecyclePhases(); On 2015/10/26 21:43:09, esprehn wrote: > remove this, you don't need it. Without this line, there's an assertion error: ASSERTION FAILED: !node.needsDistributionRecalc() ../../third_party/WebKit/Source/core/dom/shadow/ComposedTreeTraversal.h(128) : static void blink::ComposedTreeTraversal::assertPrecondition(const blink::Node &) This is caused by ensureComputedStyle(). Without ensureComputedStyle(), CountScore couldn't pass.
https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:228: document.updateLayout(); Instead of updateLayout, just ASSERT to ensure the function is never called when layout is dirty -- since you'll be calling this only out of didMeaningfulLayout hook. This may seen counter to Elliott's earlier advice, but it's really the same thing :)
innerText length is essentially the same as the textContent length if you only accumulate visible text. Why do you care about the ratio? What are you trying to detect here? Actually using the text iterator forces us to walk the entire document in visual order, but you don't actually care about that. I also don't understand what you mean by "non js features", we have no plan to expose any non-JS features. Distiller should only be using web exposed APIs (or the conceptual equivalents). https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: const blink::ComputedStyle* style = element.ensureComputedStyle(); don't use ensureComputedStyle(), it forces us to compute the style for display: none nodes and allocate rare data. You just want to use computedStyle() and null check it, that won't force any new style computations. https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:182: DEFINE_STATIC_LOCAL(AtomicString, propertyAttr, ("property")); move these to the top of the function, not inside the loop. https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:209: double startTime = WTF::currentTime(); add tracing macros, remove your time based stuff. :)
https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: const blink::ComputedStyle* style = element.ensureComputedStyle(); On 2015/10/28 00:07:28, esprehn wrote: > don't use ensureComputedStyle(), it forces us to compute the style for display: > none nodes and allocate rare data. You just want to use computedStyle() and null > check it, that won't force any new style computations. Quick question. I want to use getBoundingClientRect().{height, width}==0 to replace display==none checking. Is this cheap enough to be used here?
https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: const blink::ComputedStyle* style = element.ensureComputedStyle(); On 2015/10/28 at 00:25:45, wychen wrote: > On 2015/10/28 00:07:28, esprehn wrote: > > don't use ensureComputedStyle(), it forces us to compute the style for display: > > none nodes and allocate rare data. You just want to use computedStyle() and null > > check it, that won't force any new style computations. > > Quick question. I want to use getBoundingClientRect().{height, width}==0 to replace display==none checking. Is this cheap enough to be used here? That'll be more expensive than just checking the display, it requires allocating a ClientRect object, why do you want to do that instead of just checking display?
https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: const blink::ComputedStyle* style = element.ensureComputedStyle(); On 2015/10/28 00:37:17, esprehn wrote: > On 2015/10/28 at 00:25:45, wychen wrote: > > On 2015/10/28 00:07:28, esprehn wrote: > > > don't use ensureComputedStyle(), it forces us to compute the style for > display: > > > none nodes and allocate rare data. You just want to use computedStyle() and > null > > > check it, that won't force any new style computations. > > > > Quick question. I want to use getBoundingClientRect().{height, width}==0 to > replace display==none checking. Is this cheap enough to be used here? > > That'll be more expensive than just checking the display, it requires allocating > a ClientRect object, why do you want to do that instead of just checking > display? Since display is not inherited, the current checking is not very accurate. getBoundingClientRect() is available in JS, and works for the child elements of a display=='none' element. I'm trying to find a way that is both more accurate, and available in JS. Are there better alternatives?
https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: const blink::ComputedStyle* style = element.ensureComputedStyle(); On 2015/10/28 at 00:49:46, wychen wrote: > ... > > > > That'll be more expensive than just checking the display, it requires allocating > > a ClientRect object, why do you want to do that instead of just checking > > display? > > Since display is not inherited, the current checking is not very accurate. getBoundingClientRect() is available in JS, and works for the child elements of a display=='none' element. I'm trying to find a way that is both more accurate, and available in JS. Are there better alternatives? var computedStyle = getComputedStyle(element); !computedStyle.width && !computedStyle.height I'm not sure if that's faster than getBoundingClientRect() though. For now I'd suggest just null checking the return value of computedStyle() until we figure out the right web exposed API here.
Patchset #4 (id:100001) has been deleted
innerText is removed as well. With these changes, the typical cost should be <5ms for most pages on N5. Does this look good except for the debugging messages? https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:65: unsigned innerTextLength(Element& root) I'll remove innerTextLength() for this version. https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:78: const blink::ComputedStyle* style = element.ensureComputedStyle(); On 2015/10/28 00:57:19, esprehn wrote: > var computedStyle = getComputedStyle(element); > !computedStyle.width && !computedStyle.height > > I'm not sure if that's faster than getBoundingClientRect() though. I just tried that. If the parent is display==none, width and height become "auto". I guess getBoundingClientRect() is still necessary to catch this case. > For now I'd suggest just null checking the return value of computedStyle() until > we figure out the right web exposed API here. Done. https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:182: DEFINE_STATIC_LOCAL(AtomicString, propertyAttr, ("property")); On 2015/10/28 00:07:28, esprehn wrote: > move these to the top of the function, not inside the loop. Done. https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:209: double startTime = WTF::currentTime(); On 2015/10/28 00:07:28, esprehn wrote: > add tracing macros, remove your time based stuff. :) Thanks for the tip. I gave it a try, but it involved with more GUI than I'd like. https://codereview.chromium.org/1419033004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:228: document.updateLayout(); On 2015/10/27 23:59:19, dglazkov wrote: > Instead of updateLayout, just ASSERT to ensure the function is never called when > layout is dirty -- since you'll be calling this only out of didMeaningfulLayout > hook. This may seen counter to Elliott's earlier advice, but it's really the > same thing :) Done. https://codereview.chromium.org/1419033004/diff/120001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/120001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:56: if (length > kTextContentLengthSaturation) { With saturations, the total cost of trimmedTextContentLength in a page should not be too high.
The CQ bit was checked by mdjones@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1419033004/160001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1419033004/160001
Patchset #7 (id:180001) has been deleted
Not trimming turned out to be not too bad, so I've changed that. Could you take another look? Thanks!
I'd appreciate if you could let me know what I should modify before landing. Thanks a lot!
On 2015/11/02 at 17:43:33, wychen wrote: > The visibility model used here is the same as JavaScript as intended, before we can use non-JS features. What does this comment mean?
On 2015/11/02 19:00:46, dglazkov wrote: > On 2015/11/02 at 17:43:33, wychen wrote: > > The visibility model used here is the same as JavaScript as intended, before > we can use non-JS features. > > What does this comment mean? Oh. I thought JavaScript couldn't look deeply into shadow DOM. Later on I realized that it is actually possible on modern Chrome versions. Since our JavaScript feature extraction code doesn't look into shadow DOM, the native counterpart should follow.
On 2015/11/02 at 19:24:40, wychen wrote: > On 2015/11/02 19:00:46, dglazkov wrote: > > On 2015/11/02 at 17:43:33, wychen wrote: > > > The visibility model used here is the same as JavaScript as intended, before > > we can use non-JS features. > > > > What does this comment mean? > > Oh. I thought JavaScript couldn't look deeply into shadow DOM. Later on I realized that it is actually possible on modern Chrome versions. > > Since our JavaScript feature extraction code doesn't look into shadow DOM, the native counterpart should follow. I think the JS feature extraction is just wrong. Let's not blindly follow mistakes made by our parents :)
On 2015/11/02 19:25:38, dglazkov wrote: > I think the JS feature extraction is just wrong. Let's not blindly follow > mistakes made by our parents :) Thanks for spotting this! I wouldn't call it wrong though. There are many ways to tune the heuristics to make the extracted features more useful in the model, but some might not worth the effort. For example, adding innerHTML would improve the score, but the cost is way too large, or innerText, or even trimming the textContent. Given the probability a shadow DOM is used in <p> elements in a long-form article, I guess it might not affect the result too much. Another consideration for not looking into shadow DOM is for Bling to reuse it. Besides this one, is there anything I should work on for this CL?
On 2015/11/02 at 23:57:21, wychen wrote: > On 2015/11/02 19:25:38, dglazkov wrote: > > I think the JS feature extraction is just wrong. Let's not blindly follow > > mistakes made by our parents :) > > Thanks for spotting this! > > I wouldn't call it wrong though. There are many ways to tune the heuristics > to make the extracted features more useful in the model, but some might not > worth the effort. For example, adding innerHTML would improve the score, > but the cost is way too large, or innerText, or even trimming the textContent. > > Given the probability a shadow DOM is used in <p> elements in a long-form > article, I guess it might not affect the result too much. It's the opposite case that I am worried about -- when the text is not actually visible because of Shadow DOM. Luckily, your isVisible check takes care of that. > > Another consideration for not looking into shadow DOM is for Bling to reuse > it. Blink will have to figure this out, too. > > Besides this one, is there anything I should work on for this CL? Left a few more comments.
https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:34: // This skips shadow dom intentionally. Please explain why in the comment. https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:50: if (!style) { Don't need braces here. https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:131: unsigned length = textContentLengthSaturated(element); Is this an O(NxM) built in here? https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:186: if (!document.body() || !document.head()) Both of these are traversals, so might be good to stash these away if you're using them later.
Also, can we add performance metrics around this code, so that we can track performance using UMA and deep/slow reports?
UMA is also added. PTAL. https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:34: // This skips shadow dom intentionally. On 2015/11/03 04:45:37, dglazkov wrote: > Please explain why in the comment. Done. https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:50: if (!style) { On 2015/11/03 04:45:37, dglazkov wrote: > Don't need braces here. Done. https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:131: unsigned length = textContentLengthSaturated(element); On 2015/11/03 04:45:37, dglazkov wrote: > Is this an O(NxM) built in here? I'm not quite sure about what you meant here. textContentLengthSaturated() should be O(subtree) with an early termination. If most of the child elements contribute to textContent, it should be more like O(1). As for how many times textContentLengthSaturated() can be called, it should be number of <p> elements in the page, with an early termination. Again, in normal cases should be more like O(1). The worst case here would be lots of <p> elements, with lots of children that don't contribute to textContent. https://codereview.chromium.org/1419033004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:186: if (!document.body() || !document.head()) On 2015/11/03 04:45:37, dglazkov wrote: > Both of these are traversals, so might be good to stash these away if you're > using them later. Done.
https://codereview.chromium.org/1419033004/diff/120001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/120001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:71: ASSERT(style->display() != NONE); this isn't true, you can still have a style and be display: none (ex. SVG does it). You want to both null check and look at the display(). https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:10: #include "core/css/CSSComputedStyleDeclaration.h" don't need this. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:13: #include "core/editing/iterators/TextIterator.h" don't need this. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:21: #include "wtf/text/StringImpl.h" you don't need StringImpl, StringBuilder https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:35: // This skips shadow dom intentionally, to match the JavaScript implementation. Why? https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:53: ASSERT(style->display() != NONE); this assert is wrong, you can have a style and be display() == NONE, you want to check it too. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:108: for (auto word : {"and", "article", "body", "column", "main", "shadow"}) { I'd wrap this one too. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:122: features.textContentLength += toText(node).length(); this is going to add the length of every inline <script> or <style> too which seems bad, I don't think you want to do that. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:137: if (equalIgnoringCase(input.type(), InputTypeNames::text)) { ditto == InputTypeNames::text https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:139: } else if (equalIgnoringCase(input.type(), InputTypeNames::password)) { this is always lowercase, you can just do == ::password https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:152: && isVisible(element) this is a crazy set of conditions, in blink we try not to do this. Instead we write a helper function with lots of early returns that ends in return true. bool isFoo(a, b, c) { if (!underListItem) return false; ... return true; } that avoids this crazy nesting of && and ||. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:209: TRACE_EVENT0("DocumentStatisticsCollector::collectStatistics") https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:210: HTMLElement* body = document.body(); needs a trace macro
https://codereview.chromium.org/1419033004/diff/120001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/120001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:71: ASSERT(style->display() != NONE); On 2015/11/03 07:45:10, esprehn wrote: > this isn't true, you can still have a style and be display: none (ex. SVG does > it). You want to both null check and look at the display(). Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:10: #include "core/css/CSSComputedStyleDeclaration.h" On 2015/11/03 07:45:10, esprehn wrote: > don't need this. Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:13: #include "core/editing/iterators/TextIterator.h" On 2015/11/03 07:45:10, esprehn wrote: > don't need this. Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:21: #include "wtf/text/StringImpl.h" On 2015/11/03 07:45:10, esprehn wrote: > you don't need StringImpl, StringBuilder Done. Just curious, did you use IDE to check unnecessary headers, or you just know it? https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:35: // This skips shadow dom intentionally, to match the JavaScript implementation. On 2015/11/03 07:45:10, esprehn wrote: > Why? Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:53: ASSERT(style->display() != NONE); On 2015/11/03 07:45:10, esprehn wrote: > this assert is wrong, you can have a style and be display() == NONE, you want to > check it too. Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:108: for (auto word : {"and", "article", "body", "column", "main", "shadow"}) { On 2015/11/03 07:45:10, esprehn wrote: > I'd wrap this one too. Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:122: features.textContentLength += toText(node).length(); On 2015/11/03 07:45:10, esprehn wrote: > this is going to add the length of every inline <script> or <style> too which > seems bad, I don't think you want to do that. It is possible that innerTextLength/textContentLength ratio is useful because it catches exactly this case, and can learn from this feature that doesn't make visual differences. Without innerTextLength, having textContentLength itself is not really useful, so I've removed this as well. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:137: if (equalIgnoringCase(input.type(), InputTypeNames::text)) { On 2015/11/03 07:45:10, esprehn wrote: > ditto == InputTypeNames::text Good catch! I guess AtomicString == is faster. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:139: } else if (equalIgnoringCase(input.type(), InputTypeNames::password)) { On 2015/11/03 07:45:10, esprehn wrote: > this is always lowercase, you can just do == ::password Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:152: && isVisible(element) On 2015/11/03 07:45:10, esprehn wrote: > this is a crazy set of conditions, in blink we try not to do this. Instead we > write a helper function with lots of early returns that ends in return true. > > bool isFoo(a, b, c) { > if (!underListItem) > return false; > ... > return true; > } > > that avoids this crazy nesting of && and ||. Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:209: On 2015/11/03 07:45:10, esprehn wrote: > TRACE_EVENT0("DocumentStatisticsCollector::collectStatistics") Done. https://codereview.chromium.org/1419033004/diff/250001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:210: HTMLElement* body = document.body(); On 2015/11/03 07:45:10, esprehn wrote: > needs a trace macro I don't understand this comment. Did you mean DEFINE_TRACE?
The CQ bit was checked by mdjones@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1419033004/270001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1419033004/270001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
Patchset #12 (id:290001) has been deleted
So close! https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:60: const ComputedStyle* style = element.computedStyle(); you need ASSERT(!element.document().needsLayoutTreeUpdate()) https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:138: if (matchAttributes(element, unlikelyCandidates) && !matchAttributes(element, highlyLikelyCandidates)) I'd wrap at the && like you did above https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:170: if (isGoodForScoring(underListItem, features, element)) { I'd probably move the underListItem check out so this becomes: if (underListItem && isGoodForScoring(features, element)) { https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:212: return frameHost->visualViewport().shouldDisableDesktopWorkarounds(); in blink we'd usually write: if (FrameHost* frameHost = document.frameHost()) return frameHost->visualViewport().shouldDisableDesktopWorkarounds(); return false; https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:235: features.isMobileFriendly = true; so if it's mobile friendly we don't need to collect stats? https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:241: // Traverse the DOM tree and collect statistics. you either need to call updateLayoutTreeIfNeeded() here or you need the ASSERT() I mentioned assert above.
https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:60: const ComputedStyle* style = element.computedStyle(); On 2015/11/05 01:21:59, esprehn wrote: > you need > > ASSERT(!element.document().needsLayoutTreeUpdate()) Skipped. https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:138: if (matchAttributes(element, unlikelyCandidates) && !matchAttributes(element, highlyLikelyCandidates)) On 2015/11/05 01:21:58, esprehn wrote: > I'd wrap at the && like you did above Done. https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:170: if (isGoodForScoring(underListItem, features, element)) { On 2015/11/05 01:21:58, esprehn wrote: > I'd probably move the underListItem check out so this becomes: > > if (underListItem && isGoodForScoring(features, element)) { Done. https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:212: return frameHost->visualViewport().shouldDisableDesktopWorkarounds(); On 2015/11/05 01:21:59, esprehn wrote: > in blink we'd usually write: > > if (FrameHost* frameHost = document.frameHost()) > return frameHost->visualViewport().shouldDisableDesktopWorkarounds(); > return false; Done. Is this style for performance? Like what LIKELY() would do? https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:235: features.isMobileFriendly = true; On 2015/11/05 01:21:58, esprehn wrote: > so if it's mobile friendly we don't need to collect stats? Yes. We currently only trigger Reader Mode on non-mobile-friendly pages. https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:241: // Traverse the DOM tree and collect statistics. On 2015/11/05 01:21:58, esprehn wrote: > you either need to call updateLayoutTreeIfNeeded() here or you need the ASSERT() > I mentioned assert above. I'll skip the assertion above. It might be slightly faster this way. I'll use updateLayoutTreeIfNeeded(), just to be more tolerant.
lgtm, dglazkov@ look good to you? https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:212: return frameHost->visualViewport().shouldDisableDesktopWorkarounds(); Not for performance, it just makes the scope of your object clear and it's fewer lines of code. https://codereview.chromium.org/1419033004/diff/350001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp (right): https://codereview.chromium.org/1419033004/diff/350001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:55: document().view()->updateAllLifecyclePhases(); you can remove this if you do that.
https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp (right): https://codereview.chromium.org/1419033004/diff/330001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollector.cpp:212: return frameHost->visualViewport().shouldDisableDesktopWorkarounds(); On 2015/11/05 01:54:17, esprehn wrote: > Not for performance, it just makes the scope of your object clear and it's fewer > lines of code. I see. Thanks https://codereview.chromium.org/1419033004/diff/350001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp (right): https://codereview.chromium.org/1419033004/diff/350001/third_party/WebKit/Sou... third_party/WebKit/Source/core/dom/DocumentStatisticsCollectorTest.cpp:55: document().view()->updateAllLifecyclePhases(); On 2015/11/05 01:54:17, esprehn wrote: > you can remove this if you do that. Right! I forgot to update this one.
lgtm
The CQ bit was checked by wychen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1419033004/370001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1419033004/370001
wychen@chromium.org changed reviewers: + jwd@chromium.org
jwd@, could you take a look at the UMA xml? Thanks!
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
lgtm
The CQ bit was checked by wychen@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from esprehn@chromium.org Link to the patchset: https://codereview.chromium.org/1419033004/#ps370001 (title: "stricter test")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1419033004/370001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1419033004/370001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: ios_rel_device_ninja on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios_rel_device_ni...)
The CQ bit was checked by wychen@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1419033004/370001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1419033004/370001
Message was sent while issue was closed.
Committed patchset #15 (id:370001)
Message was sent while issue was closed.
Patchset 15 (id:??) landed as https://crrev.com/db4d18afb53ef9ac67a03edefa2bbbafe50723a7 Cr-Commit-Position: refs/heads/master@{#359158} |