Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query idea: 3 longest words in title of works with no main subject #1118

Open
Daniel-Mietchen opened this issue Jan 12, 2019 · 4 comments

Comments

Projects
None yet
1 participant
@Daniel-Mietchen
Copy link
Owner

commented Jan 12, 2019

The hope is that this could help with topic tagging.

Started to work on this last night and posted at
https://www.wikidata.org/w/index.php?title=Wikidata:Request_a_query&oldid=832978769#Find_longest_substrings_in_a_title but still don't have a real solution.

The partial solution of sorting works by the length of their first or second word, however, is already there, and it is useful for identifying topics that need attention.

@Daniel-Mietchen

This comment has been minimized.

Copy link
Owner Author

commented Jan 12, 2019

@Daniel-Mietchen

This comment has been minimized.

Copy link
Owner Author

commented Jan 13, 2019

Also from StackOverflow is a Jupyter notebook that does the per-string part (in Python): https://paws-public.wmflabs.org/paws-public/User:Luitzen/Stack/Split%20titles.ipynb?kernel_name=python3 (archived)

@Daniel-Mietchen

This comment has been minimized.

Copy link
Owner Author

commented Feb 21, 2019

Based on this query, here is an adaptation for papers by a given author:

SELECT (SAMPLE(DISTINCT ?x) AS ?item) ?w (COUNT(DISTINCT ?x) AS ?c) (STRLEN(?w) AS ?l) WHERE {
  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x schema:dateModified ?date_modified ;
#         wdt:P921 wd:Q202864 ;
         wdt:P50 wd:Q46155812 ;
         wdt:P1476 ?title.
      BIND (now() - ?date_modified as ?date_range)
      FILTER(STRLEN(?title) >= 6)
    }
    LIMIT 10000
  }
  FILTER NOT EXISTS { ?x wdt:P921 ?topic .}
  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  BIND(REPLACE(STRAFTER(?ltitle, ?w3), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w4)
  BIND(REPLACE(STRAFTER(?ltitle, ?w4), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w5)
  BIND(REPLACE(STRAFTER(?ltitle, ?w5), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w6)
  BIND(REPLACE(STRAFTER(?ltitle, ?w6), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w7)
  BIND(REPLACE(STRAFTER(?ltitle, ?w7), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w8)
  BIND(REPLACE(STRAFTER(?ltitle, ?w8), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w9)
  BIND(REPLACE(STRAFTER(?ltitle, ?w9), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w10)
  VALUES ?w_ { 1 2 3 4 5 6 7 8 9 10}
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, IF(?w_ = 3, ?w3, IF(?w_ = 4, ?w4, IF(?w_ = 5, ?w5, IF(?w_ = 6, ?w6, IF(?w_ = 7, ?w7, 
          IF(?w_ = 8, ?w8, IF(?w_ = 9, ?w9, ?w10))))))))) AS ?w)
  FILTER(REGEX(?w, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
}
GROUP BY ?item ?w
ORDER BY DESC(?l) DESC(?c)
LIMIT 2000
@Daniel-Mietchen

This comment has been minimized.

Copy link
Owner Author

commented Jul 6, 2019

A slight variation:

SELECT (SAMPLE(DISTINCT ?x) AS ?item) ?w (COUNT(DISTINCT ?x) AS ?c) (STRLEN(?w) AS ?l) WHERE {
  {
    SELECT DISTINCT ?x ?title WHERE {
      ?x schema:dateModified ?date_modified ;
#         wdt:P921 wd:Q202864 ;
         wdt:P50 wd:Q46155812 ;
         wdt:P1476 ?title.
      BIND (now() - ?date_modified as ?date_range)
      FILTER(STRLEN(?title) >= 5)
    }
    LIMIT 10000
  }
  FILTER NOT EXISTS { ?x wdt:P921 ?topic .}
  BIND(LCASE(?title) AS ?ltitle)
  BIND(REPLACE(?ltitle, "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w1)
  BIND(REPLACE(STRAFTER(?ltitle, ?w1), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w2)
  BIND(REPLACE(STRAFTER(?ltitle, ?w2), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w3)
  BIND(REPLACE(STRAFTER(?ltitle, ?w3), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w4)
  BIND(REPLACE(STRAFTER(?ltitle, ?w4), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w5)
  BIND(REPLACE(STRAFTER(?ltitle, ?w5), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w6)
  BIND(REPLACE(STRAFTER(?ltitle, ?w6), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w7)
  BIND(REPLACE(STRAFTER(?ltitle, ?w7), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w8)
  BIND(REPLACE(STRAFTER(?ltitle, ?w8), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w9)
  BIND(REPLACE(STRAFTER(?ltitle, ?w9), "^.*?(\\b\\w{6,}\\b).*$", "$1") AS ?w10)
  VALUES ?w_ { 1 2 3 4 5 6 7 8 9 10}
  BIND(IF(?w_ = 1, ?w1, IF(?w_ = 2, ?w2, IF(?w_ = 3, ?w3, IF(?w_ = 4, ?w4, IF(?w_ = 5, ?w5, IF(?w_ = 6, ?w6, IF(?w_ = 7, ?w7, 
          IF(?w_ = 8, ?w8, IF(?w_ = 9, ?w9, ?w10))))))))) AS ?w)
  FILTER(REGEX(?w, "^\\w+$")) # since ?w may evaluate to an empty string, e.g. for one-word titles
}
GROUP BY ?item ?w
ORDER BY DESC(?c) DESC(?l)
LIMIT 2000

Compared to the previous one, the differences are

  • strings from 5 characters onwards
  • sorting by count first, length second
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.