Jump to content

User talk:Drdee

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Welcome to Wikipedia!!!

[edit]
Hello Drdee! Welcome to Wikipedia! Thank you for your contributions. If you decide that you need help, check out Wikipedia:Where to ask a question, ask me on my talk page, or place {{helpme}} on your talk page and someone will show up shortly to answer your questions. Please remember to sign your name on talk pages using four tildes (~~~~); this will automatically produce your name and the date. Below are some recommended guidelines to facilitate your involvement. Happy Editing! Piotr Konieczny aka Prokonsul Piotrus| talk 21:44, 9 November 2010 (UTC)[reply]
Getting Started
Getting your info out there
Getting more Wikipedia rules
Getting Help
Getting along
Getting technical
[edit]

I'd gladly help, but the links you posted on my talk page are broken. --Piotr Konieczny aka Prokonsul Piotrus| talk 21:44, 9 November 2010 (UTC)[reply]

I believe Drdee was referring to Editor Trends Study at the Strategic Planning wiki. ~ Ningauble (talk) 22:36, 9 November 2010 (UTC)[reply]
Yes, thanks Ningauble. Drdee (talk) 01:02, 10 November 2010 (UTC)[reply]

Editor database

[edit]

Apologies for taking so long to respond to your 9 November posting.

I'm a bit concerned about the database schema laid out here.

In particular, I think this is a mistake:

extract from each revision

  • username id
  • date edit
  • article id

because you're not including the diff # ("oldid"), which is the actual - sequentially assigned - edit number. Omitting this field/value means that you can't do sampling of edits, because the three values that you do have, for each edit, do not uniquely point to an edit. (The three values fail to point to a unique edit for those cases when an editor edits the same article twice or more in the same day.) To see the usefulness of the diff#/oldid, here, for example, is edit number 390,000,000:

http://wiki.riteme.site/w/index.php?oldid=390000000

In fact, it's unclear - since you can't uniquely match the three-tuple above to any particular edit - why you are storing the article id at all. (If you stored the diff/oldid, you could obtain the article id whenever you wanted, but the reverse is not true.)

To return to the matter of the ability to do sampling: this is very important for a number of reasons, For example, it could distinguish vandals (who have all their edits reverted, and are then blocked) from frustrated editors (who have most or all of their edits reverted, but aren't blocked) from constructive editors (who have few or no edits reverted). Or editors who just do minor fixes versus ones who contribute significant new material to articles. Or any number of other things that researchers might want to do.

Also, it would be helpful to clarify the meaning of this: sort edits by date and keep first 10 edits. Is that the only the first ten edits ever made that editor? Or, for an editor who started editing in 2006, is that the first 10 edits in 2006, the first 10 in 2007, the first 10 in 2008, etc? Or the first ten edits made for each separate date? And none of those seems consistent with this sentence, regarding stored data: so the first observation is the first edit made by that person while the last observation is the final edit made by that person. If a person made 15 edits on the final date covered by the database download, it doesn't appear that the final edit (or the final five edits, for that matter) are stored, just the first ten (or none at all, depending on the answers to the question at the beginning of this paragraph).

The larger issue, of course, is why limit (to ten per editor per date) the number of edits being stored? At 40 characters of information stored per edit (which is generous, and includes the diff/oldid as well as article id), the entire database (of 400 million edits, to date) would only be about 16 gigabytes. [Actually, that overstates the size - the 400 million edits are across all namespaces, not just mainspace, where articles reside. I seem to recall that roughly half of Wikipedia edits are to articles; if so, then the size of the database would be less than 10 gigabytes, because you're not interested in anything other than article edits - those in namespace 0.]

Finally, I'm hoping that the statement The Python scripts to create the dataset to answer the question "Which editors are the ones that are leaving -- are they the new editors or the more tenured ones?" was only a rough, initial draft. Because, of course, the answer could be "both". And the answer could vary with time - the cohort of those who started editing in (say) January 2007 could experience very different patterns of departure than the cohort of those whose first edit was in (say) January 2008. -- John Broughton (♫♫) 19:48, 2 December 2010 (UTC)[reply]

Dear John,

Thank you very much for your detailed comments. I will clarify some points: 1) You are right that with the current scheme you cannot identify an edit. The main reason is that, at this moment, we do not need that information. For example, Erik Zachte (among others) is looking at reverts. We are primarily concerned with the entry and exit rates of editors. In the situation we would need this information then it's extremely easy to add this variable to MongoDB. MongoDB is a schemaless db and making changes on the fly is very straightforward: just start saving the variable, there is no need to make changes to columns or dealing with NULL values.

2) Clarification of the first 10 edits: -By first 10 edits we mean the first 10 edits an editor ever made (over the entire life span of being an active community member). Why 10 edits? This has been historically the cut-off value used within different Wikipedia projects (such as stats.wikimedia.org). -First edit: the is the very first edit made by an editor and last edit is the very last_edit made by an editor.

3) This is an omission from the initial description: I create two databases in Mongo: 1) A database collecting all edits by all editors for a particular project 2) (this is the database I described): Using database 1 creating a subdatabase with derived variables, this database will be exported as is and so it can be thought of as view of the first database.

I hope this clarifies some of the points that you raised.

Best, Diederik

Re: Using Huggle for experiments

[edit]

You can if you want, although I'm not sure I see why you want to improve retention of vandals, who should be the only people seeing warning messages from Huggle.

If you think unfriendly warning messages are deterring contributors -- and I would agree with you -- I would recommend directing your efforts to warning messages that legitimate new users are hassled by. For example:

  • protected page notice
  • warnings from overzealous abuse filters
  • vandalism warnings from automated bots
  • various other unfriendly notifications by automated bots
  • the spam blacklist
  • captcha when adding external links

...that sort of thing. Gurch (talk) 21:43, 18 July 2011 (UTC)[reply]

Re: Signpost October / DataChallenge

[edit]

The goal of The Signpost is to inform. (Other outlets for the same material - I appreciate that the Recent Research is unique insofar as it does have other outlets - may choose to do things differently.) I can assure you that that and that alone was what I was intending to do by adding the sentence. Of course, I share your concerns about due balance, but I was personally very satisfied that the rest of the piece makes it clear that it was a successful competition. Unlike some real-world newspapers we are blessed with a discerning readership who would have had no problem picking that out.

Against this, it should be noted that *not* reporting this fact would have looked like an attempt to whitewash the view of the competition. While I appreciate that whitewashing is in fact very useful in many circumstances, it's in everyone's interests that the (perception *and* reality of) integrity and independence of The Signpost is preserved. Had we not mentioned this unfortunate event, I have no doubt that it would have appeared in the comments section of either the Tech Report or the Research Report, and said whitewashing would have been exposed (or we just look like idiots who don't research our stories properly).

Certainly, though, I agree that notification of such choices (which rest ultimately with the E-in-C(s), to whom you can always appeal) is a good thing. Unfortunately, it is difficult for me to notify all the authors of the report in a way that they would have read prior to publication. This is the same for all reports and it is advisable for all authors to check wording they have a particular concern over prior to publication (or as soon afterwards as they are available). This is, unfortunately, the reality of running a community-staffed volunteer newspaper and I wish it was otherwise as much as you, I can assure you. Many regards, - Jarry1250 [Weasel? Discuss.] 10:33, 3 November 2011 (UTC)[reply]

Nomination for deletion of Template:Ancestors of Duchess Luise of Brunswick-Wolfenbüttel

[edit]

Template:Ancestors of Duchess Luise of Brunswick-Wolfenbüttel has been nominated for deletion. You are invited to comment on the discussion at the template's entry on the Templates for discussion page. Frietjes (talk) 22:16, 14 December 2018 (UTC)[reply]

researcher flag

[edit]

Hello Drdee, in 2011 a special flag (researcher) was added to your account for a project. This flag is for accessing certain histories of deleted pages. Are you still working on this project and require this continuing access? Please reply here or on my talk page. Thank you, — xaosflux Talk 01:02, 5 December 2019 (UTC)[reply]

Feel free to remove it. --Drdee (talk) 01:30, 5 December 2019 (UTC)[reply]

Thank you for confirmation, processing at meta:Special:Diff/19635672. — xaosflux Talk 16:38, 13 December 2019 (UTC)[reply]