Jump to content

User:The Transhumanist/AWB tips

From Wikipedia, the free encyclopedia

AWB, aka AutoWikiBrowser, is a powerful semi-automated batch editor. It can also be described as an "auto page loader".

It has a versatile "make list" feature to help you specify a batch of pages (articles, etc.). The program can be configured to make many edits automatically. When you press the "start" button, it loads the next page, does its auto edits, and then waits for you to inspect and/or work on the page yourself. Once you decide you are satisfied, or not, with the revision of the page, you press "Save" or "Skip", and AWB complies, followed by loading the next page in the batch. This continues until all the pages in the batch have been processed.

It is easy to plateau with this program, settling for using it to do simple search/replaces, without learning the full potential of the program.

Following are some tips on how to get more out of AWB...

Read the manual!

[edit]

Read the manual before you start. There are lots of caveats and no-nos that you must know before you begin.

Also, AWB is packed with powerful features, and these are all included in the manual. There's a natural temptation to skip the manual, and just use the items in the menus that are familiar or self-explanatory. But, then you miss out on some of the most powerful features.

If you read some of the manual for a few minutes every time you have an AWB session, you'll be surprised how fast you become an advanced user.

Yes, the manual leaves something to be desired. Keep in mind that it is editable, just like all the other pages on Wikipedia. (Hint, hint).

Learn regex

[edit]

Learn regex. "Regex" is short for "regular expression", which is a fancy term for "advanced search query". Regex is a pattern-matching language used to describe a string or set of strings that match a specific pattern. It enables very powerful search/replace.

AWB's search/replace feature optionally supports regex. You can use AWB for basic search and replace, but regex search/replaces can be 10 times more powerful. Regex includes metacharacters that can match any character (similar to wildcards in poker), and supports sets so that you can customize what it matches; this lets you match pretty much anything you want, rather than just specific words. And it does a whole lot more.

The better you know regex, the more effective you will be with AWB.

Regex is pretty technical, and it may be difficult to remember its details in the beginning, or if you haven't used it in awhile. For that reason, AWB has a regex cheat sheet to help jog your memory on what to put in regexes to get the effects you want.

Once you've learned it, you'll have a valuable skill you can apply in many other places. Regex can be used in user scripts, and is supported in a great many software programs, including editors such as Notepad++. Many programming languages support regex, which you can include right in a program's source code that you are writing. It's also supported in Wikipedia's editor (in the advanced menu), and the WikEd gadget on Wikipedia. AWB itself has another powerful feature that supports Regex: its database scanner.

AWB is multi-instance capable

[edit]

AWB is "multi-instance", which means you can run it in more than one window (on your computer) on the same Wikipedia account at the same time. You can also run it on more than one computer at the same time on the same account.

AWB is complementary

[edit]

AWB does not force you to make an either/or decision between it and your browser. You can be logged into your Wikipedia account with both at the same time.

During an AWB session, you can switch to the window with your browser to inspect the work. Try to avoid mass producing errors. "Look before you take a thousand leaps" is a good motto. (See the next tip...)

Don't propagate errors en masse

[edit]

Be careful.

When posting notices, newsletters, and mass edits in general with AWB, stop after the first one sent, and proofread the posting in another window or computer (I generally have two going). Do this regardless of how many times you've checked or proofread the settings before going live. Once live, you don't want to have to go back over the whole batch with corrections. (A batch can have up to 25,000 pages in it). I generally...

  1. Look at the first page saved
  2. Do some fixes to the settings (and to the page)
  3. Look at the next page saved
  4. Do more fixes to the settings (and the page), etc.
  5. Continue doing this for the first few pages, until the results come out clean

There's something about a communication going live that heightens one's alertness. Use that, to find errors.

Save sets of settings

[edit]

When you get up to speed with AWB, you'll likely switch back and forth between tasks (some tasks may be huge, and so time consuming that finishing before you need to do something else might not be an option). That's what "Save settings as..." (in the File menu) is for: to save multiple configurations. It's faster to switch configs than clicking a bunch of "enabled" check boxes in search every time you switch tasks.

Saved settings are also good for configuring multiple instances of AWB.

Consolidate skipping with pre-parse mode

[edit]

Pre-parse mode (in Options in the upper menu, top of screen) lets you do all the skipping at one time, so you can go fix dinner, rather than have the skipping slow down your task while you are engaged in it. In other words, pre-parse mode does a dry run of your batch task without making any changes to the pages, and removes pages from your batch list that match your various "skip" settings. So, when you do the live run, it won't be slowed down by skipping, because those pages have already been screened out during the dry run.

Warning: Don't get hypnotized by preparse. Look away! :)

AWB metaphor

[edit]

AWB can be characterized as a machine gun, because it can do a lot of damage quickly. But if you use it maliciously or negligently, you'll only get to do that once or twice before being blocked or having your AWB access revoked.

A much better analogy is that each instance of AWB is like a radio or TV station (one-to-many mode of mass communication) - you are communicating with lots of editors or lots of readers.

Regex anchors

[edit]

Anchors in regex are characters that specify a location in the document rather than a character. Anchors therefore can't be replaced. They are used for matching only. So, ^A means "an A at the beginning of the line" and A$ means "an A at the end of the line". The A can be replaced, but the anchors are abstract and aren't actually present as characters in the text of the page. (The newline character is replaceable, but that is another matter. See the next section...).

The newline character

[edit]

"\n" represents the newline character. Memorize it, because it will be one of your best friends. It is what tells the computer to continue the text from that point on the next line. When you see a blank line, like right below this paragraph you are reading right now, there are two "\n" characters in a row that make that happen.

With "\n", you can do searches that include multiple lines. (Not to be confused with "Multiline" below).

Regex MultiLine vs. SingleLine

[edit]

In AWB search, MultiLine and SingleLine are often misunderstood. They are poorly named (industry standard), and do not refer to how many lines are included in your search string. They specify how to treat the document you are editing: as multiple lines, or as a single line. When the document is treated as a single line, that line includes new line ("\n") characters (returns, aka carriage returns/line feeds), and therefore you can search/replace those (using "\n"). The regex beginning-of-line and end-of-line anchors become beginning-of-file and end-of-file, respectively.

When the document is treated as multiple lines, the search/replace is performed within each line independently. That is, it works on the text between new line characters, rendering those unsearchable.

So, when your search query includes multiple lines (by including new line characters), specify "SingleLine".

When you want to use anchors to specify the beginning or end of a line (and not of the whole page), specify "Multiline".

Search/replace input/output

[edit]

Search/replaces work on the output of previous search/replaces.

Keep this in mind so that you don't undo or overwrite the work of previous search/replaces.

You can arrange your search/replaces in whatever order you want.

Prepending and appending feature

[edit]

AWB has a prepend and append feature, for inserting prose at the beginning or end of a page. Annoyingly, search/replace cannot edit the output of these in the same pass. Multiple passes are not the best solution here. See below.

More powerful prepending and appending using regex

[edit]

There is another way to prepend and/or append, using regex anchors. The SingleLine box must be checked for this to work. Then the subsequent search/replaces will work on the prepended/appended output. This is especially powerful if you use AWB keywords, like %%title%% (which represents the page title).

Processing redirect pages

[edit]

If you wish to process redirects, make sure "Follow redirects" (in the upper Options menu) is turned off.

AWB as list processor

[edit]

Treat AWB as a multi-purpose tool (like a Swiss Army knife), not just a search/replace stream editor. For example, it is a powerful list processor as well.

List processor?

Batches are lists. And you can do a lot to a batch list without editing any pages.

With AWB, you can make lists, split lists, compare lists, combine lists, filter lists (using pre-parse), and so on.

You can also use AWB's database scanner to make lists. It has a feature to send its search results (which are page names) to AWB's batch list box.

Once you are done processing a list, you can copy/paste it to a Wikipedia page, and convert it to links.

Supercharged Wikipedia search using AWB's database scanner

[edit]

The Database scanner (in the Tools menu) is exceedingly powerful. It lets you do regex searches of Wikipedia, by page title, by page content, or both. But, you have to download the database (a data dump) first, and have it stored on your computer.

More to come...

[edit]

Well, that's all for now. I'll jot some more down as I remember them. Be sure to reread the above list again after you've learned the ropes some, as it will make more sense and will accelerate your learning curve.    — The Transhumanist   09:51, 1 May 2024 (UTC)

P.S.: If you have any questions, feel free to ask me on my talk page. --TT