Tightening up security, restricting by country

Lately we’ve noticed that more and more traffic to our servers involves attempted brute-force logins. That means that a remote computer is connecting to our server and trying to guess usernames and passwords, with different variations of both, and hoping they’ll get in.

Things got to the point where we were seeing hundreds of thousands of login attempts daily for accounts like ‘admin’, ‘hortnet’, ‘wordpress’, variations of my name, and so on. That’s crazy! The attacks were also coming in bursts, so there might be a 30-second lull followed by thirty attempts within a few seconds. We weren’t overly concerned because we have other protections in place like two-factor authentication, but there were so many failed logins that our server performance was starting to suffer.

The first thing we did is what everyone else does: install software that blocks multiple failed attempts. OSSEC is great for that, but it only catches people after they’ve jiggled the door handle a few times. Even with OSSEC installed we were still seeing 50,000 attempts daily.

After studying the logs, a commonality became apparent. Almost all of the attacks were coming from other countries. Russia and China were the worst offenders, but just about every country around the globe was represented. What if we could limit administrative access only to computers based in the US? Nobody outside of the US needs to get in, and MaxMind maintains a database of IP to country code mappings that would work perfectly. Fortunately, they’re gracious enough to share a free version that will meet our needs.

We crafted some scripts and discovered that limiting access from only US systems reduced login attempts by 99.9 percent.

The first thing we did was install ipset on our server and configure the system to use it and our scripts (this is RedHat/CentOS-centric). Cut and paste the section below into your shell.

sh -c "PERL_MM_USE_DEFAULT=1 perl -MCPAN -e 'install Net::CIDR'"
yum install -y ipset perl-Net-CIDR perl-libwww-perl
echo "create -exist US hash:net family inet hashsize 2048 maxelem 65536" >> /etc/sysconfig/ipset
mkdir -p /var/cache/ipset
mkdir -p /usr/services/GeoIP
echo "ipset restore < /var/cache/ipset/us.conf" >> /etc/rc.local
echo "15 12 * * 2 root /usr/services/GeoIP/createUSipset" >> /etc/cron.d/ipset
/etc/rc.d/init.d/ipset start

This creates a set named ‘US’ in ipset that will store the IP ranges of all known US IPs. Note that the actual population of the set is done via /etc/rc.local — this is because /etc/sysconfig/ipset can’t contain a ‘restore’ command. We could store all of our IPs directly in /etc/sysconfig/ipset, but if you want to use ipset for anything different it becomes complicated. Also, note that our cron entry only downloads on Tuesday because that’s the day that MaxMind releases database changes.

Anyhow, once the stuff above is done, paste the section below into your shell to create the script that downloads the database.

cat <<'EOF' > /usr/services/GeoIP/createUSipset
#!/usr/bin/perl

use Net::CIDR;
use LWP::UserAgent;
use v5.10;
$|++;

my $ua    = LWP::UserAgent->new;
my $dir   = qq(/usr/services/GeoIP);
my $file  = qq(GeoIPCountryCSV.zip);
my $url   = qq(https://geolite.maxmind.com/download/geoip/database/$file);
my $cache = qq(/var/cache/ipset/us.conf);

exit if ! ($ua->mirror($url, qq($dir/$file)))->is_success;

open (IPSET, "|ipset restore");
say IPSET qq(create -exist US-new hash:net family inet);
open (COUNTRY, "zcat $dir/$file|");
while (<COUNTRY>) {
   next if $_ !~ /"US"/;
   my ($ipstart, $ipend) = ($_ =~ /^"([^"]+)","([^"]+)",/);
   foreach my $i (Net::CIDR::range2cidr($ipstart .  "-" .  $ipend)) {
      say IPSET qq(add US-new $i);
   }
}
close COUNTRY;
close IPSET;

system qq(ipset swap US US-new);
system qq(ipset save US | sed 's/create US/create -exist US/' > $cache);
system qq(ipset destroy US-new);
EOF

chmod ug+rx /usr/services/GeoIP/createUSipset
/usr/services/GeoIP/createUSipset

Now you have a set populated with US IP addresses — all we have left is to actually use it with our iptables rules. Here’s an example that blocks all non-US traffic to ssh, then saves the changes for a reboot:

# BE CAREFUL RUNNING THIS, AND ONLY DO IT IF YOU UNDERSTAND WHAT YOU'RE DOING
iptables -A INPUT -m set --set US src -p tcp -m tcp --dport 22 -j ACCEPT
iptables -A INPUT -p tcp -m tcp --dport 22 -j DROP
iptables-save

The critical part above is the ‘-m set –set US’ bit; it tells iptables to compare the connecting IP against the set contained within ipset.

Detecting if a remote connection is mobile in Apache

hort.net-mobile

As many of you have heard, Google is requiring that sites pass their ‘mobile usability’ test if they want to show up in mobile search rankings. After April 22nd, Google will start rolling out changes to their listings, and those who don’t pass the test stand to lose a lot of traffic.

I was pretty concerned until I realized that our biggest problem is all of the small menus around our content. That stuff works fine on a desktop, but fat fingers (compared to a mouse pointer) can’t discern between tightly spaced links.

So… Why not just make all of the menus disappear for mobile users?

It seemed like the easiest way would be to create an environment variable from within Apache. This would let me either use Server Side Includes (SSI) or make programmatic changes in scripts.

The trick, of course, was figuring out which visitors were mobile and which ones weren’t. Setting an environment variable wouldn’t be of much use if I was identifying the wrong visitors.

I ended up finding a great Apache regular expression at detectmobilebrowers.com. Although the regular expression had what I needed, they were using it to rewrite mobile visitors to another URL. I just wanted to set a variable, not redirect people. Here’s what I came up with instead.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (android|bb\d+|meego).+mobile|avantgo|bada\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge\ |maemo|midp|mmp|mobile.+firefox|netfront|opera\ m(ob|in)i|palm(\ os)?|phone|p(ixi|re)\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\.(browser|link)|vodafone|wap|windows\ ce|xda|xiino [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a\ wa|abac|ac(er|oo|s\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\-m|r\ |s\ )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\-(n|u)|c55\/|capi|ccwa|cdm\-|cell|chtm|cldc|cmd\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\-s|devi|dica|dmob|do(c|p)o|ds(12|\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\-|_)|g1\ u|g560|gene|gf\-5|g\-mo|go(\.w|od)|gr(ad|un)|haie|hcit|hd\-(m|p|t)|hei\-|hi(pt|ta)|hp(\ i|ip)|hs\-c|ht(c(\-|\ |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\-(20|go|ma)|i230|iac(\ |\-|\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt(\ |\/)|klon|kpt\ |kwc\-|kyo(c|k)|le(no|xi)|lg(\ g|\/(k|l|u)|50|54|\-[a-w])|libw|lynx|m1\-w|m3ga|m50\/|ma(te|ui|xo)|mc(01|21|ca)|m\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\-|\ |o|v)|zz)|mt(50|p1|v\ )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\-2|po(ck|rt|se)|prox|psio|pt\-g|qa\-a|qc(07|12|21|32|60|\-[2-7]|i\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\-|oo|p\-)|sdk\/|se(c(\-|0|1)|47|mc|nd|ri)|sgh\-|shar|sie(\-|m)|sk\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\-|v\-|v\ )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\-|tdg\-|tel(i|m)|tim\-|t\-mo|to(pl|sh)|ts(70|m\-|m3|m5)|tx\-9|up(\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\-|\ )|webc|whit|wi(g\ |nc|nw)|wmlb|wonu|x700|yas\-|your|zeto|zte\-) [NC]
RewriteRule ^ - [E=MOBILE_TMP:1]
RequestHeader add MOBILEUSER %{MOBILE_TMP}e env=MOBILE_TMP

As you can see, the first three lines came from detectmobilebrowsers.com. The last two lines set the environment variable in a two-step process.

The mod_rewrite line that starts with RewriteRule sets a variable named MOBILE_TMP to ‘1’ if any of the regular expressions match. Unfortunately, Apache won’t pass that environment variable on to CGI scripts. mod_header, however, can pass variables that it sets.

The last line uses this feature in mod_header to set another variable named HTTP_MOBILEUSER to the value of MOBILE_TMP if MOBILE_TMP is set. Note that the variable is called HTTP_MOBILEUSER, not MOBILEUSER. The RequestHeader directive will prepend the string ‘HTTP_’ to whatever environment variable you pick.

Most of our pages at hort.net use SSI, so it’s very easy for me to tell the system to exclude blocks of code for mobile users now. Here’s a sample SSI snippet that only includes the files if it’s a mobile user. If not, they’re excluded.

<!--#if expr="${HTTP_MOBILEUSER} != /1/" -->
   <!--#include virtual="/include/root-top.html"-->
   <!--#include virtual="/include/vertical-spacer8.html"-->
   <!--#include virtual="/include/root-left.html"-->
<!--#endif -->

Another example sets the background color of a page to green for mobile users, but white for desktop users:

<!--#if expr="${HTTP_MOBILEUSER} != /1/" -->
<body bgcolor="#FFFFFF" vlink="#668F66" link="#000000">
<!--#else -->
<body bgcolor="#004400" vlink="#668F66" link="#000000">
<!--#endif -->

Of course, it would be ideal to use CSS for these, but when you have 800,000 legacy pages dating back nearly twenty years this is much simpler!

Sometimes our Perl scripts also had stylistic changes for mobile users too. In this case, we choose a different image for mobile visitors:

if ($ENV{'HTTP_MOBILEUSER'} == 1) {
   $imagedir = "/mnt/WWW/hort.net/images/gallery/mobile";
} else {
   $imagedir = "/mnt/WWW/hort.net/images/gallery/desktop";
}

Or:

  $imagedir = sprintf(qq(/mnt/WWW/hort.net/images/gallery/%s),
     ($ENV{'HTTP_MOBILEUSER'} == 1) ? "mobile" : "desktop"
  );

Making these changes ended up being easier than I thought, and for that I’m grateful. There are still some tweaks to make to hort.net, and we still need to add some mobile navigation options to our pages. But for now we’ll continue to show up in the search rankings, our traffic won’t disappear, and most importantly, people will be able to use our pages on mobile devices.

Who are the top plant patent holders?

So far we’ve queried our plant database to find out who the patent examiners are and which plants are most prolific in the patent system. Today we’re going to look at the top inventors and patent holders on granted patents.

First, let’s see who the top 20 plant patent holders are worldwide:

mysql> SELECT COUNT(application_applicant.entity_id) AS total, entity_name, entity_city, entity_country FROM application_applicant LEFT JOIN entity ON entity.entity_id = application_applicant.entity_id LEFT JOIN patent ON patent.application_id = application_applicant.application_id WHERE issue_date IS NOT NULL GROUP BY entity_name ORDER BY total DESC LIMIT 20;
+-------+-----------------------+------------------------+----------------+
| total | entity_name           | entity_city            | entity_country |
+-------+-----------------------+------------------------+----------------+
|   207 | Dümmen, Tobias        | Rheinberg              | DE             |
|   192 | Klemm, Nils           | Stuttgart              | DE             |
|   140 | Bergman, Wendy R.     | Lehigh Acres           | US             |
|   136 | Smith, Mark A.        | Fort Myers             | US             |
|   115 | Korlipara, Harini     | Canby                  | US             |
|   113 | Danziger, Gabriel     | Moshav Nir-Zvi         | IL             |
|   110 | Bradford, Lowell Glen | Le Grand               | US             |
|   110 | Egger, Janet N.       | Wilsonville            | US             |
|   106 | Olesen, Mogens N.     | Fredensborg            | DK             |
|   102 | Zaiger, Gary Neil     | Modesto                | US             |
|   102 | Gardner, Leith Marie  | Modesto                | US             |
|   102 | Zaiger, Grant Gene    | Modesto                | US             |
|    98 | Kordes, Tim-Hermann   | Offenseth-Sparrieshoop | DE             |
|    93 | Pieters, Dirk         | Oostnieuwkerke         | BE             |
|    88 | Sakazaki, Ushio       | Shiga                  | JP             |
|    79 | Trees, Scott C.       | Shell Beach            | US             |
|    77 | Kanaya, Takeshi       | Shiga                  | JP             |
|    76 | Meilland, Alain A.    | Antibes                | FR             |
|    70 | Maillard, Laurence    | Elne                   | FR             |
|    70 | Maillard, Arsene      | Elne                   | FR             |
+-------+-----------------------+------------------------+----------------+
20 rows in set (0.14 sec)

Germany is heavily represented by Tobias Dümmen (who seems to focus on Petunia, Pelargonium, Calibrachoa, Verbena, poinsettia, and a handful of other annuals) and Nils Klemm (who focuses on pretty much the same thing). Mr. Dümmen hasn’t had a patent granted on anything filed since he merged with Agribio in 2013. Mr. Klemm is affiliated with Selecta
(which is partnered with Ball), but also doesn’t have anything granted on a patent filed since September, 2012.

Suprisingly, many of the top patent holders in the US patent system above are not from the US.

mysql> SELECT COUNT(entity_country) AS total, countryname, population FROM entity LEFT JOIN application_applicant ON entity.entity_id = application_applicant.entity_id LEFT JOIN geonames.country ON entity.entity_country = geonames.country.countrycode LEFT JOIN patent ON patent.application_id = application_applicant.application_id WHERE issue_date IS NOT NULL GROUP BY entity_country ORDER BY total DESC limit 10;
+-------+----------------+------------+
| total | countryname    | population |
+-------+----------------+------------+
|  4676 | United States  |  303824000 |
|  1926 | Netherlands    |   16645000 |
|  1366 | Germany        |   82369000 |
|   931 | Japan          |  127288000 |
|   508 | United Kingdom |   60943000 |
|   476 | Denmark        |    5484000 |
|   409 | Australia      |   20600000 |
|   355 | France         |   64094000 |
|   305 | Israel         |    6500000 |
|   217 | New Zealand    |    4154000 |
+-------+----------------+------------+
10 rows in set (0.22 sec)

As you can see, the majority of granted plant patents are assigned to US residents. Netherlands, Germany, and Japan are high on the list too, and Netherlands has greatest number of patents per capita (at about 29 patents for every 250,000 people).

So who are the top US granted patent applicants?

mysql> SELECT COUNT(application_applicant.entity_id) AS total, entity_name, entity_city, entity_country FROM application_applicant LEFT JOIN entity ON entity.entity_id = application_applicant.entity_id LEFT JOIN patent ON patent.application_id = application_applicant.application_id WHERE issue_date IS NOT NULL AND entity_country = "US" GROUP BY entity_name ORDER BY total DESC LIMIT 20;
+-------+-------------------------+---------------+----------------+
| total | entity_name             | entity_city   | entity_country |
+-------+-------------------------+---------------+----------------+
|   140 | Bergman, Wendy R.       | Lehigh Acres  | US             |
|   136 | Smith, Mark A.          | Fort Myers    | US             |
|   115 | Korlipara, Harini       | Canby         | US             |
|   110 | Bradford, Lowell Glen   | Le Grand      | US             |
|   110 | Egger, Janet N.         | Wilsonville   | US             |
|   102 | Zaiger, Gary Neil       | Modesto       | US             |
|   102 | Gardner, Leith Marie    | Modesto       | US             |
|   102 | Zaiger, Grant Gene      | Modesto       | US             |
|    79 | Trees, Scott C.         | Shell Beach   | US             |
|    57 | Hanes, Mitchell         | Morgan Hill   | US             |
|    50 | Kobayashi, Ruth         | Carlsbad      | US             |
|    44 | Dupont, Sr., Robert J.  | Plaquemine    | US             |
|    40 | Zary, Keith W.          | Thousand Oaks | US             |
|    39 | Carruth, Thomas F.      | Altadena      | US             |
|    38 | Wood, Timothy D.        | Spring Lake   | US             |
|    34 | Vandenberg, Cornelis P. | Fort Myers    | US             |
|    32 | Hanes, Mitchell E.      | Morgan Hill   | US             |
|    30 | Lim, Peter P.           | Yamhill       | US             |
|    30 | Ren, Jianping           | Geneva        | US             |
|    29 | Lyrene, Paul M.         | Micanopy      | US             |
+-------+-------------------------+---------------+----------------+
20 rows in set (0.07 sec)

So there’s our list. Some quick Google searches give us background and affiliations for the top listings:

Bergman, Wendy R.

    Yoder Brothers, Inc. from 1996-2008 (now Aris), Syngenta from 2000-2013, Aris from 2009-2012. Specializes in Hibiscus and Chrysanthemums.

Korlipara, Harini

    Terra Nova lab manager and plant breeder, cytogenetics. Harini earned a M. Phil. and Ph.D in Cytogenetics and Plant Breeding from Nagarjuna University in India. Published dozens of articles in professional journals. Her major crops here include Echinacea and Coreopsis.

Bradford, Lowell Glen

    Fruit hybridizer, specializing in cherries, nectarines, plums, peaches…

Egger, Janet N.

    Terra Nova patent writer and manager, B.S. in Botany and a M.S. in Horticulture from the University of California, Davis, with Terra Nova since 2000, Heuchera and Heucherella specialist.

Zaiger, Gary Neil

    Inventor of pluot, works with nectarines, peaches, plums, almonds. Father of Grant Zaiger and Leith Marie Gardner.

Gardner, Leith Marie

    Daughter of Gary Neil Zaiger (above), with similar patents.

Zaiger, Grant Gene

    Son of Gary Neil Zaiger (above), with similar patents.

Trees, Scott

    Senior ornamental plant breeder for Ball Horticultural Company and PanAmerican Seed Company from 2005-present, V.P of Research for Pan American Seed from 1978-1988. Has a PhD in plant genetics from UC Davis. Specializes in Lantana, Pelargonium, Verbena, and Angelonia (and a few others).

Based on the totals above, Terra Nova Nurseries probably has the largest ornamental plant patent portfolio for US patent holders, and the Zaiger family has the largest number of plant patents on food crops.


What do you want to know about plant patents? Let us know in the comments and we’ll look it up. Or maybe we can make a tool that you can use regularly.

Which plants are most represented by patents?

As we looked at the different plant patents that have been granted, we started to wonder which genera are most represented.

This was pretty easy to do with a simple database query; we get a total for the first word (genus) of all plant names that start with the letters ‘A’ through ‘Z’ (which eliminates greges and the like).

mysql> SELECT COUNT(sci_name) AS total, SUBSTRING_INDEX(sci_name, ' ', 1) AS name FROM patent WHERE sci_name REGEXP '^[A-Z]' GROUP BY name ORDER BY total DESC LIMIT 50;
+-------+---------------+
| total | name          |
+-------+---------------+
|   742 | Rosa          |
|   613 | Chrysanthemum |
|   433 | Prunus        |
|   420 | Pelargonium   |
|   323 | Impatiens     |
|   317 | Petunia       |
|   265 | Calibrachoa   |
|   226 | Osteospermum  |
|   200 | Verbena       |
|   173 | Euphorbia     |
|   165 | Hydrangea     |
|   140 | Hibiscus      |
|   134 | Fragaria      |
|   131 | Begonia       |
|   130 | Echinacea     |
|   125 | Malus         |
|   121 | Dianthus      |
|   118 | Dahlia        |
|   115 | Phlox         |
|   109 | Alstroemeria  |
|   108 | Heuchera      |
|   108 | Vitis         |
|   100 | Kalanchoe     |
|    97 | Vaccinium     |
|    95 | Mandevilla    |
|    92 | Nemesia       |
|    81 | Clematis      |
|    78 | Rhododendron  |
|    70 | Argyranthemum |
|    69 | Sedum         |
|    68 | Lobelia       |
|    68 | Coreopsis     |
|    66 | Hosta         |
|    65 | Rubus         |
|    63 | Lantana       |
|    62 | Lavandula     |
|    61 | Salvia        |
|    60 | Diascia       |
|    57 | Helleborus    |
|    55 | Phalaenopsis  |
|    55 | Campanula     |
|    53 | Angelonia     |
|    52 | Aster         |
|    49 | Lilium        |
|    49 | Sutera        |
|    47 | Veronica      |
|    41 | Torenia       |
|    40 | Gerbera       |
|    38 | Citrus        |
|    38 | Caladium      |
+-------+---------------+
50 rows in set (0.02 sec)

So, there you have it. Those are the 50 most popular genera to receive patents in the last decade.

What do you want to know about plant patents? Let us know in the comments and we’ll look it up. Or maybe we can make a tool that you can use regularly.

Meet your plant patent examiners

We’ve added nearly 10,000 patents to our plant patent database at hort.net going back through 2005, and we thought that it would be fun to look at the patent examiners. How many are there? Which ones are busiest? This won’t reveal anything new, but we’re hoping it offers insights into what can be done with the data and you’ll give feedback about what you’d like to see.

If we query our database for unique primary patent examiners, this list is returned:

mysql&gt; SELECT COUNT(application_examiner.entity_id) AS total,entity_name FROM application_examiner LEFT JOIN entity ON entity.entity_id = application_examiner.entity_id WHERE examiner_type="primary" GROUP BY entity_name ORDER BY total DESC; 
+-------+----------------------------+
| total | entity_name                |
+-------+----------------------------+
|  1922 | Bell, Kent                 |
|  1271 | Bell, Kent L               |
|  1163 | Para, Annette H            |
|  1061 | Hwu, June                  |
|  1048 | Grunberg, Anne Marie       |
|   705 | Para, Annette              |
|   594 | McCormick Ewoldt, Susan    |18/15
|   415 | McCormick Ewoldt, Susan B  |
|   336 | Haas, Wendy C.             |
|   328 | Haas, Wendy C              |
|   218 | Bell, Kent L.              |
|   186 | Haas, Wendy                |
|   114 | Grunberg, Anne             |
|    98 | Locker, Howard J.          |
|    82 | Locker, Howard             |
|    64 | Campell, Bruce R.          |
|    31 | Para, Annette H.           |
|    30 | Ewoldt, Susan McCormick    |
|    21 | McCormick Ewoldt, S. B     |
|    18 | McCormick Ewoldt, S. B.    |
|    16 | Locker, Howard J           |
|     9 | Kruse, David H             |
|     6 | McCormick-Ewoldt, S. B.    |
|     2 | Grunberg, Ann Marie        |
|     2 | McCormick Ewoldt, Susan B. |
|     2 | Krawczewicz Myers, Louanne |
|     1 | Haas, W. C.                |
|     1 | Bell., Kent L.             |
|     1 | Grundberg, Anne Marie      |
|     1 | Para, Annetta H            |
|     1 | Hass, Wendy                |
|     1 | Bell, Kentl L              |
|     1 | Wu, June                   |
|     1 | Ball, Kent                 |
|     1 | Para, Annertte H           |
|     1 | Hwu, Jane                  |
|     1 | McCormick Edwoldt, Susan   |
|     1 | Helmer, Georgia            |
|     1 | Bell, Kenbt L              |
|     1 | Campell, Bruce R           |
|     1 | Ewoldt, S. B McCormick     |
|     1 | McCormick-Ewoldt, S. B     |
|     1 | McCormick, S. B.           |
|     1 | Grunbeg, Anne Marie        |
|     1 | Para, Annnette H           |
|     1 | Campbell, Bruce            |
|     1 | Campbell, Bruce R.         |
|     1 | Hass, Wendy C.             |
|     1 | McCormick-Ewoldt, Susan B  |
|     1 | Campell, Btuce R.          |
|     1 | Grunsberg, Anne Marie      |
|     1 | McCormack Ewoldt, Susan B  |
|     1 | McCormick Ewoldt, Sysan B  |
+-------+----------------------------+
53 rows in set (0.03 sec)

Unfortunately, the list doesn’t deal too well with the misspellings and variations that we talked about in our last post.

If we only group patent examiners based on the first four characters of their last name (which wouldn’t always work, but will in this case) we come up with a much shorter list:

mysql&gt; select count(application_examiner.entity_id) as total,left(entity_name,4) as shortname, entity_name from application_examiner left join entity on entity.entity_id = application_examiner.entity_id where examiner_type="primary" group by shortname order by total desc;
+-------+-----------+----------------------------+
| total | shortname | entity_name                |
+-------+-----------+----------------------------+
|  3414 | Bell      | Bell, Kent                 |
|  1902 | Para      | Para, Annette H            |
|  1167 | Grun      | Grunberg, Anne Marie       |
|  1062 | McCo      | McCormick Ewoldt, Susan    |
|  1062 | Hwu,      | Hwu, June                  |
|   851 | Haas      | Haas, Wendy C.             |
|   196 | Lock      | Locker, Howard J.          |
|    68 | Camp      | Campell, Bruce R.          |
|    31 | Ewol      | Ewoldt, S. B McCormick     |
|     9 | Krus      | Kruse, David H             |
|     2 | Kraw      | Krawczewicz Myers, Louanne |
|     2 | Hass      | Hass, Wendy                |
|     1 | Ball      | Ball, Kent                 |
|     1 | Helm      | Helmer, Georgia            |
|     1 | Wu,       | Wu, June                   |
+-------+-----------+----------------------------+
15 rows in set (0.01 sec)

It’s still not perfect, though. You can see that Kent Bell was once listed as Kent Ball, Wendy Haas was Wendy Hass twice, and June Hwu was once listed as June Wu. If we manually adjust the list, we come up with this:

+-------+-----------+----------------------------+
| total | shortname | entity_name                |
+-------+-----------+----------------------------+
|  3415 | Bell      | Bell, Kent                 |
|  1902 | Para      | Para, Annette H            |
|  1167 | Grun      | Grunberg, Anne Marie       |
|  1093 | McCo      | McCormick Ewoldt, Susan    |
|  1063 | Hwu,      | Hwu, June                  |
|   853 | Haas      | Haas, Wendy C.             |
|   196 | Lock      | Locker, Howard J.          |
|    68 | Camp      | Campell, Bruce R.          |
|     9 | Krus      | Kruse, David H             |
|     2 | Kraw      | Krawczewicz Myers, Louanne |
|     1 | Helm      | Helmer, Georgia            |
+-------+-----------+----------------------------+
11 rows in set (0.01 sec)

Kent Bell has granted a whopping 34.9% of plant patents in the past decade, followed by 19.4% by Annette H. Para, followed by roughly 11% each for Anne Marie Grunberg, Susan McCormick Ewoldy, and June Hwu. Wendy Haas rounds out the top six with 8.7% of plant patent grants.

There aren’t nearly as many examiners as we expected, and they’re busy. They don’t only handle plant patents, either.

So, there you go! Those are the primary plant patent examiners at the USPTO over the past ten years. Now we need to edit our database upload scripts to consolidate the names automatically.

Fun parsing plant patents

As we mentioned in our non-tech blog, hort.net has been processing plant patents to try to find and understand commonalities and trends in the industry. Besides that, it’s fun.

Well, sort of fun. It would be a lot more fun if the patent data was consistent.

The United States Patent and Trademark Office (USPTO) releases patent data in a machine-parseable format called XML. The layout of the XML is specified by a Document Type Definition (DTD), but these change regularly. Normally these changes are minor, but sometimes they’re subtle and more major.

For example, consider this bit of XML:

<parties>
   <applicants>
      <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
         <addressbook>
            <last-name>Hofmann</last-name>
            <first-name>Birgit Christa</first-name>
         </addressbook>
      </applicant>
   </applicants>
</parties>

It later became this:

<us-parties>
   <us-applicants>
      <us-applicant sequence="001" app-type="applicant-inventor" designation="us-only">
         <addressbook>
            <last-name>Hofmann</last-name>
            <first-name>Birgit Christa</first-name>
         </addressbook>
      </us-applicant>
   </us-applicants>
</us-parties>

Can you spot the difference? At one point ‘us-‘ was prepended to the parties, applicants, and applicant XML tags, so any code that was loading things into the database was suddenly coming up empty.

It’s not a big deal, and things change over time, but it would be nice if the USPTO just converted all of their old documents to the new DTD.

Other issues are less technical and more policy-driven. Consider the case of patent examiner Susan B. McCormick-Ewoldt. Of the patents we’ve processed so far, she examined nineteen different plant patents that were granted, and on all nineteen her name appears differently. At least, we assume it’s the same person. Here are the variations:

mysql> SELECT entity_name, patent_id, sci_name FROM application_examiner LEFT JOIN entity ON application_examiner.entity_id = entity.entity_id LEFT JOIN patent ON application_examiner.application_id = patent.application_id WHERE entity_name LIKE "mccor%";
+----------------------------+-----------+-------------------------------------------------+
| entity_name                | patent_id | sci_name                                        |
+----------------------------+-----------+-------------------------------------------------+
| McCormick, Susan B.        | 15460     | Impatiens hawkeri 'Fisnics Sweet Red'           |
| McCormick-Ewoldt, S. B.    | 15488     | Prunus persica var. nucipersica 'GBN-One'       |
| McCormick-Ewoldt, Susan B. | 16270     | Malus pumila 'Fugachee Fuji'                    |
| McCormick-Ewoldt, S B      | 15496     | Prunus persica 'Calara'                         |
| McCormick-Ewoldt, S B.     | 15794     | Rosa hybrida 'POULac007'                        |
| McCormick, S. B.           | 17451     | Chrysanthemum x morifolium 'Elegant Yomarjorie' |
| McCormick Ewoldt, S. B.    | 19011     | Baptisia x variicolor 'Twilite'                 |
| McCormick, S. B            | 19098     | Phlox hybrida 'USPHL03M'                        |
| McCormick/Ewoldt, S. B.    | 18988     | Lobelia erinus 'Balwalila'                      |
| McCormick Ewoldt, S. B     | 19664     | Styrax japonicus 'Fragrant Fountain'            |
| McCormick Ewoldt, S B      | 19933     | Scoparia hybrid 'USSCO401-3'                    |
| McCormick-Ewoldt, S. B     | 20130     | Begonia x hiemalis 'Binos Pinky White'          |
| McCormick Ewoldt, Susan B  | 20634     | Cordyline australis 'Sunrise'                   |
| McCormick Ewoldt, Susan B. | 20113     | Petunia hybrid 'KLEPH07140'                     |
| McCormick-Ewoldt, Susan B  | 20626     | Penstemon hartwegii benth 'Peni Vio09'          |
| McCormack Ewoldt, Susan B  | 20901     | Pelargonium x hortorum 'Pacneon'                |
| McCormick Ewoldt, Sysan B  | 20809     | Geranium x cantabrigiense 'ABPP'                |
| McCormick Edwoldt, Susan   | 22972     | Rosa hybrida  'AUStobias'                       |
| McCormick Ewoldt, Susan    | 25207     | Mandevilla hybrida 'Sunparaoros'                |
+----------------------------+-----------+-------------------------------------------------+
19 rows in set (0.00 sec)

We have initials, ‘Sysan’ (presumably a typo), ‘Susan’, ‘Susan’ with initials, ‘McCormack’ (presumably a typo), a hyphenated last name, a last name with a slash, and a last name with two words. It becomes very difficult to automate processing of patents if there’s no consistency in values, and that’s something USPTO will have to deal with procedurally.

Fixing Mail::ClamAV to work with >= clamav-0.98.4

The latest version of clamav relies on OpenSSL, but libclamav doesn’t automatically intialize that connection. This patch we threw together for Mail-ClamAV-0.29 fixes the problem by calling cl_initialize_crypto() first.

*** ClamAV.pm.orig      2014-10-28 16:27:30.000000000 -0500
--- ClamAV.pm   2014-10-28 16:26:48.000000000 -0500
***************
*** 205,210 ****
--- 205,215 ----
      if (stat(path, &st) != 0)
          croak("%s does not exist: %s\n", path, strerror(errno));
  
+     if ((status = cl_initialize_crypto()) != CL_SUCCESS) { 
+        error(status);
+        return &PL_sv_undef; 
+     } 
+ 
      if ((status = cl_init(CL_INIT_DEFAULT)) != CL_SUCCESS) {
          error(status);
          return &PL_sv_undef;

 

No Verizon tracking at hort.net

We’re pretty horrified by the recent revelation that Verizon is adding tracking codes to its customers’ web browsing.

Verizon claims that this code is in a database that doesn’t get shared, but that’s not good enough. As the article we linked to mentions, any site that can tie a user’s personal information to that tracking code can sell that information and a database can be constructed. At that point, any traffic from a phone, even if run in a private or incognito mode, can be tied to the phone’s owner by the site that’s being visited unless the site use the HTTPS protocol.

We strongly advise that sites disable this header on their systems. At hort.net we have set this Apache directive:

RequestHeader unset X_UIDH early
RequestHeader unset HTTP_X_UIDH early

This ensures that the tracking code will be removed before any of our scripts run, so we won’t be able to tie Verizon’s tracking to any of our visitors (even accidentally).

If you use Verizon, try to only connect to sites using the HTTPS protocol (secure web sites) or ones that are known to remove that tracker.


Tin can on a string

What Verizon users may have to do to in the future to communicate safely.


Upgrading hort.net to Redis, part 1

We’ve been playing with Redis at hort.net over the past year. Its speed and simplicity make system performance better and programming easier — what used to be 30 lines of MySQL statements can be condensed into five Redis queries that shave 1/10th of a second off page loading times. And that’s just for one part of the site. Redis also scales to multiple servers more easily, so as hort.net grows it won’t be painful to keep up technically.

After successfully moving our sHORTurl service to Redis (and getting the aforementioned performance gains), we started looking for other ways to use it here.

hort.net is entirely written in custom Perl (through mod_perl) under Apache with a MySQL backend.  Redis would eliminate the latency in database queries, but because Redis is stored in memory we only wanted to use it in highly repetitive cases that didn’t involve complicated MySQL statements.

Identifying MySQL queries that could shift to Redis

I restarted MySQL with the –log option so that every database command would be logged in a file.  After about 21 hours, a short shell command was run to identify the most frequently queried databases:

    
$ grep SELECT /var/log/mysql/query-log | awk -F'FROM' '{print $2}' | awk '{print $1}' | sort | uniq -c | sort -n | tail -10
1101 authentication
1294 account.email
2579 splits
4025 image
4383 plant
7737 vote
15248 wp_options
26336 topstory
29944
50506 navigation

The number on the left shows how many queries the database named on the right received during that timeframe.

navigation and topstory are both displayed on every hort.net page, and navigation is actually used in multiple places on each page. wp_options is used by WordPress, and vote, plant, and image are used when images in the plant gallery are viewed (or voted upon). image and plant are fairly large databases that involve some complicated database queries so they wouldn’t make sense to move to Redis. But these three databases looked like great candidates:

  • navigation
  • topstory
  • vote

topstory was very straightforward.  topstory just contained a story_id, headline, url, date, and a bit of text in a database, and the most common query retrieved the most recent horticultural news articles posted on hort.net.  vote is more complicated because it uses the AVG() and STDDEV() MySQL functions to display higher-voting images first, and navigation was moderately complex because it used a weighting system to sort navigation links. Atlhough topstory didn’t give us the largest database hit, it seemed like the easiest to migrate.  So we started there.

Creating the Redis schema

First I had to develop a schema. The old MySQL database had this schema and these queries:

+-------------+------------------+------+-----+---------+----------------+
| Field       | Type             | Null | Key | Default | Extra          |
+-------------+------------------+------+-----+---------+----------------+
| story_id    | int(10) unsigned | NO   | MUL | NULL    | auto_increment |
| headline    | varchar(255)     | NO   | MUL |         |                |
| url         | varchar(255)     | NO   |     |         |                |
| description | text             | YES  |     | NULL    |                |
| date        | datetime         | NO   | MUL | NULL    |                |
+-------------+------------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)

# get last 8 headlines and URLs, displayed on every page
SELECT SQL_CACHE url, headline FROM topstory ORDER BY date DESC LIMIT 8
# display last six hours of headlines in RSS
SELECT SQL_CACHE DATE_FORMAT(DATE_SUB(date, INTERVAL 6 HOUR), “%a, %d %b %Y %T GMT”) FROM topstory WHERE date <= NOW() ORDER BY DATE DESC LIMIT 1
SELECT SQL_CACHE story_id, headline, url, description, DATE_FORMAT(DATE_SUB(date, INTERVAL 6 HOUR), “%a, %d %b %Y %T GMT”) FROM topstory WHERE date <= NOW() ORDER BY DATE DESC
# display all stories on the topstories page
SELECT SQL_CACHE headline, url, date FROM topstory WHERE date <= NOW() ORDER BY date DESC

Since speed was a primary goal, I had to make sure that our Redis datastore allowed us to easily sort our stories by date since every query relied on that. I ended up with this:

Hash s:h [story_id] [headline]
Hash s:u [story_id] [url]
Hash s:d [story_id] [description]
Hash s:e [story_id] [epoch time]
SortedSet s:t [epoch time] [story_id]
Set s:id [story_id]

The SortedSet uses the time in seconds since 1970 as our sorting field; with the ZREVRANGEBYSCORE command we can tell Redis to sort our ids by date and only return the last eight, if desired.  Once the list of ids is returned the script will use them to do a HMGET for the url, headline, and if necessary, description.

Before going any further, I wanted to time hort.net page loads before that part of the site was switched to Redis. I threw together a quick script that did 12 page loads five seconds apart, then averaged the load time. This should take other server activity into account, and it should also leverage MySQL caching to give us a true apples-to-apples comparison. The script was actually run regularly throughout the day and the values below represent the averages.

#!/bin/sh
# hort.net front page load timing script

total=0
for i in {1..12}
do
   seconds=`(time lynx -dump -source http://www.hort.net/ > /dev/null ) |& grep real | sed -r 's/s$|0m//g' | awk '{print $NF}'`
   echo $seconds" seconds"
   sleep 5s
   total=$(echo $total + $seconds | bc) 
done

echo -n "Average load time: "; echo $total / 12 | bc -l
$ ./timeit
0.183 seconds
0.195 seconds
0.198 seconds
0.202 seconds
0.162 seconds
0.172 seconds
0.188 seconds
0.185 seconds
0.189 seconds
0.193 seconds
0.184 seconds
0.175 seconds
Average load time: .18550000000000000000

As you can see, the front page loaded in .1855 seconds on average.

Next I wrote a script to copy the topstory database into Redis.  It’s a script intended for onetime use, and since stories are manually added I didn’t have to worry about losing anything while migrating. It was fairly straightforward, and once it was run it took .979 seconds to move the 615 records into Redis.

Now came the hardest part — converting scripts running on the site to use Redis instead. The queries had to be rewritten, although that actually wasn’t as horrible as expected. The simple query that displayed the eight most recent stories was replaced with three Redis queries from Perl:

# PLACE ids OF 8 MOST RECENT STORIES INTO AN ARRAY CALLED @ids
my @ids = $GLOBAL::RedisHandle->zrevrangebyscore('s:t', '+INF', '-INF', 'LIMIT', 0, 8);
# GRAB THE HEADLINES FOR THOSE 8 ids
my @headlines = $GLOBAL::RedisHandle->hmget('s:h', @ids);
# GRAB THE URLS FOR THOSE 8 ids
my @urls = $GLOBAL::RedisHandle->hmget('s:u', @ids);

How did it perform?

So how did the timing go? It shaved off about 1/100th of a second, so the gains were minimal. I suspect that once the navbar is changed we’ll see improved gains like we did for sHORTurl because MySQL will no longer be used for the design framework, so the main Web page and things like the mailing list archives won’t have MySQL overhead anywhere. Pages like the gallery won’t see any real improvement because they still rely on MySQL for other things that aren’t appropriate for Redis.

We’ll tackle the navbar in Part 2 and see how much of a difference that makes.

$ ./timeit
0.178 seconds
0.168 seconds
0.162 seconds
0.175 seconds
0.175 seconds
0.191 seconds
0.174 seconds
0.167 seconds
0.169 seconds
0.167 seconds
0.179 seconds
0.177 seconds
Average load time: .17350000000000000000