Seasonal photo, (c) 2006 Christopher P. Lindsey, All Rights Reserved: do not copy

Fun parsing plant patents

As we mentioned in our non-tech blog, hort.net has been processing plant patents to try to find and understand commonalities and trends in the industry. Besides that, it’s fun.

Well, sort of fun. It would be a lot more fun if the patent data was consistent.

The United States Patent and Trademark Office (USPTO) releases patent data in a machine-parseable format called XML. The layout of the XML is specified by a Document Type Definition (DTD), but these change regularly. Normally these changes are minor, but sometimes they’re subtle and more major.

For example, consider this bit of XML:

<parties>
   <applicants>
      <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
         <addressbook>
            <last-name>Hofmann</last-name>
            <first-name>Birgit Christa</first-name>
         </addressbook>
      </applicant>
   </applicants>
</parties>

It later became this:

<us-parties>
   <us-applicants>
      <us-applicant sequence="001" app-type="applicant-inventor" designation="us-only">
         <addressbook>
            <last-name>Hofmann</last-name>
            <first-name>Birgit Christa</first-name>
         </addressbook>
      </us-applicant>
   </us-applicants>
</us-parties>

Can you spot the difference? At one point ‘us-‘ was prepended to the parties, applicants, and applicant XML tags, so any code that was loading things into the database was suddenly coming up empty.

It’s not a big deal, and things change over time, but it would be nice if the USPTO just converted all of their old documents to the new DTD.

Other issues are less technical and more policy-driven. Consider the case of patent examiner Susan B. McCormick-Ewoldt. Of the patents we’ve processed so far, she examined nineteen different plant patents that were granted, and on all nineteen her name appears differently. At least, we assume it’s the same person. Here are the variations:

mysql> SELECT entity_name, patent_id, sci_name FROM application_examiner LEFT JOIN entity ON application_examiner.entity_id = entity.entity_id LEFT JOIN patent ON application_examiner.application_id = patent.application_id WHERE entity_name LIKE "mccor%";
+----------------------------+-----------+-------------------------------------------------+
| entity_name                | patent_id | sci_name                                        |
+----------------------------+-----------+-------------------------------------------------+
| McCormick, Susan B.        | 15460     | Impatiens hawkeri 'Fisnics Sweet Red'           |
| McCormick-Ewoldt, S. B.    | 15488     | Prunus persica var. nucipersica 'GBN-One'       |
| McCormick-Ewoldt, Susan B. | 16270     | Malus pumila 'Fugachee Fuji'                    |
| McCormick-Ewoldt, S B      | 15496     | Prunus persica 'Calara'                         |
| McCormick-Ewoldt, S B.     | 15794     | Rosa hybrida 'POULac007'                        |
| McCormick, S. B.           | 17451     | Chrysanthemum x morifolium 'Elegant Yomarjorie' |
| McCormick Ewoldt, S. B.    | 19011     | Baptisia x variicolor 'Twilite'                 |
| McCormick, S. B            | 19098     | Phlox hybrida 'USPHL03M'                        |
| McCormick/Ewoldt, S. B.    | 18988     | Lobelia erinus 'Balwalila'                      |
| McCormick Ewoldt, S. B     | 19664     | Styrax japonicus 'Fragrant Fountain'            |
| McCormick Ewoldt, S B      | 19933     | Scoparia hybrid 'USSCO401-3'                    |
| McCormick-Ewoldt, S. B     | 20130     | Begonia x hiemalis 'Binos Pinky White'          |
| McCormick Ewoldt, Susan B  | 20634     | Cordyline australis 'Sunrise'                   |
| McCormick Ewoldt, Susan B. | 20113     | Petunia hybrid 'KLEPH07140'                     |
| McCormick-Ewoldt, Susan B  | 20626     | Penstemon hartwegii benth 'Peni Vio09'          |
| McCormack Ewoldt, Susan B  | 20901     | Pelargonium x hortorum 'Pacneon'                |
| McCormick Ewoldt, Sysan B  | 20809     | Geranium x cantabrigiense 'ABPP'                |
| McCormick Edwoldt, Susan   | 22972     | Rosa hybrida  'AUStobias'                       |
| McCormick Ewoldt, Susan    | 25207     | Mandevilla hybrida 'Sunparaoros'                |
+----------------------------+-----------+-------------------------------------------------+
19 rows in set (0.00 sec)

We have initials, ‘Sysan’ (presumably a typo), ‘Susan’, ‘Susan’ with initials, ‘McCormack’ (presumably a typo), a hyphenated last name, a last name with a slash, and a last name with two words. It becomes very difficult to automate processing of patents if there’s no consistency in values, and that’s something USPTO will have to deal with procedurally.

Announcing our plant patent tracker

PP25095, submitted image

PP25095, submitted image

We’ve been interested in plant patents for a while at hort.net.  It’s fascinating to track them and see what’s in the pipes for floriculture, crop sciences, or landscaping.  We realized that it was becoming tedious to monitor this stuff using the standard tools that are out there, so we started populating our own database with plant patent data.

Every Tuesday we report on the past week’s issue plant patents.  Right now we just give you the name of the plants that patents were issued for, but we’re extending our database to include:

  • assignee
  • inventor (breeder/discoverer)
  • attorney
  • patent examiner
  • applicant city/state/country
  • expiration date
  • scientific name
  • all plant patents available electronically, back to December 1976

We think it would be interesting to be able to search for all of the patents issued to a given breeder, see which countries are specializing in hybridization of what species, etc.

What features would you like to see?  Comment below, and check back every Tuesday to see the latest updates!

Fixing Mail::ClamAV to work with >= clamav-0.98.4

The latest version of clamav relies on OpenSSL, but libclamav doesn’t automatically intialize that connection. This patch we threw together for Mail-ClamAV-0.29 fixes the problem by calling cl_initialize_crypto() first.

*** ClamAV.pm.orig      2014-10-28 16:27:30.000000000 -0500
--- ClamAV.pm   2014-10-28 16:26:48.000000000 -0500
***************
*** 205,210 ****
--- 205,215 ----
      if (stat(path, &st) != 0)
          croak("%s does not exist: %s\n", path, strerror(errno));
  
+     if ((status = cl_initialize_crypto()) != CL_SUCCESS) { 
+        error(status);
+        return &PL_sv_undef; 
+     } 
+ 
      if ((status = cl_init(CL_INIT_DEFAULT)) != CL_SUCCESS) {
          error(status);
          return &PL_sv_undef;

 

No Verizon tracking at hort.net

We’re pretty horrified by the recent revelation that Verizon is adding tracking codes to its customers’ web browsing.

Verizon claims that this code is in a database that doesn’t get shared, but that’s not good enough. As the article we linked to mentions, any site that can tie a user’s personal information to that tracking code can sell that information and a database can be constructed. At that point, any traffic from a phone, even if run in a private or incognito mode, can be tied to the phone’s owner by the site that’s being visited unless the site use the HTTPS protocol.

We strongly advise that sites disable this header on their systems. At hort.net we have set this Apache directive:

RequestHeader unset X_UIDH early
RequestHeader unset HTTP_X_UIDH early

This ensures that the tracking code will be removed before any of our scripts run, so we won’t be able to tie Verizon’s tracking to any of our visitors (even accidentally).

If you use Verizon, try to only connect to sites using the HTTPS protocol (secure web sites) or ones that are known to remove that tracker.


Tin can on a string

What Verizon users may have to do to in the future to communicate safely.