Fun parsing plant patents

As we mentioned in our non-tech blog, has been processing plant patents to try to find and understand commonalities and trends in the industry. Besides that, it’s fun.

Well, sort of fun. It would be a lot more fun if the patent data was consistent.

The United States Patent and Trademark Office (USPTO) releases patent data in a machine-parseable format called XML. The layout of the XML is specified by a Document Type Definition (DTD), but these change regularly. Normally these changes are minor, but sometimes they’re subtle and more major.

For example, consider this bit of XML:

      <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
            <first-name>Birgit Christa</first-name>

It later became this:

      <us-applicant sequence="001" app-type="applicant-inventor" designation="us-only">
            <first-name>Birgit Christa</first-name>

Can you spot the difference? At one point ‘us-‘ was prepended to the parties, applicants, and applicant XML tags, so any code that was loading things into the database was suddenly coming up empty.

It’s not a big deal, and things change over time, but it would be nice if the USPTO just converted all of their old documents to the new DTD.

Other issues are less technical and more policy-driven. Consider the case of patent examiner Susan B. McCormick-Ewoldt. Of the patents we’ve processed so far, she examined nineteen different plant patents that were granted, and on all nineteen her name appears differently. At least, we assume it’s the same person. Here are the variations:

mysql> SELECT entity_name, patent_id, sci_name FROM application_examiner LEFT JOIN entity ON application_examiner.entity_id = entity.entity_id LEFT JOIN patent ON application_examiner.application_id = patent.application_id WHERE entity_name LIKE "mccor%";
| entity_name                | patent_id | sci_name                                        |
| McCormick, Susan B.        | 15460     | Impatiens hawkeri 'Fisnics Sweet Red'           |
| McCormick-Ewoldt, S. B.    | 15488     | Prunus persica var. nucipersica 'GBN-One'       |
| McCormick-Ewoldt, Susan B. | 16270     | Malus pumila 'Fugachee Fuji'                    |
| McCormick-Ewoldt, S B      | 15496     | Prunus persica 'Calara'                         |
| McCormick-Ewoldt, S B.     | 15794     | Rosa hybrida 'POULac007'                        |
| McCormick, S. B.           | 17451     | Chrysanthemum x morifolium 'Elegant Yomarjorie' |
| McCormick Ewoldt, S. B.    | 19011     | Baptisia x variicolor 'Twilite'                 |
| McCormick, S. B            | 19098     | Phlox hybrida 'USPHL03M'                        |
| McCormick/Ewoldt, S. B.    | 18988     | Lobelia erinus 'Balwalila'                      |
| McCormick Ewoldt, S. B     | 19664     | Styrax japonicus 'Fragrant Fountain'            |
| McCormick Ewoldt, S B      | 19933     | Scoparia hybrid 'USSCO401-3'                    |
| McCormick-Ewoldt, S. B     | 20130     | Begonia x hiemalis 'Binos Pinky White'          |
| McCormick Ewoldt, Susan B  | 20634     | Cordyline australis 'Sunrise'                   |
| McCormick Ewoldt, Susan B. | 20113     | Petunia hybrid 'KLEPH07140'                     |
| McCormick-Ewoldt, Susan B  | 20626     | Penstemon hartwegii benth 'Peni Vio09'          |
| McCormack Ewoldt, Susan B  | 20901     | Pelargonium x hortorum 'Pacneon'                |
| McCormick Ewoldt, Sysan B  | 20809     | Geranium x cantabrigiense 'ABPP'                |
| McCormick Edwoldt, Susan   | 22972     | Rosa hybrida  'AUStobias'                       |
| McCormick Ewoldt, Susan    | 25207     | Mandevilla hybrida 'Sunparaoros'                |
19 rows in set (0.00 sec)

We have initials, ‘Sysan’ (presumably a typo), ‘Susan’, ‘Susan’ with initials, ‘McCormack’ (presumably a typo), a hyphenated last name, a last name with a slash, and a last name with two words. It becomes very difficult to automate processing of patents if there’s no consistency in values, and that’s something USPTO will have to deal with procedurally.

Leave a Reply

Your email address will not be published. Required fields are marked *