Skip to Content

Big Unstructured Data

records digitizationHave you ever tried to go online to look up public records? Perhaps when looking to buy a house, etc? It isn’t easy. Between trying to understand where the data is, how to get access to it, and how to understand the different formats, you are looking at spending LOTS of time. Not to mention that much of the information you may be looking for is probably not even digitized yet.  Carl Malamud of Public.Resource.Org has been trying to spearhead a digitization project of the government data from a bunch of different sources like PACER, EDGAR, and the U.S. Patent Office. The project aims to start a “Federal Scanning Commission,” which tries to not only determine the scope of the project, but includes making access and search of the information much easier. Skip much easier—let’s go with *possible*.

The massive effort of scanning all of the information aside, let’s talk about searching through the information. Public information includes records like these:

  • Environmental records for suspected contamination, compliance and violation concerns, and more
  • Real estate information, including property reports, parcel maps, ownership traceability, tax information, and more
  • Aerial maps, including multi-year comparisons to see change

Let’s walk through the process, which you could apply to your own enterprise data.

  • Establish data models for the information. You not only need models for each type of information, but also how that information links and relates to each other. For example, if you were looking to buy a new commercial property, you would definitely want to see not only the property reports, but also the environmental records of property and surrounding area. And aerial maps to show that property changing over time would help you make inferences on value and possible property issues.
  • Determine appropriate metadata for the information. You need to tag all of the newly-digital information with not only the type of data, but also keywords to aid in searching.
  • Account for variants for the information. For example, you could be searching for information on 1051 W. King St.. However, that street is also legally called 1051 W. 9th St. Search results should respond for both variants. Also, searches should respond whether the West King Street or W King St is entered—not to mention helping the user drill down if only King St is entered. The same of course applies to name information, product information, and more.
  • Establish indexes to quickly complete searching for the information in a truly big data set.
  • Establish policies to determine when information can (and should) be available to the public. The policies should also include SLAs for the providing agencies on the completeness of the information and how well it conforms to the standards establishes (models, metadata, variants, and indexes).

Imagine coordinating this information across federal government agencies, state governments, and county governments, which all have varying access policies, enabling technologies, and information management maturities. If PublicResource.Org can make progress on this problem, surely you can make progress within your own company!

Be the first to leave a comment
You must be Logged on to comment or reply to a post.