Jump to content

A Novel Algorithm for Data Mining Disparate Databases Using Boolean Operations


DonRocks

Recommended Posts

A Novel Algorithm for Data Mining Disparate Databases via Boolean Logic

In my nearly-25 years working as a contractor for the EPA, I developed only one thing that might be of lasting importance. Since I left the EPA in 2009, they "modernized" and replaced the work I did, and given that I'm the only living person who remembers the algorithm, unless I write it down, it will be lost forever. Anyone reading this may consider this logic in the public domain, and my "gift" to humanity.

Background

In 2010 I wrote:

"In March, 1989, the Exxon Valdez ran aground in Prince William Sound, Alaska, spilling 11 million gallons of oil into Prudhoe Bay. It was the biggest environmental disaster in history.
 
Shortly afterwards, the White House called the Environmental Protection Agency, and wanted to know if Exxon could be trusted to clean up their own spill. The EPA's response? "We have no way to answer this question."
 
The White House was not happy, and EPA's inability to respond was reported as a "FMFIA Material Weakness." EPA was now under direct orders to correct this problem, and to do so immediately.
 
In January, 1990, EPA began developing IDEA (Integrated Data for Enforcement Analysis), a high-powered retrieval engine that would be able to integrate data from EPA's numerous, unconnected enforcement databases, presenting it to the user in a single report. Many people said it couldn't be done.
 
In February, 1991, just thirteen months later, the first version of IDEA was completed, and the system went into production. Using only a handful of people to design and implement IDEA, EPA had achieved one of the greatest success stories in its history.
 
At first, IDEA was only used internally by EPA enforcement personnel to target polluters, but it quickly became clear that the vast majority of the information should be made available to anyone who wanted it.
 
EPA took the bold step of developing a powerful, easy-to-use, interactive web interface, making IDEA-gathered data available to the general public in real-time, at no charge. In 2002, ECHO (Enforcement and Compliance History Online) went into production.
 
For years, EPA had a high-performance engine with IDEA. Now, with ECHO, they had placed that engine into a sports car, and handed everyone the keys.
Today, in 2010, ECHO is the most powerful environmental data-retrieval system in the world. The numbers speak for themselves: In any given day, over 4,000 real-time retrievals are performed. This year, there will be well over 1,000,000 - all completely free to users.
 
IDEA and ECHO are shining examples of the federal government at its best. Both were developed and are maintained on an incredibly small budget, yet have proven to be the workhorses that keep EPA performing, year after year."
 
The Problem
 
The EPA had different departments (Air (Clean Air Act), Water (Clean Water Act), etc.), all of whom maintained their own databases (using different database software), and none of whom communicated with each other. After the Valdez accident, and because each database was its own separate entity, run by separate Divisions, the EPA was completely unable to perform simple-sounding queries such as: "Show me all EPA-regulated facilities with violations in both Air and Water."
 
The Solution
 
Given the disparate databases, and the requirement to logically unify them, it was necessary to place them into a standard format. Robert Greenspun, project supervisor (and father of this project - I was Scottie Pippen to his Michael Jordan), brought me in to help him write our own database which would be in a consistent format in order to handle cross-database queries (a special mention also to Charles Marks and Lloyd Andrew, who were integral to this project) Fortunately, environmental data doesn't change rapidly (or, at least, it didn't back in 1990), so we had the luxury of performing a monthly extract of each database, putting it into our own standard format - this itself was a laborious and time-consuming process, but given the White House's deadline, it had to be done. All EPA-regulated facilities (which include such diverse entities as oil refineries, chemical plants, and even dry cleaners) have a unifying field which is known as the EPA-ID - a unique number assigned to each facility, regardless of which database it was in - it was this common identifier which made it possible to perform cross-database queries. After the extracts, we still had disparate databases, but they were in a standardized format, and had a common identifier - now, all that remained was figuring out how to perform high-speed retrievals, in real-time.
 
Since one of our goals was to allow the general public to query on anything they wanted, we needed to develop an interface - a Parser - so someone, for example, who works for Greenpeace, could find out which facilities were "High-Priority Violators," i.e., which facilities had severe violations in more than one area (Air, Water, Toxic Releases, Hazardous Waste, etc.). Before this system was implemented, it would have been entirely possible for one, fenced-in chemical plant to have severe violations with 1) a smokestack, 2) a pipe spilling chemicals into a stream, and 3) a hazardous-waste burial pit - all in one, single property - and nobody would have any means to discover it.
 
I wrote the Parser myself, but won't go into any detail, since any trained programmer can write a Parser - ours could handle complex, cross-database queries of the form: (DATABASE1.ELEMENT EQ VALUE AND DATABASE1.ELEMENT GT 20100630) OR (DATABASE2.ELEMENT NE 3 AND NOT DATABASE4.ELEMENT EQ "Y") - in otherwords, it was a fairly generic Parser that could handle Boolean Logic (AND, OR, NOT) with any indexed database elements. The Parser converted these simple commands into Postfix Notation for indexed retrieval.
 
The Algorithm
 
Since we only had about one year to finish this entire project, the original index-retrieval algorithm was relatively primitive and slow. In the mid-late 1990s, I developed a novel, high-powered algorithm which could handle the one thing that was dogging our system with false positives - namely, queries in the form:
 
( A AND B ) OR ( C AND D ) <--- If you don't know why this is problematic, it's probably best to stop here.
 
Each "Expression" (as we termed them) was of the form DATABASE.ELEMENT EQ VALUE (all Boolean operators were handled, including NOT at the beginning of an Expression).
 
For each expression an indexed retrieval was performed, and stored as a binary string - a series of 1s and 0s, that was as long as there were records in the database. For example, if the Air Database had 33,000 facilities, the binary string would be 33,000 bits long.
 
Note that, in a given database, there were multiple levels, that generally lent themselves towards a hierarchical structure - for example in the Air Database, one facility could have many inspections, one inspection could have many violations, one violation could have many repeat visits by inspectors, etc. Given this, the top level in the Air Database might have 33,000 entries, the next-level down (inspections) might have 75,000, etc. This is but one example, but should adequately describe the situation as a whole. Performing Boolean operations between expressions required all binary strings to be at the same level, i.e., be of the same length - this required indexed conversion tables that I won't go into, but were simple (LevelX, LevelY) couplets. Since memory was at a premium in the 1990s, level conversion was a costly operation in terms of computer resources - especially when you needed to convert gigantic binary strings that were over one-million bits long; nevertheless, you cannot perform Boolean operations on "Apples" AND "Oranges" without first converting them upwards to "Fruits." 
 
We were fortunate that IBMs computers performed Boolean Operations inside of hardware, rather than using valuable CPU time (or worse, disk retrieval), so we took advantage of this fact.
 
Here's the algorithm - an entire year of work, condensed into several instructions:
 
1) For each AND, ensure the binary strings are at the same Level, perform the AND, and carry forward one binary string.
2) For each OR, do nothing - carry forward both binary strings (if, by chance, they are at the same Level, it's best to perform the OR, and carry forward one binary string instead of two).
3) For each NOT, perform the NOT, and carry forward one binary string (this is a controversial step, since bits representing non-existent data will be turned on, but these will be resolved by the conversions in the next step, i.e., they will not survive the conversion).
4) At the end of the query, convert all remaining binary strings (there can be many, many binary strings, given that OR operations carry forward two strings instead of one) upward to the lowest common level (generally, the EPA-ID level), and you have your answer. After this step, you can OR everything into the same binary string for convenience.
 
And that's it. Robert Greenspun remained (and still remains) unconvinced that this algorithm is mathematically correct; I remained (and still remain) convinced that it is. The algorithm has never been mathematically proven or disproven to be correct. However, many millions of queries have been performed using it, and no reports of any wrong answers have ever occurred (at least, none that were caused by this algorithm). One thing that's difficult to visualize from this explanation is that in reality, an interface (EPA ECHO) was written that generated hundreds of expressions for complex queries, and that at the end of the algorithm, there may remain dozens of binary strings needing to be resolved. Also, given the fact that an OR generates two strings, it's very likely that one Boolean Operation (well into the query) will be performed on, e.g, (One *set* of binary strings at different levels) AND (Another *set* of binary strings at different levels) - it makes sense to combine all strings at the same level as early as possible in processing, in order to minimize the number of binary strings that you're toting around with you. In each of the four steps, the term "binary string" actually means, in the general sense, "set of binary strings at different levels," with the Boolean operation performed on each string at a similar level (for example, since a NOT is a unary operator, you simply NOT each-and-every string you have at that moment). Of note: The only operator that might require level conversions is an AND: Don't ever convert levels for a NOT or an OR.
 
This algorithm plays fast-and-loose with invalid bits (because of the way it treats NOT operations), but the final conversions act as a sentry, preventing them from getting through, and (hopefully) no "real" bits are incorrectly activated before that time. No, I can't prove it, but I know I'm right: an AND can't activate them, an OR can't activate them, and a NOT will have them filtered out at the end.
 
Given Step 3), queries of the form (for example): NOT ( A AND B ) OR NOT ( C AND D ) can appear to be intimidating and chaotic, but I'm convinced that this algorithm works, and that no invalid bits can possibly survive the conversion process at the end (NOT NOT A is the exact same thing as A, and at the time I developed this, I had convinced myself that there were no false positives, and no false negatives, but I'll leave that for future generations to prove or disprove).
 
Lots of luck,
Don Rockwell
 
PS - There are a couple of details I've left out of this description which would render it incorrect (for example, if the final operator is a NOT, the resultant, penultimate bit string must be ANDed against an "existence string" to eliminate non-existing bits; also, I wrote a recursive procedure to pre-process and annotate our postfix structure in order to optimize levels (*). I think the only logic addition is the "existence mask AND" situation described here - just make sure that a NOT hasn't activated non-existent bits without performing this AND; the rest, e.g., annotating our postfix notation, is strictly for optimization (the use of dataspace, hiperspace, etc., which will no longer be needed in the future, as everything will be stored in memory or core).
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...