Saturday, January 26, 2013

DNA IMAGER

I have been writing programs for about 28 years now beginning with Basic on TRS-80s which had about as much memory as a cheap solar calculator you find these days.  One of the first things I used a computer for was to turn data into images I could look at. I have always believed in the cliche that a picture is worth a thousand words.  We perceive more data through vision than any other sense. We see patterns that we wouldn't otherwise detect by other means.  The Human Genome Files have their data as a giant set of text containing the letters A, G, C, and T.  I like to use the files with the extension FA.  These are easiest to convert for my use although I have written several programs that will seek out the lines of a, c, g, and t in the other file formats.  The program won't start imaging the data until it gets an entire line of just those four letters thus a regular sentence such as those typed here won't fool the program.  If 40 a, c, g, and t's don't appear side by side it won't use the data at that point.   FA files eliminate the need for The Sifter as I call it.

I have imaged the entire Human Genome Set.  The last set I used was from February 2011 and isn't complete. I am not sure what this means. Is some of the data in doubt?  Is some of the data being withheld? Is it just actually incomplete?  We were all told that the entire human genome had been sequenced.  Be that as it may those missing lines of data are denoted in the Human Genome files with the letter N.  These appear as the black regions in the images. I originally excluded that data from my imaging but realized it would mess up the accounting part of the program. Not only am I imaging the entire genome I am testing the frequency of the bases in each chromosome as well as combinations of the bases such as the how often one letter/base occurs next to another and so forth.

The imaging is simple. A small bmp file 2 pixels by 2 pixels stands in for the letters in the files. Adenine is red, Guanine is blue, Cytosine is green, and Thymine is yellow. Instead of setting all four pixels in a 2 x 2 block I just set 3 leaving 1 pixel black. This gives contrast to the image.  Using small bmp files I am able to use images of small spheres, ovoids, or anything I want to represent the bases.

If I want to look at very large patterns on my largest computer screen with a resolution of 1400 x 900 I can use individual pixels for each base.  This gives me a larger section of the data I can look at at one time.

I have found a multitude of patterns.  I expect it will take me years to sift through all the data. Each chromosome, especially the larger ones, have as many as 250 million bases in them and generate several gigabytes of image files.  I have done only a quick look through the entire set and pulled out those patterns that caught my eye.  Originally I wasn't going to blog/publish any of these results thinking I may be on to something. My thought is to ultimately get crowd source funding to continue the work assuming someone doesn't just steal the work if they find it useful and call it their own. It wouldn't be the first time someone nearly anonymous got ripped off in the field of DNA research. Read the story of Rosalind Franklin if you are curious about that comment.

The first two panels below show a pattern I have seen across the entire genome all 24 chromosomes. That's 22 numbered chromosomes and 2 sex chromosomes X and Y.  In the panel below is a pattern I call Red and Green Banding.  All this work is preliminary so I don't know what it means yet except that it is an obvious strategy employed in every chromosome. Off the top of my head I am thinking it is like file folders in Windows.  Perhaps these sequences, which are folded in the chromosome, emit a magnetic signature that is like an index card along the sequence. Since it is my contention that all processes are based on electromagnetic properties I call profiles.  This is in line with the study of Protein Folding. The imaging idea is loosely based on the idea of a Beta Pleated Sheet. The image is laid down left to right and then depending upon the matrix it doubles back on itself like a snake winding its way down the screen.  To save hard drive space and to facilitate side by side comparisons I wrote a feature of the program that determines how many patterns can be put on a single screen, minus the control buttons on the side you can't see in the screen shots. This folding would not occur in nature because the particular EM properties of the bases would not allow such an arrangement if only because my representation is 2 dimensional and these molecules would be in 3 dimensions. This doesn't matter since I am looking for patterns in the bases. A similar string of bases, as you will see, will show a similar pattern


Red and Green Banding in the number 5 Chromosome.  An arbitrary distinction made up by the earlier scientists in the field. I don't believe there is any particular order of chromosomes although this research, when I really get down into the patterns, may prove otherwise.







Red and Green Banding in the number 3 Chromosome.  Having looked through the thousands upon thousands of images my programs have generated I have noticed that the red and green bands are "tighter".  There are also Orange and Aqua colored bands as well. These bands are more drawn out and not as "tight".






I expect these images to expand quite a bit when the viewer clicks on them. I am working with blogger while it is acting up so I am not sure what it is going to do.  Assuming it works correctly the user will see in the next panel what I call Red Blue Jean Pattern and in the same image what I call the Slanted Pattern.  The red blue jean pattern is fairly unique. This is the largest example of it. The slanted pattern occurs in half of the chromosomes which is interesting to me.  Of course the orientation of the pattern is affected by the matrix used. The matrix being the point at which the line of pixels, beads, or what ever colored image I am using loops back on itself.  This particular panel has a matrix of 54.

I wrote a program that I could change the matrix in real time while the image was being painted on the screen.  The range of 50s to 70s produces some very striking patterns. In the pixel form of this program which sets a single pixel per base the matrix 121 is interesting. I am imaging the entire genome again using that method and comparing these areas. Changing the matrix changes patterns and reveals new ones the other matrix didn't.














 The Red Blue Jean panel shows another pattern similar to the next two image panels above this sentence.  This feature is the slanted pattern and in some chromosomes is found in the middle of the data and continues for hundreds of thousands of bases. The two panels above are from chromosomes 3 and 5.  At the bottom of all the panels  you will see numbers in white blocks.  These are reference numbers for my continuing work.  Another program I have written allows me to start imaging from any point within a file from the Human Genome set.  These numbers help me to approximate where to start looking for patterns I find interesting.  Since I have completed the entire imaging of the genome using a single matrix, 54, and  have sifted through all the images I am now writing the program that will isolate these patterns and their associated representation in the data files. Then I can begin to classify them.

The next panels show a pattern I found in chromosome 19 that prompted me to post these findings. If you look  you will see a pattern repeated in small chunks through out this data set. Panel 1 shows one type of this pattern and panel two shows another. By type I mean in the first panel the slants are in sets of 2 where as in panel 2 they are in sets of 3. The patterns are similar but different.  What causes them to be a group is that this pattern occurs as many as 200 times within the data file for chromosome 19.






Chromosome 19 is the oddest of the chromosomes I have imaged yet.  After viewing all the other chromosomes using this method I got the feeling I was looking at the Frankenstein's Monster of the human genome. A chromosome that looks as if it is a hodge podge of the features of the other 23. Chromosome Y is pretty interesting as well but I will keep that to myself for now until I develop this idea further.

I have an ultimate theory which is the reason I am doing this work.  This is not my job this is my hobby. I enjoy science for its own sake but I think this method may be of use in finding patterns we wouldn't otherwise come across. The imaging method uses the advantage that the human eye offers for finding patterns. More correctly the human brain. I do realize that we also see patterns that aren't necessarily there. For instance I have a picture of a coffee stain on a coffee pot that looks just like the Coca Cola Polar Bear and one of mold in a coffee cup that looks just like the Egyptian Sphinx. So I am not oblivious to the fact that I may see and interpret things that are not necessarily really there.

This is why this work has gone on for several years now. To be sure I am seeing something that is persistent and not a transient effect of screen resolution or another problem I once encountered while imaging large number files like Pi.  I am collecting the data for my larger hypothesis that has to do with the mechanism for evolution. Here is a hint. It deals with the phenomenon called Punctuated Equilibrium. I believe I know how it is possible and this is how I am going about looking for it.

Along the way I am hoping to find what I call structures and sub structures. In a few days I will be going to the crowd sourcing sites looking for funding to complete at least one project from this research. Even if this turns out to have little or no scientific value it has art value. And the completion of the art project I have in mind would also be useful in the scientific aspect of the work.  I would like to image and print the entire human genome and lay the images end to end on a giant white wall and stand back and see what I see.  A look at the entire Human Genome at one time. There may be a super structure there that we cannot see without imaging the entire thing.

Since I do this in my spare time the work is going to be slow.  So far the work keeps generating more experiments of which I have only completed one. The imaging of the entire genome with a single matrix. My computer screen size limits the size of a panel I can produce. With some time and effort I could manipulate the program to produce a block image of an entire chromosome. Even pixel by pixel this block would be several feet by several feet but laid out on a large wall might reveal something we didn't expect.  I can think of 10 good experiments that can be done using this and other genome data.  If you could use this technique to find a gene signature image or enzymes wouldn't that be very interesting?

I also have a version that converts the codons that code for amino acids into images. It is at a very early stage. Finding 20 unique colors or tiny images to represent the 20 usual amino acids formed from the 64 possible combinations of 3 letter codons isn't easy but the  preliminary work shows that some of patterns hold up and some don't which is very interesting. This is made more interesting by the fact that different codons code for the same amino acid. This also has ramifications in the the accounting program and in evolution itself. A single or even multiple mutations may not cause a change in amino acids and hence the protein it forms if the mutation is in the same set as the ones that already code form that amino acid. For instance cct, ccc, cca, and ccg code for the amino acid proline. If there is a change in the third base from any one base to another the same amino acid is coded for so the question is does it make a difference?  Are there "flavors" of the same amino acid? Electrochemically I am sure there are.

A person could spend a life time at this I suspect.  I would need 20 computers and several printers and about 20 years to run just the experiments I have thought of so far. It would have to become my job. Doing this piece meal won't get results very fast.

Below is another pattern found in chromosome 19. I thought I would throw it in for fun. This panel is actually a center panel of 3. In the data on either side of this panel the pattern continues. If you notice it is similar in color scheme to the panel directly above.  These are imaged at the same matrix so apparently the "math" internally is different or it would have produced slants. The arrangement that produces an orderly slanted pattern above produces these "swiggles below". Until I get into the file itself and look at the arrangement of the a, c, g, and t's I won't know what is going on.  Stay tuned for that post if you are interested.

























Finally you will see below a panel I have made by using a photo editing program. These are crops from various chromosome images.  Another theme I have found is that at the beginning and ending of nearly every chromosome you find these strong vivid images. Some are what I call primary color bars, scales, waves, and interference. Collecting these snippets out of each image set is a little like bug collecting.  I have to go through each image set for each chromosome panel by panel and crop and paste. During this operation I noticed that some of these same patterns appear in more than one chromosome.  I haven't bothered to keep track of them yet. Like I said, since this is a hobby and not something that pays the bills I have to focus on my primary theory in the bulk of my work.

I just took the time out to post these images for any viewers who might be interested in the subject.


In conclusion I always like to do a recap in case anyone is actually reading this blog and becomes confused about some of the content.  This research is based on a pet theory I have about Punctuated Equilibrium and its mechanism. I believe I know how it is possible that we find sudden dramatic changes in a species within just a few layers of the fossil record.  The classic view of Darwinian Evolution puts forth the idea of change over time by a build up of mutations and selection but according to the fossil record sometimes this occurs so fast as to seem impossible based on the slow mutation/selection method but it does appear to happen.  I think I know why and the answer is actually kind of simple.


Until next time. Enjoy the post.  I will edit and spell check it later as always.

And that's what's inside my brain for today.